Apache DataFusion Comet 0.8.0 Release
Posted on: Tue 06 May 2025 by pmc
The Apache DataFusion PMC is pleased to announce version 0.8.0 of the Comet subproject.
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.
This release covers approximately six weeks of development work and is the result of merging 81 PRs from 11 contributors. See the change log for more information.
Release Highlights
Performance & Stability
- Up to 4x speedup in jobs using
dropDuplicates
, thanks to optimizations in thefirst_value
andlast_value
aggregate functions in DataFusion 47.0.0. - Introduction of a global Tokio runtime, which resolves potential deadlocks in certain multi-task scenarios.
Native Shuffle Improvements
Significant enhancements to the native shuffle mechanism include:
- Lower memory usage through using
interleave_record_batches
instead of using array builders. - Support for complex types in shuffle data (note: hash partition expressions still require primitive types).
- Reclaimable shuffle files, reducing disk pressure.
- Respects
spark.local.dir
for temporary storage. - Per-task shuffle metrics are now available, providing better visibility into execution behavior.
Experimental Support for DataFusion’s Parquet Scan
It is now possible to configure Comet to use DataFusion’s Parquet reader instead of Comet’s current Parquet reader. This has the advantage of supporting complex types, and also has performance optimizations that are not present in Comet's existing reader.
This release continues with the ongoing improvements and bug fixes and supports more use cases, but there are still some known issues:
- There are schema coercion bugs for nested types containing INT96 columns, which can cause incorrect results.
- There are compatibility issues when reading integer values that are larger than their type annotation, such as the value 1024 being stored in a field annotated as int(8).
- A small number of Spark SQL tests remain unsupported (#1545).
To enable DataFusion’s Parquet reader, either set spark.comet.scan.impl=native_datafusion
or set the environment
variable COMET_PARQUET_SCAN_IMPL=native_datafusion
.
Updates to Supported Spark Versions
- Added support for Spark 3.5.5
- Dropped support for Spark 3.3.x
Getting Involved
The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.
The easiest way to get involved is to test Comet with your current Spark jobs and file issues for any bugs or performance regressions that you find. See the Getting Started guide for instructions on downloading and installing Comet.
There are also many good first issues waiting for contributions.
Copyright 2025, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.