Apache DataFusion Comet 0.8.0 Release

Posted on: Tue 06 May 2025 by pmc

The Apache DataFusion PMC is pleased to announce version 0.8.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately six weeks of development work and is the result of merging 81 PRs from 11 contributors. See the change log for more information.

Release Highlights

Performance & Stability

Up to 4x speedup in jobs using dropDuplicates, thanks to optimizations in the first_value and last_value aggregate functions in DataFusion 47.0.0.
Introduction of a global Tokio runtime, which resolves potential deadlocks in certain multi-task scenarios.

Native Shuffle Improvements

Significant enhancements to the native shuffle mechanism include:

Lower memory usage through using interleave_record_batches instead of using array builders.
Support for complex types in shuffle data (note: hash partition expressions still require primitive types).
Reclaimable shuffle files, reducing disk pressure.
Respects spark.local.dir for temporary storage.
Per-task shuffle metrics are now available, providing better visibility into execution behavior.

Experimental Support for DataFusion’s Parquet Scan

It is now possible to configure Comet to use DataFusion’s Parquet reader instead of Comet’s current Parquet reader. This has the advantage of supporting complex types, and also has performance optimizations that are not present in Comet's existing reader.

This release continues with the ongoing improvements and bug fixes and supports more use cases, but there are still some known issues:

There are schema coercion bugs for nested types containing INT96 columns, which can cause incorrect results.
There are compatibility issues when reading integer values that are larger than their type annotation, such as the value 1024 being stored in a field annotated as int(8).
A small number of Spark SQL tests remain unsupported (#1545).

To enable DataFusion’s Parquet reader, either set spark.comet.scan.impl=native_datafusion or set the environment variable COMET_PARQUET_SCAN_IMPL=native_datafusion.

Updates to Supported Spark Versions

Added support for Spark 3.5.5
Dropped support for Spark 3.3.x

Getting Involved

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

The easiest way to get involved is to test Comet with your current Spark jobs and file issues for any bugs or performance regressions that you find. See the Getting Started guide for instructions on downloading and installing Comet.

There are also many good first issues waiting for contributions.

Copyright 2025, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.