Apache DataFusion Comet 0.9.0 Release

Posted on: Tue 01 July 2025 by pmc

The Apache DataFusion PMC is pleased to announce version 0.9.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately ten weeks of development work and is the result of merging 139 PRs from 24 contributors. See the change log for more information.

Release Highlights

Complex Type Support in Parquet Scans

Comet now supports complex types (Structs, Maps, and Arrays) when reading Parquet files. This functionality is not yet available when reading Parquet files from Apache Iceberg.

This functionality was only available in previous releases when manually specifying one of the new experimental scan implementations. Comet now automatically chooses the best scan implementation based on the input schema, and no longer requires manual configuration.

Complex Type Processing Improvements

Numerous improvements have been made to complex type support to ensure Spark-compatible behavior when casting between structs and accessing fields within deeply nested types.

Shuffle Improvements

Comet now accelerates a broader range of shuffle operations, leading to more queries running fully natively. In previous releases, some shuffle operations fell back to Spark to avoid some known bugs in Comet, and these bugs have now been fixed.

New Features

Comet 0.9.0 adds support for the following Spark expressions:

  • ArrayDistinct
  • ArrayMax
  • ArrayRepeat
  • ArrayUnion
  • BitCount
  • BitNot
  • Expm1
  • MapValues
  • Signum
  • ToPrettyString
  • map[]

Improved Spark SQL Test Coverage

Comet now passes 97% of the Spark SQL test suite, with more than 24,000 tests passing (based on testing against Spark 3.5.6). The remaining 3% of tests are ignored for various reasons, such as being too specific to Spark internals, or testing for features that are not relevant to Comet, such as whole-stage code generation, which is not needed when using a vectorized execution engine.

This release contains numerous bug fixes to achieve this coverage, including improved support for exchange reuse when AQE is enabled.

Module Passed Ignored Canceled Total
catalyst7,232517,238
core-19,18624669,438
core-22,64939303,042
core-31,757136161,909
hive-12,1741442,192
hive-2191424
hive-31,0581141,073
Total24,0758063124,912

Memory & Performance Tracing

Comet now provides a tracing feature for analyzing performance and off-heap versus on-heap memory usage. See the Comet Tracing Guide for more information.

Comet Tracing

Spark Compatibility

  • Spark 3.4.3 with JDK 11 & 17, Scala 2.12 & 2.13
  • Spark 3.5.4 through 3.5.6 with JDK 11 & 17, Scala 2.12 & 2.13
  • Experimental support for Spark 4.0.0 with JDK 17, Scala 2.13

We are looking for help from the community to fully support Spark 4.0.0. See EPIC: Support 4.0.0 for more information.

Note that Java 8 support was removed from this release because Apache Arrow no longer supports it.

Getting Involved

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

The easiest way to get involved is to test Comet with your current Spark jobs and file issues for any bugs or performance regressions that you find. See the Getting Started guide for instructions on downloading and installing Comet.

There are also many good first issues waiting for contributions.

Copyright 2025, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.