Apache DataFusion Comet 0.6.0 Release
Posted on: Mon 17 February 2025 by pmc
The Apache DataFusion PMC is pleased to announce version 0.6.0 of the Comet subproject.
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.
Comet runs on commodity hardware and aims to provide 100% compatibility with Apache Spark. Any operators or expressions that are not fully compatible will fall back to Spark unless explicitly enabled by the user. Refer to the compatibility guide for more information.
This release covers approximately four weeks of development work and is the result of merging 39 PRs from 12 contributors. See the change log for more information.
Starting with this release, we now plan on releasing new versions of Comet more frequently, typically within 1-2 weeks of each major DataFusion release. The main motivation for this change is to better support downstream Rust projects that depend on the datafusion_comet_spark_expr crate.
Release Highlights
DataFusion Upgrade
- Comet 0.6.0 uses DataFusion 45.0.0
New Features
- Comet now supports
array_join
,array_intersect
, andarrays_overlap
. Note that these expressions are not yet guaranteed to be 100% compatible with Spark for all input data types, so these expressions are only enabled with the configuration settingspark.comet.expression.allowIncompatible=true
.
Performance & Stability
- Metrics from native execution are now updated in Spark every 3 seconds by default, rather than for each batch being processed. The mechanism for passing the metrics via JNI is also more efficient.
- New memory pool options "fair unified" and "unbounded" have been added. See the Comet Tuning Guide for more information.
Bug Fixes
- Hashing of decimal values with precision <= 18 is now compatible with Spark
- Comet falls back to Spark when hashing decimals with precision > 18
Getting Involved
The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.
The easiest way to get involved is to test Comet with your current Spark jobs and file issues for any bugs or performance regressions that you find. See the Getting Started guide for instructions on downloading and installing Comet.
There are also many good first issues waiting for contributions.
Copyright 2025, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.