Apache DataFusion Comet 0.11.0 Release

Posted on: Tue 21 October 2025 by pmc

The Apache DataFusion PMC is pleased to announce version 0.11.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately five weeks of development work and is the result of merging 131 PRs from 15 contributors. See the change log for more information.

Release Highlights¶

Parquet Modular Encryption Support¶

Spark supports Parquet Modular Encryption to independently encrypt column values and metadata. Furthermore, Spark supports custom encryption factories for users to provide their own key-management service (KMS) implementations. Thanks to a number of contributions in upstream DataFusion and arrow-rs, Comet now supports Parquet Modular Encryption with Spark KMS for native readers, enabling secure reading of encrypted Parquet files in production environments.

Improved Memory Management¶

Comet 0.11.0 introduces significant improvements to memory management, making it easier to deploy and more resilient to out-of-memory conditions:

Changed default memory pool: The default off-heap memory pool has been changed from greedy_unified to fair_unified, providing better memory fairness across operations
Off-heap deployment recommended: To simplify configuration and improve performance, Comet now expects to be deployed with Spark's off-heap memory configuration. On-heap memory is still available for development and debugging, but is not recommended for deployment
Better disk management: The DiskManager max_temp_directory_size is now configurable for better control over temporary disk usage
Enhanced safety: Memory pool operations now use checked arithmetic operations to prevent overflow issues

These changes make Comet significantly easier to configure and deploy in production environments.

Improved Apache Spark 4.0 Support¶

Comet has improved its support for Apache Spark 4.0.1 with several important enhancements:

Updated support from Spark 4.0.0 to Spark 4.0.1
Spark 4.0 is now included in the release build script
Expanded ANSI mode compatibility with several new implementations:
ANSI evaluation mode arithmetic operations
ANSI mode integral divide
ANSI mode rounding functions
ANSI mode remainder function

Spark 4.0 compatible jar files are now available on Maven Central. See the installation guide for instructions on using published jar files.

Complex Types for Columnar Shuffle¶

ashdnazg submitted a fantastic refactoring PR that simplified the logic for writing rows in Comet’s JVM-based, columnar shuffle. A benefit of this refactoring is better support for complex types (e.g., structs, lists, and arrays) in columnar shuffle. Comet no longer falls back to Spark to shuffle these types, enabling native acceleration for queries involving nested data structures. This enhancement significantly expands the range of queries that can benefit from Comet's columnar shuffle implementation.

RangePartitioning for Native Shuffle¶

Comet's native shuffle now supports RangePartitioning, providing better performance for operations that require range-based data distribution. Comet now matches Spark behavior for computing and distributing range boundaries, and serializes them to native execution for faster shuffle operations.

New Functionality¶

The following SQL functions are now supported:

weekday - Extract day of week from date
lpad - Left pad a string with column support for pad length
rpad - Right pad a string with column support and additional character support
reverse - Support for ArrayType input in addition to strings
count(distinct) - Native support without falling back to Spark
bit_get - Get bit value at position

New expression capabilities include:

Performance Improvements¶

Improved BroadcastExchangeExec conversion for better broadcast join performance
Use of DataFusion's native count_udaf instead of SUM(IF(expr IS NOT NULL, 1, 0))
New configuration from shared conf to reduce overhead
Buffered index writes to reduce system calls in shuffle operations

Comet 0.11.0 TPC-H Performance¶

Comet 0.11.0 continues to deliver significant performance improvements over Spark. In our TPC-H benchmarks, Comet reduced overall query runtime from 687 seconds to 302 seconds when processing 100 GB of Parquet data using a single 8-core executor, achieving a 2.2x speedup.

TPC-H Overall Performance

The performance gains are consistent across individual queries, with most queries showing substantial improvements:

TPC-H Query-by-Query Comparison

You can reproduce these benchmarks using our Comet Benchmarking Guide. We encourage you to run your own performance tests with your workloads.

Apache Iceberg Support¶

Updated support for Apache Iceberg 1.9.1
Additional Parquet-independent API improvements for Iceberg integration
Improved resource management in Iceberg reader instances

UX Improvements¶

Added plan conversion statistics to extended explain info for better observability
Improved fallback information to help users understand when and why Comet falls back to Spark
Added backtrace feature to simplify enabling native backtraces in CometNativeException
Native log level is now configurable via Comet configuration

Bug Fixes¶

Documentation Updates¶

Updated documentation for native shuffle configuration and tuning
Added documentation for ANSI mode support in various functions
Improved EC2 benchmarking guide
Split configuration guide into different sections (scan, exec, shuffle, etc.) for better organization
Various clarifications and improvements throughout the documentation

Spark Compatibility¶

Spark 3.4.3 with JDK 11 & 17, Scala 2.12 & 2.13
Spark 3.5.4 through 3.5.6 with JDK 11 & 17, Scala 2.12 & 2.13
Spark 4.0.1 with JDK 17, Scala 2.13

We are looking for help from the community to fully support Spark 4.0.1. See EPIC: Support 4.0.0 for more information.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

The easiest way to get involved is to test Comet with your current Spark jobs and file issues for any bugs or performance regressions that you find. See the Getting Started guide for instructions on downloading and installing Comet.

There are also many good first issues waiting for contributions.

Comments

Copyright 2025, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.