Apache DataFusion Blog - pmc

Apache DataFusion Comet 0.16.0 Release

2026-05-07T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.16.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately three weeks of development work and is the result of merging 115 PRs from 17 contributors. See the change log for more information.

Expanded Spark 4 Support¶

Spark 4 is a major theme of this release. Comet now ships first-class support for both Spark 4.0.2 and Spark 4.1.1, with dedicated Maven profiles, shim sources, and CI matrices for each.

Spark 4.1.1: New spark-4.1 Maven profile and shim sources, with Comet's PR test matrix and Spark SQL test suites enabled against Spark 4.1.1. The default Maven profile has been updated to Spark 4.1 / Scala 2.13 to reflect that this is now the primary development target.
Shared 4.x shims: Identical pieces of the Spark 4.0 and 4.1 shims have been consolidated into a shared spark-4.x source tree, reducing duplication as more 4.x minor versions land.
Spark 4.0 / JDK 21: Added a Spark 4.0 / JDK 21 CI profile to validate Comet on the JDK most users are expected to deploy with Spark 4.

Adapting to Spark 4 Behavior Changes¶

Spark 4 introduced a number of type, planner, and on-disk format changes relative to Spark 3.x. Several correctness fixes this release bring Comet's behavior in line with these changes:

Variant type (new in Spark 4.0): Spark 4.0 added a new Variant data type for semi-structured data. Comet does not yet read the shredded Variant on-disk format natively, and delegates these scans to Spark.
String collation (new in Spark 4.0): Spark 4.0 added collation support for StringType. Comet's native operators do not yet implement non-default collations, so hash join and sort-merge join reject collated string join keys, and shuffle, sort, and aggregate fall back to Spark when keys carry a non-default collation.
Wider TimestampNTZType usage: Spark 4 uses TimestampNTZType (timestamp without time zone) in more places than 3.x — for example, in expression return types and as the inferred type for some literal forms. Comet adds support this cycle for cast to and from timestamp_ntz, cast from string to timestamp_ntz, and unix_timestamp over TimestampNTZType inputs.
to_json and array_compact (Spark 4.0): Spark 4.0 adjusted output formatting and return-type metadata for these expressions; Comet now matches the new behavior.
BloomFilter V2 (new in Spark 4.1): Spark 4.1 introduced a new BloomFilter binary format with different bit-scattering. Comet now reads this format so that runtime filters produced by Spark 4.1 remain usable in native execution.
Spark 4.1.1 analyzer refinements: Spark 4.1.1 changed how struct projections handle the case where every requested child field is missing from the Parquet file, and how allowDecimalPrecisionLoss flows through the DecimalPrecision rule. Comet now preserves parent-struct nullness in the first case and the stored allowDecimalPrecisionLoss flag in the second.

Most of these behavior differences were caught because Comet runs the full Apache Spark SQL test suite against each supported Spark version — 3.4.3, 3.5.8, 4.0.2, and 4.1.1 — as part of CI. Running Spark's own correctness tests through Comet's native execution path is what surfaces semantic shifts like TimestampNTZType propagation, ANSI-driven cast and overflow changes, BloomFilter V2 encoding, and the 4.1.1 analyzer rule changes, often before they show up in user workloads. As more Spark 4.x minor releases land, this same harness is what gives us confidence that Comet keeps up.

ANSI SQL Semantics¶

Spark 4 enables ANSI SQL semantics by default. ANSI mode changes how arithmetic overflow, invalid casts, division by zero, and similar error conditions are handled, and Spark itself now treats this as the standard configuration rather than an opt-in.

This is a critical area for any Spark accelerator: an engine that falls back to vanilla Spark whenever ANSI is enabled effectively does not run on Spark 4 by default. Comet implements ANSI semantics for the expressions it supports natively, including arithmetic overflow checks, ANSI cast behavior, and try_* variants. Queries running with spark.sql.ansi.enabled=true continue to be accelerated rather than falling back.

See the Comet Compatibility Guide for details on which expressions have full ANSI coverage.

Expanded Adaptive Execution Support¶

Modern Spark plans are adaptive: AQE re-plans stages at runtime, Dynamic Partition Pruning (DPP) prunes fact-table partitions based on broadcast dimension filters, and ReuseExchange and ReuseSubquery ensure that a broadcast or subquery referenced in multiple places executes only once. For star-schema workloads, these mechanisms are not optional. They are often the difference between a query that reads 1% of the fact table and one that reads all of it.

Prior to 0.16.0, Comet's native scans only partially participated in this machinery. CometNativeScanExec (the DataFusion-based native Parquet scan) fell back to Spark entirely whenever a DPP filter was present. CometIcebergNativeScanExec supported non-AQE DPP as of 0.15.0 (#3349), but without broadcast exchange reuse, so the DPP subquery re-executed the dimension broadcast.

Comet 0.16.0 closes both gaps and aligns the native Parquet and native Iceberg scans on a single DPP and subquery-resolution path:

Non-AQE DPP for native Parquet, with broadcast exchange reuse (#4011, #4037): A new CometSubqueryBroadcastExec replaces Spark's SubqueryBroadcastExec in DPP expressions and wraps a CometBroadcastExchangeExec, so ReuseExchangeAndSubquery matches the join side and the DPP subquery and broadcasts the dimension exactly once.
AQE DPP for native Parquet (#4112): Under AQE, Spark's PlanAdaptiveDynamicPruningFilters cannot match Comet's broadcast hash join and would otherwise rewrite DPP to TrueLiteral, disabling pruning. 0.16.0 intercepts SubqueryAdaptiveBroadcastExec before Spark's rule runs, and applies Spark's decision tree in a Comet-aware rule that searches both the current stage and the root plan for a reusable broadcast. DPP subqueries are registered in AQE's shared subqueryCache so cross-plan DPP (for example, a main query and a scalar subquery referencing the same dimension) deduplicates correctly. A narrower tagging-based fallback covers Spark 3.4, which lacks the injectQueryStageOptimizerRule extension point.
AQE DPP broadcast reuse for native Iceberg (#4215): Lifts runtimeFilters to a top-level constructor field on CometIcebergNativeScanExec (mirroring BatchScanExec), so Spark's expression-rewrite passes can see and convert the DPP subquery. The same CometSubqueryBroadcastExec machinery from the Parquet path now handles the Iceberg case.
Scalar subquery pushdown and AQE subquery reuse (#4053, SPARK-43402): CometNativeScanExec now participates in scalar subquery pushdown into Parquet data filters, and in AQE-time subquery deduplication via a new CometReuseSubquery rule that re-applies Spark's ReuseAdaptiveSubquery algorithm after Comet's node replacements.

Measured impact on TPC-DS: 78 queries previously fell back to Spark whenever DPP filters were planned, running 30–50% natively. With native DPP in 0.16.0, the same queries run 80–97% natively. Representative examples:

Query	Before	After
q1	36%	96%
q4	31%	95%
q31	31%	95%
q74	32%	95%
q92	36%	95%

Several Spark SQL DPP tests that Comet previously skipped are re-enabled to guarantee Spark compatibility and prevent regressions.

Improved TPC-DS Benchmark Results¶

TPC-DS performance increased significantly compared to the 0.15.0 release and Comet is now very close to 2x faster than Spark.

See the Comet Benchmarking Guide for more details about these benchmark results.

Other Key Features¶

Hash Join Improvements¶

BuildRight + LeftAnti (#4073): Regular hash joins now support the BuildRight + LeftAnti combination, eliminating a common fallback path. Tests previously gated on InjectRuntimeFilterSuite issues have been re-enabled.

Aggregation¶

PartialMerge aggregation mode (#4003): The PartialMerge mode is now executed natively, allowing more multi-stage aggregation plans to remain in Comet without falling back to Spark.
collect_set (#3954): Native support for the collect_set aggregate.

New Expression Support¶

This release adds native support for the following Spark expressions:

Math: Pi, Cbrt, Acosh, Asinh, Atanh, ToDegrees, ToRadians
Date/time: timestamp_seconds, unix_timestamp with TimestampNTZType
String / URL: url_encode, url_decode, try_url_decode, str_to_map
Array / map: arrays_zip, array_position, array_union, array_distinct, arrays_overlap, MapSort (Spark 4.0)
Cast: string to timestamp_ntz, cast to and from timestamp_ntz

array_insert and array_compact have been audited and promoted to Compatible.

Object Storage¶

OpenDAL 0.56.0: Picks up the latest OpenDAL release, including upstream object-store fixes.
Profile credential chain: ProfileCredentialsProvider is now mapped to the AWS profile credential chain, matching the credential resolution behavior users expect.

Native Scan Improvements¶

Parquet field ID matching: The native_datafusion scan now supports field-ID-based column resolution, matching Spark's behavior for files written with field IDs.
Schema-mismatch errors: native_datafusion now throws SchemaColumnConvertNotSupportedException on schema mismatch, allowing Spark's standard error handling to engage.
Stricter type validation: The native_datafusion scan now detects incompatible decimal precision/scale and string/binary columns read as numeric, and delegates these reads to Spark.

Metrics and Observability¶

Spark UI task output metrics: Native execution now reports task output metrics through the standard Spark UI path.
Iceberg input metrics: Task-level bytesRead is now reported for the Iceberg native scan, matching Comet's native Parquet scan.
Shuffle encode time: Shuffle operations now track encode time as a separate metric, making it easier to attribute shuffle cost.

Stability and Correctness¶

Substring with negative start index: Fixed a Spark-incompatibility in substring for negative indices.
Strict floating-point comparison: RangePartitioning now honors strictFloatingPoint, ensuring NaN and ±0.0 are partitioned consistently with Spark.
Broadcast / AQE coalescing: Broadcast exchanges now bypass AQE partition coalescing, fixing plans that could otherwise be coalesced into invalid shapes.
JNI: JNI local frame management has been hardened with explicit error handling.
Shuffle fallback logic: Shuffle fallback decisions have been improved, with a new config to gate conversion of Spark shuffle to Comet shuffle when the child plan is non-Comet, and a fix to avoid redundant columnar shuffle when both parent and child are non-Comet.

Compatibility¶

Supported platforms include:

Spark 3.4.3 with Java 11/17 and Scala 2.12/2.13
Spark 3.5.8 with Java 11/17 and Scala 2.12/2.13
Spark 4.0.2 with Java 17 and Scala 2.13
Spark 4.1.1 with Java 17 and Scala 2.13

See the Spark Version Compatibility page for known limitations specific to each version.

This release continues to build on DataFusion 53.1 and Arrow 58.1.

Get Started with Comet 0.16.0¶

Ready to try it out? Follow the Comet 0.16.0 Installation Guide to get up and running, then point Comet at your existing Spark workloads — including Spark 4 with ANSI mode enabled — and see the speedup for yourself.

Apache DataFusion Comet 0.15.0 Release

2026-04-18T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.15.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately four weeks of development work and is the result of merging 142 PRs from 19 contributors. See the change log for more information.

Performance¶

Comet 0.15.0 provides a 2x speedup for TPC-H @ SF1000 (1TB), resulting in 50% cost savings.

That 2x speedup gives you a choice: finish the same Spark workload in half the time on the cluster you already have, or match your current Spark performance on roughly half the resources. Either way, the gain translates directly into lower cloud bills, reduced on-prem capacity, and lower energy usage, with no changes to your existing Spark SQL, DataFrame, or PySpark code. Comet runs on commodity hardware: no GPUs, FPGAs, or other specialized accelerators are required, so the savings come from better utilization of the infrastructure you already run on.

See the Comet Benchmarking Guide for more details.

Performance was a major theme of this release, with a series of targeted optimizations across the shuffle, scan, and execution layers.

Reducing JVM/Native Boundary Overhead¶

Several changes in this release target the cost of crossing between the JVM and native sides, which can dominate execution time in shuffle- and broadcast-heavy workloads:

Shuffle read path: The native shuffle reader no longer uses FFI on the read side, removing a per-batch cost that was particularly visible in shuffle-heavy queries.
Broadcast exchanges: Batches are now coalesced before broadcasting, reducing the number of small batches crossing the JVM/native boundary.
FFI-safe operators: More operators are marked as FFI-safe, avoiding unnecessary deep copies when crossing the JVM/native boundary.

Expanded Native Execution Coverage¶

Columnar-to-row (C2R): Native C2R conversion is now exercised for a broader set of query shapes.
auto scan mode: The auto scan mode now enables the native_datafusion scan where supported, giving users the benefits of the native Parquet reader without having to explicitly opt in. This is part of the ongoing effort to make native_datafusion the default Parquet path once the deprecation of native_iceberg_compat completes.

Memory Management¶

Shared memory pools: Unified memory pools are now shared across native execution contexts within a Spark task, improving memory accounting and reducing OOMs.

Object Storage I/O¶

Object store caching: Object stores and bucket region lookups are cached, dramatically reducing DNS query volume on workloads that open many files.
get_ranges performance: Picked up an upstream opendal fix that restores fast range reads from object storage.

Together, these changes reduce CPU and memory overhead for shuffle-heavy, broadcast-heavy, and object-storage-bound workloads.

Native Iceberg Reader Enabled by Default¶

This release marks a major milestone for Iceberg users: Comet's fully-native Iceberg reader is now enabled by default. Workloads that read Iceberg tables will automatically benefit from native Rust-based scans built on iceberg-rust, with no additional configuration required.

To support this change, the release bundles a broad set of Iceberg-focused improvements:

Dynamic Partition Pruning (DPP): The native Iceberg reader supports DPP, allowing partition filters derived at runtime to prune Iceberg file scans and substantially reduce I/O for star-schema-style queries.
Correct classloader handling: Iceberg classes are now loaded via the thread context classloader, resolving class-loading issues in environments where the executor classloader differs from the application classloader.
Continuous Iceberg CI: Iceberg Spark integration tests now run on every PR and push to main, providing continuous validation of the native Iceberg code path. Test diffs for Spark 3.4 were updated to keep the matrix green across supported Spark versions.
iceberg-rust upgrade: Comet picks up the latest iceberg-rust, pulling in fixes for Parquet reader edge cases discovered in earlier testing.
Refreshed documentation: The Iceberg user guide has been rewritten to reflect current capabilities, and the contributor guide now documents how to run the Iceberg Spark test suites locally.

Users who need to fall back to the previous behavior can still opt out, but we encourage the community to exercise the native reader and report any issues.

Sort-Merge Join Performance¶

Comet relies heavily on sort-merge join (SMJ) because DataFusion's hash joins do not yet support spilling to disk. For larger-than-memory joins, SMJ is the only viable path, making its performance critical for real-world workloads at scale.

DataFusion 53 includes several SMJ improvements that Comet 0.15.0 benefits from directly:

Zero-copy slicing instead of the take kernel (datafusion#20463)
Streaming output instead of waiting for all input before emitting (datafusion#20482)
Cached row counts to avoid O(n) recounting (datafusion#20478)

Additional SMJ work is landing in upstream DataFusion and will arrive in a future Comet release:

Specialized semi/anti join stream (datafusion#20806)
Batch deferred filtering with 20–50x improvements for near-unique LEFT and FULL joins (datafusion#21184)
DynComparator for ~5% TPC-H improvement (datafusion#21484)
Vec-based filter state replacing HashMap (datafusion#21517)
Full outer join correctness fix for NULL filter results (datafusion#21660)

With these performance improvements, the next release of Comet will enable SMJ with filters by default.

Other Key Features¶

New Expressions and Function Support¶

This release adds support for the following:

Date/time functions: days, hours, date_from_unix_date
String/JSON functions: native get_json_object with improved performance over the fallback path
Hash/math functions: bin
Array functions: sort_array
Window functions: LEAD and LAG with IGNORE NULLS
Aggregates: SQL FILTER (WHERE ...) clauses now execute natively; Corr aggregate enabled

Expanded Metrics and Observability¶

Comet metrics can now be exposed through Spark's external monitoring system, making it easier to integrate Comet execution statistics with existing observability dashboards. Native DataFusion scans also now report accurate filesScanned and bytesScanned input metrics, matching Spark's native Parquet scan reporting.

Stability and Correctness¶

A significant portion of this release is dedicated to stability and Spark compatibility. Highlights include:

Cast string to timestamp: Multiple fixes for UTC timestamps, timezone handling, special formats (epoch, now, etc.), and compatibility with Spark's semantics.
Cast decimal to string: Added legacy mode handling to match Spark's output formatting.
String to decimal: Support for full-width characters, null characters, and negative scale.
Decimal arithmetic: Fixes for decimal division and additional test coverage for ANSI overflow handling, including scalar decimal overflow.
Array expressions: Corrected GetArrayItem null handling for dynamic indices; array_append return type fixed and marked Compatible; audited array_insert for correctness; array_compact marked Compatible; array-to-array cast enabled.
DateTrunc/TimestampTrunc: Fixed native crashes when the input is a literal.
Ambiguous local times: Correct handling of ambiguous and non-existent local times across DST transitions.
Case-insensitive Parquet fields: native_datafusion now correctly detects duplicate/ambiguous fields in case-insensitive mode and falls back where appropriate.
Shuffle planning: Shuffle fallback decisions are now "sticky" across planning passes, and Comet columnar shuffle is skipped for stages containing DPP scans to avoid mismatched partitioning.
Error propagation: Native error messages are now propagated through SparkException even when the errorClass is empty, and file-not-found errors flow through the standard Spark error JSON path.
Trigonometric compatibility: tan and atan2 are now Spark-compatible.

Dependency Upgrades¶

This release upgrades to DataFusion 53.1 and Arrow 58.1, and picks up the latest iceberg-rust release with additional reader fixes. The jni crate was upgraded to 0.22.4.

Deprecations and Removals¶

The SupportsComet interface has been removed, along with the Java-based Iceberg integration path (which is fully superseded by the native Iceberg reader). See comet#2921 for background on the decision to standardize on the native iceberg-rust integration. The native_iceberg_compat scan remains deprecated and is expected to be removed in a future release in favor of native_datafusion.

Compatibility¶

Supported platforms include Spark 3.4.3, 3.5.4–3.5.8, and Spark 4.0.x with various JDK and Scala combinations.

The community encourages users to test Comet with existing Spark and Iceberg workloads and welcomes contributions to ongoing development.

Get Started with Comet 0.15.0¶

Ready to try it out? Follow the Comet 0.15.0 Installation Guide to get up and running, then point Comet at your existing Spark workloads and see the speedup for yourself.

Apache DataFusion 53.0.0 Released

2026-04-02T00:00:00+00:00

We are proud to announce the release of DataFusion 53.0.0. This post highlights some of the major improvements since DataFusion 52.0.0. The complete list of changes is available in the changelog. Thanks to the 114 contributors for making this release possible.

Performance Improvements 🚀¶

Figure 1: Average and median normalized execution times for DataFusion 53.0.0 on ClickBench queries, compared to previous releases. Query times are normalized using the ClickBench definition. See the DataFusion Benchmarking Page for more details.

DataFusion 53 continues the project-wide focus on performance. This release reduces planning overhead, skips more unnecessary I/O, and pushes more work into earlier and cheaper stages of execution.

`LIMIT`-Aware Parquet Row Group Pruning¶

DataFusion 53 includes a new optimization that makes Parquet pruning aware of LIMIT. This optimization is described in full in limit pruning blog post. If DataFusion can prove that an entire row group matches the predicate, and those fully matching row groups contain enough rows to satisfy the LIMIT, partially matching row groups are skipped entirely.

Figure 2: Limit pruning is inserted between row group and page index pruning.

Thanks to @xudong963 for implementing this feature. Related PRs: #18868

Improved Filter Pushdown¶

DataFusion 53 pushes filters down through more join types and through UnionExec, and expands support for pushing down dynamic filters. More pushdown means fewer rows flow into joins, repartitions, and later operators, which reduces CPU, memory, and I/O.

For example:

SELECT *
FROM (
    SELECT *
    FROM t1
    LEFT ANTI JOIN t2 ON t1.k = t2.k
) a
JOIN t1 b ON a.k = b.k
WHERE b.v = 1;

Now DataFusion can often transform the physical plan so filters and dynamic filters are pushed deeper into the plan, even through subqueries and nested joins. In this example, the filter on b.v helps produce dynamic filters that can be pushed into both sides of the nested anti join.

Figure 3: DataFusion 53 pushes dynamic filters through subqueries and into both sides of nested joins.

Thanks to @nuno-faria, @haohuaijin, and @jackkleeman for driving this work. Related PRs: #19918, #20145, #20192

Faster Query Planning¶

DataFusion 53 improves query planning performance by making immutable pieces of execution plans cheaper to clone. This helps applications that need extremely low latency, plan many or complex queries, or use prepared statements or parameterized queries. In some benchmarks, overall execution time drops from roughly 4-5 ms to about 100 us.

Thanks to @askalt for leading this work. Related PRs: #19792, #19893

Faster Functions¶

DataFusion includes 235 built-in functions. Improving the performance of these functions benefits a wide range of workloads. This release improves the performance of 42 of those functions, such as strpos, replace, concat, translate, array_has, array_agg, left, right, and case_when.

Thanks to the contributors who drove this work, especially @neilconway, @theirix, @lyne7-sc, @kumarUjjawal, @pepijnve, @zhangxffff, and @UBarney.

Nested Field Pushdown¶

DataFusion 53 pushes expressions such as get_field down the plan and into data sources. This is especially important for nested data such as structs in Parquet files. Instead of reading an entire struct column and then extracting the field of interest, DataFusion 53 pushes the field extraction into the scan.

For example, the following query reads a struct column s and extracts the label field for rows where the value field is greater than 150:

SELECT id, s['label']
FROM t
WHERE s['value'] > 150;

Figure 4: DataFusion 53 pushes field-access expressions closer to the scan.

Special thanks to @adriangb for designing and implementing this optimizer work. Related PRs: #20065, #20117, #20239

New Features ✨¶

JSON Array File Support: DataFusion 53 can now read JSON arrays such as [{...}, {...}] directly as multiple rows, including streaming inputs from object stores. Thanks to @zhuqi-lucas for implementing this feature. Related PRs: #19924
Support for : operator: DataFusion can plan queries such as SELECT payload:'user_id' FROM events;, enabling better Parquet Variant support via datafusion-variant. Thanks to @Samyak2. Related PRs: #20717
New SQL: DataFusion supports additional set-comparison subqueries, null-aware anti join, and deletion predicates. Thanks to @waynexia, @viirya, and @askalt for key contributions in this area. Related PRs: #19109, #19635, #20137
Spark-Compatible Functions: This release includes almost 20 new or improved Spark-compatible functions and behaviors in the datafusion-spark crate. It includes functions such as collect_list, date_diff, from_utc_timestamp, json_tuple, arrays_zip, bin, and array_contains. Thanks to the contributors who drove this work, especially @cht42, @CuteChuanChuan, @SubhamSinghal, @kazantsev-maksim, @unknowntpo, @aryan-212, @hsiang-c, and @davidlghellin.

Stability and Release Engineering 🦺¶

The community spent significant time this release cycle stabilizing the release branch and improving the release process. While such improvements are not as headline-friendly as new features, they are highly important for real deployments. We are discussing ways to improve the process on #21034 and would welcome suggestions and contributions to help with release engineering work in the future.

Thanks to @comphead for running this release, and to @jonathanc-n, @alamb, @xanderbailey, @haohuaijin, @friendlymatthew, @fwojciec, @Kontinuation, @nathanb9, and many others who helped stabilize the release branch.

Upgrade Notes¶

DataFusion 53 includes some breaking changes, including updates to the SQL parser, optimizer behavior, and some physical-plan APIs. Please see the upgrade guide and changelog for the full details before upgrading.

Known Issues¶

A small number of issues were discovered after the 53.0.0 release, and we expect to publish DataFusion 53.1.0 soon. See the 53.1.0 release tracking issue for the latest status.

Thank You¶

Thank you to everyone in the DataFusion community who contributed code, reviews, testing, bug reports, documentation, and release engineering work for 53.0.0. This release contains direct contributions from 114 different people, and we are grateful for the time and effort that everyone put in to make it happen.

Apache DataFusion Comet 0.14.0 Release

2026-03-18T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.14.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately eight weeks of development work and is the result of merging 189 PRs from 21 contributors. See the change log for more information.

Key Features¶

Native Iceberg Improvements¶

Comet's fully-native Iceberg integration received several enhancements:

Per-Partition Plan Serialization: CometExecRDD now supports per-partition plan data, reducing serialization overhead for native Iceberg scans and enabling dynamic partition pruning (DPP).

Vended Credentials: Native Iceberg scans now support passing vended credentials from the catalog, improving integration with cloud storage services.

Upstream Reader Performance Improvements: The Comet team contributed a number of reader performance improvements to iceberg-rust 0.9.0, which Comet now uses. These improvements benefit all iceberg-rust users.

Performance Optimizations:

Single-pass FileScanTask validation for reduced planning overhead
Configurable data file concurrency via spark.comet.scan.icebergNative.dataFileConcurrencyLimit
Channel-based executor thread parking instead of yield_now() for reduced CPU overhead
Reuse of CometConf and native utility instances in batch decoding

Native Columnar-to-Row Conversion¶

Comet now uses a native columnar-to-row (C2R) conversion by default. This feature replaces Comet's JVM-based columnar-to-row transition with a native Rust implementation, reducing JVM memory overhead when data flows from Comet's native execution back to Spark operators that require row-based input.

New Expressions¶

This release adds support for the following expressions:

Date/time functions: make_date, next_day
String functions: right, string_split, luhn_check
Math functions: crc32
Map functions: map_contains_key, map_from_entries
Conversion functions: to_csv
Cast support: date to timestamp, numeric to timestamp, integer to binary, boolean to decimal, date to numeric

ANSI Mode Error Messages¶

ANSI SQL mode now produces proper error messages matching Spark's expected output, improving compatibility for workloads that rely on strict SQL error handling.

DataFusion Configuration Passthrough¶

DataFusion session-level configurations can now be set directly from Spark using the spark.comet.datafusion.* prefix. This enables tuning DataFusion internals such as batch sizes and memory limits without modifying Comet code.

Performance Improvements¶

This release includes extensive performance optimizations:

Sum aggregation: Specialized implementations for each eval mode eliminate per-row mode checks
Contains expression: SIMD-based scalar pattern search for faster string matching
Batch coalescing: Reduced IPC schema overhead in BufBatchWriter by coalescing small batches
Tokio runtime: Worker threads now initialize from spark.executor.cores for better resource utilization
Decimal expressions: Optimized decimal arithmetic operations
Row-to-columnar transition: Improved performance for JVM shuffle data conversion
Aligned pointer reads: Optimized SparkUnsafeRow field accessors using aligned memory reads

Deprecations and Removals¶

The deprecated native_comet scan mode has been removed. Use native_datafusion instead. Note that the native_iceberg_compat scan is now deprecated and will be removed from a future release.

Compatibility¶

This release upgrades to DataFusion 52.3, Arrow 57.3, and iceberg-rust 0.9.0. Published binaries now target x86-64-v3 and neoverse-n1 CPU architectures for improved performance on modern hardware.

Supported platforms include Spark 3.4.3, 3.5.4-3.5.8, and Spark 4.0.x with various JDK and Scala combinations.

The community encourages users to test Comet with existing Spark workloads and welcomes contributions to ongoing development.

Apache DataFusion Comet 0.13.0 Release

2026-01-30T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.13.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately eight weeks of development work and is the result of merging 169 PRs from 15 contributors. See the change log for more information.

Key Features¶

Native Parquet Write Support (Experimental)¶

This release introduces experimental native Parquet write capabilities, allowing Comet to intercept and execute Parquet write operations natively through DataFusion. Key capabilities include:

File commit protocol support for reliable writes
Remote HDFS writing via OpenDAL integration
Complex type support (arrays, maps, structs)
Proper handling of object store settings

To enable native Parquet writes, set:

spark.comet.allowIncompatibleOp.DataWritingCommandExec=true
spark.comet.parquet.write.enabled=true

Note: This feature is highly experimental and should not be used in production environments. It is currently categorized as a testing feature and is disabled by default.

Native Iceberg Improvements¶

Comet's fully-native Iceberg integration received significant enhancements in this release:

REST Catalog Support: Native Iceberg scans now support REST catalogs, enabling integration with catalog services like Apache Polaris and Tabular. Configure with:

--conf spark.sql.catalog.rest_cat=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.rest_cat.catalog-impl=org.apache.iceberg.rest.RESTCatalog
--conf spark.sql.catalog.rest_cat.uri=http://localhost:8181
--conf spark.comet.scan.icebergNative.enabled=true

Session Token Authentication: Added support for session tokens in native Iceberg scans for secure S3 access.

Performance Optimizations:

Deduplicated serialized metadata reducing memory overhead
Switched from JSON to protobuf for partition value serialization
Removed IcebergFileStream in favor of iceberg-rust's built-in parallelization
Reduced metadata serialization points
Added SchemaAdapter caching

To enable fully-native Iceberg scanning:

spark.comet.scan.icebergNative.enabled=true

The native reader supports Iceberg table spec v1 and v2, all primitive and complex types, schema evolution, time travel, positional and equality deletes, filter pushdown, and various storage backends (local, HDFS, S3).

Native CSV Reading (Experimental)¶

Experimental support for native CSV file reading has been added, expanding Comet's file format capabilities beyond Parquet.

New Expressions¶

The release adds support for numerous expressions:

Array functions: explode, explode_outer, size
Date/time functions: unix_date, date_format, datediff, last_day, unix_timestamp
String functions: left
JSON functions: from_json (partial support)

ANSI Mode Support¶

Sum and average aggregate expressions now support ANSI mode for both integer and decimal inputs, enabling overflow checking in strict SQL mode.

Native Shuffle Improvements¶

Round-robin partitioning is now supported in native shuffle
Spill metrics are now reported correctly
Configurable shuffle writer buffer size via spark.comet.shuffle.write.bufferSize

Performance Improvements¶

This release includes extensive performance optimizations:

String to integer casting: Significant speedups through optimized parsing
String functions: Optimized lpad/rpad to remove unnecessary memory allocations
Date operations: Improved normalize_nan and date truncate performance
Query planning: Cached query plans to avoid per-partition serialization overhead
Memory efficiency: Reduced GC pressure in protobuf serialization
Hash operations: Optimized complex-type hash implementations including murmur3 support for nested types
Runtime efficiency: Eliminated busy-polling of Tokio stream for plans without CometScan
Metrics overhead: Reduced timer and syscall overhead in native shuffle writer

Deprecations¶

The native_comet scan mode is now deprecated in favor of native_iceberg_compat and will be removed in a future release. The auto scan mode no longer falls back to native_comet.

Compatibility¶

This release upgrades to DataFusion 51, Arrow 57, and the latest iceberg-rust. The minimum supported Rust version is now 1.88.

Supported platforms include Spark 3.4.3, 3.5.4-3.5.7, and Spark 4.0.x with various JDK and Scala combinations.

The community encourages users to test Comet with existing Spark workloads and welcomes contributions to ongoing development.

Apache DataFusion 52.0.0 Released

2026-01-12T00:00:00+00:00

We are proud to announce the release of DataFusion 52.0.0. This post highlights some of the major improvements since DataFusion 51.0.0. The complete list of changes is available in the changelog. Thanks to the 121 contributors for making this release possible.

Performance Improvements 🚀¶

We continue to make significant performance improvements in DataFusion as explained below.

Faster `CASE` Expressions¶

DataFusion 52 has lookup-table-based evaluation for certain CASE expressions to avoid repeated evaluation for accelerating common ETL patterns such as

CASE company
    WHEN 1 THEN 'Apple'
    WHEN 5 THEN 'Samsung'
    WHEN 2 THEN 'Motorola'
    WHEN 3 THEN 'LG'
    ELSE 'Other'
END

This is the final work in our CASE performance epic (#18075), which has improved CASE evaluation significantly. Related PRs #18183. Thanks to rluvaton and pepijnve for the implementation. See the Optimizing SQL CASE Expression Evaluation blog post for more details.

`MIN`/`MAX` Aggregate Dynamic Filters¶

DataFusion now creates dynamic filters for queries with MIN/MAX aggregates that have filters, but no GROUP BY. These dynamic filters are used during scan to prune files and rows as tighter bounds are discovered during execution, as explained in the Dynamic Filtering Blog. For example, the following query:

SELECT min(l_shipdate)
FROM lineitem
WHERE l_returnflag = 'R';

Is now executed like this

SELECT min(l_shipdate)
FROM lineitem
--  '__current_min' is updated dynamically during execution
WHERE l_returnflag = 'R' AND l_shipdate < __current_min;

Thanks to 2010YOUY01 for implementing this feature, with reviews from martin-g, adriangb, and LiaCastaneda. Related PRs: #18644

New Merge Join¶

DataFusion 52 includes a rewrite of the sort-merge join (SMJ) operator, with speedups of three orders of magnitude in some pathological cases such as the case in #18487, which also affected Apache Comet workloads. Benchmarks in #18875 show dramatic gains for TPC-H Q21 (minutes to milliseconds) while leaving other queries unchanged or modestly faster. Thanks to mbutrovich for the implementation and reviews from Dandandan.

Caching Improvements¶

This release also includes several additional caching improvements.

A new statistics cache for File Metadata avoids repeatedly (re)calculating statistics for files. This significantly improves planning time for certain queries. You can see the contents of the new cache using the statistics_cache function in the CLI:

select * from statistics_cache();
+------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
| path             | file_modified       | file_size_bytes | e_tag                  | version | num_rows        | num_columns | table_size_bytes   | statistics_size_bytes |
+------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
| .../hits.parquet | 2022-06-25T22:22:22 | 14779976446     | 0-5e24d1ee16380-370f48 | NULL    | Exact(99997497) | 105         | Exact(36445943240) | 0                     |
+------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+

Thanks to bharath-techie and nuno-faria for implementing the statistics cache, with reviews from martin-g, alamb, and alchemist51. Related PRs: #18971, #19054

A prefix-aware list-files cache accelerates evaluating partition predicates for Hive partitioned tables.

-- Read the hive partitioned dataset from Overture Maps (100s of Parquet files)
CREATE EXTERNAL TABLE overturemaps
STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-12-17.0/';
-- Find all files where the path contains `theme=base without requiring another LIST call
select count(*) from overturemaps where theme='base';

You can see the contents of the new cache using the list_files_cache function in the CLI:

create external table overturemaps
stored as parquet
location 's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/type=infrastructure';
0 row(s) fetched.
> select table, path, metadata_size_bytes, expires_in, unnest(metadata_list)['file_size_bytes'] as file_size_bytes, unnest(metadata_list)['e_tag'] as e_tag from list_files_cache() limit 10;
+--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
| table        | path                                                | metadata_size_bytes | expires_in                        | file_size_bytes | e_tag                                 |
+--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 999055952       | "35fc8fbe8400960b54c66fbb408c48e8-60" |
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 975592768       | "8a16e10b722681cdc00242564b502965-59" |
...
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 1016732378      | "6d70857a0473ed9ed3fc6e149814168b-61" |
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 991363784       | "c9cafb42fcbb413f851691c895dd7c2b-60" |
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 1032469715      | "7540252d0d67158297a67038a3365e0f-62" |
+--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+

Thanks to BlakeOrth and Yuvraj-cyborg for implementing the list-files cache work, with reviews from gabotechs, alamb, alchemist51, martin-g, and BlakeOrth. Related PRs: #18146, #18855, #19366, #19298,

Improved Hash Join Filter Pushdown¶

Starting in DataFusion 51, filtering information from HashJoinExec is passed dynamically to scans, as explained in the Dynamic Filtering Blog using a technique referred to as Sideways Information Passing in Database research literature. The initial implementation passed min/max values for the join keys. DataFusion 52 extends the optimization (#17171 / #18393) to pass the contents of the build side hash map. These filters are evaluated on the probe side scan to prune files, row groups, and individual rows. When the build side contains 20 or fewer rows (configurable) the contents of the hash map are transformed to an IN expression and used for statistics-based pruning which can avoid reading entire files or row groups that contain no matching join keys. Thanks to adriangb for implementing this feature, with reviews from LiaCastaneda, asolimando, comphead, and mbutrovich.

Major Features ✨¶

Arrow IPC Stream file support¶

DataFusion can now read Arrow IPC stream files (#18457). This expands interoperability with systems that emit Arrow streams directly, making it simpler to ingest Arrow-native data without conversion. Thanks to corasaurus-hex for implementing this feature, with reviews from martin-g, Jefffrey, jdcasale, 2010YOUY01, and timsaucer.

CREATE EXTERNAL TABLE ipc_events
STORED AS ARROW
LOCATION 's3://bucket/events.arrow';

Related PRs: #18457

More Extensible SQL Planning with `RelationPlanner`¶

DataFusion now has an API for extending the SQL planner for relations, as explained in the Extending SQL in DataFusion Blog. In addition to the existing expression and types extension points, this new API now allows extending FROM clauses. Using these APIs it is straightforward to provide SQL support for almost any dialect, including vendor-specific syntax. Example use cases include:

-- Postgres-style JSON operators
SELECT payload->'user'->>'id' FROM logs;
-- MySQL-specific types
SELECT DATETIME '2001-01-01 18:00:00';
-- Statistical sampling
SELECT * FROM sensor_data TABLESAMPLE BERNOULLI(10 PERCENT);

Thanks to geoffreyclaude for implementing relation planner extensions, and to theirix, alamb, NGA-TRAN, and gabotechs for reviews and feedback on the design. Related PRs: #17843

Expression Evaluation Pushdown to Scans¶

DataFusion now pushes down expression evaluation into TableProviders using PhysicalExprAdapter, replacing the older SchemaAdapter approach (#14993, #16800). Predicates and expressions can now be customized for each individual file schema, opening additional optimization such as support for Variant shredding. Thanks to adriangb for implementing PhysicalExprAdapter and reworking pushdown to use it. Related PRs: #18998, #19345

Sort Pushdown to Scans¶

DataFusion can now push sorts into data sources (#10433, #19064). This allows table provider implementations to optimize based on sort knowledge for certain query patterns. For example, the provided Parquet data source now reverses the scan order of row groups and files when queried for the opposite of the file's natural sort (e.g. DESC when the files are sorted ASC). This reversal, combined with dynamic filtering, allows top-K queries with LIMIT on pre-sorted data to find the requested rows very quickly, pruning more files and row groups without even scanning them. We have seen a ~30x performance improvement on benchmark queries with pre-sorted data. Thanks to zhuqi-lucas and xudong963 for this feature, with reviews from martin-g, adriangb, and alamb.

`TableProvider` supports `DELETE` and `UPDATE` statements¶

The TableProvider trait now includes hooks for DELETE and UPDATE statements and the basic MemTable implements them (#19142). This lets downstream implementations and storage engines plug in their own mutation logic. See TableProvider::delete_from and TableProvider::update for more details.

Example:

DELETE FROM mem_table WHERE status = 'obsolete';

Thanks to ethan-tyler for the implementation and alamb and adriangb for reviews.

`CoalesceBatchesExec` Removed¶

The standalone CoalesceBatchesExec operator existed to ensure batches were large enough for subsequent vectorized execution, and was inserted after filter-like operators such as FilterExec, HashJoinExec, and RepartitionExec. However, using a separate operator also blocks other optimizations such as pushing LIMIT through joins and made optimizer rules more complex. In this release, we integrated the coalescing into the operators themselves (#18779) using Arrow's coalesce kernel. This reduces plan complexity while keeping batch sizes efficient, and allows additional focused optimization work in the Arrow kernel, such as Dandandan's recent work with filtering in arrow-rs/#8951.

Related PRs: #18540, #18604, #18630, #18972, #19002, #19342, #19239 Thanks to Tim-53, Dandandan, jizezhang, and feniljain for implementing this feature, with reviews from Jefffrey, alamb, martin-g, geoffreyclaude, milenkovicm, and jizezhang.

Upgrade Guide and Changelog¶

As always, upgrading to 52.0.0 should be straightforward for most users. Please review the Upgrade Guide for details on breaking changes and code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog.

About DataFusion¶

Apache DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast, data-centric systems such as databases, dataframe libraries, and machine learning and streaming applications. While DataFusion's primary design goal is to accelerate the creation of other data-centric systems, it provides a reasonable experience directly out of the box as a dataframe library, Python library, and command-line SQL tool.

How to Get Involved¶

DataFusion is not a project built or driven by a single person, company, or foundation. Rather, our community of users and contributors works together to build a shared technology that none of us could have built alone.

If you are interested in joining us, we would love to have you. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests, or code. A list of open issues suitable for beginners is here, and you can find out how to reach us on the communication doc.

Apache DataFusion Comet 0.12.0 Release

2025-12-04T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.12.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately four weeks of development work and is the result of merging 105 PRs from 13 contributors. See the change log for more information.

Release Highlights¶

Experimental Native Apache Iceberg Scan Support¶

Comet has a new, experimental, native Iceberg scan. This work relies on iceberg-rust and the Parquet reader from arrow-rs that Comet already uses to great effect. Comet’s existing Iceberg integration relies on a modified Iceberg Java build to accelerate Parquet decoding. This new approach allows unmodified Iceberg Java to handle query planning (i.e., catalog access, partition pruning, etc.), then Comet serializes Iceberg FileScanTask objects directly to iceberg-rust, enabling native execution of Iceberg table scans through DataFusion.

This represents a significant step forward in Comet's support for data lakehouse architectures and expands the range of workloads that can benefit from native acceleration. Please take a look at the PR and Comet’s documentation to understand the current limitations and try it on your workloads! We are eager for feedback on this approach.

Code Architecture Improvements¶

This release includes significant refactoring to improve code maintainability and extensibility, and we will continue those efforts into 0.13.0 development:

Unified operator serialization: The CometExecRule refactor unifies CometNativeExec creation with serialization through the new CometOperatorSerde trait
Expression serde refactoring: Multiple PRs (#2738, #2741, #2791) moved expression serialization logic out of QueryPlanSerde into specialized traits
Aggregate expression improvements: Added getSupportLevel to CometAggregateExpressionSerde trait for better aggregate function handling

These architectural improvements make it easier for contributors to add new operators and expressions while reducing code complexity.

New SQL Functions¶

The following SQL functions are now supported:

concat - String concatenation
abs - Absolute value
sha1 - SHA-1 hash function
cot - Cotangent function
Hyperbolic trigonometric functions - sinh, cosh, tanh, and their inverse functions

New Operators¶

CometLocalTableScanExec - Native support for local table scans, eliminating fallback to Spark for small, in-memory datasets

Configuration and Usability Improvements¶

Simplified on-heap configuration: Simplified on-heap memory configuration for easier setup
Extended explain format: Renamed and improved COMET_EXTENDED_EXPLAIN_FORMAT with better defaults
Environment variable support: Improved framework for setting configs with environment variables
Native config passing: All Comet configs now passed to native plan
Config categorization: Categorized testing configs and added notes about known timezone issues
Removed legacy configs: Removed COMET_EXPR_ALLOW_INCOMPATIBLE config to simplify configuration

Bug Fixes¶

This release includes numerous bug fixes:

Fixed None.get in stringDecode when binary child cannot be converted
Proper fallback for lpad/rpad with unsupported arguments
Fixed trunc/date_trunc with unsupported format strings
Corrected single partition handling in native_datafusion
Fixed LeftSemi join handling - do not replace SMJ with HJ
Fixed CometLiteral class cast exception with arrays
Fixed missing SortOrder fallback reason in range partitioning
Improved checkSparkMaybeThrows to compare results in success case
Fixed null handling in CometVector implementations

Documentation Improvements¶

Added FFI documentation to contributor guide
Updated contributor guide for adding new expressions and operators
Improved documentation layout and navigation
Added prettier enforcement for consistent markdown formatting
CI check to ensure generated docs are in sync
Various documentation updates for SortOrder expressions, LocalTableScan and WindowExec, and Spark SQL tests

Dependency Updates¶

Upgraded to Spark 3.5.7
Upgraded to DataFusion 50.3.0
Upgraded Parquet from 56.0.0 to 56.2.0
Various other dependency updates via Dependabot

Spark Compatibility¶

Spark 3.4.3 with JDK 11 & 17, Scala 2.12 & 2.13
Spark 3.5.4 through 3.5.7 with JDK 11 & 17, Scala 2.12 & 2.13
Spark 4.0.1 with JDK 17, Scala 2.13

We are looking for help from the community to fully support Spark 4.0.1. See EPIC: Support 4.0.0 for more information.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

The easiest way to get involved is to test Comet with your current Spark jobs and file issues for any bugs or performance regressions that you find. See the Getting Started guide for instructions on downloading and installing Comet.

There are also many good first issues waiting for contributions.

Apache DataFusion 51.0.0 Released

2025-11-25T00:00:00+00:00

Introduction¶

We are proud to announce the release of DataFusion 51.0.0. This post highlights some of the major improvements since DataFusion 50.0.0. The complete list of changes is available in the changelog. Thanks to the 128 contributors for making this release possible.

Performance Improvements 🚀¶

We continue to make significant performance improvements in DataFusion, both in the core engine and in the Parquet reader.

Figure 1: Average and median normalized query execution times for ClickBench queries for DataFusion 51.0.0 compared to previous releases. Query times are normalized using the ClickBench definition. See the DataFusion Benchmarking Page for more details.

Faster `CASE` expression evaluation¶

This release builds on the CASE performance epic with significant improvements. Expressions short‑circuit earlier, reuse partial results, and avoid unnecessary scattering, speeding up common ETL patterns. Thanks to pepijnve, chenkovsky, and petern48 for leading this effort. You can find more details in the Optimizing SQL CASE Expression Evaluation blog post.

Better Defaults for Remote Parquet Reads¶

By default, DataFusion now always fetches the last 512KB (configurable) of Apache Parquet files which usually includes the footer and metadata (#18118). This change typically avoids 2 I/O requests for each Parquet. While this setting has existed in DataFusion for many years, it was not previously enabled by default. Users can tune the number of bytes fetched in the initial I/O request via the datafusion.execution.parquet.metadata_size_hint config setting. Thanks to zhuqi-lucas for leading this effort.

Faster Parquet metadata parsing¶

DataFusion 51 also includes the latest Parquet reader from Arrow Rust 57.0.0, which parses Parquet metadata significantly faster. This is especially beneficial for workloads with many small Parquet files and scenarios where startup time or low latency is important. You can read more about the upstream work by etseidl and jhorstmann that enabled these improvements in the Faster Apache Parquet Footer Metadata Using a Custom Thrift Parser blog.

Figure 2: Metadata parsing performance improvements in Arrow/Parquet 57.0.0.

New Features ✨¶

Decimal32/Decimal64 support¶

The new Arrow types Decimal32 and Decimal64 are now supported in DataFusion (#17501), including aggregations such as SUM, AVG, MIN/MAX, and window functions. Thanks to AdamGS for leading this effort.

SQL Pipe Operators¶

DataFusion now supports the SQL pipe operator syntax (#17278), enabling inline transforms such as:

SELECT * FROM t
|> WHERE a > 10
|> ORDER BY b
|> LIMIT 5;

This syntax, popularized by Google BigQuery, keeps multi-step transformations concise while preserving regular SQL semantics. Thanks to simonvandel for leading this effort.

I/O Profiling in `datafusion-cli`¶

datafusion-cli now has built-in instrumentation to trace object store calls (#17207). Toggle profiling with the \object_store_profiling command and inspect the exact GET/LIST requests issued during query execution:

DataFusion CLI v51.0.0
> \object_store_profiling trace
ObjectStore Profile mode set to Trace
> select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
+----------+
| count(*) |
+----------+
| 1000000  |
+----------+
1 row(s) fetched.
Elapsed 0.367 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Trace, inner: HttpStore
2025-11-19T21:10:43.476121+00:00 operation=Head duration=0.069763s path=hits_compatible/athena_partitioned/hits_1.parquet
2025-11-19T21:10:43.545903+00:00 operation=Head duration=0.025859s path=hits_compatible/athena_partitioned/hits_1.parquet
2025-11-19T21:10:43.571768+00:00 operation=Head duration=0.025684s path=hits_compatible/athena_partitioned/hits_1.parquet
2025-11-19T21:10:43.597463+00:00 operation=Get duration=0.034194s size=524288 range: bytes=174440756-174965043 path=hits_compatible/athena_partitioned/hits_1.parquet
2025-11-19T21:10:43.705821+00:00 operation=Head duration=0.022029s path=hits_compatible/athena_partitioned/hits_1.parquet

Summaries:
+-----------+----------+-----------+-----------+-----------+-----------+-------+
| Operation | Metric   | min       | max       | avg       | sum       | count |
+-----------+----------+-----------+-----------+-----------+-----------+-------+
| Get       | duration | 0.034194s | 0.034194s | 0.034194s | 0.034194s | 1     |
| Get       | size     | 524288 B  | 524288 B  | 524288 B  | 524288 B  | 1     |
| Head      | duration | 0.022029s | 0.069763s | 0.035834s | 0.143335s | 4     |
| Head      | size     |           |           |           |           | 4     |
+-----------+----------+-----------+-----------+-----------+-----------+-------+

This makes it far easier to diagnose slow remote scans and validate caching strategies. Thanks to BlakeOrth for leading this effort.

`DESCRIBE <query>`¶

DESCRIBE now works on arbitrary queries, returning the schema instead of being an alias for EXPLAIN (#18234). This brings DataFusion in line with engines like DuckDB and makes it easy to inspect the output schema of queries without executing them. Thanks to djanderson for leading this effort.

For example:

DataFusion CLI v51.0.0
> create table t(a int, b varchar, c float) as values (1, 'a', 2.0);
0 row(s) fetched.
Elapsed 0.002 seconds.

> DESCRIBE SELECT a, b, SUM(c) FROM t GROUP BY a, b;

+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| a           | Int32     | YES         |
| b           | Utf8View  | YES         |
| sum(t.c)    | Float64   | YES         |
+-------------+-----------+-------------+
3 row(s) fetched.

Named arguments in SQL functions¶

DataFusion now understands PostgreSQL-style named arguments (param => value) for scalar, aggregate, and window functions (#17379). You can mix positional and named arguments in any order, and error messages now list parameter names to make diagnostics clearer. UDF authors can also expose parameter names so their functions benefit from the same syntax. Thanks to timsaucer and bubulalabu for leading this effort.

For example, you can pass arguments to functions like this:

SELECT power(exponent => 3.0, base => 2.0);

Metrics improvements¶

The output of EXPLAIN ANALYZE has been improved to include more metrics about execution time and memory usage of each operator (#18217). You can learn more about these new metrics in the metrics user guide. Thanks to 2010YOUY01 for leading this effort.

The 51.0.0 release adds:

Configuration: adds a new option datafusion.explain.analyze_level, which can be set to summary for a concise output or dev for the full set of metrics (the previous default).
For all major operators: adds output_bytes, reporting how many bytes of data each operator produces.
FilterExec: adds a selectivity metric (output_rows / input_rows) to show how effective the filter is.
AggregateExec:
adds detailed timing metrics for group-ID computation, aggregate argument evaluation, aggregation work, and emitting final results.
adds a reduction_factor metric (output_rows / input_rows) to show how much grouping reduces the data.
NestedLoopJoinExec: adds a selectivity metric (output_rows / (left_rows * right_rows)) to show how many combinations actually pass the join condition.
Several display formatting improvements were added to make EXPLAIN ANALYZE output easier to read.

For example, the following query:

set datafusion.explain.analyze_level = summary

explain analyze 
select count(*) 
from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet' 
where "URL" <> '';

Now shows easier-to-understand metrics such as:

 metrics=[
   output_rows=1000000, 
   elapsed_compute=16ns, 
   output_bytes=222.5 MB, 
   files_ranges_pruned_statistics=16 total → 16 matched, 
   row_groups_pruned_statistics=3 total → 3 matched, 
   row_groups_pruned_bloom_filter=3 total → 3 matched, 
   page_index_rows_pruned=0 total → 0 matched,
   bytes_scanned=33661364,
   metadata_load_time=4.243098ms, 
]

Upgrade Guide and Changelog¶

Upgrading to 51.0.0 should be straightforward for most users. Please review the Upgrade Guide for details on breaking changes and code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog.

About DataFusion¶

Apache DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast, data-centric systems such as databases, dataframe libraries, and machine learning and streaming applications. While DataFusion’s primary design goal is to accelerate the creation of other data-centric systems, it provides a reasonable experience directly out of the box as a dataframe library, Python library, and command-line SQL tool.

DataFusion's core thesis is that, as a community, together we can build much more advanced technology than any of us as individuals or companies could build alone. Without DataFusion, highly performant vectorized query engines would remain the domain of a few large companies and world-class research institutions. With DataFusion, we can all build on top of a shared foundation and focus on what makes our projects unique.

How to Get Involved¶

Apache DataFusion Comet 0.11.0 Release

2025-10-21T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.11.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately five weeks of development work and is the result of merging 131 PRs from 15 contributors. See the change log for more information.

Release Highlights¶

Parquet Modular Encryption Support¶

Spark supports Parquet Modular Encryption to independently encrypt column values and metadata. Furthermore, Spark supports custom encryption factories for users to provide their own key-management service (KMS) implementations. Thanks to a number of contributions in upstream DataFusion and arrow-rs, Comet now supports Parquet Modular Encryption with Spark KMS for native readers, enabling secure reading of encrypted Parquet files in production environments.

Improved Memory Management¶

Comet 0.11.0 introduces significant improvements to memory management, making it easier to deploy and more resilient to out-of-memory conditions:

Changed default memory pool: The default off-heap memory pool has been changed from greedy_unified to fair_unified, providing better memory fairness across operations
Off-heap deployment recommended: To simplify configuration and improve performance, Comet now expects to be deployed with Spark's off-heap memory configuration. On-heap memory is still available for development and debugging, but is not recommended for deployment
Better disk management: The DiskManager max_temp_directory_size is now configurable for better control over temporary disk usage
Enhanced safety: Memory pool operations now use checked arithmetic operations to prevent overflow issues

These changes make Comet significantly easier to configure and deploy in production environments.

Improved Apache Spark 4.0 Support¶

Comet has improved its support for Apache Spark 4.0.1 with several important enhancements:

Updated support from Spark 4.0.0 to Spark 4.0.1
Spark 4.0 is now included in the release build script
Expanded ANSI mode compatibility with several new implementations:
ANSI evaluation mode arithmetic operations
ANSI mode integral divide
ANSI mode rounding functions
ANSI mode remainder function

Spark 4.0 compatible jar files are now available on Maven Central. See the installation guide for instructions on using published jar files.

Complex Types for Columnar Shuffle¶

ashdnazg submitted a fantastic refactoring PR that simplified the logic for writing rows in Comet’s JVM-based, columnar shuffle. A benefit of this refactoring is better support for complex types (e.g., structs, lists, and arrays) in columnar shuffle. Comet no longer falls back to Spark to shuffle these types, enabling native acceleration for queries involving nested data structures. This enhancement significantly expands the range of queries that can benefit from Comet's columnar shuffle implementation.

RangePartitioning for Native Shuffle¶

Comet's native shuffle now supports RangePartitioning, providing better performance for operations that require range-based data distribution. Comet now matches Spark behavior for computing and distributing range boundaries, and serializes them to native execution for faster shuffle operations.

New Functionality¶

The following SQL functions are now supported:

weekday - Extract day of week from date
lpad - Left pad a string with column support for pad length
rpad - Right pad a string with column support and additional character support
reverse - Support for ArrayType input in addition to strings
count(distinct) - Native support without falling back to Spark
bit_get - Get bit value at position

New expression capabilities include:

Performance Improvements¶

Improved BroadcastExchangeExec conversion for better broadcast join performance
Use of DataFusion's native count_udaf instead of SUM(IF(expr IS NOT NULL, 1, 0))
New configuration from shared conf to reduce overhead
Buffered index writes to reduce system calls in shuffle operations

Comet 0.11.0 TPC-H Performance¶

Comet 0.11.0 continues to deliver significant performance improvements over Spark. In our TPC-H benchmarks, Comet reduced overall query runtime from 687 seconds to 302 seconds when processing 100 GB of Parquet data using a single 8-core executor, achieving a 2.2x speedup.

The performance gains are consistent across individual queries, with most queries showing substantial improvements:

You can reproduce these benchmarks using our Comet Benchmarking Guide. We encourage you to run your own performance tests with your workloads.

Apache Iceberg Support¶

Updated support for Apache Iceberg 1.9.1
Additional Parquet-independent API improvements for Iceberg integration
Improved resource management in Iceberg reader instances

UX Improvements¶

Added plan conversion statistics to extended explain info for better observability
Improved fallback information to help users understand when and why Comet falls back to Spark
Added backtrace feature to simplify enabling native backtraces in CometNativeException
Native log level is now configurable via Comet configuration

Bug Fixes¶

Documentation Updates¶

Updated documentation for native shuffle configuration and tuning
Added documentation for ANSI mode support in various functions
Improved EC2 benchmarking guide
Split configuration guide into different sections (scan, exec, shuffle, etc.) for better organization
Various clarifications and improvements throughout the documentation

Spark Compatibility¶

Spark 3.4.3 with JDK 11 & 17, Scala 2.12 & 2.13
Spark 3.5.4 through 3.5.6 with JDK 11 & 17, Scala 2.12 & 2.13
Spark 4.0.1 with JDK 17, Scala 2.13

We are looking for help from the community to fully support Spark 4.0.1. See EPIC: Support 4.0.0 for more information.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Apache DataFusion 50.0.0 Released

2025-09-29T00:00:00+00:00

Introduction¶

We are proud to announce the release of DataFusion 50.0.0. This blog post highlights some of the major improvements since the release of DataFusion 49.0.0. The complete list of changes is available in the changelog. Thanks to numerous contributors for making this release possible!

Performance Improvements 🚀¶

DataFusion continues to focus on enhancing performance, as shown in ClickBench and other benchmark results.

Figure 1: Average and median normalized query execution times for ClickBench queries for each git revision. Query times are normalized using the ClickBench definition. See the DataFusion Benchmarking Page for more details.

Here are some noteworthy optimizations added since DataFusion 49:

Dynamic Filter Pushdown Improvements

The dynamic filter pushdown optimization, which allows runtime filters to cut down on the amount of data read, has been extended to support inner hash joins, dramatically improving performance when one relation is relatively small or filtered by a highly selective predicate. More details can be found in the Dynamic Filter Pushdown for Hash Joins section below. The dynamic filters in the TopK operator have also been improved in DataFusion 50.0.0, further increasing the effectiveness and efficiency of the optimization. More details can be found in this ticket.

Nested Loop Join Optimization

The nested loop join operator has been rewritten to reduce execution time and memory usage by adopting a finer-grained approach. Specifically, we now limit the intermediate data size to around a single RecordBatch for better memory efficiency, and we have eliminated redundant conversions from the old implementation to further improve execution speed. When evaluating this new approach in a microbenchmark, we measured up to a 5x improvement in execution time and a 99% reduction in memory usage. More details and results can be found in this ticket.

Parquet Metadata Caching

DataFusion now automatically caches the metadata of Parquet files (statistics, page indexes, etc.), to avoid unnecessary disk/network round-trips. This is especially useful when querying the same table multiple times over relatively slow networks, allowing us to achieve an order of magnitude faster execution time when running many small reads over large files. More information can be found in the Parquet Metadata Cache section.

Community Growth 📈¶

Between 49.0.0 and 50.0.0, we continue to see our community grow:

Qi Zhu (zhuqi-lucas) and Yoav Cohen (yoavcloud) became committers. See the mailing list for more details.
In the core DataFusion repo alone, we reviewed and accepted 318 PRs from 79 different committers, created over 235 issues, and closed 197 of them 🚀. All changes are listed in the detailed changelogs.
DataFusion published several blogs, including Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet, Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries, and Implementing User Defined Types and Custom Metadata in DataFusion.

New Features ✨¶

Improved Spilling Sorts for Larger-than-Memory Datasets¶

DataFusion has long been able to sort datasets that do not fit entirely in memory, but still struggled with particularly large inputs or highly memory-constrained setups. Larger-than-memory sorts in DataFusion 50.0.0 have been improved with the recent introduction of multi-level merge sorts (more details in the respective ticket). It is now possible to execute almost any sorting query that would have previously triggered out-of-memory errors, by relying on disk spilling. Thanks to Raz Luvaton, Yongting You, and ding-young for delivering this feature.

Dynamic Filter Pushdown for Hash Joins¶

The dynamic filter pushdown optimization has been extended to inner hash joins, dramatically reducing the amount of scanned data in some workloads—a technique sometimes referred to as Sideways Information Passing.

These filters are automatically applied to inner hash joins, while future work will introduce them to other join types.

For example, given a query that looks for a specific customer and their orders, DataFusion can now filter the orders relation based on the c_custkey of the target customer, reducing the amount of data read from disk by orders of magnitude.

-- retrieve the orders of the customer with c_phone = '25-989-741-2988'
SELECT *
FROM customer
JOIN orders ON c_custkey = o_custkey
WHERE c_phone = '25-989-741-2988';

The following shows an execution plan in DataFusion 50.0.0 with this optimization:

HashJoinExec
    DataSourceExec: <-- read customer
      predicate=c_phone@4 = 25-989-741-2988
      metrics=[output_rows=1, ...]
    DataSourceExec: <-- read orders
      -- dynamic filter is added here, filtering directly at scan time
      predicate=DynamicFilterPhysicalExpr [ o_custkey@1 >= 1 AND o_custkey@1 <= 1 ]
      -- the number of output rows is kept to a minimum
      metrics=[output_rows=11, ...]

Because there is a single customer in this query, almost all rows from orders are filtered out by the join. In previous versions of DataFusion, the entire orders relation would be scanned to join with the target customer, but now the dynamic filter pushdown can filter it right at the source, minimizing the amount of data decoded.

More information can be found in the respective ticket and the next step will be to extend the dynamic filters to other types of joins, such as LEFT and RIGHT outer joins. Thanks to Adrian Garcia Badaracco, Qi Zhu, xudong963, Daniël Heres, and Lía Adriana for delivering this feature.

Parquet Metadata Cache¶

The metadata of Parquet files (statistics, page indexes, etc.) is now automatically cached when using the built-in ListingTable, which reduces disk/network round-trips and repeated decoding of the same information. With a simple microbenchmark that executes point reads (e.g., SELECT v FROM t WHERE k = x) over large files, we measured a 12x improvement in execution time (more details can be found in the respective ticket). This optimization is production ready and enabled by default (more details in the Epic). Thanks to Nuno Faria, Jonathan Chen, Shehab Amin, Oleks V, Tim Saucer, and Blake Orth for delivering this feature.

Here is an example of the metadata cache in action:

-- disabling the metadata cache
> SET datafusion.runtime.metadata_cache_limit = '0M';

-- simple query (t.parquet: 100M rows, 3 cols)
> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
DataSourceExec: ... metrics=[..., metadata_load_time=229.196422ms, ...]
Elapsed 0.246 seconds.

-- enabling the metadata cache
> SET datafusion.runtime.metadata_cache_limit = '50M';

> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
DataSourceExec: ... metrics=[..., metadata_load_time=228.612µs, ...]
Elapsed 0.003 seconds. -- 82x improvement in this specific query

The cache can be configured with the following runtime parameter:

datafusion.runtime.metadata_cache_limit

The default FileMetadataCache uses a least-recently-used eviction algorithm and up to 50MB of memory. If the underlying file changes, the cache is automatically invalidated. Setting the limit to 0 will disable any metadata caching. As with most APIs in DataFusion, users can provide their own behavior using a custom FileMetadataCache implementation when setting up the RuntimeEnv.

For users with custom TableProvider:

If the custom provider uses the ParquetFormat, caching will work without any changes.
Otherwise the CachedParquetFileReaderFactory can be provided when creating a ParquetSource.

Users can inspect the cache contents through the FileMetadataCache::list_entries method, or with the metadata_cache() function in datafusion-cli:

> SELECT * FROM metadata_cache();
+---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
| path          | file_modified           | file_size_bytes | e_tag                    | version | metadata_size_bytes | hits | extra           |
+---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
| .../t.parquet | 2025-09-21T17:40:13.650 | 420827020       | 0-63f5331fb4458-19154f8c | NULL    | 44480534            | 27   | page_index=true |
+---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
1 row(s) fetched.
Elapsed 0.003 seconds.

`QUALIFY` Clause¶

DataFusion now supports the QUALIFY SQL clause (#16933), which simplifies filtering window function output (similar to how HAVING filters aggregation output).

For example, filtering the output of the rank() function previously required a query like this:

SELECT a, b, c
FROM (
   SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
   FROM t
)
WHERE rk = 1

The same query can now be written like this:

SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
FROM t
QUALIFY rk = 1

Although it is not part of the SQL standard (yet), it has been gaining adoption in several SQL analytical systems such as DuckDB, Snowflake, and BigQuery. Thanks to Huaijin and Jonah Gao for delivering this feature.

`FILTER` Support for Window Functions¶

Continuing the theme, the FILTER clause has been extended to support aggregate window functions. It allows these functions to apply to specific rows without having to rely on CASE expressions, similar to what was already possible with regular aggregate functions.

For example, we can gather multiple distinct sets of values matching different criteria with a single pass over the input:

SELECT 
  ARRAY_AGG(c2) FILTER (WHERE c2 >= 2) OVER (...)     -- e.g. [2, 3, 4]
  ARRAY_AGG(CASE WHEN c2 >= 2 THEN c2 END) OVER (...) -- e.g. [NULL, NULL, 2, 3, 4]
...
FROM table

Thanks to Geoffrey Claude and Jeffrey Vo for delivering this feature.

`ConfigOptions` Now Available to Functions¶

DataFusion 50.0.0 now passes session configuration parameters to User-Defined Functions (UDFs) via ScalarFunctionArgs (#16970). This allows behavior that varies based on runtime state; for example, time UDFs can use the session-specified time zone instead of just UTC.

Thanks to Bruce Ritchie, Piotr Findeisen, Oleks V, and Andrew Lamb for delivering this feature.

Additional Apache Spark Compatible Functions¶

Finally, due to Apache Spark's impact on analytical processing, many DataFusion users desire Spark compatibility in their workloads, so DataFusion provides a set of Spark-compatible functions in the datafusion-spark crate. You can read more about this project in the announcement and epic. DataFusion 50.0.0 adds several new such functions:

Thanks to David López, Chen Chongchen, Alan Tang, Peter Nguyen, and Evgenii Glotov for delivering these functions. We are looking for additional help reviewing and implementing more functions; please reach out on the epic if you are interested.

Known Issues / Patchset¶

As DataFusion continues to mature, we regularly release patch versions to fix issues in major releases. Since the release of 50.0.0, we have identified a few issues, and expect to release 50.1.0 to address them. You can track progress in this ticket.

Upgrade Guide and Changelog¶

Upgrading to 50.0.0 should be straightforward for most users. Please review the Upgrade Guide for details on breaking changes and code snippets to help with the transition. Recently, some users have reported success automatically upgrading DataFusion by pairing AI tools with the upgrade guide. For a comprehensive list of all changes, please refer to the changelog.

About DataFusion¶

Apache DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast, data-centric systems such as databases, dataframe libraries, and machine learning and streaming applications. While DataFusion’s primary design goal is to accelerate the creation of other data-centric systems, it provides a reasonable experience directly out of the box as a dataframe library, Python library, and command-line SQL tool.

How to Get Involved¶

Apache DataFusion Comet 0.10.0 Release

2025-09-16T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.10.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately ten weeks of development work and is the result of merging 183 PRs from 26 contributors. See the change log for more information.

Release Highlights¶

Improved Support for Apache Iceberg¶

It is now possible to use Comet with Apache Iceberg 1.8.1 to accelerate reads of Iceberg Parquet tables. Please refer to Comet's Iceberg Guide for information on building Iceberg with Comet.

Improved Spark 4.0.0 Support¶

Comet no longer falls back to Spark for all queries when ANSI mode is enabled (which is the default in Spark 4.0.0). Instead, Comet will now only fall back to Spark for arithmetic and aggregates expressions that support ANSI mode.

Setting spark.comet.ansi.ignore=true will override this behavior and force these expressions to continue to be accelerated by Comet. Full support for ANSI mode will be available in a future release.

Comet will now use the native_iceberg_compat scan for Spark 4.0.0 in most cases, which supports reading complex types.

New Functionality¶

The following SQL functions are now supported:

array_min
map_entries
map_from_array
randn
from_unixtime
monotonically_increasing_id
spark_partition_id
try_add
try_divide
try_mod
try_multiply
try_subtract

Other new features include:

Support for array literals
Support for limit with offset

UX Improvements¶

Improved reporting of reasons why Comet cannot accelerate some operators and expressions
New spark.comet.logFallbackReasons.enabled configuration setting for logging all fallback reasons
CometScan nodes in the physical plan now show which scan implementation is being used (native_comet, native_datafusion, or native_iceberg_compat)

Bug Fixes¶

Improved memory safety for FFI transfers
Fixed a double-free issue in the shuffle unified memory pool
Fixed an FFI issue with non-zero offsets
Fixed an issue with buffered reads from HDFS

Benchmarking¶

Benchmarking scripts for benchmarks based on TPC-H and TPS-DS are now available in the repository under dev/benchmarks.

Documentation Updates¶

The documentation for supported operators and expressions is now more complete, and Spark-compatibility status per operator/expression is now documented.
The documentation now contains a roadmap section.
New guide comparing Comet with Apache Gluten (incubating) + Velox
User guides are now available for multiple Comet versions

Spark Compatibility¶

Spark 3.4.3 with JDK 11 & 17, Scala 2.12 & 2.13
Spark 3.5.4 through 3.5.6 with JDK 11 & 17, Scala 2.12 & 2.13
Experimental support for Spark 4.0.0 with JDK 17, Scala 2.13

We are looking for help from the community to fully support Spark 4.0.0. See EPIC: Support 4.0.0 for more information.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Apache DataFusion 49.0.0 Released

2025-07-28T00:00:00+00:00

Introduction¶

We are proud to announce the release of DataFusion 49.0.0. This blog post highlights some of the major improvements since the release of DataFusion 48.0.0. The complete list of changes is available in the changelog.

Performance Improvements 🚀¶

DataFusion continues to focus on enhancing performance, as shown in the ClickBench and other results.

Figure 1: ClickBench performance improvements over time Average and median normalized query execution times for ClickBench queries for each git revision. Query times are normalized using the ClickBench definition. Data and definitions on the DataFusion Benchmarking Page.

Here are some noteworthy optimizations added since DataFusion 48:

Equivalence system upgrade: The lower levels of the equivalence system, which is used to implement the optimizations described in Using Ordering for Better Plans, were rewritten, leading to much faster planning times, especially for queries with a large number of columns. This change also prepares the way for more sophisticated sort-based optimizations in the future. (PR #16217 by ozankabak).

Dynamic Filters and TopK pushdown

DataFusion now supports dynamic filters, which are improved during query execution, and physical filter pushdown. Together, these features improve the performance of queries that use LIMIT and ORDER BY clauses, such as the following:

SELECT *
FROM data
ORDER BY timestamp DESC
LIMIT 10

While the query above is simple, without dynamic filtering or knowing that the data is already sorted by timestamp, a query engine must decode all of the data to find the top 10 values. With the dynamic filters system, DataFusion applies an increasingly selective filter during query execution. It checks the current top 10 values of the timestamp column before opening files or reading Parquet Row Groups and Data Pages, which can skip older data very quickly.

Dynamic predicates are a common feature of advanced engines such as Dynamic Filters in Starburst and Top-K Aggregation Optimization at Snowflake. The technique drastically improves query performance (we've seen over a 1.5x improvement for some TPC-H-style queries), especially in combination with late materialization and columnar file formats such as Parquet. We plan to write a blog post explaining the details of this optimization in the future, and we expect to use the same mechanism to implement additional optimizations such as Sideways Information Passing for joins (Issue #15037 PR #15770 by adriangb).

Community Growth 📈¶

The last few months, between 46.0.0 and 49.0.0, have seen our community grow:

New PMC members and committers: berkay, xudong963 and timsaucer joined the PMC. blaginin, milenkovicm, adriangb and kosiew joined as committers. See the mailing list for more details.
In the core DataFusion repo alone, we reviewed and accepted over 850 PRs from 172 different committers, created over 669 issues, and closed 379 of them 🚀. All changes are listed in the detailed changelogs.
DataFusion published a number of blog posts, including User defined Window Functions, Optimizing SQL (and DataFrames) in DataFusion part 1, part 2, Using Rust async for Query Execution and Cancelling Long-Running Queries, and Embedding User-Defined Indexes in Apache Parquet Files.

New Features ✨¶

Async User-Defined Functions¶

It is now possible to write async User-Defined Functions (UDFs) in DataFusion that perform asynchronous operations, such as network requests or database queries, without blocking the execution of the query. This enables new use cases, such as integrating with large language models (LLMs) or other external services, and we can't wait to see what the community builds with it.

See the documentation for more details and the async UDF example for working code.

You could, for example, implement a function ask_llm that asks a large language model (LLM) service a question based on the content of two columns.

SELECT * 
FROM animal a
WHERE ask_llm(a.name, 'Is this animal furry?')")

The implementation of an async UDF is almost identical to a normal UDF, except that it must implement the AsyncScalarUDFImpl trait in addition to ScalarUDFImpl and provide an async implementation via invoke_async_with_args:

#[derive(Debug)]
struct AskLLM {
    signature: Signature,
}

#[async_trait]
impl AsyncScalarUDFImpl for AskLLM {
    /// The `invoke_async_with_args` method is similar to `invoke_with_args`,
    /// but it returns a `Future` that resolves to the result.
    ///
    /// Since this signature is `async`, it can do any `async` operations, such
    /// as network requests.
    async fn invoke_async_with_args(
        &self,
        args: ScalarFunctionArgs,
        options: &ConfigOptions,
    ) -> Result<ArrayRef> {
        // Converts the arguments to arrays for simplicity.
        let args = ColumnarValue::values_to_arrays(&args.args)?;
        let [column_of_interest, question] = take_function_args(self.name(), args)?;
        let client = Client::new();

        // Make a network request to a hypothetical LLM service
        let res = client
            .post(URI)
            .headers(get_llm_headers(options))
            .json(&req)
            .send()
            .await?
            .json::<LLMResponse>()
            .await?;

        let results = extract_results_from_llm_response(&res);

        Ok(Arc::new(results))
    }
}

(Issue #6518, PR #14837 from goldmedal 🏆)

Better Cancellation for Certain Long-Running Queries¶

In rare cases, it was previously not possible to cancel long-running queries, leading to unresponsiveness. Other projects would likely have fixed this issue by treating the symptom, but pepijnve and the DataFusion community worked together to treat the root cause. The general solution required a deep understanding of the DataFusion execution engine, Rust Streams, and the tokio cooperative scheduling model. The resulting PR is a model of careful community engineering and a great example of using Rust's async ecosystem to implement complex functionality. It even resulted in a contribution upstream to tokio (since accepted). See the blog post for more details.

Metadata for User Defined Types such as `Variant` and `Geometry`¶

User-defined types have been a long-requested feature, and this release provides the low-level APIs to support them efficiently.

Metadata handling in PRs #15646 and #16170 from timsaucer
Pushdown of filters and expressions (see "Dynamic Filters and TopK pushdown" section above)

We still have some work to do to fully support user-defined types, specifically in documentation and testing, and we would love your help in this area. If you are interested in contributing, please see issue #12644.

Parquet Modular Encryption¶

DataFusion now supports reading and writing encrypted Apache Parquet files with modular encryption. This allows users to encrypt specific columns in a Parquet file using different keys, while still being able to read data without needing to decrypt the entire file.

Here is an example of how to configure DataFusion to read an encrypted Parquet table with two columns, double_field and float_field, using modular encryption:

CREATE EXTERNAL TABLE encrypted_parquet_table
(
double_field double,
float_field float
)
STORED AS PARQUET LOCATION 'pq/' OPTIONS (
    -- encryption
    'format.crypto.file_encryption.encrypt_footer' 'true',
    'format.crypto.file_encryption.footer_key_as_hex' '30313233343536373839303132333435',  -- b"0123456789012345"
    'format.crypto.file_encryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450"
    'format.crypto.file_encryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451"
    -- decryption
    'format.crypto.file_decryption.footer_key_as_hex' '30313233343536373839303132333435', -- b"0123456789012345"
    'format.crypto.file_decryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450"
    'format.crypto.file_decryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451"
);

(Issue #15216, PR #16351 from corwinjoy and adamreeve)

Support for `WITHIN GROUP` for Ordered-Set Aggregate Functions¶

DataFusion now supports the WITHIN GROUP clause for ordered-set aggregate functions such as approx_percentile_cont, percentile_cont, and percentile_disc, which allows users to specify the precise order.

For example, the following query computes the 50th percentile for the temperature column in the city_data table, ordered by date:

SELECT
    percentile_disc(0.5) WITHIN GROUP (ORDER BY date) AS median_temperature
FROM city_data;

(Issue #11732, PR #13511, by Garamda)

Compressed Spill Files¶

DataFusion now supports compressing the files written to disk when spilling larger-than-memory datasets while sorting and grouping. Using compression can significantly reduce the size of the intermediate files and improve performance when reading them back into memory.

(Issue #16130, PR #16268 by ding-young)

Support for `REGEX_INSTR` function¶

DataFusion now supports the [REGEXP_INSTR function], which returns the position of a regular expression match within a string.

For example, to find the position of the first match of the regular expression C(.)(..) in the string ABCDEF, you can use:

> SELECT regexp_instr('ABCDEF', 'C(.)(..)');
+---------------------------------------------------------------+
| regexp_instr(Utf8("ABCDEF"),Utf8("C(.)(..)"))                 |
+---------------------------------------------------------------+
| 3                                                             |
+---------------------------------------------------------------+

(Issue #13009, PR #15928 by nirnayroy)

Upgrade Guide and Changelog¶

Upgrading to 49.0.0 should be straightforward for most users. Please review the Upgrade Guide for details on breaking changes and code snippets to help with the transition. Recently, some users have reported success automatically upgrading DataFusion by pairing AI tools with the upgrade guide. For a comprehensive list of all changes, please refer to the changelog.

About DataFusion¶

Apache DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast, data-centric systems such as databases, dataframe libraries, and machine learning and streaming applications. While DataFusion’s primary design goal is to accelerate the creation of other data-centric systems, it provides a reasonable experience directly out of the box as a dataframe library, python library, and [command-line SQL tool].

DataFusion's core thesis is that as a community, together we can build much more advanced technology than any of us as individuals or companies could do alone. Without DataFusion, highly performant vectorized query engines would remain the domain of a few large companies and world-class research institutions. With DataFusion, we can all build on top of a shared foundation and focus on what makes our projects unique.

How to Get Involved¶

Apache DataFusion 48.0.0 Released

2025-07-16T00:00:00+00:00

We’re excited to announce the release of Apache DataFusion 48.0.0! As always, this version packs in a wide range of improvements and fixes. You can find the complete details in the full changelog. We’ll highlight the most important changes below and guide you through upgrading.

Breaking Changes¶

DataFusion 48.0.0 brings a few breaking changes that may require adjustments to your code as described in the Upgrade Guide. Here are the most notable ones:

datafusion.execution.collect_statistics defaults to true: In DataFusion 48.0.0, the default value of this configuration setting is now true, and DataFusion will collect and store statistics when a table is first created via CREATE EXTERNAL TABLE or one of the DataFrame::register_* APIs.
Expr::Literal has optional metadata: The Expr::Literal variant now includes optional metadata, which allows for carrying through Arrow field metadata to support extension types and other uses. This means code such as

match expr {
...
  Expr::Literal(scalar) => ...
...
}

Should be updated to:

match expr {
...
  Expr::Literal(scalar, _metadata) => ...
...
}

Expr::WindowFunction is now Boxed: Expr::WindowFunction is now a Box<WindowFunction> instead of a WindowFunction directly. This change was made to reduce the size of Expr and improve performance when planning queries (see details on #16207).
UDFs changed to use FieldRef instead of DataType: To support metadata handling and prepare for extension types, UDF traits now use FieldRef rather than a DataType and nullability. FieldRef contains the type and nullability, and additionally allows access to metadata fields, which can be used for extension types.
Physical Expression return Field: Similarly to UDFs, in order to prepare for extension type support the PhysicalExpr trait has been changed to return Field rather than DataType. To upgrade structs which implement PhysicalExpr you need to implement the return_field function.
FileFormat::supports_filters_pushdown was replaced with FileSource::try_pushdown_filters to support upcoming work to push down dynamic filters and physical filter pushdown.
ParquetExec, AvroExec, CsvExec, JsonExec removed: ParquetExec, AvroExec, CsvExec, and JsonExec were deprecated in DataFusion 46 and are removed in DataFusion 48.

Performance Improvements¶

DataFusion 48.0.0 comes with some noteworthy performance enhancements:

Fewer unnecessary projections: DataFusion now removes additional unnecessary Projections in queries. (PRs #15787, #15761, and #15746 by xudong963).
Accelerated string functions: The ascii function was optimized to significantly improve its performance (PR #16087 by tlm365). The character_length function was optimized resulting in up to 3x performance improvement (PR #15931 by Dandandan)
Constant aggregate window expressions: For unbounded aggregate window functions the result is the same for all rows within a partition. DataFusion 48.0.0 avoids unnecessary computation for such queries, resulting in improved performance by 5.6x (PR #16234 by suibianwanwank)

Highlighted New Features¶

New `datafusion-spark` crate¶

The DataFusion community has requested Apache Spark-compatible functions for many years, but the current builtin function library is most similar to Postgresql, which leads to friction. Unfortunately, there are even functions with the same name but different signatures and/or return types in the two systems.

One of the many uses of DataFusion is to enhance (e.g. Apache DataFusion Comet) or replace (e.g. Sail) Apache Spark. To support the community requests and the use cases mentioned above, we have introduced a new datafusion-spark crate for DataFusion with spark-compatible functions so the community can collaborate to build this shared resource. There are several hundred functions to implement, and we are looking for help to complete datafusion-spark Spark Compatible Functions.

To register all functions in datafusion-spark you can use:

    // Create a new session context
    let mut ctx = SessionContext::new();
    // register all spark functions with the context
    datafusion_spark::register_all(&mut ctx)?;
    // run a query. Note the `sha2` function is now available which
    // has Spark semantics
    let df = ctx.sql("SELECT sha2('The input String', 256)").await?;
    ...
}

Or, to use an individual function, you can do:

use datafusion_expr::{col, lit};
use datafusion_spark::expr_fn::sha2;
// Create the expression `sha2(my_data, 256)`
let expr = sha2(col("my_data"), lit(256));
...

Thanks to shehabgamin for the initial PR #15168 and many others for their help adding additional functions. Please consider helping complete datafusion-spark Spark Compatible Functions.

`ORDER BY ALL sql` support¶

Inspired by DuckDB, DataFusion 48.0.0 adds support for ORDER BY ALL. This allows for easy ordering of all columns in a query:

> set datafusion.sql_parser.dialect = 'DuckDB';
0 row(s) fetched.
> CREATE OR REPLACE TABLE addresses AS
    SELECT '123 Quack Blvd' AS address, 'DuckTown' AS city, '11111' AS zip
    UNION ALL
    SELECT '111 Duck Duck Goose Ln', 'DuckTown', '11111'
    UNION ALL
    SELECT '111 Duck Duck Goose Ln', 'Duck Town', '11111'
    UNION ALL
    SELECT '111 Duck Duck Goose Ln', 'Duck Town', '11111-0001';
0 row(s) fetched.
> SELECT * FROM addresses ORDER BY ALL;
+------------------------+-----------+------------+
| address                | city      | zip        |
+------------------------+-----------+------------+
| 111 Duck Duck Goose Ln | Duck Town | 11111      |
| 111 Duck Duck Goose Ln | Duck Town | 11111-0001 |
| 111 Duck Duck Goose Ln | DuckTown  | 11111      |
| 123 Quack Blvd         | DuckTown  | 11111      |
+------------------------+-----------+------------+
4 row(s) fetched.

Thanks to PokIsemaine for PR #15772

FFI Support for `AggregateUDF` and `WindowUDF`¶

This improvement allows for using user defined aggregate and user defined window functions across FFI boundaries, which enables shared libraries to pass functions back and forth. This feature unlocks:

Modules to provide DataFusion based FFI aggregates that can be reused in projects such as datafusion-python
Using the same aggregate and window functions without recompiling with different DataFusion versions.

This completes the work to add support for all UDF types to DataFusion's FFI bindings. Thanks to timsaucer for PRs #16261 and #14775.

Reduced size of `Expr` struct¶

The Expr struct is widely used across the DataFusion and downstream codebases. By Boxing WindowFunctions, we reduced the size of Expr by almost 50%, from 272 to 144 bytes. This reduction improved planning times between 10% and 20% and reduced memory usage. Thanks to hendrikmakait for PR #16207

Upgrade Guide and Changelog¶

Upgrading to 48.0.0 should be straightforward for most users, but do review the Upgrade Guide for DataFusion 48.0.0 for detailed steps and code changes. The upgrade guide covers the breaking changes mentioned above and provides code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog for the 48.0.0 release. The changelog enumerates every merged PR in this release, including many smaller fixes and improvements that we couldn’t cover in this post.

Get Involved¶

Apache DataFusion is an open-source project, and we welcome involvement from anyone interested. Now is a great time to take 48.0.0 for a spin: try it out on your workloads, and let us know if you encounter any issues or have suggestions. You can report bugs or request features on our GitHub issue tracker, or better yet, submit a pull request. Join our community discussions – whether you have questions, want to share how you’re using DataFusion, or are looking to contribute, we’d love to hear from you. A list of open issues suitable for beginners is here and you can find how to reach us on the communication doc.

Happy querying!

Apache DataFusion 47.0.0 Released

2025-07-11T00:00:00+00:00

We’re excited to announce the release of Apache DataFusion 47.0.0! This new version represents a significant milestone for the project, packing in a wide range of improvements and fixes. You can find the complete details in the full changelog. We’ll highlight the most important changes below and guide you through upgrading.

Note that DataFusion 47.0.0 was released in April 2025, but we are only now publishing the blog post due to limited bandwidth in the DataFusion community. We apologize for the delay and encourage you to come help us accelerate the next release and announcements by joining the community 🎣.

Breaking Changes¶

DataFusion 47.0.0 brings a few breaking changes that may require adjustments to your code as described in the Upgrade Guide. Here are some notable ones:

Upgrades to arrow-rs and arrow-parquet 55.0.0 and object_store 0.12.0: Several APIs changed in the underlying arrow, parquet and object_store libraries to use a u64 instead of usize to better support WASM. This requires converting from usize to u64 occasionally as well as changes to ObjectStore implementations such as

impl ObjectStore {
    ...

    // The range is now a u64 instead of usize
    async fn get_range(&self, location: &Path, range: Range<u64>) -> ObjectStoreResult<Bytes> {
        self.inner.get_range(location, range).await
    }

    ...

    // the lifetime is now 'static instead of '_ (meaning the captured closure can't contain references)
    // (this also applies to list_with_offset)
    fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, ObjectStoreResult<ObjectMeta>> {
        self.inner.list(prefix)
    }
}

DisplayFormatType::TreeRender: Implementations of ExecutionPlan must also provide a description in the DisplayFormatType::TreeRender format to provide support for the new tree style explains. This can be the same as the existing DisplayFormatType::Default.

Performance Improvements¶

DataFusion 47.0.0 comes with numerous performance enhancements across the board. Here are some of the noteworthy optimizations in this release:

FIRST_VALUE and LAST_VALUE: FIRST_VALUE and LAST_VALUE functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in 7 seconds compared to 36 seconds in DataFusion 46.0.0: select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4 (h2o.ai dataset). (PR's #15266 and #15542 by UBarney).
MIN, MAX and AVG for Durations: DataFusion executes aggregate queries up to 2.5x faster when they include MIN, MAX and AVG on Duration columns. (PRs #15322 and #15748 by shruti2522).
Short circuit evaluation for AND and OR: DataFusion now eagerly skips the evaluation of the right operand if the left is known to be false (AND) or true (OR) in certain cases. For complex predicates, such as those with many LIKE or CASE expressions, this optimization results in significant performance improvements (up to 100x in extreme cases). (PRs #15462 and #15694 by acking-you).
TopK optimization for partially sorted input: Previous versions of DataFusion implemented early termination optimization (TopK) for fully sorted data. DataFusion 47.0.0 extends the optimization for partially sorted data, which is common in many real-world datasets, such as time-series data sorted by day but not within each day. (PR #15563 by geoffreyclaude).
Disable re-validation of spilled files: DataFusion no longer does unnecessary re-validation of temporary spill files. The validation is unnecessary and expensive as the data is known to be valid when it was written out (PR #15454 by zebsme).

Highlighted New Features¶

Tree style explains¶

In previous releases the EXPLAIN statement results in a formatted table which is succinct and contains important details for implementers, but was often hard to read especially with queries that included joins or unions having multiple children.

DataFusion 47.0.0 includes the new EXPLAIN FORMAT TREE (default in datafusion-cli) rendered in a visual tree style that is much easier to quickly understand.

Example of the new explain output:

> explain select * from t1 inner join t2 on t1.ti=t2.ti;
+---------------+------------------------------------------------------------+
| plan_type     | plan                                                       |
+---------------+------------------------------------------------------------+
| physical_plan | ┌───────────────────────────┐                              |
|               | │    CoalesceBatchesExec    │                              |
|               | │    --------------------   │                              |
|               | │     target_batch_size:    │                              |
|               | │            8192           │                              |
|               | └─────────────┬─────────────┘                              |
|               | ┌─────────────┴─────────────┐                              |
|               | │        HashJoinExec       │                              |
|               | │    --------------------   ├──────────────┐               |
|               | │       on: (ti = ti)       │              │               |
|               | └─────────────┬─────────────┘              │               |
|               | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      ││       DataSourceExec      │ |
|               | │    --------------------   ││    --------------------   │ |
|               | │         bytes: 112        ││         bytes: 112        │ |
|               | │       format: memory      ││       format: memory      │ |
|               | │          rows: 1          ││          rows: 1          │ |
|               | └───────────────────────────┘└───────────────────────────┘ |
|               |                                                            |
+---------------+------------------------------------------------------------+

Example of the EXPLAIN FORMAT INDENT output for the same query

> explain format indent select * from t1 inner join t2 on t1.ti=t2.ti;
+---------------+----------------------------------------------------------------------+
| plan_type     | plan                                                                 |
+---------------+----------------------------------------------------------------------+
| logical_plan  | Inner Join: t1.ti = t2.ti                                            |
|               |   TableScan: t1 projection=[ti]                                      |
|               |   TableScan: t2 projection=[ti]                                      |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                          |
|               |   HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(ti@0, ti@0)] |
|               |     DataSourceExec: partitions=1, partition_sizes=[1]                |
|               |     DataSourceExec: partitions=1, partition_sizes=[1]                |
|               |                                                                      |
+---------------+----------------------------------------------------------------------+
2 row(s) fetched.

Thanks to irenjj for the initial work in PR #14677 and many others for completing the followup epic

SQL `VARCHAR` defaults to Utf8View¶

In previous releases when a column was created in SQL the column would be mapped to the Utf8 Arrow data type. In this release the SQL varchar columns will be mapped to the Utf8View arrow data type by default, which is a more efficient representation of UTF-8 strings in Arrow.

create table foo(x varchar);
0 row(s) fetched.

> describe foo;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| x           | Utf8View  | YES         |
+-------------+-----------+-------------+

Previous versions of DataFusion used Utf8View when reading parquet files and it is faster in most cases.

Thanks to zhuqi-lucas for PR #15104

Context propagation in spawned tasks (for tracing, logging, etc.)¶

This release introduces an API for propagating user-defined context (such as tracing spans, logging, or metrics) across thread boundaries without depending on any specific instrumentation library. You can use the JoinSetTracer API to instrument DataFusion plans with your own tracing or logging libraries, or use pre-integrated community crates such as the datafusion-tracing crate.

Previously, tasks spawned on new threads — such as those performing repartitioning or Parquet file reads — could lose thread-local context, which is often used in instrumentation libraries. A full example of how to use this new API is available in the DataFusion examples, and a simple example is shown below.

/// Models a simple tracer. Calling `in_current_span()` and `in_scope()` saves thread-specific state
/// for the current span and must be called at the start of each new task or thread.
struct SpanTracer;

/// Implements the `JoinSetTracer` trait so we can inject instrumentation
/// for both async futures and blocking closures.
impl JoinSetTracer for SpanTracer {
    /// Instruments a boxed future to run in the current span. The future's
    /// return type is erased to `Box<dyn Any + Send>`, which we simply
    /// run inside the `Span::current()` context.
    fn trace_future(
        &self,
        fut: BoxFuture<'static, Box<dyn Any + Send>>,
    ) -> BoxFuture<'static, Box<dyn Any + Send>> {
        // Ensures any thread-local context is set in this future 
        fut.in_current_span().boxed()
    }

    /// Instruments a boxed blocking closure by running it inside the
    /// `Span::current()` context.
    fn trace_block(
        &self,
        f: Box<dyn FnOnce() -> Box<dyn Any + Send> + Send>,
    ) -> Box<dyn FnOnce() -> Box<dyn Any + Send> + Send> {
        let span = Span::current();
        // Ensures any thread-local context is set for this closure
        Box::new(move || span.in_scope(f))
    }
}

...
set_join_set_tracer(&SpanTracer).expect("Failed to set tracer");
...

Thanks to geoffreyclaude for PR #14914

Upgrade Guide and Changelog¶

Upgrading to 47.0.0 should be straightforward for most users, but do review the Upgrade Guide for DataFusion 47.0.0 for detailed steps and code changes. The upgrade guide covers the breaking changes mentioned above and provides code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog for 47.0.0. The changelog enumerates every merged PR in this release, including many smaller fixes and improvements that we couldn’t cover in this post.

Get Involved¶

Apache DataFusion is an open-source project, and we welcome involvement from anyone interested. Now is a great time to take 47.0.0 for a spin: try it out on your workloads, and let us know if you encounter any issues or have suggestions. You can report bugs or request features on our GitHub issue tracker, or better yet, submit a pull request. Join our community discussions – whether you have questions, want to share how you’re using DataFusion, or are looking to contribute, we’d love to hear from you. A list of open issues suitable for beginners is here and you can find how to reach us on the communication doc.

Happy querying!

Apache DataFusion Comet 0.9.0 Release

2025-07-01T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.9.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately ten weeks of development work and is the result of merging 139 PRs from 24 contributors. See the change log for more information.

Release Highlights¶

Complex Type Support in Parquet Scans¶

Comet now supports complex types (Structs, Maps, and Arrays) when reading Parquet files. This functionality is not yet available when reading Parquet files from Apache Iceberg.

This functionality was only available in previous releases when manually specifying one of the new experimental scan implementations. Comet now automatically chooses the best scan implementation based on the input schema, and no longer requires manual configuration.

Complex Type Processing Improvements¶

Numerous improvements have been made to complex type support to ensure Spark-compatible behavior when casting between structs and accessing fields within deeply nested types.

Shuffle Improvements¶

Comet now accelerates a broader range of shuffle operations, leading to more queries running fully natively. In previous releases, some shuffle operations fell back to Spark to avoid some known bugs in Comet, and these bugs have now been fixed.

New Features¶

Comet 0.9.0 adds support for the following Spark expressions:

ArrayDistinct
ArrayMax
ArrayRepeat
ArrayUnion
BitCount
BitNot
Expm1
MapValues
Signum
ToPrettyString
map[]

Improved Spark SQL Test Coverage¶

Comet now passes 97% of the Spark SQL test suite, with more than 24,000 tests passing (based on testing against Spark 3.5.6). The remaining 3% of tests are ignored for various reasons, such as being too specific to Spark internals, or testing for features that are not relevant to Comet, such as whole-stage code generation, which is not needed when using a vectorized execution engine.

This release contains numerous bug fixes to achieve this coverage, including improved support for exchange reuse when AQE is enabled.

Module	Passed	Ignored	Canceled	Total
catalyst	7,232	5	1	7,238
core-1	9,186	246	6	9,438
core-2	2,649	393	0	3,042
core-3	1,757	136	16	1,909
hive-1	2,174	14	4	2,192
hive-2	19	1	4	24
hive-3	1,058	11	4	1,073
Total	24,075	806	31	24,912

Memory & Performance Tracing¶

Comet now provides a tracing feature for analyzing performance and off-heap versus on-heap memory usage. See the Comet Tracing Guide for more information.

Spark Compatibility¶

Spark 3.4.3 with JDK 11 & 17, Scala 2.12 & 2.13
Spark 3.5.4 through 3.5.6 with JDK 11 & 17, Scala 2.12 & 2.13
Experimental support for Spark 4.0.0 with JDK 17, Scala 2.13

We are looking for help from the community to fully support Spark 4.0.0. See EPIC: Support 4.0.0 for more information.

Note that Java 8 support was removed from this release because Apache Arrow no longer supports it.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Apache DataFusion Comet 0.8.0 Release

2025-05-06T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.8.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately six weeks of development work and is the result of merging 81 PRs from 11 contributors. See the change log for more information.

Release Highlights¶

Performance & Stability¶

Up to 4x speedup in jobs using dropDuplicates, thanks to optimizations in the first_value and last_value aggregate functions in DataFusion 47.0.0.
Introduction of a global Tokio runtime, which resolves potential deadlocks in certain multi-task scenarios.

Native Shuffle Improvements¶

Significant enhancements to the native shuffle mechanism include:

Lower memory usage through using interleave_record_batches instead of using array builders.
Support for complex types in shuffle data (note: hash partition expressions still require primitive types).
Reclaimable shuffle files, reducing disk pressure.
Respects spark.local.dir for temporary storage.
Per-task shuffle metrics are now available, providing better visibility into execution behavior.

Experimental Support for DataFusion’s Parquet Scan¶

It is now possible to configure Comet to use DataFusion’s Parquet reader instead of Comet’s current Parquet reader. This has the advantage of supporting complex types, and also has performance optimizations that are not present in Comet's existing reader.

This release continues with the ongoing improvements and bug fixes and supports more use cases, but there are still some known issues:

There are schema coercion bugs for nested types containing INT96 columns, which can cause incorrect results.
There are compatibility issues when reading integer values that are larger than their type annotation, such as the value 1024 being stored in a field annotated as int(8).
A small number of Spark SQL tests remain unsupported (#1545).

To enable DataFusion’s Parquet reader, either set spark.comet.scan.impl=native_datafusion or set the environment variable COMET_PARQUET_SCAN_IMPL=native_datafusion.

Updates to Supported Spark Versions¶

Added support for Spark 3.5.5
Dropped support for Spark 3.3.x

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Apache DataFusion Comet 0.7.0 Release

2025-03-20T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.7.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

Comet runs on commodity hardware and aims to provide 100% compatibility with Apache Spark. Any operators or expressions that are not fully compatible will fall back to Spark unless explicitly enabled by the user. Refer to the compatibility guide for more information.

This release covers approximately four weeks of development work and is the result of merging 46 PRs from 11 contributors. See the change log for more information.

Release Highlights¶

Performance¶

Comet 0.7.0 has improved performance compared to the previous release due to improvements in the native shuffle implementation and performance improvements in DataFusion 46.

For single-node TPC-H at 100 GB, Comet now delivers a greater than 2x speedup compared to Spark using the same CPU and RAM. Even with half the resources, Comet still provides a measurable performance improvement.

These benchmarks were performed on a Linux workstation with PCIe 5, AMD 7950X CPU (16 cores), 128 GB RAM, and data stored locally in Parquet format on NVMe storage. Spark was running in Kubernetes with hard memory limits.

Shuffle Improvements¶

There are several improvements to shuffle in this release:

When running in off-heap mode (which is the recommended approach), Comet was using the wrong memory allocator implementation for some types of shuffle operation, which could result in OOM rather than spilling to disk.
The number of spill files is drastically reduced. In previous releases, each instance of ShuffleMapTask could potentially create a new spill file for each output partition each time that spill was invoked. Comet now creates a maximum of one spill file per output partition per instance of ShuffleMapTask, which is appended to in subsequent spills.
There was a flaw in the memory accounting which resulted in Comet requesting approximately twice the amount of memory that was needed, resulting in premature spilling. This is now resolved.
The metric for number of spilled bytes is now accurate. It was previously reporting invalid information.

Improved Hash Join Performance¶

When using the spark.comet.exec.replaceSortMergeJoin setting to replace sort-merge joins with hash joins, Comet will now do a better job of picking the optimal build side. Thanks to @hayman42 for suggesting this, and thanks to the Apache Gluten(incubating) project for the inspiration in implementing this feature.

Experimental Support for DataFusion’s Parquet Scan¶

Support should still be considered experimental, but most of Comet’s unit tests are now passing with the new reader. Known issues include handling of INT96 timestamps and unsigned bytes and shorts.

To enable DataFusion’s Parquet reader, either set spark.comet.scan.impl=native_datafusion or set the environment variable COMET_PARQUET_SCAN_IMPL=native_datafusion.

Complex Type Support¶

With DataFusion’s Parquet reader enabled, there is now some early support for reading structs from Parquet. This is not thoroughly tested yet. We would welcome additional testing from the community to help determine what is and isn’t working, as well as contributions to improve support for structs and other complex types. The tracking issue is https://github.com/apache/datafusion-comet/issues/1043.

Updates to supported Spark versions¶

Comet 0.7.0 is now tested against Spark 3.5.4 rather than 3.5.1
This will be the last Comet release to support Spark 3.3.x

Improved Tuning Guide¶

The Comet Tuning Guide has been improved and now provides guidance on determining how much memory to allocate to Comet.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Apache DataFusion 45.0.0 Released

2025-02-20T00:00:00+00:00

Introduction¶

We are very proud to announce DataFusion 45.0.0. This blog highlights some of the many major improvements since we released DataFusion 40.0.0 and a preview of what the community is thinking about in the next 6 months. It has been an exciting period of development for DataFusion!

Apache DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast data centric systems such as databases, dataframe libraries, machine learning and streaming applications. While DataFusion’s primary design goal is to accelerate the creation of other data centric systems, it has a reasonable experience directly out of the box as a dataframe library, python library and command line SQL tool.

Community Growth 📈¶

In the last 6 months, between 40.0.0 and 45.0.0, our community continues to grow in new and exciting ways.

We added several PMC members and new committers: @jayzhan211 and @jonahgao joined the PMC, @2010YOUY01, @rachelint, @findpi, @iffyio, @goldmedal, @Weijun-H, @Michael-J-Ward and @korowa joined as committers. See the mailing list for more details.
In the core DataFusion repo alone we reviewed and accepted almost 1600 PRs from 206 different committers, created over 1100 issues and closed 751 of them 🚀. All changes are listed in the detailed changelogs.
DataFusion focused meetups happened in multiple cities around the world: Hangzhou, Belgrade, New York, Seattle, Chicago, Boston and Amsterdam as well as a Rust NYC meetup in NYC focused on DataFusion.

DataFusion has put in an application to be part of Google Summer of Code with a number of ideas for projects with mentors already selected. Additionally, some ideas on how to make DataFusion an ideal selection for university database projects such as the CMU database classes have been put forward.

In addition, DataFusion has been appearing publicly more and more, both online and offline. Here are some highlights:

A demonstration of how uwheel is integrated into DataFusion
Integrating StringView into DataFusion - part 1 and part 2
Building streams with DataFusion
Caching in DataFusion: Don't read twice
Parquet pruning in DataFusion: Read no more than you need
DataFusion is one of The 10 coolest open source software tools
Building databases over a weekend

Improved Performance 🚀¶

DataFusion hit a milestone in its development by becoming the fastest single node engine for querying Apache Parquet files in clickbench benchmark for the 43.0.0 release. A lot of work went into making this happen! While other engines have subsequently gotten faster, displacing DataFusion from the top spot, DataFusion still remains near the top and we are planning more improvements.

Figure 1: ClickBench performance improved over 33% between DataFusion 33 (released Nov. 2023) and DataFusion 45 (released Feb. 2025).

The task of integrating the new Arrow StringView which significantly improves performance for workloads that scan, filter and group by variable length string and binary data was completed and enabled by default in the past 6 months. The improvement is especially pronounced for Parquet files due to upstream work in the parquet reader. Kudos to @XiangpengHong, @AriesDevil, @PsiACE, @Weijun-H, @a10y, and @RinChanNOWWW for driving this project.

Improved Quality 📋¶

DataFusion continues to improve overall in quality. In addition to ongoing bug fixes, one of the most exciting improvements in the last 6 months was the addition of the SQLite sqllogictest suite thanks to @Omega359. These tests run over 5 million sql statements on every push to the main branch.

Support for explicitly checking logical plan invariants was added by @wiedld which can help catch implicit changes that might cause problems during upgrades.

We have also started other quality initiatives to make it easier to use DataFusion based on GlareDB's experience along with more extensive prerelease testing.

Improved Documentation 📚¶

We continue to improve the documentation to make it easier to get started using DataFusion. During the last 6 months two projects were initiated to migrate the function documentation from strictly static markdown files. First, @Omega359 to allow function documentation to be generated from code and @jonathanc-n and others helped with the migration, then @comphead lead a project to create a doc macro to allow for an even easier way to write function documentation. A special thanks to @Chen-Yuan-Lai for migrating many functions to the new syntax.

Additionally, the examples were refactored and cleaned up to improve their usefulness.

New Features ✨¶

There are too many new features in the last 6 months to list them all, but here are some highlights:

Functions¶

Uniform Window Functions: BuiltInWindowFunctions was removed and all now use UDFs (@jcsherin)
Uniform Aggregate Functions: BuiltInAggregateFunctions was removed and all now use UDFs
As mentioned above function documentation was extracted from the markdown files
Some new functions and sql support were added including 'show functions', 'to_local_time', 'regexp_count', 'map_extract', 'array_distance', 'array_any_value', 'greatest', 'least', 'arrays_overlap'

FFI¶

Foreign Function Interface work has started. This should allow for using table providers across languages and versions of DataFusion. This is especially pertinent for integration with delta-rs and other table formats.

Materialized Views¶

@suremarc has added a materialized view implementation in datafusion-contrib 🚀

Substrait¶

A lot of work was put into improving and enhancing substrait support (@Blizzara, @westonpace, @tokoko, @vbarua, @LatrecheYasser, @notfilippo and others)

Looking Ahead: The Next Six Months 🔭¶

One of the long term goals of @alamb, DataFusion's PMC chair, has been to have 1000 DataFusion based projects. This may be the year that happens!

The community has been discussing what we will work on in the next six months. Some major initiatives are likely to be:

Performance: A number of items have been identified as areas that could use additional work
Memory usage: Tracking and improving memory usage, statistics and spilling to disk
Google Summer of Code (GSOC): DataFusion is hopefully selected as a project and we start accepting and supporting student projects
FFI: Extending the FFI implementation to support to all types of UDF's and SessionContext
Spark Functions: A proposal has been made to add a crate covering spark compatible builtin functions

How to Get Involved¶

DataFusion is not a project built or driven by a single person, company, or foundation. Rather, our community of users and contributors work together to build a shared technology that none of us could have built alone.

If you are interested in joining us we would love to have you. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests or code. A list of open issues suitable for beginners is here and you can find how to reach us on the communication doc.

Apache DataFusion Comet 0.6.0 Release

2025-02-17T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.6.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately four weeks of development work and is the result of merging 39 PRs from 12 contributors. See the change log for more information.

Starting with this release, we now plan on releasing new versions of Comet more frequently, typically within 1-2 weeks of each major DataFusion release. The main motivation for this change is to better support downstream Rust projects that depend on the datafusion_comet_spark_expr crate.

Release Highlights¶

DataFusion Upgrade¶

Comet 0.6.0 uses DataFusion 45.0.0

New Features¶

Comet now supports array_join, array_intersect, and arrays_overlap. Note that these expressions are not yet guaranteed to be 100% compatible with Spark for all input data types, so these expressions are only enabled with the configuration setting spark.comet.expression.allowIncompatible=true.

Performance & Stability¶

Metrics from native execution are now updated in Spark every 3 seconds by default, rather than for each batch being processed. The mechanism for passing the metrics via JNI is also more efficient.
New memory pool options "fair unified" and "unbounded" have been added. See the Comet Tuning Guide for more information.

Bug Fixes¶

Hashing of decimal values with precision <= 18 is now compatible with Spark
Comet falls back to Spark when hashing decimals with precision > 18

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Apache DataFusion Comet 0.5.0 Release

2025-01-17T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.5.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately 8 weeks of development work and is the result of merging 69 PRs from 15 contributors. See the change log for more information.

Release Highlights¶

Performance¶

Comet 0.5.0 achieves a 1.9x speedup for single-node TPC-H @ 100 GB, an improvement from 1.7x in the previous release.

More benchmarking results can be found in the Comet Benchmarking Guide.

Shuffle Improvements¶

Comet now supports multiple compression algorithms for compressing shuffle files. Previously, only ZSTD was supported but Comet now also supports LZ4 and Snappy. The default is now LZ4, which matches the default in Spark. ZSTD may be a better choice when the compression ratio is more important than CPU overhead.

Previously, Comet used Arrow IPC to encode record batches into shuffle files. Although Arrow IPC is a good general-purpose framework for serializing Arrow record batches, we found that we could get better performance using a custom serialization approach optimized for Comet. One optimization is that the schema is encoded once per shuffle operation rather than once per batch. There are some planned performance improvements in the Rust implementation of Arrow IPC and Comet may switch back to Arrow IPC in the future.

Comet provides two shuffle implementations. Comet native shuffle is the fastest and performs repartitioning in native code. Comet columnar shuffle delegates to Spark to perform repartitioning and is used in cases where native shuffle is not supported, such as with RangePartitioning. Comet generally tries to use native shuffle first, then columnar shuffle, and finally falls back to Spark if neither is supported. There was a bug in previous releases where Comet would sometimes fall back to Spark shuffle if native shuffle was not supported and missed opportunities to use columnar shuffle. This bug was fixed in this release but currently requires the configuration setting spark.comet.exec.shuffle.fallbackToColumnar=true. This will be enabled by default in the next release.

Memory Management¶

Comet 0.4.0 required Spark to be configured to use off-heap memory. In this release it is no longer required and there are multiple options for configuring Comet to use on-heap memory instead. More details are available in the Comet Tuning Guide.

Spark SQL Metrics¶

Comet now provides detailed metrics for native shuffle, showing time for repartitioning, encoding and compressing, and writing to disk.

Crate Reorganization¶

One of the goals of the Comet project is to make Spark-compatible functionality available to other projects that are based on DataFusion. In this release, many implementations of Spark-compatible expressions were moved from the unpublished datafusion-comet crate, which provides the native part of the Spark plugin, into the datafusion-comet-spark-expr crate. There is also ongoing work to reorganize this crate to move expressions into subfolders named after the group name that Spark uses to organize expressions. For example, there are now subfolders named agg_funcs, datetime_funcs, hash_funcs, and so on.

Update on Complex Type Support¶

Good progress has been made with proof-of-concept work using DataFusion’s ParquetExec, which has the advantage of supporting complex types. This work is available on the comet-parquet-exec branch, and the current focus is on fixing test regressions, particularly regarding timestamp conversion issues.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Apache DataFusion Comet 0.4.0 Release

2024-11-20T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.4.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately six weeks of development work and is the result of merging 51 PRs from 10 contributors. See the change log for more information.

Release Highlights¶

Performance & Stability¶

There are a number of performance and stability improvements in this release. Here is a summary of some of the larger changes. Current benchmarking results can be found in the Comet Benchmarking Guide.

Unified Memory Management¶

Comet now uses a unified memory management approach that shares an off-heap memory pool with Apache Spark, resulting in a much simpler configuration. Comet now requires spark.memory.offHeap.enabled=true. This approach provides a holistic view of memory usage in Spark and Comet and makes it easier to optimize system performance.

Faster Joins¶

Apache Spark supports sort-merge and hash joins, which have similar performance characteristics. Spark defaults to using sort-merge joins because they are less likely to result in OutOfMemory exceptions. In vectorized query engines such as DataFusion, hash joins outperform sort-merge joins. Comet now has an experimental feature to replace Spark sort-merge joins with hash joins for improved performance. This feature is experimental because there is currently no spill-to-disk support in the hash join implementation. This feature can be enabled by setting spark.comet.exec.replaceSortMergeJoin=true.

Bloom Filter Aggregates¶

Spark’s optimizer can insert Bloom filter aggregations and filters to prune large result sets before a shuffle. However, Comet would fall back to Spark for the aggregation. Comet now has native support for Bloom filter aggregations after previously supporting Bloom filter testing. Users no longer need to set spark.sql.optimizer.runtime.bloomFilter.enabled=false when using Comet.

Complex Type support¶

This release has the following improvements to complex type support:

Implemented ArrayAppend and GetArrayStructFields.
Implemented native cast between structs
Implemented native cast from structs to string

Roadmap¶

One of the highest priority items on the roadmap is to add support for reading complex types (maps, structs, and arrays) from Parquet sources, both when reading Parquet directly and from Iceberg.

Comet currently has proprietary native code for decoding Parquet pages, native column readers for all of Spark’s primitive types, and special handling for Spark-specific use cases such as timestamp rebasing and decimal type promotion. This implementation does not yet support complex types. File IO, decryption, and decompression are handled in JVM code, and Parquet pages are passed on to native code for decoding.

Rather than add complex type support to this existing code, we are exploring two main options to allow us to leverage more of the upstream Arrow and DataFusion code.

Use DataFusion’s ParquetExec¶

For use cases where DataFusion can support reading a Parquet source, Comet could create a native plan that uses DataFusion’s ParquetExec. We are investigating using DataFusion’s SchemaAdapter to handle some Spark-specific handling of timestamps and decimals.

Use Arrow’s Parquet Batch Reader¶

For use cases not supported by DataFusion’s ParquetExec, such as integrating with Iceberg, we are exploring replacing our current native Parquet decoding logic with the Arrow readers provided by the Parquet crate.

Iceberg already provides a vectorized Spark reader for Parquet. A PR is open against Iceberg for adding a native version based on Comet, and we hope to update this to leverage the improvements outlined above.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Apache DataFusion Comet 0.3.0 Release

2024-09-27T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.3.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately four weeks of development work and is the result of merging 57 PRs from 12 contributors. See the change log for more information.

Release Highlights¶

Binary Releases¶

Comet jar files are now published to Maven central for amd64 and arm64 architectures (Linux only).

Files can be found at https://central.sonatype.com/search?q=org.apache.datafusion

Spark versions 3.3, 3.4, and 3.5 are supported.
Scala versions 2.12 and 2.13 are supported.

New Features¶

The following expressions are now supported natively:

DateAdd
DateSub
ElementAt
GetArrayElement
ToJson

Performance & Stability¶

Upgraded to DataFusion 42.0.0
Reduced memory overhead due to some memory leaks being fixed
Comet will now fall back to Spark for queries that use DPP, to avoid performance regressions because Comet does not have native support for DPP yet
Improved performance when converting Spark columnar data to Arrow format
Faster decimal sum and avg functions

Documentation Updates¶

Improved documentation for deploying Comet with Kubernetes and Helm in the Comet Kubernetes Guide
More detailed architectural overview of Comet scan and execution in the Comet Plugin Overview in the contributor guide

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project.

There are also many good first issues waiting for contributions.

Apache DataFusion Comet 0.2.0 Release

2024-08-28T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.2.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately four weeks of development work and is the result of merging 87 PRs from 14 contributors. See the change log for more information.

Release Highlights¶

Docker Images¶

Docker images are now available from the GitHub Container Registry.

Performance improvements¶

Native shuffle is now enabled by default
Improved handling of decimal types
Reduced some redundant copying of batches in Filter/Scan operations
Optimized performance of count aggregates
Optimized performance of CASE expressions for specific uses:
CASE WHEN expr THEN column ELSE null END
CASE WHEN expr THEN literal ELSE literal END
Optimized performance of IS NOT NULL

New Features¶

Window operations now support count and sum aggregates
CreateArray
GetStructField
Support nested types in hash join
Basic implementation of RLIKE expression

Current Performance¶

We use benchmarks derived from the industry standard TPC-H and TPC-DS benchmarks for tracking progress with performance. The following charts shows the time it takes to run the queries against 100 GB of data in Parquet format using a single executor with eight cores. See the Comet Benchmarking Guide for details of the environment used for these benchmarks.

Benchmark derived from TPC-H¶

Comet 0.2.0 provides a 62% speedup compared to Spark. This is slightly better than the Comet 0.1.0 release.

Benchmark derived from TPC-DS¶

Comet 0.2.0 provides a 21% speedup compared to Spark, which is a significant improvement compared to Comet 0.1.0, which did not provide any speedup for this benchmark.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project.

There are also many good first issues waiting for contributions.

Apache DataFusion 40.0.0 Released

2024-07-24T00:00:00+00:00

Introduction¶

We are proud to announce DataFusion 40.0.0. This blog highlights some of the many major improvements since we released DataFusion 34.0.0 and a preview of what the community is thinking about in the next 6 months. We are hoping to make more regular blog posts -- if you are interested in helping write them, please reach out!

Community Growth 📈¶

In the last 6 months, between 34.0.0 and 40.0.0, our community continues to grow in new and exciting ways.

DataFusion became a top level Apache Software Foundation project (read the press release and blog post).
We added several PMC members and new committers: @comphead, @mustafasrepo, @ozankabak, and @waynexia joined the PMC, @jonahgao and @lewiszlw joined as committers. See the mailing list for more details.
DataFusion Comet was donated and is nearing its first release.
In the core DataFusion repo alone we reviewed and accepted almost 1500 PRs from 182 different committers, created over 1000 issues and closed 781 of them 🚀. This is up almost 50% from our last post (1000 PRs from 124 committers with 650 issues created in our last post) 🤯. All changes are listed in the detailed CHANGELOG.
DataFusion focused meetups happened or are happening in multiple cities around the world: Austin, San Francisco, Hangzhou, New York, and Belgrade.
Many new projects started in the datafusion-contrib organization, including Table Providers, SQLancer, Open Variant, JSON, and ORC.

In addition, DataFusion has been appearing publicly more and more, both online and offline. Here are some highlights:

Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine, was presented in SIGMOD '24, one of the major database conferences
As part of the trend to define "the POSIX of databases" in "What Goes Around Comes Around... And Around..." from Andy Pavlo and Mike Stonebraker
"Why you should keep an eye on Apache DataFusion and its community"
Apache DataFusion offline meetup in the Bay Area

Improved Performance 🚀¶

Performance is a key feature of DataFusion, and the community continues to work to keep DataFusion state of the art in this area. One major area DataFusion improved is the time it takes to convert a SQL query into a plan that can be executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and over 10x faster for some queries with many columns.

Here is a chart showing the improvement due to the concerted effort of many contributors including @jackwener, @alamb, @Lordworms, @dmitrybugakov, @appletreeisyellow, @ClSlaid, @rohitrastogi, @emgeee, @kevinmingtarja, and @peter-toth over several months (see ticket for more details)

DataFusion is now up to 40% faster for queries that GROUP BY a single string or binary column due to a specialization for single Uft8/LargeUtf8/Binary/LargeBinary. We are working on improving performance when there are [multiple variable length columns in the GROUP BY clause].

We are also in the final phases of integrating the new Arrow StringView which significantly improves performance for workloads that scan, filter and group by variable length string and binary data. We expect the improvement to be especially pronounced for Parquet files due to upstream work in the parquet reader. Kudos to @XiangpengHong, @AriesDevil, @PsiACE, @Weijun-H, @a10y, and @RinChanNOWWW for driving this project.

Improved Quality 📋¶

DataFusion continues to improve overall in quality. In addition to ongoing bug fixes, one of the most exciting improvements is the addition of a new SQLancer based DataFusion Fuzzing suite thanks to @2010YOUY01 that has already found several bugs and thanks to @jonahgao, @tshauck, @xinlifoobar, @LorrensP-2158466 for fixing them so fast.

Improved Documentation 📚¶

We continue to improve the documentation to make it easier to get started using DataFusion with the Library Users Guide, API documentation, and Examples.

Some notable new examples include: * sql_analysis.rs to analyse SQL queries with DataFusion structures (thanks @LorrensP-2158466) * function_factory.rs to create custom functions via SQL (thanks @milenkovicm) * plan_to_sql.rs to generate SQL from DataFusion Expr and LogicalPlan (thanks @edmondop) * parquet_index.rs and advanced_parquet_index.rs for parquet indexing, described more below (thanks @alamb)

New Features ✨¶

There are too many new features in the last 6 months to list them all, but here are some highlights:

SQL¶

Support for UNNEST (thanks @duongcongtoai, @JasonLi-cn and @jayzhan211)
Support for Recursive CTEs (thanks @jonahgao and @matthewgapp)
Support for CREATE FUNCTION (see below)
Many new SQL functions

DataFusion now has much improved support for structured types such STRUCT, LIST/ARRAY and MAP. For example, you can now create STRUCT literals in SQL like this:

> select {'foo': {'bar': 2}};
+--------------------------------------------------------------+
| named_struct(Utf8("foo"),named_struct(Utf8("bar"),Int64(2))) |
+--------------------------------------------------------------+
| {foo: {bar: 2}}                                              |
+--------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.

SQL Unparser (SQL Formatter)¶

DataFusion now supports converting Exprs and LogicalPlans BACK to SQL text. This can be useful in query federation to push predicates down into other systems that only accept SQL, and for building systems that generate SQL.

For example, you can now convert a logical expression back to SQL text:

// Form a logical expression that represents the SQL "a < 5 OR a = 8"
let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8)));
// convert the expression back to SQL text
let sql = expr_to_sql(&expr)?.to_string();
assert_eq!(sql, "a < 5 OR a = 8");

You can also do complex things like parsing SQL, modifying the plan, and convert it back to SQL:

let df = ctx
  // Use SQL to read some data from the parquet file
  .sql("SELECT int_col, double_col, CAST(date_string_col as VARCHAR) FROM alltypes_plain")
  .await?;
// Programmatically add new filters `id > 1 and tinyint_col < double_col`
let df = df.filter(col("id").gt(lit(1)).and(col("tinyint_col").lt(col("double_col"))))?
// Convert the new logical plan back to SQL
let sql = plan_to_sql(df.logical_plan())?.to_string();
assert_eq!(sql, 
           "SELECT alltypes_plain.int_col, alltypes_plain.double_col, CAST(alltypes_plain.date_string_col AS VARCHAR) \
           FROM alltypes_plain WHERE ((alltypes_plain.id > 1) AND (alltypes_plain.tinyint_col < alltypes_plain.double_col))")
);

See the Plan to SQL example or the APIs expr_to_sql and plan_to_sql for more details.

Low Level APIs for Fast Parquet Access (indexing)¶

With their rising prevalence, supporting efficient access to Parquet files stored remotely on object storage is important. Part of doing this efficiently is minimizing the number of object store requests made by caching metadata and skipping over parts of the file that are not needed (e.g. via an index).

DataFusion's Parquet reader has long internally supported advanced predicate pushdown by reading the parquet metadata from the file footer and pruning based on row group and data page statistics. DataFusion now also supports users supplying their own low level pruning information via the [ParquetAccessPlan] API.

This API can be used along with index information to selectively skip decoding parts of the file. For example, Spice AI used this feature to add efficient support for reading from DeltaLake tables and handling deletion vectors.

        ┌───────────────────────┐   If the RowSelection does not include any
        │          ...          │   rows from a particular Data Page, that
        │                       │   Data Page is not fetched or decoded.
        │ ┌───────────────────┐ │   Note this requires a PageIndex
        │ │     ┌──────────┐  │ │
Row     │ │     │DataPage 0│  │ │                 ┌────────────────────┐
Groups  │ │     └──────────┘  │ │                 │                    │
        │ │     ┌──────────┐  │ │                 │    ParquetExec     │
        │ │ ... │DataPage 1│ ◀┼ ┼ ─ ─ ─           │  (Parquet Reader)  │
        │ │     └──────────┘  │ │      └ ─ ─ ─ ─ ─│                    │
        │ │     ┌──────────┐  │ │                 │ ╔═══════════════╗  │
        │ │     │DataPage 2│  │ │ If only rows    │ ║ParquetMetadata║  │
        │ │     └──────────┘  │ │ from DataPage 1 │ ╚═══════════════╝  │
        │ └───────────────────┘ │ are selected,   └────────────────────┘
        │                       │ only DataPage 1
        │          ...          │ is fetched and
        │                       │ decoded
        │ ╔═══════════════════╗ │
        │ ║  Thrift metadata  ║ │
        │ ╚═══════════════════╝ │
        └───────────────────────┘
         Parquet File

See the parquet_index.rs and advanced_parquet_index.rs examples for more details.

Thanks to @alamb and @Ted-Jiang for this feature.

Building Systems is Easier with DataFusion 🛠️¶

In addition to many incremental API improvements, there are several new APIs that make it easier to build systems on top of DataFusion:

Faster and easier to use TreeNode API for traversing and manipulating plans and expressions.
All functions now use the same Scalar User Defined Function API, making it easier to customize DataFusion's behavior without sacrificing performance. See ticket for more details.
DataFusion can now be compiled to WASM.

User Defined SQL Parsing Extensions¶

As of DataFusion 40.0.0, you can use the [ExprPlanner] trait to extend DataFusion's SQL planner to support custom operators or syntax.

For example the datafusion-functions-json project uses this API to support JSON operators in SQL queries. It provides a custom implementation for planning JSON operators such as -> and ->> with code like:

struct MyCustomPlanner;

impl ExprPlanner for MyCustomPlanner {
    // Provide custom implementation for planning a binary operators
    // such as `->` and `->>`
    fn plan_binary_op(
        &self,
        expr: RawBinaryExpr,
        _schema: &DFSchema,
    ) -> Result<PlannerResult<RawBinaryExpr>> {
        match &expr.op {
           BinaryOperator::Arrow => { /* plan -> operator */ }
           BinaryOperator::LongArrow => { /* plan ->> operator */ }
           ...
        }
    }
}

Thanks to @samuelcolvin, @jayzhan211 and @dharanad for helping make this feature happen.

Pluggable Support for `CREATE FUNCTION`¶

DataFusion's new [FunctionFactory] API let's users provide a handler for CREATE FUNCTION SQL statements. This feature lets you build systems that support defining functions in SQL such as

-- SQL based functions
CREATE FUNCTION my_func(DOUBLE, DOUBLE) RETURNS DOUBLE
    RETURN $1 + $3
;

-- ML Models
CREATE FUNCTION iris(FLOAT[]) RETURNS FLOAT[] 
LANGUAGE TORCH AS 'models:/iris@champion';

-- WebAssembly
CREATE FUNCTION func(FLOAT[]) RETURNS FLOAT[] 
LANGUAGE WASM AS 'func.wasm'

Huge thanks to @milenkovicm for this feature. There is an example of how to make macro like functions in function_factory.rs. It would be great if someone made a demo showing how to create WASMs 🎣.

Looking Ahead: The Next Six Months 🔭¶

The community has been discussing what we will work on in the next six months. Some major initiatives from that discussion are:

Performance: Improve the speed of aggregating "high cardinality" data when there are many (e.g. millions) of distinct groups as well as additional ideas to improve parquet performance.
Modularity: Make DataFusion even more modular, by completely unifying built in and user aggregate functions and window functions.
LogicalTypes: Introduce Logical Types to make it easier to use different encodings like StringView, RunEnd and Dictionary arrays as well as user defined types. Thanks @notfilippo for driving this.
Improved Documentation: Write blog posts and videos explaining how to use DataFusion for real-world use cases.
Testing: Improve CI infrastructure and test coverage, more fuzz testing, and better functional and performance regression testing.

How to Get Involved¶

Apache DataFusion Comet 0.1.0 Release

2024-07-20T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce the first official source release of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers five months of development work since the project was donated to the Apache DataFusion project and is the result of merging 343 PRs from 41 contributors. See the change log for more information.

This first release supports 15 data types, 13 operators, and 106 expressions. Comet is compatible with Apache Spark versions 3.3, 3.4, and 3.5. There is also experimental support for preview versions of Spark 4.0.

Project Status¶

The project's recent focus has been on fixing correctness and stability issues and implementing additional native operators and expressions so that a broader range of queries can be executed natively.

Here are some of the highlights since the project was donated:

Implemented native support for:
SortMergeJoin
HashJoin
BroadcastHashJoin
Columnar Shuffle
More aggregate expressions
Window aggregates
Many Spark-compatible CAST expressions
Implemented a simple Spark Fuzz Testing utility to find correctness issues
Published a User Guide and Contributors Guide
Created a DataFusion Benchmarks repository with scripts and documentation for running benchmarks derived
from TPC-H and TPC-DS with DataFusion and Comet

Current Performance¶

Comet already delivers a modest performance speedup for many queries, enabling faster data processing and shorter time-to-insights.

We use benchmarks derived from the industry standard TPC-H and TPC-DS benchmarks for tracking progress with performance. The following chart shows the time it takes to run the 22 TPC-H queries against 100 GB of data in Parquet format using a single executor with eight cores. See the Comet Benchmarking Guide for details of the environment used for these benchmarks.

Comet reduces the overall execution time from 626 seconds to 407 seconds, a 54% speedup (1.54x faster).

Running the same queries with DataFusion standalone using the same number of cores results in a 3.9x speedup compared to Spark. Although this isn’t a fair comparison (DataFusion does not have shuffle or match Spark semantics in some cases, for example), it does give some idea about the potential future performance of Comet. Comet aims to provide a 2x-4x speedup for a wide range of queries once more operators and expressions can run natively.

The following chart shows how much Comet currently accelerates each query from the benchmark.

These benchmarks can be reproduced in any environment using the documentation in the Comet Benchmarking Guide. We encourage you to run these benchmarks in your environment or, even better, try Comet out with your existing Spark jobs.

Roadmap¶

Comet is an open-source project, and contributors are welcome to work on any features they are interested in, but here are some current focus areas.

Improve Performance & Reliability:
Implement the remaining features needed so that all TPC-H queries can run entirely natively
Implement spill support in SortMergeJoin
Enable columnar shuffle by default
Fully support Spark version 4.0.0
Support more Spark operators and expressions
We would like to support many more expressions natively in Comet, and this is a great place to start contributing. The contributors' guide has a section covering adding support for new expressions.
Move more Spark expressions into the datafusion-comet-spark-expr crate. Although the main focus of the Comet project is to provide an accelerator for Apache Spark, we also publish a standalone crate containing Spark-compatible expressions that can be used by any project using DataFusion, without adding any dependencies on JVM or Apache Spark.
Release Process & Documentation
Implement a binary release process so that we can publish JAR files to Maven for all supported platforms
Add documentation for running Spark and Comet in Kubernetes, and add example Dockerfiles.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project, and there is a Comet community video call held every four weeks on Wednesdays at 11:30 a.m. Eastern Time, which is 16:30 UTC during Eastern Standard Time and 15:30 UTC during Eastern Daylight Time. See the Comet Community Meeting Google Document for the next scheduled meeting date, the video call link, and recordings of previous calls.

There are also many good first issues waiting for contributions.

Announcing Apache Arrow DataFusion is now Apache DataFusion

2024-05-07T00:00:00+00:00

Introduction¶

TLDR; Apache Arrow DataFusion --> Apache DataFusion

The Arrow PMC and newly created DataFusion PMC are happy to announce that as of April 16, 2024 the Apache Arrow DataFusion subproject is now a top level Apache Software Foundation project.

Background¶

Apache DataFusion is a fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.

When DataFusion was donated to the Apache Software Foundation in 2019, the DataFusion community was not large enough to stand on its own and the Arrow project agreed to help support it. The community has grown significantly since 2019, benefiting immensely from being part of Arrow and following The Apache Way.

Why now?¶

The community discussed graduating to a top level project publicly for almost a year, as the project seemed ready to stand on its own and would benefit from more focused governance. For example, earlier in DataFusion's life many contributed to both arrow-rs and DataFusion, but as DataFusion has matured many contributors, committers and PMC members focused more and more exclusively on DataFusion.

Looking forward¶

The future looks bright. There are now 10s of known projects built with DataFusion, and that number continues to grow. We recently held our first in person meetup passed 5000 stars on GitHub, wrote a paper that was accepted at SIGMOD 2024, and began work on Comet, an Apache Spark accelerator initially donated by Apple.

Thank you to everyone in the Arrow community who helped DataFusion grow and mature over the years, and we look forward to continuing our collaboration as projects. All future blogs and announcements will be posted on the Apache DataFusion website.

Get Involved¶

If you are interested in joining the community, we would love to have you join us. Get in touch using Communication Doc and learn how to get involved in the Contributor Guide. We welcome everyone to try DataFusion on their own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests or code.

Announcing Apache Arrow DataFusion Comet

2024-03-06T00:00:00+00:00

Introduction¶

The Apache Arrow PMC is pleased to announce the donation of the Comet project, a native Spark SQL Accelerator built on Apache Arrow DataFusion.

Comet is an Apache Spark plugin that uses Apache Arrow DataFusion to accelerate Spark workloads. It is designed as a drop-in replacement for Spark's JVM based SQL execution engine and offers significant performance improvements for some workloads as shown below.

Figure 1: With Comet, users interact with the same Spark ecosystem, tools and APIs such as Spark SQL. Queries still run through Spark's query optimizer and planner. However, the execution is delegated to Comet, which is significantly faster and more resource efficient than a JVM based implementation.

Comet is one of a growing class of projects that aim to accelerate Spark using native columnar engines such as the proprietary Databricks Photon Engine and open source projects Gluten, Spark RAPIDS, and Blaze (also built using DataFusion).

Comet was originally implemented at Apple and the engineers who worked on the project are also significant contributors to Arrow and DataFusion. Bringing Comet into the Apache Software Foundation will accelerate its development and grow its community of contributors and users.

Get Involved¶

Comet is still in the early stages of development and we would love to have you join us and help shape the project. We are working on an initial release, and expect to post another update with more details at that time.

Before then, here are some ways to get involved:

Learn more by visiting the Comet project page and reading the mailing list discussion about the initial donation.
Help us plan out the roadmap
Try out the project and provide feedback, file issues, and contribute code.

Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024

2024-01-19T00:00:00+00:00

Introduction¶

We recently released DataFusion 34.0.0. This blog highlights some of the major improvements since we released DataFusion 26.0.0 (spoiler alert there are many) and a preview of where the community plans to focus in the next 6 months.

Apache Arrow DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast data centric systems such as databases, dataframe libraries, machine learning and streaming applications. While DataFusion’s primary design goal is to accelerate creating other data centric systems, it has a reasonable experience directly out of the box as a dataframe library and command line SQL tool.

This may also be our last update on the Apache Arrow Site. Future updates will likely be on the DataFusion website as we are working to graduate to a top level project (Apache Arrow DataFusion → Apache DataFusion!) which will help focus governance and project growth. Also exciting, our first DataFusion in person meetup is planned for March 2024.

DataFusion is very much a community endeavor. Our core thesis is that as a community we can build much more advanced technology than any of us as individuals or companies could alone. In the last 6 months between 26.0.0 and 34.0.0, community growth has been strong. We accepted and reviewed over a thousand PRs from 124 different committers, created over 650 issues and closed 517 of them. You can find a list of all changes in the detailed CHANGELOG.

Improved Performance 🚀¶

Performance is a key feature of DataFusion, DataFusion is more than 2x faster on ClickBench compared to version 25.0.0, as shown below:

Figure 1: Performance improvement between 25.0.0 and 34.0.0 on ClickBench. Note that DataFusion 25.0.0, could not run several queries due to unsupported SQL (Q9, Q11, Q12, Q14) or memory requirements (Q33).

Figure 2: Total query runtime for DataFusion 34.0.0 and DataFusion 25.0.0.

Here are some specific enhancements we have made to improve performance: * 2-3x better aggregation performance with many distinct groups * Partially ordered grouping / streaming grouping * [Specialized operator for "TopK" ORDER BY LIMIT XXX] * [Specialized operator for min(col) GROUP BY .. ORDER by min(col) LIMIT XXX] * Improved join performance * Eliminate redundant sorting with sort order aware optimizers

New Features ✨¶

DML / Insert / Creating Files¶

DataFusion now supports writing data in parallel, to individual or multiple files, using Parquet, CSV, JSON, ARROW and user defined formats. Benchmark results show improvements up to 5x in some cases.

Similarly to reading, data can now be written to any [ObjectStore] implementation, including AWS S3, Azure Blob Storage, GCP Cloud Storage, local files, and user defined implementations. While reading from hive style partitioned tables has long been supported, it is now possible to write to such tables as well.

For example, to write to a local file:

❯ CREATE EXTERNAL TABLE awesome_table(x INT) STORED AS PARQUET LOCATION '/tmp/my_awesome_table';
0 rows in set. Query took 0.003 seconds.

❯ INSERT INTO awesome_table SELECT x * 10 FROM my_source_table;
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.024 seconds.

You can also write to files with the [COPY], similarly to [DuckDB’s COPY]:

❯ COPY (SELECT x + 1 FROM my_source_table) TO '/tmp/output.json';
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.014 seconds.

$ cat /tmp/output.json
{"x":1}
{"x":2}
{"x":3}

Improved `STRUCT` and `ARRAY` support¶

DataFusion 34.0.0 has much improved STRUCT and ARRAY support, including a full range of struct functions and array functions.

For example, you can now use [] syntax and array_length to access and inspect arrays:

❯ SELECT column1, 
         column1[1] AS first_element, 
         array_length(column1) AS len 
  FROM my_table;
+-----------+---------------+-----+
| column1   | first_element | len |
+-----------+---------------+-----+
| [1, 2, 3] | 1             | 3   |
| [2]       | 2             | 1   |
| [4, 5]    | 4             | 2   |
+-----------+---------------+-----+

❯ SELECT column1, column1['c0'] FROM  my_table;
+------------------+----------------------+
| column1          | my_table.column1[c0] |
+------------------+----------------------+
| {c0: foo, c1: 1} | foo                  |
| {c0: bar, c1: 2} | bar                  |
+------------------+----------------------+
2 rows in set. Query took 0.002 seconds.

Other Features¶

Other notable features include: * Support aggregating datasets that exceed memory size, with group by spill to disk * All operators now track and limit their memory consumption, including Joins

Building Systems is Easier with DataFusion 🛠️¶

Documentation¶

It is easier than ever to get started using DataFusion with the new Library Users Guide as well as significantly improved the API documentation.

User Defined Window and Table Functions¶

In addition to DataFusion's User Defined Scalar Functions, and User Defined Aggregate Functions, DataFusion now supports User Defined Window Functions and User Defined Table Functions.

For example, [the datafusion-cli] implements a DuckDB style [parquet_metadata] function as a user defined table function (source code here):

❯ SELECT 
      path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size 
FROM 
      parquet_metadata('hits.parquet')
WHERE path_in_schema = '"WatchID"' 
LIMIT 3;

+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| path_in_schema | row_group_id | row_group_num_rows | stats_min           | stats_max           | total_compressed_size |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| "WatchID"      | 0            | 450560             | 4611687214012840539 | 9223369186199968220 | 3883759               |
| "WatchID"      | 1            | 612174             | 4611689135232456464 | 9223371478009085789 | 5176803               |
| "WatchID"      | 2            | 344064             | 4611692774829951781 | 9223363791697310021 | 3031680               |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
3 rows in set. Query took 0.053 seconds.

Growth of DataFusion 📈¶

DataFusion has been appearing more publicly in the wild. For example * New projects built using DataFusion such as lancedb, GlareDB, Arroyo, and optd. * Public talks such as Apache Arrow Datafusion: Vectorized Execution Framework For Maximum Performance in CommunityOverCode Asia 2023 * Blogs posts such as Apache Arrow, Arrow/DataFusion, AI-native Data Infra, Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0, and A Guide to User-Defined Functions in Apache Arrow DataFusion

We have also submitted a paper to SIGMOD 2024, one of the premiere database conferences, describing DataFusion in a technically formal style and making the case that it is possible to create a modular and extensive query engine without sacrificing performance. We hope this paper helps people evaluating DataFusion for their needs understand it better.

DataFusion in 2024 🥳¶

Some major initiatives from contributors we know of this year are:

Modularity: Make DataFusion even more modular, such as unifying built in and user functions, making it easier to customize DataFusion's behavior.
Community Growth: Graduate to our own top level Apache project, and subsequently add more committers and PMC members to keep pace with project growth.
Use case white papers: Write blog posts and videos explaining how to use DataFusion for real-world use cases.
Testing: Improve CI infrastructure and test coverage, more fuzz testing, and better functional and performance regression testing.
Planning Time: Reduce the time taken to plan queries, both wide tables of 1000s of columns, and in general.
Aggregate Performance: Improve the speed of aggregating "high cardinality" data when there are many (e.g. millions) of distinct groups.
Statistics: Improved statistics handling with an eye towards more sophisticated expression analysis and cost models.

How to Get Involved¶

If you are interested in contributing to DataFusion we would love to have you join us. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests or code. A list of open issues suitable for beginners is here.

As the community grows, we are also looking to restart biweekly calls / meetings. Timezones are always a challenge for such meetings, but we hope to have two calls that can work for most attendees. If you are interested in helping, or just want to say hi, please drop us a note via one of the methods listed in our Communication Doc.

Apache Arrow DataFusion 26.0.0

2023-06-24T00:00:00+00:00

It has been a whirlwind 6 months of DataFusion development since our last update: the community has grown, many features have been added, performance improved and we are discussing branching out to our own top level Apache Project.

Background¶

Apache Arrow DataFusion is an extensible query engine and database toolkit, written in Rust, that uses Apache Arrow as its in-memory format.

DataFusion, along with Apache Calcite, Facebook's Velox and similar technology are part of the next generation "Deconstructed Database" architectures, where new systems are built on a foundation of fast, modular components, rather as a single tightly integrated system.

While single tightly integrated systems such as Spark, DuckDB and Pola.rs are great pieces of technology, our community believes that anyone developing new data heavy application, such as those common in machine learning in the next 5 years, will require a high performance, vectorized, query engine to remain relevant. The only practical way to gain access to such technology without investing many millions of dollars to build a new tightly integrated engine, is though open source projects like DataFusion and similar enabling technologies such as Apache Arrow and Rust.

DataFusion is targeted primarily at developers creating other data intensive analytics, and offers:

High performance, native, parallel streaming execution engine
Mature SQL support, featuring subqueries, window functions, grouping sets, and more
Built in support for Parquet, Avro, CSV, JSON and Arrow formats and easy extension for others
Native DataFrame API and python bindings
Well documented source code and architecture, designed to be customized to suit downstream project needs
High quality, easy to use code released every 2 weeks to crates.io
Welcoming, open community, governed by the highly regarded and well understood Apache Software Foundation

The rest of this post highlights some of the improvements we have made to DataFusion over the last 6 months and a preview of where we are heading. You can see a list of all changes in the detailed CHANGELOG.

(Even) Better Performance¶

Various benchmarks show DataFusion to be quite close or even faster to the state of the art in analytic performance (at the moment this seems to be DuckDB). We continually work on improving performance (see #5546 for a list) and would love additional help in this area.

DataFusion now reads single large Parquet files significantly faster by parallelizing across multiple cores. Native speeds for reading JSON and CSV files are also up to 2.5x faster thanks to improvements upstream in arrow-rs JSON reader and CSV reader.

Also, we have integrated the arrow-rs Row Format into DataFusion resulting in up to 2-3x faster sorting and merging.

Improved Documentation and Website¶

Part of growing the DataFusion community is ensuring that DataFusion's features are understood and that it is easy to contribute and participate. To that end the website has been cleaned up, the architecture guide expanded, the roadmap updated, and several overview talks created:

Apr 2023 Query Engine: recording and slides
April 2023 Logical Plan and Expressions: recording and slides
April 2023 Physical Plan and Execution: recording and slides

New Features¶

More Streaming, Less Memory¶

We have made significant progress on the streaming execution roadmap such as unbounded datasources, streaming group by, sophisticated sort and repartitioning improvements in the optimizer, and support for symmetric hash join (read more about that in the great Synnada Blog Post on the topic). Together, these features both 1) make it easier to build streaming systems using DataFusion that can incrementally generate output before (or ever) seeing the end of the input and 2) allow general queries to use less memory and generate their results faster.

We have also improved the runtime memory management system so that DataFusion now stays within its declared memory budget generate runtime errors.

DML Support (`INSERT`, `DELETE`, `UPDATE`, etc)¶

Part of building high performance data systems includes writing data, and DataFusion supports several features for creating new files:

INSERT INTO and SELECT ... INTO support for memory backed and CSV tables
New API for writing data into TableProviders

We are working on easier to use COPY INTO syntax, better support for writing parquet, JSON, and AVRO, and more -- see our tracking epic for more details.

Timestamp and Intervals¶

One mark of the maturity of a SQL engine is how it handles the tricky world of timestamp, date, times and interval arithmetic. DataFusion is feature complete in this area and behaves as you would expect, supporting queries such as

SELECT now() + '1 month' FROM my_table;

We still have a long tail of date and time improvements, which we are working on as well.

Querying Structured Types (`List` and `Struct`s)¶

Arrow and Parquet support nested data well and DataFusion lets you easily query such Struct and List. For example, you can use DataFusion to read and query the JSON Datasets for Exploratory OLAP - Mendeley Data like this:

----------
-- Explore structured data using SQL
----------
SELECT delete FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL limit 10;
+---------------------------------------------------------------------------------------------------------------------------+
| delete                                                                                                                    |
+---------------------------------------------------------------------------------------------------------------------------+
| {status: {id: {$numberLong: 135037425050320896}, id_str: 135037425050320896, user_id: 334902461, user_id_str: 334902461}} |
| {status: {id: {$numberLong: 134703982051463168}, id_str: 134703982051463168, user_id: 405383453, user_id_str: 405383453}} |
| {status: {id: {$numberLong: 134773741740765184}, id_str: 134773741740765184, user_id: 64823441, user_id_str: 64823441}}   |
| {status: {id: {$numberLong: 132543659655704576}, id_str: 132543659655704576, user_id: 45917834, user_id_str: 45917834}}   |
| {status: {id: {$numberLong: 133786431926697984}, id_str: 133786431926697984, user_id: 67229952, user_id_str: 67229952}}   |
| {status: {id: {$numberLong: 134619093570560002}, id_str: 134619093570560002, user_id: 182430773, user_id_str: 182430773}} |
| {status: {id: {$numberLong: 134019857527214080}, id_str: 134019857527214080, user_id: 257396311, user_id_str: 257396311}} |
| {status: {id: {$numberLong: 133931546469076993}, id_str: 133931546469076993, user_id: 124539548, user_id_str: 124539548}} |
| {status: {id: {$numberLong: 134397743350296576}, id_str: 134397743350296576, user_id: 139836391, user_id_str: 139836391}} |
| {status: {id: {$numberLong: 127833661767823360}, id_str: 127833661767823360, user_id: 244442687, user_id_str: 244442687}} |
+---------------------------------------------------------------------------------------------------------------------------+

----------
-- Select some deeply nested fields
----------
SELECT
  delete['status']['id']['$numberLong'] as delete_id,
  delete['status']['user_id'] as delete_user_id
FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL LIMIT 10;

+--------------------+----------------+
| delete_id          | delete_user_id |
+--------------------+----------------+
| 135037425050320896 | 334902461      |
| 134703982051463168 | 405383453      |
| 134773741740765184 | 64823441       |
| 132543659655704576 | 45917834       |
| 133786431926697984 | 67229952       |
| 134619093570560002 | 182430773      |
| 134019857527214080 | 257396311      |
| 133931546469076993 | 124539548      |
| 134397743350296576 | 139836391      |
| 127833661767823360 | 244442687      |
+--------------------+----------------+

Subqueries All the Way Down¶

DataFusion can run many different subqueries by rewriting them to joins. It has been able to run the full suite of TPC-H queries for at least the last year, but recently we have implemented significant improvements to this logic, sufficient to run almost all queries in the TPC-DS benchmark as well.

Community and Project Growth¶

The six months since our last update saw significant growth in the DataFusion community. Between versions 17.0.0 and 26.0.0, DataFusion merged 711 PRs from 107 distinct contributors, not including all the work that goes into our core dependencies such as arrow, parquet, and object_store, that much of the same community helps support.

In addition, we have added 7 new committers and 1 new PMC member to the Apache Arrow project, largely focused on DataFusion, and we learned about some of the cool new systems which are using DataFusion. Given the growth of the community and interest in the project, we also clarified the mission statement and are discussing "graduate"ing DataFusion to a new top level Apache Software Foundation project.

How to Get Involved¶

Kudos to everyone in the community who has contributed ideas, discussions, bug reports, documentation and code. It is exciting to be innovating on the next generation of database architectures together!

If you are interested in contributing to DataFusion, we would love to have you join us. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here.

Check out our Communication Doc for more ways to engage with the community.

Apache Arrow DataFusion 16.0.0 Project Update

2023-01-19T00:00:00+00:00

Introduction¶

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. It is targeted primarily at developers creating data intensive analytics, and offers mature SQL support, a DataFrame API, and many extension points.

Systems based on DataFusion perform very well in benchmarks, especially considering they operate directly on parquet files rather than first loading into a specialized format. Some recent highlights include clickbench and the Cloudfuse.io standalone query engines page.

DataFusion is also part of a longer term trend, articulated clearly by Andy Pavlo in his 2022 Databases Retrospective. Database frameworks are proliferating and it is likely that all OLAP DBMSs and other data heavy applications, such as machine learning, will require a vectorized, highly performant query engine in the next 5 years to remain relevant. The only practical way to make such technology so widely available without many millions of dollars of investment is though open source engine such as DataFusion or Velox.

The rest of this post describes the improvements made to DataFusion over the last three months and some hints of where we are heading.

Community Growth¶

We again saw significant growth in the DataFusion community since our last update. There are some interesting metrics on OSSRank.

The DataFusion 16.0.0 release consists of 543 PRs from 73 distinct contributors, not including all the work that goes into dependencies such as arrow, parquet, and object_store, that much of the same community helps support. Thank you all for your help

Several new systems based on DataFusion were recently added:

Performance 🚀¶

Performance and efficiency are core values for DataFusion. While there is still a gap between DataFusion and the best of breed, tightly integrated systems such as DuckDB and Polars, DataFusion is closing the gap quickly. Performance highlights from the last three months:

Up to 30% Faster Sorting and Merging using the new Row Format
Advanced predicate pushdown, directly on parquet, directly from object storage, enabling sub millisecond filtering.
70% faster IN expressions evaluation (#4057)
Sort and partition aware optimizations (#3969 and #4691)
Filter selectivity analysis (#3868)

Runtime Resource Limits¶

Previously, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins.

In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to optionally spill to secondary storage. See #3941 for more detail.

SQL Window Functions¶

SQL Window Functions are useful for a variety of analysis and DataFusion's implementation support expanded significantly:

Custom window frames such as ... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING)
Unbounded window frames such as ... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING)
Support for the NTILE window function (#4676)
Support for GROUPS mode (#4155)

Improved Joins¶

Joins are often the most complicated operations to handle well in analytics systems and DataFusion 16.0.0 offers significant improvements such as

Cost based optimizer (CBO) automatically reorders join evaluations, selects algorithms (Merge / Hash), and pick build side based on available statistics and join type (INNER, LEFT, etc) (#4219)
Fast non column=column equijoins such as JOIN ON a.x + 5 = b.y
Better performance on non-equijoins (#4562)

Streaming Execution¶

One emerging use case for Datafusion is as a foundation for streaming-first data platforms. An important prerequisite is support for incremental execution for queries that can be computed incrementally.

With this release, DataFusion now supports the following streaming features:

Data ingestion from infinite files such as FIFOs (#4694),
Detection of pipeline-breaking queries in streaming use cases (#4694),
Automatic input swapping for joins so probe side is a data stream (#4694),
Intelligent elision of pipeline-breaking sort operations whenever possible (#4691),
Incremental execution for more types of queries; e.g. queries involving finite window frames (#4777).

These are a major steps forward, and we plan even more improvements over the next few releases.

Better Support for Distributed Catalogs¶

16.0.0 has been enhanced support for asynchronous catalogs (#4607) to better support distributed metadata stores such as Delta.io and Apache Iceberg which require asynchronous I/O during planning to access remote catalogs. Previously, DataFusion required synchronous access to all relevant catalog information.

Additional SQL Support¶

SQL support continues to improve, including some of these highlights:

Add TPC-DS query planning regression tests #4719
Support for PREPARE statement #4490
Automatic coercions ast between Date and Timestamp #4726
Support type coercion for timestamp and utf8 #4312
Full support for time32 and time64 literal values (ScalarValue) #4156
New functions, including uuid() #4041, current_time #4054, current_date #4022
Compressed CSV/JSON support #3642

The community has also invested in new sqllogic based tests to keep improving DataFusion's quality with less effort.

Plan Serialization and Substrait¶

DataFusion now supports serialization of physical plans, with a custom protocol buffers format. In addition, we are adding initial support for Substrait, a Cross-Language Serialization for Relational Algebra

How to Get Involved¶

Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!

Check out our Communication Doc on more ways to engage with the community.

Appendix: Contributor Shoutout¶

Here is a list of people who have contributed PRs to this project over the last three releases, derived from git shortlog -sn 13.0.0..16.0.0 . Thank you all!

   113  Andrew Lamb
    58  jakevin
    46  Raphael Taylor-Davies
    30  Andy Grove
    19  Batuhan Taskaya
    19  Remzi Yang
    17  ygf11
    16  Burak
    16  Jeffrey
    16  Marco Neumann
    14  Kun Liu
    12  Yang Jiang
    10  mingmwang
     9  Daniël Heres
     9  Mustafa akur
     9  comphead
     9  mvanschellebeeck
     9  xudong.w
     7  dependabot[bot]
     7  yahoNanJing
     6  Brent Gardner
     5  AssHero
     4  Jiayu Liu
     4  Wei-Ting Kuo
     4  askoa
     3  André Calado Coroado
     3  Jie Han
     3  Jon Mease
     3  Metehan Yıldırım
     3  Nga Tran
     3  Ruihang Xia
     3  baishen
     2  Berkay Şahin
     2  Dan Harris
     2  Dongyan Zhou
     2  Eduard Karacharov
     2  Kikkon
     2  Liang-Chi Hsieh
     2  Marko Milenković
     2  Martin Grigorov
     2  Roman Nozdrin
     2  Tim Van Wassenhove
     2  r.4ntix
     2  unconsolable
     2  unvalley
     1  Ajaya Agrawal
     1  Alexander Spies
     1  ArkashaJavelin
     1  Artjoms Iskovs
     1  BoredPerson
     1  Christian Salvati
     1  Creampanda
     1  Data Psycho
     1  Francis Du
     1  Francis Le Roy
     1  LFC
     1  Marko Grujic
     1  Matt Willian
     1  Matthijs Brobbel
     1  Max Burke
     1  Mehmet Ozan Kabak
     1  Rito Takeuchi
     1  Roman Zeyde
     1  Vrishabh
     1  Zhang Li
     1  ZuoTiJia
     1  byteink
     1  cfraz89
     1  nbr
     1  xxchan
     1  yujie.zhang
     1  zembunia
     1  哇呜哇呜呀咦耶

Apache Arrow Ballista 0.9.0 Release

2022-10-28T00:00:00+00:00

Introduction¶

Ballista is an Arrow-native distributed SQL query engine implemented in Rust.

Ballista 0.9.0 is now available and is the most significant release since the project was donated to Apache Arrow in 2021.

This release represents 4 weeks of work, with 66 commits from 14 contributors:

    22  Andy Grove
    12  yahoNanJing
     6  Daniël Heres
     4  Brent Gardner
     4  dependabot[bot]
     4  r.4ntix
     3  Stefan Stanciulescu
     3  mingmwang
     2  Ken Suenobu
     2  Yang Jiang
     1  Metehan Yıldırım
     1  Trent Feda
     1  askoa
     1  yangzhong

Release Highlights¶

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.

Support for Cloud Object Stores and Distributed File Systems¶

This is the first release of Ballista to have documented support for querying data from distributed file systems and object stores. Currently, S3 and HDFS are supported. Support for Google Cloud Storage and Azure Blob Storage is planned for the next release.

Flight SQL & JDBC support¶

The Ballista scheduler now implements the Flight SQL protocol, enabling any compliant Flight SQL client to connect to and run queries against a Ballista cluster.

The Apache Arrow Flight SQL JDBC driver can be used to connect Business Intelligence tools to a Ballista cluster.

Python Bindings¶

It is now possible to connect to a Ballista cluster from Python and execute queries using both the DataFrame and SQL interfaces.

Scheduler Web User Interface and REST API¶

The scheduler now has a web user interface for monitoring queries. It is also possible to view graphical query plans that show how the query was executed, along with metrics.

The REST API that powers the user interface can also be accessed directly.

Simplified Kubernetes Deployment¶

Ballista now provides a Helm chart for simplified Kubernetes deployment.

User Guide¶

The user guide is published at https://arrow.apache.org/ballista/ and provides deployment instructions for Docker, Docker Compose, and Kubernetes, as well as references for configuring and tuning Ballista.

Roadmap¶

The Ballista community is currently focused on the following tasks for the next release:

Support for Azure Blob Storage and Google Cloud Storage
Improve benchmark performance by implementing more query optimizations
Improve scheduler web user interface
Publish Docker images to GitHub Container Registry

The detailed list of issues planned for the 0.10.0 release can be found in the tracking issue.

Getting Involved¶

Ballista has a friendly community and we welcome contributions. A good place to start is to following the instructions in the user guide and try using Ballista with your own SQL queries and ETL pipelines, and file issues for any bugs or feature suggestions.

Apache Arrow DataFusion 13.0.0 Project Update

2022-10-25T00:00:00+00:00

Introduction¶

Apache Arrow DataFusion 13.0.0 is released, and this blog contains an update on the project for the 5 months since our last update in May 2022.

DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project to:

Support SQL support
Support DataFrame API
Support a Domain Specific Query Language
Easily and quickly read and process Parquet, JSON, Avro or CSV data.
Read from remote object stores such as AWS S3, Azure Blob Storage, GCP.

Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.

Background¶

DataFusion is used as the engine in many open source and commercial projects and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a "LLVM for database and AI systems"(alternate link) with announcements such as the release of FaceBook's Velox engine, the major investments in Acero as well as the continued popularity of Apache Calcite and other similar technologies.

While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and extension points for just about everything. Some DataFusion users use a subset of the features such as the frontend (e.g. dask-sql) or the execution engine, (e.g. Blaze), and some use many different components to build both SQL based and customized DSL based systems such as InfluxDB IOx and VegaFusion.

One of DataFusion’s advantages is its implementation in Rust and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the ease of parallelization with the high quality and standardized async ecosystem , as well as its modern dependency management system and wonderful performance.

Summary¶

We have increased the frequency of DataFusion releases to monthly instead of quarterly. This makes it easier for the increasing number of projects that now depend on DataFusion.

We have also completed the "graduation" of Ballista to its own top-level arrow-ballista repository which decouples the two projects and allows each project to move even faster.

Along with numerous other bug fixes and smaller improvements, here are some of the major advances:

Improved Support for Cloud Object Stores¶

DataFusion now supports many major cloud object stores (Amazon S3, Azure Blob Storage, and Google Cloud Storage) "out of the box" via the object_store crate. Using this integration, DataFusion optimizes reading parquet files by reading only the parts of the files that are needed.

Advanced SQL¶

DataFusion now supports correlated subqueries, by rewriting them as joins. See the Subquery page in the User Guide for more information.

In addition to numerous other small improvements, the following SQL features are now supported:

ROWS, RANGE, PRECEDING and FOLLOWING in OVER clauses #3570
ROLLUP and CUBE grouping set expressions #2446
SUM DISTINCT aggregate support #2405
IN and NOT IN Subqueries by rewriting them to SEMI / ANTI #2421
Non equality predicates in ON clause of LEFT, RIGHT,and FULL joins #2591
Exact MEDIAN #3009
GROUPING SETS/CUBE/ROLLUP #2716

More DDL Support¶

Just as it is important to query, it is also important to give users the ability to define their data sources. We have added:

CREATE VIEW #2279
DESCRIBE <table> #2642
Custom / Dynamic table provider factories #3311
SHOW CREATE TABLE for support for views #2830

Faster Execution¶

Performance is always an important goal for DataFusion, and there are a number of significant new optimizations such as

Optimizations of TopK (queries with a LIMIT or OFFSET clause): #3527, #2521
Reduce left/right/full joins to inner join #2750
Convert cross joins to inner joins when possible #3482
Sort preserving SortMergeJoin #2699
Improvements in group by and sort performance #2375
Adaptive regex_replace implementation #3518

Optimizer Enhancements¶

Internally the optimizer has been significantly enhanced as well.

Casting / coercion now happens during logical planning #3185 #3636
More sophisticated expression analysis and simplification is available

Parquet¶

The parquet reader can now read directly from parquet files on remote object storage #2489 #3051
Experimental support for “predicate pushdown” with late materialization after filtering during the scan (another blog post on this topic is coming soon).
Support reading directly from AWS S3 and other object stores via datafusion-cli #3631

DataType Support¶

Support for TimestampTz #3660
Expanded support for the Decimal type, including IN list and better built in coercion.
Expanded support for date/time manipulation such as date_bin built-in function , timestamp +/- interval, TIME literal values #3010, #3110, #3034
Binary operations (AND, XOR, etc): #3037 #3420
IS TRUE/FALSE and IS [NOT] UNKNOWN #3235, #3246

Upcoming Work¶

With the community growing and code accelerating, there is so much great stuff on the horizon. Some features we expect to land in the next few months:

Complete Parquet Pushdown
Additional date/time support
Cost models, Nested Join Optimizations, analysis framework #128, #3843, #3845

Community Growth¶

The DataFusion 9.0.0 and 13.0.0 releases consists of 433 PRs from 64 distinct contributors. This does not count all the work that goes into our dependencies such as arrow, parquet, and object_store, that much of the same community helps nurture.

How to Get Involved¶

Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!

If you are interested in contributing to DataFusion, we would love to have you join us on our journey to create the most advanced open source query engine. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here.

Check out our Communication Doc on more ways to engage with the community.

Appendix: Contributor Shoutout¶

To give a sense of the number of people who contribute to this project regularly, we present for your consideration the following list derived from git shortlog -sn 9.0.0..13.0.0 . Thank you all again!

    87  Andy Grove
    71  Andrew Lamb
    29  Kun Liu
    29  Kirk Mitchener
    17  Wei-Ting Kuo
    14  Yang Jiang
    12  Raphael Taylor-Davies
    11  Batuhan Taskaya
    10  Brent Gardner
    10  Remzi Yang
    10  comphead
    10  xudong.w
     8  AssHero
     7  Ruihang Xia
     6  Dan Harris
     6  Daniël Heres
     6  Ian Alexander Joiner
     6  Mike Roberts
     6  askoa
     4  BaymaxHWY
     4  gorkem
     4  jakevin
     3  George Andronchik
     3  Sarah Yurick
     3  Stuart Carnie
     2  Dalton Modlin
     2  Dmitry Patsura
     2  JasonLi
     2  Jon Mease
     2  Marco Neumann
     2  yahoNanJing
     1  Adilet Sarsembayev
     1  Ayush Dattagupta
     1  Dezhi Wu
     1  Dhamotharan Sritharan
     1  Eduard Karacharov
     1  Francis Du
     1  Harbour Zheng
     1  Ismaël Mejía
     1  Jack Klamer
     1  Jeremy Dyer
     1  Jiayu Liu
     1  Kamil Konior
     1  Liang-Chi Hsieh
     1  Martin Grigorov
     1  Matthijs Brobbel
     1  Mehmet Ozan Kabak
     1  Metehan Yıldırım
     1  Morgan Cassels
     1  Nitish Tiwari
     1  Renjie Liu
     1  Rito Takeuchi
     1  Robert Pack
     1  Thomas Cameron
     1  Vrishabh
     1  Xin Hao
     1  Yijie Shen
     1  byteink
     1  kamille
     1  mateuszkj
     1  nvartolomei
     1  yourenawo
     1  Özgür Akkurt

Apache Arrow DataFusion 8.0.0 Release

2022-05-16T00:00:00+00:00

Introduction¶

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth checking out.

DataFusion's SQL, DataFrame, and manual PlanBuilder API let users access a sophisticated query optimizer and execution engine capable of fast, resource efficient, and parallel execution that takes optimal advantage of today's multicore hardware. Being written in Rust means DataFusion can offer both the safety of a dynamic language and the resource efficiency of a compiled language.

The Apache Arrow team is pleased to announce the DataFusion 8.0.0 release (and also the release of version 0.7.0 of the Ballista subproject). This covers 3 months of development work and includes 279 commits from the following 49 distinct contributors.

    39  Andy Grove
    33  Andrew Lamb
    21  DuRipeng
    20  Yijie Shen
    19  Yang Jiang
    17  Raphael Taylor-Davies
    11  Dan Harris
    11  Matthew Turner
    11  yahoNanJing
     9  dependabot[bot]
     8  jakevin
     6  Kun Liu
     5  Jiayu Liu
     4  Daniël Heres
     4  mingmwang
     4  xudong.w
     3  Carol (Nichols || Goulding)
     3  Dmitry Patsura
     3  Eduard Karacharov
     3  Jeremy Dyer
     3  Kaushik
     3  Rich
     3  comphead
     3  gaojun2048
     3  Feynman Han
     2  Jie Han
     2  Jon Mease
     2  Tim Van Wassenhove
     2  Yt
     2  Zhang Li
     2  silence-coding
     1  Alexander Spies
     1  George Andronchik
     1  Guillaume Balaine
     1  Hao Xin
     1  Jiacai Liu
     1  Jörn Horstmann
     1  Liang-Chi Hsieh
     1  Max Burke
     1  NaincyKumariKnoldus
     1  Nga Tran
     1  Patrick More
     1  Pierre Zemb
     1  Remzi Yang
     1  Sergey Melnychuk
     1  Stephen Carman
     1  doki

The following sections highlight some of the changes in this release. Of course, many other bug fixes and improvements have been made and we encourage you to check out the changelog for full details.

Summary¶

DDL Support¶

DDL support has been expanded to include the following commands for creating databases, schemas, and views. This allows DataFusion to be used more effectively from the CLI.

CREATE DATABASE
CREATE VIEW
CREATE SCHEMA
CREATE EXTERNAL TABLE now supports JSON files, IF NOT EXISTS, and partition columns

SQL Support¶

The SQL query planner now supports a number of new SQL features, including:

Subqueries: when used via IN, EXISTS, and as scalars
Grouping Sets: CUBE and ROLLUP grouping sets.
Aggregate functions: approx_percentile, approx_percentile_cont, approx_percentile_cont_with_weight, approx_distinct, approx_median and array
null literals
bitwise operations: for example '|'

There are also many bug fixes and improvements around normalizing identifiers consistently.

We continue our tradition of incrementally releasing support for new features as they are developed. Thus, while the physical plan may not yet support all new features, it gets more complete each release. These changes also make DataFusion an increasingly compelling choice for projects looking for a SQL parser and query planner that can produce optimized logical plans that can be translated to their own execution engine.

Query Execution & Internals¶

There are several notable improvements and new features in the query execution engine:

The ExecutionContext has been renamed to SessionContext and now supports multi-tenancy
The ExecutionPlan trait is no longer async
A new serialization API for serializing plans to bytes (based on protobuf)

In addition, we have added several foundational features to drive even more advanced query processing into DataFusion, focusing on running arbitrary queries larger than available memory, and pushing the envelope for performance of sorting, grouping, and joining even further:

Morsel-Driven Scheduler based on "Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age"
Consolidated object store implementation and integration with parquet decoding
Memory Limited Spilling sort operator
Memory Limited Sort-Merge join operator
High performance JIT code generation for tuple comparisons
Memory efficient Row Format

Improved file support¶

DataFusion now supports JSON, both for reading and writing. There are also new DataFrame methods for writing query results to files in CSV, Parquet, and JSON format.

Ballista¶

Ballista continues to mature and now supports a wider range of operators and expressions. There are also improvements to the scheduler to support UDFs, and there are some robustness improvements, such as cleaning up work directories and persisting session configs to allow schedulers to restart and continue processing in-flight jobs.

Upcoming Work¶

Here are some of the initiatives that the community plans on working on prior to the next release.

There is a proposal to move Ballista to its own top-level arrow-ballista repository to decouple DataFusion and Ballista releases and to allow each project to have documentation better targeted at its particular audience.
We plan on increasing the frequency of DataFusion releases, with monthly releases now instead of quarterly. This is driven by requests from the increasing number of projects that now depend on DataFusion.
There is ongoing work to implement new optimizer rules to rewrite queries containing subquery expressions as joins, to support a wider range of queries.
The new scheduler based on morsel-driven execution will continue to evolve in this next release, with work to refine IO abstractions to improve performance and integration with the new scheduler.
Improved performance for Sort, Grouping and Joins

How to Get Involved¶

If you are interested in contributing to DataFusion, and learning about state-of-the-art query processing, we would love to have you join us on the journey! You can help by trying out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here

Check out our new Communication Doc on more ways to engage with the community.

Introducing Apache Arrow DataFusion Contrib

2022-03-21T00:00:00+00:00

Introduction¶

Apache Arrow DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth checking out. DataFusion's pluggable design makes creating extensions at various points particular easy to build.

DataFusion's SQL, DataFrame, and manual PlanBuilder API let users access a sophisticated query optimizer and execution engine capable of fast, resource efficient, and parallel execution that takes optimal advantage of todays multicore hardware. Being written in Rust means DataFusion can offer both the safety of dynamic languages as well as the resource efficiency of a compiled language.

The DataFusion team is pleased to announce the creation of the DataFusion-Contrib GitHub organization to support and accelerate other projects. While the core DataFusion library remains under Apache governance, the contrib organization provides a more flexible testing ground for new DataFusion features and a home for DataFusion extensions. With this announcement, we are pleased to introduce the following inaugural DataFusion-Contrib repositories.

DataFusion-Python¶

This project provides Python bindings to the core Rust implementation of DataFusion, which allows users to:

Work with familiar SQL or DataFrame APIs to run queries in a safe, multi-threaded environment, returning results in Python
Create User Defined Functions and User Defined Aggregate Functions for complex operations
Pay no overhead to copy between Python and underlying Rust execution engine (by way of Apache Arrow arrays)

Upcoming enhancements¶

The team is focusing on exposing more features from the underlying Rust implementation of DataFusion and improving documentation.

How to install¶

From pip

pip install datafusion

python -m pip install datafusion

DataFusion-ObjectStore-S3¶

This crate provides an ObjectStore implementation for querying data stored in S3 or S3 compatible storage. This makes it almost as easy to query data that lives on S3 as lives in local files

Ability to create S3FileSystem to register as part of DataFusion ExecutionContext
Register files or directories stored on S3 with ctx.register_listing_table

Upcoming enhancements¶

The current priority is adding python bindings for S3FileSystem. After that there will be async improvements as DataFusion adopts more of that functionality and we are looking into S3 Select functionality.

How to Install¶

Add the below to your Cargo.toml in your Rust Project with DataFusion.

datafusion-objectstore-s3 = "0.1.0"

DataFusion-Substrait¶

Substrait is an emerging standard that provides a cross-language serialization format for relational algebra (e.g. expressions and query plans).

This crate provides a Substrait producer and consumer for DataFusion. A producer converts a DataFusion logical plan into a Substrait protobuf and a consumer does the reverse.

Examples of how to use this crate can be found here.

Potential Use Cases¶

Replace custom DataFusion protobuf serialization.
Make it easier to pass query plans over FFI boundaries, such as from Python to Rust
Allow Apache Calcite query plans to be executed in DataFusion

DataFusion-BigTable¶

This crate implements Bigtable as a data source and physical executor for DataFusion queries. It currently supports both UTF-8 string and 64-bit big-endian signed integers in Bigtable. From a SQL perspective it supports both simple and composite row keys with =, IN, and BETWEEN operators as well as projection pushdown. The physical execution for queries is handled by this crate while any subsequent aggregation, group bys, or joins are handled in DataFusion.

Upcoming Enhancements¶

Predicate pushdown
Value range
Value Regex
Timestamp range
Multithreaded
Partition aware execution
Production ready

How to Install¶

Add the below to your Cargo.toml in your Rust Project with DataFusion.

datafusion-bigtable = "0.1.0"

DataFusion-HDFS¶

This crate introduces HadoopFileSystem as a remote ObjectStore which provides the ability to query HDFS files. For HDFS access the fs-hdfs library is used.

DataFusion-Tokomak¶

This crate provides an e-graph based DataFusion optimization framework based on the Rust egg library. An e-graph is a data structure that powers the equality saturation optimization technique.

As context, the optimizer framework within DataFusion is currently under review with the objective of implementing a more strategic long term solution that is more efficient and simpler to develop.

Some of the benefits of using egg within DataFusion are:

Implements optimized algorithms that are hard to match with manually written optimization passes
Makes it easy and less verbose to add optimization rules
Plugin framework to add more complex optimizations
Egg does not depend on rule order and can lead to a higher level of optimization by being able to apply multiple rules at the same time until it converges
Allows for cost-based optimizations

This is an exciting new area for DataFusion with lots of opportunity for community involvement!

DataFusion-Tui¶

DataFusion-tui aka dft provides a feature rich terminal application for using DataFusion. It has drawn inspiration and several features from datafusion-cli. In contrast to datafusion-cli the objective of this tool is to provide a light SQL IDE experience for querying data with DataFusion. This includes features such as the following which are currently implemented:

Tab Management to provide clean and structured organization of DataFusion queries, results, ExecutionContext information, and logs
SQL Editor
- Text editor for writing SQL queries
Query History
- History of executed queries, their execution time, and the number of returned rows
ExecutionContext information
- Expose information on which physical optimizers are used and which ExecutionConfig settings are set
Logs
- Logs from dft, DataFusion, and any dependent libraries
Support for custom ObjectStores
S3
Preload DDL from ~/.datafusionrc to enable having local "database" available at startup

Upcoming Enhancements¶

SQL Editor
Command to write query results to file
Multiple SQL editor tabs
Expose more information from ExecutionContext
A help tab that provides information on functions
Query custom TableProviders such as DeltaTable or BigTable

DataFusion-Streams¶

DataFusion-Stream is a new testing ground for creating a StreamProvider in DataFusion that will enable querying streaming data sources such as Apache Kafka. The implementation for this feature is currently being designed and is under active review. Once the design is finalized the trait and attendant data structures will be added back to the core DataFusion crate.

DataFusion-Java¶

This project created an initial set of Java bindings to DataFusion. The project is currently in maintenance mode and is looking for maintainers to drive future development.

How to Get Involved¶

If you are interested in contributing to DataFusion, and learning about state of the art query processing, we would love to have you join us on the journey! You can help by trying out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here

The best way to find out about creating new extensions within DataFusion-Contrib is reaching out on the #arrow-rust channel of the Apache Software Foundation Slack workspace.

You can also check out our new Communication Doc on more ways to engage with the community.

Links for each DataFusion-Contrib repository are provided above if you would like to contribute to those.

Apache Arrow DataFusion 7.0.0 Release

2022-02-28T00:00:00+00:00

Introduction¶

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth checking out.

DataFusion's SQL, DataFrame, and manual PlanBuilder API let users access a sophisticated query optimizer and execution engine capable of fast, resource efficient, and parallel execution that takes optimal advantage of todays multicore hardware. Being written in Rust means DataFusion can offer both the safety of dynamic languages as well as the resource efficiency of a compiled language.

The Apache Arrow team is pleased to announce the DataFusion 7.0.0 release. This covers 4 months of development work and includes 195 commits from the following 37 distinct contributors.

    44  Andrew Lamb
    24  Kun Liu
    23  Jiayu Liu
    17  xudong.w
    11  Yijie Shen
     9  Matthew Turner
     7  Liang-Chi Hsieh
     5  Lin Ma
     4  Stephen Carman
     4  James Katz
     4  Dmitry Patsura
     4  QP Hou
     3  dependabot[bot]
     3  Remzi Yang
     3  Yang
     3  ic4y
     3  Daniël Heres
     2  Andy Grove
     2  Raphael Taylor-Davies
     2  Jason Tianyi Wang
     2  Dan Harris
     2  Sergey Melnychuk
     1  Nitish Tiwari
     1  Dom
     1  Eduard Karacharov
     1  Javier Goday
     1  Boaz
     1  Marko Mikulicic
     1  Max Burke
     1  Carol (Nichols || Goulding)
     1  Phillip Cloud
     1  Rich
     1  Toby Hede
     1  Will Jones
     1  r.4ntix
     1  rdettai

The following section highlights some of the improvements in this release. Of course, many other bug fixes and improvements have also been made and we refer you to the complete changelog for the full detail.

Summary¶

DataFusion Crate
The DataFusion crate is being split into multiple crates to decrease compilation times and improve the development experience. Initially, datafusion-common (the core DataFusion components) and datafusion-expr (DataFusion expressions, functions, and operators) have been split out. There will be additional splits after the 7.0 release.
Performance Improvements and Optimizations
Arrow’s dyn scalar kernels are now used to enable efficient operations on DictionaryArrays #1685
Switch from std::sync::Mutex to parking_lot::Mutex #1720
New Features
Support for memory tracking and spilling to disk
- MemoryManager and DiskManager #1526
- Out of core sort #1526
- New metrics
- Gauge and CurrentMemoryUsage #1682
- Spill_count and spilled_bytes #1641
New math functions
- Approx_quantile #1529
- stddev and variance (sample and population) #1525
- corr #1561
Support decimal type #1394 #1407 #1408 #1431 #1483 #1554 #1640
Support for reading Parquet files with evolved schemas #1622 #1709
Support for registering DataFrame as table #1699
Support for the substring function #1621
Support array_agg(distinct ...) #1579
Support sort on unprojected columns #1415
Additional Integration Points
A new public Expression simplification API #1717
DataFusion-Contrib
A new GitHub organization created as a home for both DataFusion extensions and as a testing ground for new features.
- Extensions
- DataFusion-Python
- DataFusion-Java
- DataFusion-hdsfs-native
- DataFusion-ObjectStore-s3
- New Features
- DataFusion-Streams
Arrow2
An Arrow2 Branch has been created. There are ongoing discussions in DataFusion and arrow-rs about migrating DataFusion to Arrow2

Documentation and Roadmap¶

We are working to consolidate the documentation into the official site. You can find more details there on topics such as the SQL status and a user guide. This is also an area we would love to get help from the broader community #1821.

To provide transparency on DataFusion’s priorities to users and developers a three month roadmap will be published at the beginning of each quarter. This can be found here here.

Upcoming Attractions¶

Ballista is gaining momentum, and several groups are now evaluating and contributing to the project.
Some of the proposed improvements
Continued improvements for working with limited resources and large datasets
Memory limited joins#1599
Sort-merge join#141 #1776
Introduce row based bytes representation #1708

How to Get Involved¶

Check out our new Communication Doc on more ways to engage with the community.

Apache Arrow DataFusion 6.0.0 Release

2021-11-19T00:00:00+00:00

Introduction¶

DataFusion is an embedded query engine which leverages the unique features of Rust and Apache Arrow to provide a system that is high performance, easy to connect, easy to embed, and high quality.

The Apache Arrow team is pleased to announce the DataFusion 6.0.0 release. This covers 4 months of development work and includes 134 commits from the following 28 distinct contributors.

    28  Andrew Lamb
    26  Jiayu Liu
    13  xudong963
     9  rdettai
     9  QP Hou
     6  Matthew Turner
     5  Daniël Heres
     4  Guillaume Balaine
     3  Francis Du
     3  Marco Neumann
     3  Jon Mease
     3  Nga Tran
     2  Yijie Shen
     2  Ruihang Xia
     2  Liang-Chi Hsieh
     2  baishen
     2  Andy Grove
     2  Jason Tianyi Wang
     1  Nan Zhu
     1  Antoine Wendlinger
     1  Krisztián Szűcs
     1  Mike Seddon
     1  Conner Murphy
     1  Patrick More
     1  Taehoon Moon
     1  Tiphaine Ruy
     1  adsharma
     1  lichuan6

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.

New Website¶

Befitting a growing project, DataFusion now has its own website hosted as part of the main Apache Arrow Website

Roadmap¶

The community worked to gather their thoughts about where we are taking DataFusion into a public Roadmap for the first time

New Features¶

Runtime operator metrics collection framework
Object store abstraction for unified access to local or remote storage
Hive style table partitioning support, for Parquet, CSV, Avro and Json files
DataFrame API support for: except, intersect, show, limit and window functions
SQL
EXPLAIN ANALYZE with runtime metrics
trim ( [ LEADING | TRAILING | BOTH ] [ FROM ] string text [, characters text ] ) syntax
Postgres style regular expression matching operators ~, ~*, !~, and !~*
SQL set operators UNION, INTERSECT, and EXCEPT
cume_dist, percent_rank window functions
digest, blake2s, blake2b, blake3 crypto functions
HyperLogLog based approx_distinct
is distinct from and is not distinct from
CREATE TABLE AS SELECT
Accessing elements of nested Struct and List columns (e.g. SELECT struct_column['field_name'], array_column[0] FROM ...)
Boolean expressions in CASE statement
DROP TABLE
VALUES List
Postgres regex match operators
Support for Avro format
Support for ScalarValue::Struct
Automatic schema inference for CSV files
Better interactive editing support in datafusion-cli as well as psql style commands such as \d, \?, and \q
Generic constant evaluation and simplification framework
Added common subexpression eliminate query plan optimization rule
Python binding 0.4.0 with all Datafusion 6.0.0 features

With these new features, we are also now passing TPC-H queries 8, 13 and 21.

For the full list of new features with their relevant PRs, see the enhancements section in the changelog.

`async` planning and decoupling file format from table layout¶

Driven by the need to support Hive style table partitioning, @rdettai introduced the following design change to the Datafusion core.

The code for reading specific file formats (Parquet, Avro, CSV, and JSON) was separated from the logic that handles grouping sets of files into execution partitions.
The query planning process was made async.

As a result, we are able to replace the old Parquet, CSV and JSON table providers with a single ListingTable table provider.

This also sets up DataFusion and its plug-in ecosystem to supporting a wide range of catalogs and various object store implementations. You can read more about this change in the design document and on the arrow-datafusion#1010 PR.

How to Get Involved¶

If you are interested in contributing to DataFusion, we would love to have you! You can help by trying out DataFusion on some of your own data and projects and filing bug reports and helping to improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for beginners is here and the full list is here.

Check out our new Communication Doc on more ways to engage with the community.

Apache Arrow Ballista 0.5.0 Release

2021-08-18T00:00:00+00:00

Ballista extends DataFusion to provide support for distributed queries. This is the first release of Ballista since the project was donated to the Apache Arrow project and includes 80 commits from 11 contributors.

git shortlog -sn 4.0.0..5.0.0 ballista/rust/client ballista/rust/core ballista/rust/executor ballista/rust/scheduler
  27  Andy Grove
  15  Jiayu Liu
  12  Andrew Lamb
   8  Ximo Guanter
   6  Daniël Heres
   5  QP Hou
   2  Jorge Leitao
   1  Javier Goday
   1  K.I. (Dennis) Jung
   1  Mike Seddon
   1  sathis

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.

Performance and Scalability¶

Ballista is now capable of running complex SQL queries at scale and supports scalable distributed joins. We have been benchmarking using individual queries from the TPC-H benchmark at scale factors up to 1000 (1 TB). When running against CSV files, performance is generally very close to DataFusion, and significantly faster in some cases due to the fact that the scheduler limits the number of concurrent tasks that run at any given time. Performance against large Parquet datasets is currently non ideal due to some issues (#867, #868) that we hope to resolve for the next release.

New Features¶

The main new features in this release are:

Ballista queries can now be executed by calling DataFrame.collect()
The shuffle mechanism has been re-implemented
Distributed hash-partitioned joins are now supported
Keda autoscaling is supported

To get started with Ballista, refer to the crate documentation.

Now that the basic functionality is in place, the focus for the next release will be to improve the performance and scalability as well as improving the documentation.

How to Get Involved¶

If you are interested in contributing to Ballista, we would love to have you! You can help by trying out Ballista on some of your own data and projects and filing bug reports and helping to improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for beginners is here and the full list is here.

Apache Arrow DataFusion 5.0.0 Release

2021-08-18T00:00:00+00:00

The Apache Arrow team is pleased to announce the DataFusion 5.0.0 release. This covers 4 months of development work and includes 211 commits from the following 31 distinct contributors.

$ git shortlog -sn 4.0.0..5.0.0 datafusion datafusion-cli datafusion-examples
    61  Jiayu Liu
    47  Andrew Lamb
    27  Daniël Heres
    13  QP Hou
    13  Andy Grove
     4  Javier Goday
     4  sathis
     3  Ruan Pearce-Authers
     3  Raphael Taylor-Davies
     3  Jorge Leitao
     3  Cui Wenzheng
     3  Mike Seddon
     3  Edd Robinson
     2  思维
     2  Liang-Chi Hsieh
     2  Michael Lu
     2  Parth Sarthy
     2  Patrick More
     2  Rich
     1  Charlie Evans
     1  Gang Liao
     1  Agata Naomichi
     1  Ritchie Vink
     1  Evan Chan
     1  Ruihang Xia
     1  Todd Treece
     1  Yichen Wang
     1  baishen
     1  Nga Tran
     1  rdettai
     1  Marco Neumann

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.

Performance¶

There have been numerous performance improvements in this release. The following chart shows the relative performance of individual TPC-H queries compared to the previous release.

TPC-H @ scale factor 100, in parquet format. Concurrency 24.

We also extended support for more TPC-H queries: q7, q8, q9 and q13 are running successfully in DataFusion 5.0.

New Features¶

Initial support for SQL-99 Analytics (WINDOW functions)
Improved JOIN support: cross join, semi-join, anti join, and fixes to null handling
Improved EXPLAIN support
Initial implementation of metrics in the physical plan
Support for SELECT DISTINCT
Support for Json and NDJson formatted inputs
Query column with relations
Added more datetime related functions: now, date_trunc, to_timestamp_millis, to_timestamp_micros, to_timestamp_seconds
Streaming Dataframe.collect
Support table column aliases
Answer count(*), min() and max() queries using only statistics
Non-equi-join filters in JOIN conditions
Modulus operation
Support group by column positions
Added constant folding query optimizer
Hash partitioned aggregation
Added random SQL function
Implemented count distinct for floats and dictionary types
Re-exported arrow and parquet crates in Datafusion
General row group pruning logic that’s agnostic to storage format

Apache DataFusion Blog - pmc

Apache DataFusion Comet 0.16.0 Release

Expanded Spark 4 Support¶

Adapting to Spark 4 Behavior Changes¶

ANSI SQL Semantics¶

Expanded Adaptive Execution Support¶

Improved TPC-DS Benchmark Results¶

Other Key Features¶

Hash Join Improvements¶

Aggregation¶

New Expression Support¶

Object Storage¶

Native Scan Improvements¶

Metrics and Observability¶

Stability and Correctness¶

Compatibility¶

Get Started with Comet 0.16.0¶

Apache DataFusion Comet 0.15.0 Release

Performance¶

Reducing JVM/Native Boundary Overhead¶

Expanded Native Execution Coverage¶

Memory Management¶

Object Storage I/O¶

Native Iceberg Reader Enabled by Default¶

Sort-Merge Join Performance¶

Other Key Features¶

New Expressions and Function Support¶

Expanded Metrics and Observability¶

Stability and Correctness¶

Dependency Upgrades¶

Deprecations and Removals¶

Compatibility¶

Get Started with Comet 0.15.0¶

Apache DataFusion 53.0.0 Released

Performance Improvements 🚀¶

LIMIT-Aware Parquet Row Group Pruning¶

Improved Filter Pushdown¶

Faster Query Planning¶

Faster Functions¶

Nested Field Pushdown¶

New Features ✨¶

Stability and Release Engineering 🦺¶

Upgrade Notes¶

Known Issues¶

Thank You¶

Apache DataFusion Comet 0.14.0 Release

Key Features¶

Native Iceberg Improvements¶

Native Columnar-to-Row Conversion¶

New Expressions¶

ANSI Mode Error Messages¶

DataFusion Configuration Passthrough¶

Performance Improvements¶

Deprecations and Removals¶

Compatibility¶

Apache DataFusion Comet 0.13.0 Release

Key Features¶

Native Parquet Write Support (Experimental)¶

Native Iceberg Improvements¶

Native CSV Reading (Experimental)¶

New Expressions¶

ANSI Mode Support¶

Native Shuffle Improvements¶

Performance Improvements¶

Deprecations¶

Compatibility¶

Apache DataFusion 52.0.0 Released

Performance Improvements 🚀¶

Faster CASE Expressions¶

MIN/MAX Aggregate Dynamic Filters¶

New Merge Join¶

Caching Improvements¶

Improved Hash Join Filter Pushdown¶

Major Features ✨¶

Arrow IPC Stream file support¶

More Extensible SQL Planning with RelationPlanner¶

Expression Evaluation Pushdown to Scans¶

Sort Pushdown to Scans¶

TableProvider supports DELETE and UPDATE statements¶

CoalesceBatchesExec Removed¶

`LIMIT`-Aware Parquet Row Group Pruning¶

Faster `CASE` Expressions¶

`MIN`/`MAX` Aggregate Dynamic Filters¶

More Extensible SQL Planning with `RelationPlanner`¶

`TableProvider` supports `DELETE` and `UPDATE` statements¶

`CoalesceBatchesExec` Removed¶

Faster `CASE` expression evaluation¶

I/O Profiling in `datafusion-cli`¶

`DESCRIBE <query>`¶

`QUALIFY` Clause¶

`FILTER` Support for Window Functions¶

`ConfigOptions` Now Available to Functions¶