Apache DataFusion Blog

Apache DataFusion Comet 0.16.0 Release

2026-05-07T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.16.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately three weeks of development work and is the result of merging 115 PRs from 17 contributors. See the change log for more information.

Expanded Spark 4 Support¶

Spark 4 is a major theme of this release. Comet now ships first-class support for both Spark 4.0.2 and Spark 4.1.1, with dedicated Maven profiles, shim sources, and CI matrices for each.

Spark 4.1.1: New spark-4.1 Maven profile and shim sources, with Comet's PR test matrix and Spark SQL test suites enabled against Spark 4.1.1. The default Maven profile has been updated to Spark 4.1 / Scala 2.13 to reflect that this is now the primary development target.
Shared 4.x shims: Identical pieces of the Spark 4.0 and 4.1 shims have been consolidated into a shared spark-4.x source tree, reducing duplication as more 4.x minor versions land.
Spark 4.0 / JDK 21: Added a Spark 4.0 / JDK 21 CI profile to validate Comet on the JDK most users are expected to deploy with Spark 4.

Adapting to Spark 4 Behavior Changes¶

Spark 4 introduced a number of type, planner, and on-disk format changes relative to Spark 3.x. Several correctness fixes this release bring Comet's behavior in line with these changes:

Variant type (new in Spark 4.0): Spark 4.0 added a new Variant data type for semi-structured data. Comet does not yet read the shredded Variant on-disk format natively, and delegates these scans to Spark.
String collation (new in Spark 4.0): Spark 4.0 added collation support for StringType. Comet's native operators do not yet implement non-default collations, so hash join and sort-merge join reject collated string join keys, and shuffle, sort, and aggregate fall back to Spark when keys carry a non-default collation.
Wider TimestampNTZType usage: Spark 4 uses TimestampNTZType (timestamp without time zone) in more places than 3.x — for example, in expression return types and as the inferred type for some literal forms. Comet adds support this cycle for cast to and from timestamp_ntz, cast from string to timestamp_ntz, and unix_timestamp over TimestampNTZType inputs.
to_json and array_compact (Spark 4.0): Spark 4.0 adjusted output formatting and return-type metadata for these expressions; Comet now matches the new behavior.
BloomFilter V2 (new in Spark 4.1): Spark 4.1 introduced a new BloomFilter binary format with different bit-scattering. Comet now reads this format so that runtime filters produced by Spark 4.1 remain usable in native execution.
Spark 4.1.1 analyzer refinements: Spark 4.1.1 changed how struct projections handle the case where every requested child field is missing from the Parquet file, and how allowDecimalPrecisionLoss flows through the DecimalPrecision rule. Comet now preserves parent-struct nullness in the first case and the stored allowDecimalPrecisionLoss flag in the second.

Most of these behavior differences were caught because Comet runs the full Apache Spark SQL test suite against each supported Spark version — 3.4.3, 3.5.8, 4.0.2, and 4.1.1 — as part of CI. Running Spark's own correctness tests through Comet's native execution path is what surfaces semantic shifts like TimestampNTZType propagation, ANSI-driven cast and overflow changes, BloomFilter V2 encoding, and the 4.1.1 analyzer rule changes, often before they show up in user workloads. As more Spark 4.x minor releases land, this same harness is what gives us confidence that Comet keeps up.

ANSI SQL Semantics¶

Spark 4 enables ANSI SQL semantics by default. ANSI mode changes how arithmetic overflow, invalid casts, division by zero, and similar error conditions are handled, and Spark itself now treats this as the standard configuration rather than an opt-in.

This is a critical area for any Spark accelerator: an engine that falls back to vanilla Spark whenever ANSI is enabled effectively does not run on Spark 4 by default. Comet implements ANSI semantics for the expressions it supports natively, including arithmetic overflow checks, ANSI cast behavior, and try_* variants. Queries running with spark.sql.ansi.enabled=true continue to be accelerated rather than falling back.

See the Comet Compatibility Guide for details on which expressions have full ANSI coverage.

Expanded Adaptive Execution Support¶

Modern Spark plans are adaptive: AQE re-plans stages at runtime, Dynamic Partition Pruning (DPP) prunes fact-table partitions based on broadcast dimension filters, and ReuseExchange and ReuseSubquery ensure that a broadcast or subquery referenced in multiple places executes only once. For star-schema workloads, these mechanisms are not optional. They are often the difference between a query that reads 1% of the fact table and one that reads all of it.

Prior to 0.16.0, Comet's native scans only partially participated in this machinery. CometNativeScanExec (the DataFusion-based native Parquet scan) fell back to Spark entirely whenever a DPP filter was present. CometIcebergNativeScanExec supported non-AQE DPP as of 0.15.0 (#3349), but without broadcast exchange reuse, so the DPP subquery re-executed the dimension broadcast.

Comet 0.16.0 closes both gaps and aligns the native Parquet and native Iceberg scans on a single DPP and subquery-resolution path:

Non-AQE DPP for native Parquet, with broadcast exchange reuse (#4011, #4037): A new CometSubqueryBroadcastExec replaces Spark's SubqueryBroadcastExec in DPP expressions and wraps a CometBroadcastExchangeExec, so ReuseExchangeAndSubquery matches the join side and the DPP subquery and broadcasts the dimension exactly once.
AQE DPP for native Parquet (#4112): Under AQE, Spark's PlanAdaptiveDynamicPruningFilters cannot match Comet's broadcast hash join and would otherwise rewrite DPP to TrueLiteral, disabling pruning. 0.16.0 intercepts SubqueryAdaptiveBroadcastExec before Spark's rule runs, and applies Spark's decision tree in a Comet-aware rule that searches both the current stage and the root plan for a reusable broadcast. DPP subqueries are registered in AQE's shared subqueryCache so cross-plan DPP (for example, a main query and a scalar subquery referencing the same dimension) deduplicates correctly. A narrower tagging-based fallback covers Spark 3.4, which lacks the injectQueryStageOptimizerRule extension point.
AQE DPP broadcast reuse for native Iceberg (#4215): Lifts runtimeFilters to a top-level constructor field on CometIcebergNativeScanExec (mirroring BatchScanExec), so Spark's expression-rewrite passes can see and convert the DPP subquery. The same CometSubqueryBroadcastExec machinery from the Parquet path now handles the Iceberg case.
Scalar subquery pushdown and AQE subquery reuse (#4053, SPARK-43402): CometNativeScanExec now participates in scalar subquery pushdown into Parquet data filters, and in AQE-time subquery deduplication via a new CometReuseSubquery rule that re-applies Spark's ReuseAdaptiveSubquery algorithm after Comet's node replacements.

Measured impact on TPC-DS: 78 queries previously fell back to Spark whenever DPP filters were planned, running 30–50% natively. With native DPP in 0.16.0, the same queries run 80–97% natively. Representative examples:

Query	Before	After
q1	36%	96%
q4	31%	95%
q31	31%	95%
q74	32%	95%
q92	36%	95%

Several Spark SQL DPP tests that Comet previously skipped are re-enabled to guarantee Spark compatibility and prevent regressions.

Improved TPC-DS Benchmark Results¶

TPC-DS performance increased significantly compared to the 0.15.0 release and Comet is now very close to 2x faster than Spark.

See the Comet Benchmarking Guide for more details about these benchmark results.

Other Key Features¶

Hash Join Improvements¶

BuildRight + LeftAnti (#4073): Regular hash joins now support the BuildRight + LeftAnti combination, eliminating a common fallback path. Tests previously gated on InjectRuntimeFilterSuite issues have been re-enabled.

Aggregation¶

PartialMerge aggregation mode (#4003): The PartialMerge mode is now executed natively, allowing more multi-stage aggregation plans to remain in Comet without falling back to Spark.
collect_set (#3954): Native support for the collect_set aggregate.

New Expression Support¶

This release adds native support for the following Spark expressions:

Math: Pi, Cbrt, Acosh, Asinh, Atanh, ToDegrees, ToRadians
Date/time: timestamp_seconds, unix_timestamp with TimestampNTZType
String / URL: url_encode, url_decode, try_url_decode, str_to_map
Array / map: arrays_zip, array_position, array_union, array_distinct, arrays_overlap, MapSort (Spark 4.0)
Cast: string to timestamp_ntz, cast to and from timestamp_ntz

array_insert and array_compact have been audited and promoted to Compatible.

Object Storage¶

OpenDAL 0.56.0: Picks up the latest OpenDAL release, including upstream object-store fixes.
Profile credential chain: ProfileCredentialsProvider is now mapped to the AWS profile credential chain, matching the credential resolution behavior users expect.

Native Scan Improvements¶

Parquet field ID matching: The native_datafusion scan now supports field-ID-based column resolution, matching Spark's behavior for files written with field IDs.
Schema-mismatch errors: native_datafusion now throws SchemaColumnConvertNotSupportedException on schema mismatch, allowing Spark's standard error handling to engage.
Stricter type validation: The native_datafusion scan now detects incompatible decimal precision/scale and string/binary columns read as numeric, and delegates these reads to Spark.

Metrics and Observability¶

Spark UI task output metrics: Native execution now reports task output metrics through the standard Spark UI path.
Iceberg input metrics: Task-level bytesRead is now reported for the Iceberg native scan, matching Comet's native Parquet scan.
Shuffle encode time: Shuffle operations now track encode time as a separate metric, making it easier to attribute shuffle cost.

Stability and Correctness¶

Substring with negative start index: Fixed a Spark-incompatibility in substring for negative indices.
Strict floating-point comparison: RangePartitioning now honors strictFloatingPoint, ensuring NaN and ±0.0 are partitioned consistently with Spark.
Broadcast / AQE coalescing: Broadcast exchanges now bypass AQE partition coalescing, fixing plans that could otherwise be coalesced into invalid shapes.
JNI: JNI local frame management has been hardened with explicit error handling.
Shuffle fallback logic: Shuffle fallback decisions have been improved, with a new config to gate conversion of Spark shuffle to Comet shuffle when the child plan is non-Comet, and a fix to avoid redundant columnar shuffle when both parent and child are non-Comet.

Compatibility¶

Supported platforms include:

Spark 3.4.3 with Java 11/17 and Scala 2.12/2.13
Spark 3.5.8 with Java 11/17 and Scala 2.12/2.13
Spark 4.0.2 with Java 17 and Scala 2.13
Spark 4.1.1 with Java 17 and Scala 2.13

See the Spark Version Compatibility page for known limitations specific to each version.

This release continues to build on DataFusion 53.1 and Arrow 58.1.

Get Started with Comet 0.16.0¶

Ready to try it out? Follow the Comet 0.16.0 Installation Guide to get up and running, then point Comet at your existing Spark workloads — including Spark 4 with ANSI mode enabled — and see the speedup for yourself.

Apache DataFusion Comet 0.15.0 Release

2026-04-18T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.15.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately four weeks of development work and is the result of merging 142 PRs from 19 contributors. See the change log for more information.

Performance¶

Comet 0.15.0 provides a 2x speedup for TPC-H @ SF1000 (1TB), resulting in 50% cost savings.

That 2x speedup gives you a choice: finish the same Spark workload in half the time on the cluster you already have, or match your current Spark performance on roughly half the resources. Either way, the gain translates directly into lower cloud bills, reduced on-prem capacity, and lower energy usage, with no changes to your existing Spark SQL, DataFrame, or PySpark code. Comet runs on commodity hardware: no GPUs, FPGAs, or other specialized accelerators are required, so the savings come from better utilization of the infrastructure you already run on.

See the Comet Benchmarking Guide for more details.

Performance was a major theme of this release, with a series of targeted optimizations across the shuffle, scan, and execution layers.

Reducing JVM/Native Boundary Overhead¶

Several changes in this release target the cost of crossing between the JVM and native sides, which can dominate execution time in shuffle- and broadcast-heavy workloads:

Shuffle read path: The native shuffle reader no longer uses FFI on the read side, removing a per-batch cost that was particularly visible in shuffle-heavy queries.
Broadcast exchanges: Batches are now coalesced before broadcasting, reducing the number of small batches crossing the JVM/native boundary.
FFI-safe operators: More operators are marked as FFI-safe, avoiding unnecessary deep copies when crossing the JVM/native boundary.

Expanded Native Execution Coverage¶

Columnar-to-row (C2R): Native C2R conversion is now exercised for a broader set of query shapes.
auto scan mode: The auto scan mode now enables the native_datafusion scan where supported, giving users the benefits of the native Parquet reader without having to explicitly opt in. This is part of the ongoing effort to make native_datafusion the default Parquet path once the deprecation of native_iceberg_compat completes.

Memory Management¶

Shared memory pools: Unified memory pools are now shared across native execution contexts within a Spark task, improving memory accounting and reducing OOMs.

Object Storage I/O¶

Object store caching: Object stores and bucket region lookups are cached, dramatically reducing DNS query volume on workloads that open many files.
get_ranges performance: Picked up an upstream opendal fix that restores fast range reads from object storage.

Together, these changes reduce CPU and memory overhead for shuffle-heavy, broadcast-heavy, and object-storage-bound workloads.

Native Iceberg Reader Enabled by Default¶

This release marks a major milestone for Iceberg users: Comet's fully-native Iceberg reader is now enabled by default. Workloads that read Iceberg tables will automatically benefit from native Rust-based scans built on iceberg-rust, with no additional configuration required.

To support this change, the release bundles a broad set of Iceberg-focused improvements:

Dynamic Partition Pruning (DPP): The native Iceberg reader supports DPP, allowing partition filters derived at runtime to prune Iceberg file scans and substantially reduce I/O for star-schema-style queries.
Correct classloader handling: Iceberg classes are now loaded via the thread context classloader, resolving class-loading issues in environments where the executor classloader differs from the application classloader.
Continuous Iceberg CI: Iceberg Spark integration tests now run on every PR and push to main, providing continuous validation of the native Iceberg code path. Test diffs for Spark 3.4 were updated to keep the matrix green across supported Spark versions.
iceberg-rust upgrade: Comet picks up the latest iceberg-rust, pulling in fixes for Parquet reader edge cases discovered in earlier testing.
Refreshed documentation: The Iceberg user guide has been rewritten to reflect current capabilities, and the contributor guide now documents how to run the Iceberg Spark test suites locally.

Users who need to fall back to the previous behavior can still opt out, but we encourage the community to exercise the native reader and report any issues.

Sort-Merge Join Performance¶

Comet relies heavily on sort-merge join (SMJ) because DataFusion's hash joins do not yet support spilling to disk. For larger-than-memory joins, SMJ is the only viable path, making its performance critical for real-world workloads at scale.

DataFusion 53 includes several SMJ improvements that Comet 0.15.0 benefits from directly:

Zero-copy slicing instead of the take kernel (datafusion#20463)
Streaming output instead of waiting for all input before emitting (datafusion#20482)
Cached row counts to avoid O(n) recounting (datafusion#20478)

Additional SMJ work is landing in upstream DataFusion and will arrive in a future Comet release:

Specialized semi/anti join stream (datafusion#20806)
Batch deferred filtering with 20–50x improvements for near-unique LEFT and FULL joins (datafusion#21184)
DynComparator for ~5% TPC-H improvement (datafusion#21484)
Vec-based filter state replacing HashMap (datafusion#21517)
Full outer join correctness fix for NULL filter results (datafusion#21660)

With these performance improvements, the next release of Comet will enable SMJ with filters by default.

Other Key Features¶

New Expressions and Function Support¶

This release adds support for the following:

Date/time functions: days, hours, date_from_unix_date
String/JSON functions: native get_json_object with improved performance over the fallback path
Hash/math functions: bin
Array functions: sort_array
Window functions: LEAD and LAG with IGNORE NULLS
Aggregates: SQL FILTER (WHERE ...) clauses now execute natively; Corr aggregate enabled

Expanded Metrics and Observability¶

Comet metrics can now be exposed through Spark's external monitoring system, making it easier to integrate Comet execution statistics with existing observability dashboards. Native DataFusion scans also now report accurate filesScanned and bytesScanned input metrics, matching Spark's native Parquet scan reporting.

Stability and Correctness¶

A significant portion of this release is dedicated to stability and Spark compatibility. Highlights include:

Cast string to timestamp: Multiple fixes for UTC timestamps, timezone handling, special formats (epoch, now, etc.), and compatibility with Spark's semantics.
Cast decimal to string: Added legacy mode handling to match Spark's output formatting.
String to decimal: Support for full-width characters, null characters, and negative scale.
Decimal arithmetic: Fixes for decimal division and additional test coverage for ANSI overflow handling, including scalar decimal overflow.
Array expressions: Corrected GetArrayItem null handling for dynamic indices; array_append return type fixed and marked Compatible; audited array_insert for correctness; array_compact marked Compatible; array-to-array cast enabled.
DateTrunc/TimestampTrunc: Fixed native crashes when the input is a literal.
Ambiguous local times: Correct handling of ambiguous and non-existent local times across DST transitions.
Case-insensitive Parquet fields: native_datafusion now correctly detects duplicate/ambiguous fields in case-insensitive mode and falls back where appropriate.
Shuffle planning: Shuffle fallback decisions are now "sticky" across planning passes, and Comet columnar shuffle is skipped for stages containing DPP scans to avoid mismatched partitioning.
Error propagation: Native error messages are now propagated through SparkException even when the errorClass is empty, and file-not-found errors flow through the standard Spark error JSON path.
Trigonometric compatibility: tan and atan2 are now Spark-compatible.

Dependency Upgrades¶

This release upgrades to DataFusion 53.1 and Arrow 58.1, and picks up the latest iceberg-rust release with additional reader fixes. The jni crate was upgraded to 0.22.4.

Deprecations and Removals¶

The SupportsComet interface has been removed, along with the Java-based Iceberg integration path (which is fully superseded by the native Iceberg reader). See comet#2921 for background on the decision to standardize on the native iceberg-rust integration. The native_iceberg_compat scan remains deprecated and is expected to be removed in a future release in favor of native_datafusion.

Compatibility¶

Supported platforms include Spark 3.4.3, 3.5.4–3.5.8, and Spark 4.0.x with various JDK and Scala combinations.

The community encourages users to test Comet with existing Spark and Iceberg workloads and welcomes contributions to ongoing development.

Get Started with Comet 0.15.0¶

Ready to try it out? Follow the Comet 0.15.0 Installation Guide to get up and running, then point Comet at your existing Spark workloads and see the speedup for yourself.

Apache DataFusion 53.0.0 Released

2026-04-02T00:00:00+00:00

We are proud to announce the release of DataFusion 53.0.0. This post highlights some of the major improvements since DataFusion 52.0.0. The complete list of changes is available in the changelog. Thanks to the 114 contributors for making this release possible.

Performance Improvements 🚀¶

Figure 1: Average and median normalized execution times for DataFusion 53.0.0 on ClickBench queries, compared to previous releases. Query times are normalized using the ClickBench definition. See the DataFusion Benchmarking Page for more details.

DataFusion 53 continues the project-wide focus on performance. This release reduces planning overhead, skips more unnecessary I/O, and pushes more work into earlier and cheaper stages of execution.

`LIMIT`-Aware Parquet Row Group Pruning¶

DataFusion 53 includes a new optimization that makes Parquet pruning aware of LIMIT. This optimization is described in full in limit pruning blog post. If DataFusion can prove that an entire row group matches the predicate, and those fully matching row groups contain enough rows to satisfy the LIMIT, partially matching row groups are skipped entirely.

Figure 2: Limit pruning is inserted between row group and page index pruning.

Thanks to @xudong963 for implementing this feature. Related PRs: #18868

Improved Filter Pushdown¶

DataFusion 53 pushes filters down through more join types and through UnionExec, and expands support for pushing down dynamic filters. More pushdown means fewer rows flow into joins, repartitions, and later operators, which reduces CPU, memory, and I/O.

For example:

SELECT *
FROM (
    SELECT *
    FROM t1
    LEFT ANTI JOIN t2 ON t1.k = t2.k
) a
JOIN t1 b ON a.k = b.k
WHERE b.v = 1;

Now DataFusion can often transform the physical plan so filters and dynamic filters are pushed deeper into the plan, even through subqueries and nested joins. In this example, the filter on b.v helps produce dynamic filters that can be pushed into both sides of the nested anti join.

Figure 3: DataFusion 53 pushes dynamic filters through subqueries and into both sides of nested joins.

Thanks to @nuno-faria, @haohuaijin, and @jackkleeman for driving this work. Related PRs: #19918, #20145, #20192

Faster Query Planning¶

DataFusion 53 improves query planning performance by making immutable pieces of execution plans cheaper to clone. This helps applications that need extremely low latency, plan many or complex queries, or use prepared statements or parameterized queries. In some benchmarks, overall execution time drops from roughly 4-5 ms to about 100 us.

Thanks to @askalt for leading this work. Related PRs: #19792, #19893

Faster Functions¶

DataFusion includes 235 built-in functions. Improving the performance of these functions benefits a wide range of workloads. This release improves the performance of 42 of those functions, such as strpos, replace, concat, translate, array_has, array_agg, left, right, and case_when.

Thanks to the contributors who drove this work, especially @neilconway, @theirix, @lyne7-sc, @kumarUjjawal, @pepijnve, @zhangxffff, and @UBarney.

Nested Field Pushdown¶

DataFusion 53 pushes expressions such as get_field down the plan and into data sources. This is especially important for nested data such as structs in Parquet files. Instead of reading an entire struct column and then extracting the field of interest, DataFusion 53 pushes the field extraction into the scan.

For example, the following query reads a struct column s and extracts the label field for rows where the value field is greater than 150:

SELECT id, s['label']
FROM t
WHERE s['value'] > 150;

Figure 4: DataFusion 53 pushes field-access expressions closer to the scan.

Special thanks to @adriangb for designing and implementing this optimizer work. Related PRs: #20065, #20117, #20239

New Features ✨¶

JSON Array File Support: DataFusion 53 can now read JSON arrays such as [{...}, {...}] directly as multiple rows, including streaming inputs from object stores. Thanks to @zhuqi-lucas for implementing this feature. Related PRs: #19924
Support for : operator: DataFusion can plan queries such as SELECT payload:'user_id' FROM events;, enabling better Parquet Variant support via datafusion-variant. Thanks to @Samyak2. Related PRs: #20717
New SQL: DataFusion supports additional set-comparison subqueries, null-aware anti join, and deletion predicates. Thanks to @waynexia, @viirya, and @askalt for key contributions in this area. Related PRs: #19109, #19635, #20137
Spark-Compatible Functions: This release includes almost 20 new or improved Spark-compatible functions and behaviors in the datafusion-spark crate. It includes functions such as collect_list, date_diff, from_utc_timestamp, json_tuple, arrays_zip, bin, and array_contains. Thanks to the contributors who drove this work, especially @cht42, @CuteChuanChuan, @SubhamSinghal, @kazantsev-maksim, @unknowntpo, @aryan-212, @hsiang-c, and @davidlghellin.

Stability and Release Engineering 🦺¶

The community spent significant time this release cycle stabilizing the release branch and improving the release process. While such improvements are not as headline-friendly as new features, they are highly important for real deployments. We are discussing ways to improve the process on #21034 and would welcome suggestions and contributions to help with release engineering work in the future.

Thanks to @comphead for running this release, and to @jonathanc-n, @alamb, @xanderbailey, @haohuaijin, @friendlymatthew, @fwojciec, @Kontinuation, @nathanb9, and many others who helped stabilize the release branch.

Upgrade Notes¶

DataFusion 53 includes some breaking changes, including updates to the SQL parser, optimizer behavior, and some physical-plan APIs. Please see the upgrade guide and changelog for the full details before upgrading.

Known Issues¶

A small number of issues were discovered after the 53.0.0 release, and we expect to publish DataFusion 53.1.0 soon. See the 53.1.0 release tracking issue for the latest status.

Thank You¶

Thank you to everyone in the DataFusion community who contributed code, reviews, testing, bug reports, documentation, and release engineering work for 53.0.0. This release contains direct contributions from 114 different people, and we are grateful for the time and effort that everyone put in to make it happen.

Writing Custom Table Providers in Apache DataFusion

2026-03-31T00:00:00+00:00

One of DataFusion's greatest strengths is its extensibility. If your data lives in a custom format, behind an API, or in a system that DataFusion does not natively support, you can teach DataFusion to read it by implementing a custom table provider. This post walks through the three layers you need to understand to design a table provider and where planning and execution work should happen.

The Three Layers¶

When DataFusion executes a query against a table, three abstractions collaborate to produce results:

TableProvider -- Describes the table (schema, capabilities) and produces an execution plan when queried. This is part of the Logical Plan.
ExecutionPlan -- Describes how to compute the result: partitioning, ordering, and child plan relationships. This is part of the Physical Plan.
SendableRecordBatchStream -- The async stream that actually does the work, yielding RecordBatches one at a time.

Think of these as a funnel: TableProvider::scan() is called once during planning to create an ExecutionPlan, then ExecutionPlan::execute() is called once per partition to create a stream, and those streams are where rows are actually produced during execution.

Background: Logical and Physical Planning¶

Before diving into the three layers, it helps to understand how DataFusion processes a query. There are several phases between a SQL string (or DataFrame call) and streaming results:

SQL / DataFrame API
  → Logical Plan          (abstract: what to compute)
  → Logical Optimization  (rewrite rules that preserve semantics)
  → Physical Plan         (concrete: how to compute it)
  → Physical Optimization (hardware- and data-aware rewrites)
  → Execution             (streaming RecordBatches)

Logical Planning¶

A logical plan describes what the query computes without specifying how. It is a tree of relational operators -- TableScan, Filter, Projection, Aggregate, Join, Sort, Limit, and so on. The logical optimizer rewrites this tree to reduce work while preserving the query's meaning. Some logical optimizations include:

Predicate pushdown -- moves filters as close to the data source as possible, so fewer rows flow through the rest of the plan.
Projection pruning -- eliminates columns that are never referenced downstream, reducing memory and I/O.
Expression simplification -- rewrites expressions like 1 = 1 or x AND true into simpler forms.
Subquery decorrelation -- converts correlated IN / EXISTS subqueries into more efficient semi-joins.
Limit pushdown -- pushes LIMIT earlier in the plan so operators produce less data.

Physical Planning¶

The physical planner converts the optimized logical plan into an ExecutionPlan tree -- the concrete plan that will actually run. This is where decisions like "use a hash join vs. a sort-merge join" or "how many partitions to scan" are made. The physical optimizer then refines this tree further with rewrites such as:

Distribution enforcement -- inserts RepartitionExec nodes so that data is partitioned correctly for joins and aggregations.
Sort enforcement -- inserts SortExec nodes where ordering is required, and removes them where the data is already sorted.
Join selection -- picks the most efficient join strategy based on statistics and table sizes.
Aggregate optimization -- combines partial and final aggregation stages, and can use exact statistics to skip scanning entirely.

Why This Matters for Table Providers¶

Your TableProvider sits at the boundary between logical and physical planning. During logical optimization, DataFusion determines which filters and projections could be pushed down to the source. When scan() is called during physical planning, those hints are passed to you. By implementing capabilities like supports_filters_pushdown, you influence what the optimizer can do -- and the metadata you declare in your ExecutionPlan (partitioning, ordering) directly affects which physical optimizations apply.

Choosing the Right Starting Point¶

Not every custom data source requires implementing all three layers from scratch. DataFusion provides building blocks that let you plug in at whatever level makes sense:

If your data is...	Start with	You implement
Already in `RecordBatch`es in memory	MemTable	Nothing -- just construct it
An async stream of batches	StreamTable	A stream factory
A logical transformation of other tables	ViewTable wrapping a logical plan	The logical plan
A variant of an existing file format	ListingTable with a custom FileFormat wrapping an existing one	A thin `FileFormat` wrapper
Files in a custom format on disk or object storage	ListingTable with a custom FileFormat, FileSource, and FileOpener	The format, source, and opener
A custom source needing full control	`TableProvider` + `ExecutionPlan` + stream	All three layers

If your data is file-based, ListingTable handles file discovery, partition column inference, and plan construction -- you only need to implement FileFormat, FileSource, and FileOpener to describe how to read your files. See the custom_file_format example for a minimal wrapping approach, or ParquetSource and ParquetOpener for a full custom implementation to use as a reference.

The rest of this post focuses on the full TableProvider + ExecutionPlan + stream path, which gives you complete control and applies to any data source.

Layer 1: TableProvider¶

A TableProvider represents a queryable data source. For a minimal read-only table, you need four methods:

impl TableProvider for MyTable {
    fn as_any(&self) -> &dyn Any { self }

    fn schema(&self) -> SchemaRef {
        Arc::clone(&self.schema)
    }

    fn table_type(&self) -> TableType {
        TableType::Base
    }

    async fn scan(
        &self,
        state: &dyn Session,
        projection: Option<&Vec<usize>>,
        filters: &[Expr],
        limit: Option<usize>,
    ) -> Result<Arc<dyn ExecutionPlan>> {
        // Build and return an ExecutionPlan -- don't do any execution work here -- keep lightweight!
        Ok(Arc::new(MyExecPlan::new(
            Arc::clone(&self.schema),
            projection,
            limit,
        )))
    }
}

The scan method is the heart of TableProvider. It receives three pushdown hints from the optimizer, each reducing the amount of data your source needs to produce:

projection -- Which columns are needed. This reduces the width of the output. If your source supports it, read only these columns rather than the full schema.
filters -- Predicates the engine would like you to apply during the scan. This reduces the number of rows by skipping data that does not match. Implement supports_filters_pushdown to advertise which filters you can handle.
limit -- A row count cap. This also reduces the number of rows -- if you can stop reading early once you have produced enough rows, this avoids unnecessary work.

You can also use the scan_with_args() variant that provides additional pushdown information for other advanced use cases.

Keep `scan()` Lightweight¶

This is a critical point: scan() runs during planning, not execution. It should return quickly. Best practice is to avoid performing I/O, network calls, or heavy computation here. The scan method's job is to describe how the data will be produced, not to produce it. All the real work belongs in the stream (Layer 3).

A common pitfall is to fetch data or open connections in scan(). This blocks the planning thread and can cause timeouts or deadlocks, especially if the query involves multiple tables or subqueries that all need to be planned before execution begins.

Existing Implementations to Learn From¶

DataFusion ships several TableProvider implementations that are excellent references:

MemTable -- Holds data in memory as Vec<RecordBatch>. The simplest possible provider; great for tests and small datasets.
StreamTable -- Wraps a user-provided stream factory. Useful when your data arrives as a continuous stream (e.g., from Kafka or a socket).
ListingTable -- The file-based data source behind DataFusion's built-in Parquet, CSV, and JSON support. Demonstrates sophisticated filter and projection pushdown, file pruning, and schema inference.
ViewTable -- Wraps a logical plan, representing a SQL view. Useful if your provider is best expressed as a transformation of other tables.

Layer 2: ExecutionPlan¶

An ExecutionPlan is a node in the physical query plan tree. Your table provider's scan() method returns one. The required methods are:

impl ExecutionPlan for MyExecPlan {
    fn name(&self) -> &str { "MyExecPlan" }

    fn as_any(&self) -> &dyn Any { self }

    fn properties(&self) -> &PlanProperties {
        &self.properties
    }

    fn children(&self) -> Vec<&Arc<dyn ExecutionPlan>> {
        vec![]  // Leaf node -- no children
    }

    fn with_new_children(
        self: Arc<Self>,
        children: Vec<Arc<dyn ExecutionPlan>>,
    ) -> Result<Arc<dyn ExecutionPlan>> {
        assert!(children.is_empty());
        Ok(self)
    }

    fn execute(
        &self,
        partition: usize,
        context: Arc<TaskContext>,
    ) -> Result<SendableRecordBatchStream> {
        // This is where you build and return your stream
        // ...
    }
}

The key properties to set correctly in PlanProperties are output partitioning and output ordering.

Output partitioning tells the engine how many partitions your data has, which determines parallelism. If your source naturally partitions data (e.g., by file or by shard), expose that here.

Output ordering declares whether your data is naturally sorted. This enables the optimizer to avoid inserting a SortExec when a query requires ordered data. Getting this right can be a significant performance win.

Partitioning Strategies¶

Since execute() is called once per partition, partitioning directly controls the parallelism of your table scan. Each partition produces an independent stream that DataFusion schedules as a task on the tokio runtime. It is important to distinguish tasks from threads: tasks are lightweight units of async work that are multiplexed onto a thread pool. You can have many more tasks (partitions) than physical threads -- the runtime will interleave them efficiently as they await I/O or yield.

Start simple: match your data's natural layout. If you have 4 files, expose 4 partitions. If your source has 8 shards, expose 8 partitions. DataFusion will insert a RepartitionExec above your scan when downstream operators need a different distribution. You can also implement the repartitioned method on your ExecutionPlan to let DataFusion request a different partition count directly from your source, avoiding the extra operator entirely.

Consider how your data source naturally divides its data:

By file or object: If you are reading from S3, each file can be a partition. DataFusion will read them in parallel.
By shard or region: If your source is a sharded database, each shard maps naturally to a partition.
By key range: If your data is keyed (e.g., by timestamp or customer ID), you can split it into ranges.

Advanced: aligning with target_partitions. Once you have something working, you can tune further. Having too many partitions is not free: each partition adds scheduling overhead, and downstream operators may need to repartition the data anyway. The session configuration exposes a target partition count that reflects how many partitions the optimizer expects to work with:

async fn scan(
    &self,
    state: &dyn Session,
    projection: Option<&Vec<usize>>,
    filters: &[Expr],
    limit: Option<usize>,
) -> Result<Arc<dyn ExecutionPlan>> {
    let target_partitions = state.config().target_partitions();
    // Optionally coalesce or split partitions to match target_partitions.
    // ...
}

If your source produces data in exactly target_partitions partitions, the optimizer is less likely to insert a RepartitionExec above your scan. For small datasets, target_partitions may be set to 1, which avoids any repartitioning overhead entirely.

Advanced: declaring hash partitioning. If your source stores data pre-partitioned by a specific key (e.g., customer_id), you can declare this in your output partitioning. For a query like:

SELECT customer_id, SUM(amount)
FROM my_table
GROUP BY customer_id;

If you declare your output partitioning as Hash([customer_id], N), the optimizer recognizes that the data is already distributed correctly for the aggregation and eliminates the RepartitionExec that would otherwise appear in the plan. You can verify this with EXPLAIN (more on this below).

Conversely, if you report UnknownPartitioning, DataFusion must assume the worst case and will always insert repartitioning operators as needed.

Keep `execute()` Lightweight Too¶

Like scan(), the execute() method should construct and return a stream without doing heavy work. The actual data production happens when the stream is polled. Do not block on async operations here -- build the stream and let the runtime drive it.

Existing Implementations to Learn From¶

StreamingTableExec -- Executes a streaming table scan. It takes a stream factory (a closure that produces streams) and handles partitioning. Good reference for wrapping external streams.
DataSourceExec -- The execution plan behind DataFusion's built-in file scanning (Parquet, CSV, JSON). It demonstrates sophisticated partitioning, filter pushdown, and projection pushdown.

Layer 3: SendableRecordBatchStream¶

SendableRecordBatchStream is where the real work happens. It is defined as:

type SendableRecordBatchStream =
    Pin<Box<dyn RecordBatchStream<Item = Result<RecordBatch>> + Send>>;

This is an async stream of RecordBatches that can be sent across threads. When the DataFusion runtime polls this stream, your code runs: reading files, calling APIs, transforming data, etc.

Using RecordBatchStreamAdapter¶

The easiest way to create a SendableRecordBatchStream is with RecordBatchStreamAdapter. It bridges any futures::Stream<Item = Result<RecordBatch>> into the SendableRecordBatchStream type:

use datafusion::physical_plan::stream::RecordBatchStreamAdapter;

fn execute(
    &self,
    partition: usize,
    context: Arc<TaskContext>,
) -> Result<SendableRecordBatchStream> {
    let schema = self.schema();
    let config = self.config.clone();

    let stream = futures::stream::once(async move {
        // ALL the heavy work happens here, inside the stream:
        // - Open connections
        // - Read data from external sources
        // - Transform and batch the results
        let batches = fetch_data_from_source(&config).await?;
        Ok(batches)
    })
    .flat_map(|result| match result {
        Ok(batch) => futures::stream::iter(vec![Ok(batch)]),
        Err(e) => futures::stream::iter(vec![Err(e)]),
    });

    Ok(Box::pin(RecordBatchStreamAdapter::new(schema, stream)))
}

Blocking Work: Use a Separate Thread Pool¶

If your stream performs blocking work -- such as blocking I/O, or CPU work that runs for hundreds of milliseconds without yielding -- you must avoid blocking the tokio async runtime. Short CPU work (e.g., parsing a batch in a few milliseconds) is fine to do inline as long as your code yields back to the runtime frequently. But for long-running synchronous work that cannot yield, offload to a dedicated thread pool and send results back through a channel:

fn execute(
    &self,
    partition: usize,
    context: Arc<TaskContext>,
) -> Result<SendableRecordBatchStream> {
    let schema = self.schema();
    let config = self.config.clone();

    let (tx, rx) = tokio::sync::mpsc::channel(2);

    // Spawn blocking work on a dedicated thread pool
    tokio::task::spawn_blocking(move || {
        let batches = generate_data(&config);
        for batch in batches {
            if tx.blocking_send(Ok(batch)).is_err() {
                break; // Receiver dropped, query was cancelled
            }
        }
    });

    let stream = tokio_stream::wrappers::ReceiverStream::new(rx);
    Ok(Box::pin(RecordBatchStreamAdapter::new(schema, stream)))
}

This pattern keeps the async runtime responsive while long-running synchronous work runs on its own threads. For a working example that shows how to configure separate thread pools for I/O and CPU work, see the thread_pools example in the DataFusion repository.

Where Should the Work Happen?¶

This table summarizes what belongs at each layer:

Layer	Runs During	Should Do	Should NOT Do
`TableProvider::scan()`	Planning	Build an `ExecutionPlan` with metadata	I/O, network calls, heavy computation
`ExecutionPlan::execute()`	Execution (once per partition)	Construct a stream, set up channels	Block on async work, read data
`RecordBatchStream` (polling)	Execution	All I/O, computation, data production	--

The guiding principle: push work as late as possible. Planning should be fast so the optimizer can do its job. Execution setup should be fast so all partitions can start promptly. The stream is where you spend time producing data.

Why This Matters¶

When scan() does heavy work, several problems arise:

Planning becomes slow. If a query touches 10 tables and each scan() takes 500ms, planning alone takes 5 seconds before any data flows.
Execution is single-threaded. scan() runs on a single thread during planning, so any work done there cannot benefit from the parallel execution that DataFusion provides across partitions.
The optimizer cannot help. The optimizer runs between planning and execution. If you have already fetched data during planning, optimizations like predicate pushdown or partition pruning cannot reduce the work.
Resource management breaks down. DataFusion manages concurrency and memory during execution. Work done during planning bypasses these controls.

Filter Pushdown: Doing Less Work¶

One of the most impactful optimizations you can add to a custom table provider is filter pushdown -- letting the source skip data that the query does not need, rather than reading everything and filtering it afterward.

How Filter Pushdown Works¶

When DataFusion plans a query with a WHERE clause, it passes the filter predicates to your scan() method as the filters parameter. By default, DataFusion assumes your provider cannot handle any filters and inserts a FilterExec node above your scan to apply them. But if your source can evaluate some predicates during scanning -- for example, by skipping files, partitions, or row groups that cannot match -- you can eliminate a huge amount of unnecessary I/O.

To opt in, implement supports_filters_pushdown:

fn supports_filters_pushdown(
    &self,
    filters: &[&Expr],
) -> Result<Vec<TableProviderFilterPushDown>> {
    Ok(filters.iter().map(|f| {
        match f {
            // We can fully evaluate equality filters on
            // the partition column at the source
            Expr::BinaryExpr(BinaryExpr {
                left, op: Operator::Eq, right
            }) if is_partition_column(left) || is_partition_column(right) => {
                TableProviderFilterPushDown::Exact
            }
            // All other filters: let DataFusion handle them
            _ => TableProviderFilterPushDown::Unsupported,
        }
    }).collect())
}

The three possible responses for each filter are:

Exact -- Your source guarantees that no output rows will have a false value for this predicate. Because the filter is fully evaluated at the source, DataFusion will not add a FilterExec for it.
Inexact -- Your source has the ability to reduce the data produced, but the output may still include rows that do not satisfy the predicate. For example, you might skip entire files based on metadata statistics but not filter individual rows within a file. DataFusion will still add a FilterExec above your scan to remove any remaining rows that slipped through.
Unsupported -- Your source ignores this filter entirely. DataFusion handles it.

Why Filter Pushdown Matters¶

Consider a table with 1 billion rows partitioned by region, and a query:

SELECT * FROM events WHERE region = 'us-east-1' AND event_type = 'click';

Without filter pushdown: Your table provider reads all 1 billion rows across all regions. DataFusion then applies both filters, discarding the vast majority of the data.

With filter pushdown on region: Your scan() method sees the region = 'us-east-1' filter and constructs an execution plan that only reads the us-east-1 partition. If that partition holds 100 million rows, you have just eliminated 90% of the I/O. DataFusion still applies the event_type filter via FilterExec if you reported it as Unsupported.

Only Push Down Filters When the Data Source Can Do Better¶

DataFusion already pushes filters as close to the data source as possible, typically placing them directly above the scan. FilterExec is also highly optimized, with vectorized evaluation and type-specialized kernels for fast predicate evaluation.

Because of this, you should only implement filter pushdown when your data source can do strictly better -- for example, by avoiding I/O entirely through skipping files or partitions based on metadata. If your data source cannot eliminate I/O in this way, it is usually better to let DataFusion handle the filter, as its in-memory execution is already highly efficient.

Using EXPLAIN to Debug Your Table Provider¶

The EXPLAIN statement is your best tool for understanding what DataFusion is actually doing with your table provider. It shows the physical plan that DataFusion will execute, including any operators it inserted:

EXPLAIN SELECT * FROM events WHERE region = 'us-east-1' AND event_type = 'click';

If you are using DataFrames, call .explain(false, false) for the logical plan or .explain(false, true) for the physical plan. You can also print the plans in verbose mode with .explain(true, true).

Before filter pushdown, the plan might look like:

FilterExec: region@0 = us-east-1 AND event_type@1 = click
  MyExecPlan: partitions=50

Here DataFusion is reading all 50 partitions and filtering everything afterward. The FilterExec above your scan is doing all the predicate work.

After implementing pushdown for region (reported as Exact):

FilterExec: event_type@1 = click
  MyExecPlan: partitions=5, filter=[region = us-east-1]

Now your exec reads only the 5 partitions for us-east-1, and the remaining FilterExec only handles the event_type predicate. The region filter has been fully absorbed by your scan.

After implementing pushdown for both filters (both Exact):

MyExecPlan: partitions=5, filter=[region = us-east-1 AND event_type = click]

No FilterExec at all -- your source handles everything.

Similarly, EXPLAIN will reveal whether DataFusion is inserting unnecessary SortExec or RepartitionExec nodes that you could eliminate by declaring better output properties. Whenever your queries seem slower than expected, EXPLAIN is the first place to look.

A Complete Filter Pushdown Example¶

To make filter pushdown concrete, here is an illustrative example. Imagine a table provider that reads from a set of date-partitioned directories on disk (e.g., data/2026-03-01/, data/2026-03-02/, ...). Each directory contains one or more Parquet files for that date. By pushing down a filter on the date column, the provider can skip entire directories -- avoiding the I/O of listing and reading files that cannot possibly match the query.

/// A table provider backed by date-partitioned directories.
/// Each date directory contains data files; by filtering on the
/// `date` column we can skip entire directories of I/O.
struct DatePartitionedTable {
    schema: SchemaRef,
    /// Maps date strings ("2026-03-01") to directory paths
    partitions: HashMap<String, String>,
}

#[async_trait::async_trait]
impl TableProvider for DatePartitionedTable {
    fn as_any(&self) -> &dyn Any { self }
    fn schema(&self) -> SchemaRef { Arc::clone(&self.schema) }
    fn table_type(&self) -> TableType { TableType::Base }

    fn supports_filters_pushdown(
        &self,
        filters: &[&Expr],
    ) -> Result<Vec<TableProviderFilterPushDown>> {
        Ok(filters.iter().map(|f| {
            if Self::is_date_equality_filter(f) {
                // We can fully evaluate this: we will only read
                // directories matching the date, so no rows with
                // a different date will appear in the output.
                TableProviderFilterPushDown::Exact
            } else {
                TableProviderFilterPushDown::Unsupported
            }
        }).collect())
    }

    async fn scan(
        &self,
        _state: &dyn Session,
        projection: Option<&Vec<usize>>,
        filters: &[Expr],
        limit: Option<usize>,
    ) -> Result<Arc<dyn ExecutionPlan>> {
        // Determine which date partitions to read by inspecting
        // the pushed-down filters. This is the key optimization:
        // we decide *during planning* which directories to scan,
        // so that execution never touches irrelevant data.
        let dates_to_read: Vec<String> = self
            .extract_date_values(filters)
            .unwrap_or_else(||
                self.partitions.keys().cloned().collect()
            );

        let dirs: Vec<String> = dates_to_read
            .iter()
            .filter_map(|d| self.partitions.get(d).cloned())
            .collect();
        let num_dirs = dirs.len();

        Ok(Arc::new(DatePartitionedExec {
            schema: Arc::clone(&self.schema),
            directories: dirs,
            properties: PlanProperties::new(
                EquivalenceProperties::new(
                    Arc::clone(&self.schema),
                ),
                // One partition per date directory -- these
                // will be read in parallel.
                Partitioning::UnknownPartitioning(num_dirs),
                EmissionType::Incremental,
                Boundedness::Bounded,
            ),
        }))
    }
}

impl DatePartitionedTable {
    /// Check if a filter is an equality comparison on the `date` column.
    fn is_date_equality_filter(expr: &Expr) -> bool {
        // In practice, match on BinaryExpr { left, op: Eq, right }
        // and check if either side references the "date" column.
        // Simplified here for clarity.
        todo!("match on date equality expressions")
    }

    /// Extract date literal values from pushed-down equality filters.
    fn extract_date_values(&self, filters: &[Expr]) -> Option<Vec<String>> {
        // Parse filters like `date = '2026-03-01'` and return
        // the literal date strings. Returns None if no date
        // filters are present (meaning: read all partitions).
        todo!("extract date literals from filter expressions")
    }
}

The key insight is that the filter pushdown decision (supports_filters_pushdown) and the partition pruning (scan()) work together: the first tells DataFusion that a FilterExec is unnecessary for the date predicate, and the second ensures that only the relevant directories are scanned. The actual file reading happens later, in the stream produced by execute().

Putting It All Together¶

Here is a minimal but complete example of a custom table provider that generates data lazily during streaming:

use std::any::Any;
use std::sync::Arc;

use arrow::array::Int64Array;
use arrow::datatypes::{DataType, Field, Schema, SchemaRef};
use arrow::record_batch::RecordBatch;
use datafusion::catalog::TableProvider;
use datafusion::common::Result;
use datafusion::datasource::TableType;
use datafusion::catalog::Session;
use datafusion::execution::SendableRecordBatchStream;
use datafusion::logical_expr::Expr;
use datafusion::physical_expr::EquivalenceProperties;
use datafusion::physical_plan::execution_plan::{Boundedness, EmissionType};
use datafusion::physical_plan::stream::RecordBatchStreamAdapter;
use datafusion::physical_plan::{
    ExecutionPlan, Partitioning, PlanProperties,
};
use futures::stream;

/// A table provider that generates sequential numbers on demand.
struct CountingTable {
    schema: SchemaRef,
    num_partitions: usize,
    rows_per_partition: usize,
}

impl CountingTable {
    fn new(num_partitions: usize, rows_per_partition: usize) -> Self {
        let schema = Arc::new(Schema::new(vec![
            Field::new("partition", DataType::Int64, false),
            Field::new("value", DataType::Int64, false),
        ]));
        Self { schema, num_partitions, rows_per_partition }
    }
}

#[async_trait::async_trait]
impl TableProvider for CountingTable {
    fn as_any(&self) -> &dyn Any { self }
    fn schema(&self) -> SchemaRef { Arc::clone(&self.schema) }
    fn table_type(&self) -> TableType { TableType::Base }

    async fn scan(
        &self,
        _state: &dyn Session,
        projection: Option<&Vec<usize>>,
        _filters: &[Expr],
        limit: Option<usize>,
    ) -> Result<Arc<dyn ExecutionPlan>> {
        // Light work only: build the plan with metadata
        Ok(Arc::new(CountingExec {
            schema: Arc::clone(&self.schema),
            num_partitions: self.num_partitions,
            rows_per_partition: limit
                .unwrap_or(self.rows_per_partition)
                .min(self.rows_per_partition),
            properties: PlanProperties::new(
                EquivalenceProperties::new(Arc::clone(&self.schema)),
                Partitioning::UnknownPartitioning(self.num_partitions),
                EmissionType::Incremental,
                Boundedness::Bounded,
            ),
        }))
    }
}

struct CountingExec {
    schema: SchemaRef,
    num_partitions: usize,
    rows_per_partition: usize,
    properties: PlanProperties,
}

impl ExecutionPlan for CountingExec {
    fn name(&self) -> &str { "CountingExec" }
    fn as_any(&self) -> &dyn Any { self }
    fn properties(&self) -> &PlanProperties { &self.properties }
    fn children(&self) -> Vec<&Arc<dyn ExecutionPlan>> { vec![] }

    fn with_new_children(
        self: Arc<Self>,
        _children: Vec<Arc<dyn ExecutionPlan>>,
    ) -> Result<Arc<dyn ExecutionPlan>> {
        Ok(self)
    }

    fn execute(
        &self,
        partition: usize,
        _context: Arc<TaskContext>,
    ) -> Result<SendableRecordBatchStream> {
        let schema = Arc::clone(&self.schema);
        let rows = self.rows_per_partition;

        // The heavy work (data generation) happens inside the stream,
        // not here in execute().
        let batch_stream = stream::once(async move {
            let partitions = Int64Array::from(
                vec![partition as i64; rows],
            );
            let values = Int64Array::from(
                (0..rows as i64).collect::<Vec<_>>(),
            );
            let batch = RecordBatch::try_new(
                Arc::clone(&schema),
                vec![Arc::new(partitions), Arc::new(values)],
            )?;
            Ok(batch)
        });

        Ok(Box::pin(RecordBatchStreamAdapter::new(
            Arc::clone(&self.schema),
            batch_stream,
        )))
    }
}

Acknowledgements¶

I would like to thank Rerun.io for sponsoring the development of this work. Rerun is building a data visualization system for Physical AI and makes heavy use of DataFusion table providers for working with data analytics.

I would also like to thank the reviewers of this post for their helpful feedback and suggestions: @2010YOUY01, @adriangb, @alamb, @kevinjqliu, @Omega359, @pgwhalen, and @stuhood.

Get Involved¶

DataFusion is not a project built or driven by a single person, company, or foundation. Our community of users and contributors works together to build a shared technology that none of us could have built alone.

If you are interested in joining us, we would love to have you. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests, or code. A list of open issues suitable for beginners is here, and you can find out how to reach us on the communication doc.

Turning LIMIT into an I/O Optimization: Inside DataFusion’s Multi-Layer Pruning Stack

2026-03-20T00:00:00+00:00

Xudong Wang, Massive

Reading data efficiently means touching as little data as possible. The fastest I/O is the I/O you never make. This sounds obvious, but making it happen in practice requires careful engineering at every layer of the query engine. Apache DataFusion achieves this through a multi-layer pruning pipeline — a series of stages that progressively narrow down the data before decoding a single row.

In this post, we describe a new optimization called limit pruning that makes this pipeline aware of SQL LIMIT clauses. By identifying row groups where every row is guaranteed to match the predicate, DataFusion can satisfy a LIMIT query without ever touching partially matching row groups — eliminating wasted I/O entirely.

For example, given a query like:

SELECT * FROM tracking_data
WHERE species LIKE 'Alpine%' AND s >= 50
LIMIT 3

If the pruning pipeline already knows that certain row groups fully satisfy the WHERE clause, those groups alone may contain enough rows to fill the LIMIT — making it unnecessary to scan anything else.

This work was inspired by the "Pruning for LIMIT Queries" section of Snowflake's paper Pruning in Snowflake: Working Smarter, Not Harder.

DataFusion's Pruning Pipeline¶

Before diving into limit pruning, let's understand the full pruning pipeline. DataFusion scans Parquet data through a series of increasingly fine-grained filters, each one eliminating data so the next stage processes less:

Figure 1: The three phases of DataFusion's pruning pipeline — from directories down to individual rows.

Phase 1: High-Level Discovery¶

Partition Pruning: The ListingTable component evaluates filters that depend only on partition columns — things like year, month, or region encoded in directory paths (e.g., s3://data/year=2024/month=01/). Irrelevant directories are eliminated before we even open a file.
File Stats Pruning: The FilePruner checks file-level min/max and null-count statistics. If these statistics prove that a file cannot satisfy the predicate, we drop it entirely — no need to read row group metadata.

Phase 2: Row Group Statistics¶

For each surviving file, DataFusion reads row group metadata and potentially bloom filters and classifies each row group into one of three states (the example data shown in the figures below, such as "Snow Vole" and "Alpine Ibex", is adapted from the Snowflake pruning paper):

Figure 2: Row groups are classified into three states based on their statistics.

Not Matching (Skipped): Statistics prove no rows can match. The row group is ignored completely.
Partially Matching: Statistics cannot rule out matching rows, but also cannot guarantee them. These groups might be scanned and verified row by row later.
Fully Matching: Statistics prove that every single row in the group satisfies the predicate. This state is key to making limit pruning possible.

Phase 3: Granular Pruning¶

The final phase goes even deeper:

Page Index Pruning: Parquet pages have their own min/max statistics. DataFusion uses these to skip individual data pages within a surviving row group.
Late Materialization (Row Filtering): Instead of decoding all columns at once, DataFusion decodes the cheapest, most selective columns first. It filters rows using those columns, then only decodes the remaining columns for surviving rows.

The Problem: LIMIT Was Ignored¶

Before limit pruning, all of these stages worked well — but the pruning pipeline had no awareness of LIMIT. Consider a query like:

SELECT * FROM tracking_data
WHERE species LIKE 'Alpine%' AND s >= 50
LIMIT 3

Even when fully matched row groups alone contain enough rows to satisfy the LIMIT, DataFusion would still decode partially matching groups and filter out rows that did not match, wasting resources decoding rows just to immediately discard them.

Figure 3: Without limit awareness, partially matching groups are scanned and filtered even when fully matched groups already have enough rows. The left section shows 5 fully matched rows (enough to satisfy LIMIT 5), while the right section with the dashed red border represents a partially matching group that is still decoded — wasting CPU and I/O on rows that may not match at all.

If five fully matched rows in a fully matched group already satisfy LIMIT 5, why bother decoding groups where we're not even sure any rows qualify?

The Solution: Limit-Aware Pruning¶

The solution adds a new step in the pruning pipeline — right after row group pruning and before page index pruning:

Figure 4: Limit pruning is inserted between row group and page index pruning.

The idea is simple: if fully matched row groups already contain enough rows to satisfy the LIMIT, rewrite the access plan to scan only those groups and skip everything else.

This optimization is applied only when the query is a pure limit query with no ORDER BY, because reordering which groups we scan could change the output ordering of the results. In the implementation, this check is expressed as:

// Prune by limit if limit is set and order is not sensitive
if let (Some(limit), false) = (limit, preserve_order) {
    row_groups.prune_by_limit(limit, rg_metadata, &file_metrics);
}

Mechanism: Detecting Fully Matched Row Groups¶

The core insight is predicate negation. To determine if every row in a row group satisfies the predicate, we:

Negate the original predicate
Simplify the negated expression
Evaluate the negation against the row group's statistics
If the negation is pruned (proven impossible), then the original predicate holds for every row

Since DataFusion already had expression simplification (step 2) and statistics-based pruning (step 3), implementing this was relatively straightforward — the key addition was composing these existing capabilities with predicate negation.

Figure 5: If the negated predicate is impossible according to row group stats, all rows must match the original predicate.

In DataFusion's codebase, this logic lives in identify_fully_matched_row_groups (row_group_filter.rs):

fn identify_fully_matched_row_groups(
    &mut self,
    candidate_row_group_indices: &[usize],
    arrow_schema: &Schema,
    parquet_schema: &SchemaDescriptor,
    groups: &[RowGroupMetaData],
    predicate: &PruningPredicate,
    metrics: &ParquetFileMetrics,
) {
    // Create the inverted predicate: NOT(original)
    let inverted_expr = Arc::new(NotExpr::new(
        Arc::clone(predicate.orig_expr()),
    ));

    // Simplify: e.g., NOT(c1 = 0) → c1 != 0
    let simplifier = PhysicalExprSimplifier::new(arrow_schema);
    let Ok(inverted_expr) = simplifier.simplify(inverted_expr) else {
        return;
    };

    let Ok(inverted_predicate) = PruningPredicate::try_new(
        inverted_expr,
        Arc::clone(predicate.schema()),
    ) else {
        return;
    };

    // Evaluate inverted predicate against row group stats
    let Ok(inverted_values) =
        inverted_predicate.prune(&inverted_pruning_stats)
    else {
        return;
    };

    for (i, &original_idx) in
        candidate_row_group_indices.iter().enumerate()
    {
        // If negation is pruned (false), all rows match original
        if !inverted_values[i] {
            self.is_fully_matched[original_idx] = true;
        }
    }
}

Mechanism: Rewriting the Access Plan¶

Once we know which row groups are fully matched, the limit pruning algorithm is straightforward:

Figure 6: The algorithm iterates fully matched groups, accumulating row counts until the limit is satisfied.

The implementation in prune_by_limit (row_group_filter.rs):

pub fn prune_by_limit(
    &mut self,
    limit: usize,
    rg_metadata: &[RowGroupMetaData],
    metrics: &ParquetFileMetrics,
) {
    let mut fully_matched_indexes: Vec<usize> = Vec::new();
    let mut fully_matched_rows: usize = 0;

    for &idx in self.access_plan.row_group_indexes().iter() {
        if self.is_fully_matched[idx] {
            fully_matched_indexes.push(idx);
            fully_matched_rows += rg_metadata[idx].num_rows() as usize;
            if fully_matched_rows >= limit {
                break;
            }
        }
    }

    // Rewrite the plan if we have enough rows
    if fully_matched_rows >= limit {
        let mut new_plan = ParquetAccessPlan::new_none(rg_metadata.len());
        for &idx in &fully_matched_indexes {
            new_plan.scan(idx);
        }
        self.access_plan = new_plan;
    }
}

Key properties of this algorithm:

It preserves the original row group ordering
If fully matched groups don't have enough rows, the plan is unchanged — no harm done
The cost is minimal: a single pass over the row group list

Case Study: Alpine Wildlife Query¶

Let's walk through a concrete example adapted from the Snowflake pruning paper. Given a wildlife tracking dataset with four row groups:

SELECT * FROM tracking_data
WHERE species LIKE 'Alpine%' AND s >= 50
LIMIT 3

Row Group	Species Range	S Range	State
RG1	Snow Vole, Brown Bear, Gray Wolf	7–133	Not Matching (no 'Alpine%')
RG2	Lynx, Red Fox, Alpine Bat	6–71	Partially Matching
RG3	Alpine Ibex, Alpine Goat, Alpine Sheep	76–101	Fully Matching
RG4	Mixed species	Mixed	Partially Matching

Figure 7: Before limit pruning, RG2 is scanned for zero hits. After limit pruning, only RG3 is scanned.

Before limit pruning: DataFusion scans RG2 (0 hits — wasted I/O), then RG3 (3 hits, early return). RG2 was decoded entirely for nothing.

With limit pruning: The system detects that RG3 has 3 fully matched rows, which satisfies LIMIT 3. It rewrites the access plan to scan only RG3, skipping RG2 and RG4 entirely. One row group scanned. Zero waste.

Observing Limit Pruning via Metrics¶

DataFusion exposes limit pruning activity through query metrics. When running a query with EXPLAIN ANALYZE, you will see entries like:

row_groups_pruned_statistics=4 total → 3 matched -> 1 fully matched
limit_pruned_row_groups=3 total → 1 matched

This tells us: - 4 row groups were evaluated, 3 survived statistics pruning, 1 was identified as fully matching - Of the 3 row groups that entered limit pruning, only 1 survived — 2 were pruned by the limit optimization

Future Directions¶

There are two natural extensions of this work:

Page-Level Limit Pruning: Today, "fully matched" detection operates at the row group level. If we extend this to use page index statistics, we could stop decoding pages within a row group once the limit is met. This would pay dividends for wide row groups where only a few pages hold matching data.

Row Filter Hints: Even when a row group is fully matched, the current row filter still evaluates predicates row by row. If we pass the fully matched groups info into the row filter builder, we can skip predicate evaluation entirely for guaranteed groups — saving CPU cycles on predicate evaluation.

Summary¶

DataFusion's pruning pipeline trims redundant I/O from the partition level all the way down to individual rows. Limit pruning adds a new step that creates an early exit when fully matched row groups already satisfy the LIMIT. The result is fewer row groups scanned, less data decoded, and faster queries.

The key insights are: 1. Predicate negation can identify row groups where all rows match — not just "some might match" 2. Row count accumulation across fully matched groups enables early termination

About DataFusion¶

DataFusion's core thesis is that, as a community, together we can build much more advanced technology than any of us as individuals or companies could build alone.

How to Get Involved¶

If you are interested in contributing, we would love to have you. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests, or code. A list of open issues suitable for beginners is here, and you can find out how to reach us on the communication doc.

Apache DataFusion Comet 0.14.0 Release

2026-03-18T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.14.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately eight weeks of development work and is the result of merging 189 PRs from 21 contributors. See the change log for more information.

Key Features¶

Native Iceberg Improvements¶

Comet's fully-native Iceberg integration received several enhancements:

Per-Partition Plan Serialization: CometExecRDD now supports per-partition plan data, reducing serialization overhead for native Iceberg scans and enabling dynamic partition pruning (DPP).

Vended Credentials: Native Iceberg scans now support passing vended credentials from the catalog, improving integration with cloud storage services.

Upstream Reader Performance Improvements: The Comet team contributed a number of reader performance improvements to iceberg-rust 0.9.0, which Comet now uses. These improvements benefit all iceberg-rust users.

Performance Optimizations:

Single-pass FileScanTask validation for reduced planning overhead
Configurable data file concurrency via spark.comet.scan.icebergNative.dataFileConcurrencyLimit
Channel-based executor thread parking instead of yield_now() for reduced CPU overhead
Reuse of CometConf and native utility instances in batch decoding

Native Columnar-to-Row Conversion¶

Comet now uses a native columnar-to-row (C2R) conversion by default. This feature replaces Comet's JVM-based columnar-to-row transition with a native Rust implementation, reducing JVM memory overhead when data flows from Comet's native execution back to Spark operators that require row-based input.

New Expressions¶

This release adds support for the following expressions:

Date/time functions: make_date, next_day
String functions: right, string_split, luhn_check
Math functions: crc32
Map functions: map_contains_key, map_from_entries
Conversion functions: to_csv
Cast support: date to timestamp, numeric to timestamp, integer to binary, boolean to decimal, date to numeric

ANSI Mode Error Messages¶

ANSI SQL mode now produces proper error messages matching Spark's expected output, improving compatibility for workloads that rely on strict SQL error handling.

DataFusion Configuration Passthrough¶

DataFusion session-level configurations can now be set directly from Spark using the spark.comet.datafusion.* prefix. This enables tuning DataFusion internals such as batch sizes and memory limits without modifying Comet code.

Performance Improvements¶

This release includes extensive performance optimizations:

Sum aggregation: Specialized implementations for each eval mode eliminate per-row mode checks
Contains expression: SIMD-based scalar pattern search for faster string matching
Batch coalescing: Reduced IPC schema overhead in BufBatchWriter by coalescing small batches
Tokio runtime: Worker threads now initialize from spark.executor.cores for better resource utilization
Decimal expressions: Optimized decimal arithmetic operations
Row-to-columnar transition: Improved performance for JVM shuffle data conversion
Aligned pointer reads: Optimized SparkUnsafeRow field accessors using aligned memory reads

Deprecations and Removals¶

The deprecated native_comet scan mode has been removed. Use native_datafusion instead. Note that the native_iceberg_compat scan is now deprecated and will be removed from a future release.

Compatibility¶

This release upgrades to DataFusion 52.3, Arrow 57.3, and iceberg-rust 0.9.0. Published binaries now target x86-64-v3 and neoverse-n1 CPU architectures for improved performance on modern hardware.

Supported platforms include Spark 3.4.3, 3.5.4-3.5.8, and Spark 4.0.x with various JDK and Scala combinations.

The community encourages users to test Comet with existing Spark workloads and welcomes contributions to ongoing development.

Optimizing SQL CASE Expression Evaluation

2026-02-02T00:00:00+00:00

SQL's CASE expression is one of the few explicit conditional evaluation constructs the language provides. It allows you to control which expression from a set of expressions is evaluated for each row based on arbitrary boolean expressions. Its deceptively simple syntax hides significant implementation complexity. Over the past few Apache DataFusion releases, a series of improvements to CASE expression evaluator have been merged that reduce both CPU time and memory allocations. This post provides an overview of the original implementation, its performance bottlenecks, and the steps taken to address them.

Background: CASE Expression Evaluation¶

SQL supports two forms of CASE expressions:

Simple: CASE expr WHEN value1 THEN result1 WHEN value2 THEN result2 ... END
Searched: CASE WHEN condition1 THEN result1 WHEN condition2 THEN result2 ... END

The simple form evaluates an expression once for each input row and then tests that value against the expressions (typically constants) in each WHEN clause using equality comparisons.

Here's an example of the simple form:

CASE status
    WHEN 'pending' THEN 1
    WHEN 'active' THEN 2
    WHEN 'complete' THEN 3
    ELSE 0
END

In this CASE expression, status is evaluated once per row, and then its value is tested for equality with the values 'pending', 'active', and 'complete' in that order. The CASE expression evaluates to the value of the THEN expression corresponding to the first matching WHEN expression.

The searched CASE form is a more flexible variant. It evaluates completely independent boolean expressions for each branch. This allows you to test different columns with different operators per branch as shown in the following example:

CASE
    WHEN age > 65 THEN 'senior'
    WHEN childCount != 0 THEN 'parent'
    WHEN age < 21 THEN 'minor'
    ELSE 'adult'
END

In both forms, branches are evaluated sequentially with short-circuit semantics: for each row, once a WHEN condition matches, the corresponding THEN expression is evaluated. Any further branches are not evaluated for that row. This lazy evaluation model is critical for correctness. It lets you safely write CASE expressions like CASE WHEN d != 0 THEN n / d ELSE NULL END that are guaranteed to not trigger divide-by-zero errors.

Besides CASE, there are a few conditional scalar functions that provide similar, more restricted capabilities. These include COALESCE, IFNULL, and NVL2. You can consider each of these functions as the equivalent of a macro for CASE. For example, COALESCE(expr1, expr2, expr3) expands to:

CASE
  WHEN expr1 IS NOT NULL THEN expr1
  WHEN expr2 IS NOT NULL THEN expr2
  ELSE expr3
END

Since Apache DataFusion rewrites these conditional functions to their equivalent CASE expression, any optimizations related to CASE described in this post also apply to conditional function evaluation.

`CASE` Evaluation in DataFusion 50.0.0¶

For the remainder of this post, we'll be looking at 'searched CASE' evaluation. 'Simple CASE' uses a distinct, but very similar implementation. The same set of improvements has been applied to both.

The baseline implementation in DataFusion 50.0.0 evaluated CASE using a common, straightforward approach:

Start with an output array out with the same length as the input batch, filled with nulls. Additionally, create a bit vector remainder with the same length and each value set to true.
For each WHEN/THEN branch:
- Evaluate the WHEN condition for the remaining unmatched rows using PhysicalExpr::evaluate_selection, passing in the input batch and the remainder mask.
- If any rows matched, evaluate the THEN expression for those rows using PhysicalExpr::evaluate_selection.
- Merge the results into the out array using the zip kernel.
- Update the remainder mask to exclude the matched rows.
If there's an ELSE clause, evaluate it for any remaining unmatched rows and merge using zip.

Here's a simplified version of the Rust code for the original loop:

let mut out = new_null_array(&return_type, batch.num_rows());
let mut remainder = BooleanArray::from(vec![true; batch.num_rows()]);

for (when_expr, then_expr) in &self.when_then_expr {
    // Determine for which remaining rows the WHEN condition matches
    let when = when_expr.evaluate_selection(batch, &remainder)?
        .into_array(batch.num_rows())?;
    // Ensure any `NULL` values are treated as false
    let when_and_rem = and(&when, &remainder)?;

    if when_and_rem.true_count() == 0 {
        continue;
    }

    // Evaluate the THEN expression for matching rows
    let then = then_expr.evaluate_selection(batch, &when_and_rem)?;
    // Merge results into output array
    out = zip(&when_and_rem, &then_value, &out)?;
    // Update remainder mask to exclude matched rows
    remainder = and_not(&remainder, &when_and_rem)?;
}

Let's examine one iteration of this loop for the following CASE expression:

CASE
    WHEN col = 'b' THEN 100
    ELSE 200
END

Schematically, it will look as follows:

One iteration of the `CASE` evaluation loop

This implementation works perfectly fine, but there's significant room for optimization, mostly related to the usage of evaluate_selection. To understand why, we need to dig a little deeper into the implementation of that function. Here's a simplified version of it that captures the relevant parts:

pub trait PhysicalExpr {
    fn evaluate_selection(
        &self,
        batch: &RecordBatch,
        selection: &BooleanArray,
    ) -> Result<ColumnarValue> {
        // Reduce record batch to only include rows that match selection
        let filtered_batch = filter_record_batch(batch, selection)?;
        // Perform regular evaluation on filtered batch
        let filtered_result = self.evaluate(&filtered_batch)?;
        // Expand result array to match original batch length
        scatter(selection, filtered_result)
    }
}

Going back to the same example as before, the data flow in evaluate_selection looks like this:

evaluate_selection data flow

The evaluate_selection method first filters the input batch to only include rows that match the selection mask. It then calls the regular evaluate method using the filtered batch as input. Finally, to return a result array with the same number of rows as batch, the scatter function is called. This function produces a new array padded with null values for any rows that didn't match the selection mask.

So how can we improve the performance of the simple evaluation strategy and use of evaluate_selection?

Opportunity 1: Early Exit¶

The CASE evaluation loop always iterates through all branches, even when every row has already been matched. In queries where early branches match all rows, this results in unnecessary work being done for the remaining branches.

Opportunity 2: Optimize Repeated Filtering, Scattering, and Merging¶

Each iteration performs a number of operations that are very well-optimized, but still take up a significant amount of CPU time:

Filtering: PhysicalExpr::evaluate_selection filters the entire RecordBatch for each branch. For the WHEN expression, this is done even if the selection mask was entirely empty.
Scattering: PhysicalExpr::evaluate_selection scatters the filtered result back to the original RecordBatch length.
Merging: The zip kernel is called once per branch to merge partial results into the output array

Each of these operations needs to allocate memory for new arrays and shuffle quite a bit of data around.

Opportunity 3: Filter only Necessary Columns¶

The PhysicalExpr::evaluate_selection method filters the entire record batch, including columns that the current branch's WHEN and THEN expressions don't reference. For wide tables (many columns) with narrow expressions (few column references), this is wasteful.

Suppose you have a table with 26 columns named a through z, and the following simple CASE expression:

CASE
  WHEN a > 1000 THEN 'large'
  WHEN a >= 0 THEN 'positive'
  ELSE 'negative'
END

The implementation would filter all 26 columns even though only a single column is needed for the entire CASE expression evaluation. Again this involves a non-negligible amount of allocation and data copying.

Performance Optimizations¶

Optimization 1: Short-Circuit Early Exit¶

The first optimization is straightforward. As soon as we detect that all rows of the batch have been matched, we break out of the evaluation loop:

let mut remainder_count = batch.num_rows();

for (when_expr, then_expr) in &self.when_then_expr {
    if remainder_count == 0 {
        break;  // All rows matched, exit early
    }

    // ... evaluate branch ...

    let when_match_count = when_value.true_count();
    remainder_count -= when_match_count;
}

Additionally, we avoid evaluating the ELSE clause when no rows remain:

if let Some(else_expr) = &self.else_expr {
    remainder = or(&base_nulls, &remainder)?;
    if remainder.true_count() > 0 {
        // ... evaluate else ...
    }
}

For queries where early branches match all rows, this eliminates unnecessary branch evaluations and ELSE clause processing.

This optimization was implemented by Pepijn Van Eeckhoudt (@pepijnve) in PR #17898

Optimization 2: Optimized Result Merging¶

The second optimization fundamentally restructures how the results of each loop iteration will be merged. The diagram below illustrates the optimized data flow when evaluating the CASE WHEN col = 'b' THEN 100 ELSE 200 END from before:

optimized evaluation loop

In the reworked implementation, the evaluate_selection function is no longer used. The key insight is that we can defer all merging until the end of the evaluation loop by tracking result provenance. This was implemented with the following changes:

Augment the input batch with a column containing row indices.
Reduce the augmented batch after each loop iteration to only contain the remaining rows.
Use the row index column to track which partial result array contains the value for each row.
Perform a single merge operation at the end instead of a zip operation after each loop iteration.

These changes make it unnecessary to scatter and zip results in each loop iteration. Instead, when all rows have been matched, we then merge the partial results using arrow_select::merge::merge_n.

The diagram below illustrates how merge_n works for an example where three WHEN/THEN branches produced results. The first branch produced the result A for row 2, the second produced B for row 1, and the third produced C and D for rows 4 and 5.

merge_n example

The merge_n algorithm scans through the indices array. For each non-empty cell, it takes one value from the corresponding values array. In the example above, we first encounter 1. This takes the first element from the values array with index 1, resulting in B. The next cell contains 0 which takes A, from the first array. Finally, we encounter 2 twice. This takes the first and second element from the last values array respectively.

This algorithm was initially implemented in DataFusion for the CASE implementation, but in the meantime has been generalized and moved into the arrow-rs crate as arrow_select::merge::merge_n.

This optimization was implemented by Pepijn Van Eeckhoudt (@pepijnve) in PR #18152

Optimization 3: Column Projection¶

The third optimization addresses the "filtering unused columns" overhead through projection.

Look at the following query example where the mailing_address table has the columns name, surname, street, number, city, state, country:

SELECT *, CASE WHEN country = 'USA' THEN state ELSE country END AS region
FROM mailing_address

You can see that the CASE expression only references the columns country and state, but because all columns are being queried, projection pushdown cannot reduce the number of columns being fed in to the projection operator.

CASE evaluation without projection

During CASE evaluation, the batch must be filtered using the WHEN expression to evaluate the THEN expression values. As the diagram above shows, this filtering creates a reduced copy of all columns.

This unnecessary copying can be avoided by first narrowing the batch to only include the columns that are actually needed.

CASE evaluation with projection

At first glance, this might not seem beneficial, since we're introducing an additional processing step. Luckily projection of a record batch only requires a shallow copy of the record batch. The column arrays themselves are not copied, and the only work that is actually done is incrementing the reference counts of the columns.

Impact: For wide tables with narrow CASE expressions, this dramatically reduces filtering overhead by removing the copying of unused columns.

This optimization was implemented by Pepijn Van Eeckhoudt (@pepijnve) in PR #18329

Optimization 4: Eliminating Scatter in Two-Branch Case¶

Some of the earlier examples in this post use expressions of the form CASE WHEN condition THEN expr1 ELSE expr2 END to explain how the general evaluation loop works. For this kind of two-branch CASE expression, Apache DataFusion has a more optimized implementation that unrolls the loop. This specialized ExpressionOrExpression fast path still used evaluate_selection() for both branches which uses scatter and zip to combine the results incurring the same performance overhead as the general implementation.

The revised implementation eliminates the use of evaluate_selection as follows:

// Compute the `WHEN` condition for the entire batch
let when_filter = create_filter(&when_value);

// Compute a compact array of `THEN` values for the matching rows
let then_batch = filter_record_batch(batch, &when_filter)?;
let then_value = then_expr.evaluate(&then_batch)?;

// Compute a compact array of `ELSE` values for the non-matching rows
let else_filter = create_filter(&not(&when_value)?);
let else_batch = filter_record_batch(batch, &else_filter)?;
let else_value = else_expr.evaluate(&else_batch)?;

This produces two compact arrays, one for the THEN values and one for the ELSE values, which are then merged with the merge function. In contrast to zip, merge does not require both of its value inputs to have the same length. Instead it requires that the sum of the length of the value inputs matches the length of the mask array.

merge example

This eliminates unnecessary scatter operations and memory allocations for one of the most common CASE expression patterns.

Just like merge_n, this operation has been moved into arrow-rs as arrow_select::merge::merge.

This optimization was implemented by Pepijn Van Eeckhoudt (@pepijnve) in PR #18444

Optimization 5: Table Lookup of Constants¶

Up until now, we've discussed the implementations for generic CASE expressions that use non-constant expressions for both WHEN and THEN. Another common use of CASE is to perform a mapping from one set of constants to another. For instance, you can expand numeric constants to human-readable strings using the following CASE example.

CASE status
  WHEN 0 THEN 'idle'
  WHEN 1 THEN 'running'
  WHEN 2 THEN 'paused'
  WHEN 3 THEN 'stopped'
  ELSE 'unknown'
END

A final CASE optimization recognizes this pattern and compiles the CASE expression into a hash table. Rather than evaluating the WHEN and THEN expressions, the input expression is evaluated once, and the result array is computed using a vectorized hash table lookup. This approach avoids the need to filter the input batch and combine partial results entirely. The result array is computed in a single pass over the input values, and the computation time does not grow significantly with the number of WHEN branches in the CASE expression.

This optimization was implemented by Raz Luvaton (@rluvaton) in PR #18183

Results¶

The degree to which the performance optimizations described in this post will benefit your queries is highly dependent on both your data and your queries. To give some idea of the impact, we ran the following query on the TPC_H orders table with a scale factor of 100:

SELECT
    *,
    case o_orderstatus
        when 'O' then 'ordered'
        when 'F' then 'filled'
        when 'P' then 'pending'
        else 'other'
    end
from orders

This query was first run with DataFusion 50.0.0 to get a baseline measurement. The same query was then run with each optimization applied in turn. The recorded times are presented as the blue series in the chart below. The green series shows the time measurement for the SELECT * FROM orders to give an idea of the cost the addition of a CASE expression in a query incurs. All measurements were made with a target partition count of 1.

Performance measurements

What you can see in the chart is that the effect of the various optimizations compounds up to the project measurement. Up to that point these results are applicable to any CASE expression. The final improvement in the hash measurement is only applicable to simple CASE expressions with constant WHEN and THEN expressions.

The cumulative effect of these optimizations is a 63-71% reduction in CPU time spent evaluating CASE expressions compared to the baseline.

Summary¶

Through a number of targeted optimizations, we've transformed CASE expression evaluation from a simple, but unoptimized implementation into a highly optimized one. The optimizations described in this post compound: a CASE expression on a wide table with multiple branches and early matches benefits from all four optimizations simultaneously. The result is significantly reduced CPU time and memory allocation in SQL constructs that are essential for ETL-like queries.

About DataFusion¶

Apache DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast, data-centric systems such as databases, dataframe libraries, and machine learning and streaming applications. While DataFusion’s primary design goal is to accelerate the creation of other data-centric systems, it provides a reasonable experience directly out of the box as a dataframe library, Python library, and command-line SQL tool.

DataFusion's core thesis is that, as a community, together we can build much more advanced technology than any of us as individuals or companies could build alone. Without DataFusion, highly performant vectorized query engines would remain the domain of a few large companies and world-class research institutions. With DataFusion, we can all build on top of a shared foundation and focus on what makes our projects unique.

How to Get Involved¶

DataFusion is not a project built or driven by a single person, company, or foundation. Rather, our community of users and contributors works together to build a shared technology that none of us could have built alone.

Apache DataFusion Comet 0.13.0 Release

2026-01-30T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.13.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately eight weeks of development work and is the result of merging 169 PRs from 15 contributors. See the change log for more information.

Key Features¶

Native Parquet Write Support (Experimental)¶

This release introduces experimental native Parquet write capabilities, allowing Comet to intercept and execute Parquet write operations natively through DataFusion. Key capabilities include:

File commit protocol support for reliable writes
Remote HDFS writing via OpenDAL integration
Complex type support (arrays, maps, structs)
Proper handling of object store settings

To enable native Parquet writes, set:

spark.comet.allowIncompatibleOp.DataWritingCommandExec=true
spark.comet.parquet.write.enabled=true

Note: This feature is highly experimental and should not be used in production environments. It is currently categorized as a testing feature and is disabled by default.

Native Iceberg Improvements¶

Comet's fully-native Iceberg integration received significant enhancements in this release:

REST Catalog Support: Native Iceberg scans now support REST catalogs, enabling integration with catalog services like Apache Polaris and Tabular. Configure with:

--conf spark.sql.catalog.rest_cat=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.rest_cat.catalog-impl=org.apache.iceberg.rest.RESTCatalog
--conf spark.sql.catalog.rest_cat.uri=http://localhost:8181
--conf spark.comet.scan.icebergNative.enabled=true

Session Token Authentication: Added support for session tokens in native Iceberg scans for secure S3 access.

Performance Optimizations:

Deduplicated serialized metadata reducing memory overhead
Switched from JSON to protobuf for partition value serialization
Removed IcebergFileStream in favor of iceberg-rust's built-in parallelization
Reduced metadata serialization points
Added SchemaAdapter caching

To enable fully-native Iceberg scanning:

spark.comet.scan.icebergNative.enabled=true

The native reader supports Iceberg table spec v1 and v2, all primitive and complex types, schema evolution, time travel, positional and equality deletes, filter pushdown, and various storage backends (local, HDFS, S3).

Native CSV Reading (Experimental)¶

Experimental support for native CSV file reading has been added, expanding Comet's file format capabilities beyond Parquet.

New Expressions¶

The release adds support for numerous expressions:

Array functions: explode, explode_outer, size
Date/time functions: unix_date, date_format, datediff, last_day, unix_timestamp
String functions: left
JSON functions: from_json (partial support)

ANSI Mode Support¶

Sum and average aggregate expressions now support ANSI mode for both integer and decimal inputs, enabling overflow checking in strict SQL mode.

Native Shuffle Improvements¶

Round-robin partitioning is now supported in native shuffle
Spill metrics are now reported correctly
Configurable shuffle writer buffer size via spark.comet.shuffle.write.bufferSize

Performance Improvements¶

This release includes extensive performance optimizations:

String to integer casting: Significant speedups through optimized parsing
String functions: Optimized lpad/rpad to remove unnecessary memory allocations
Date operations: Improved normalize_nan and date truncate performance
Query planning: Cached query plans to avoid per-partition serialization overhead
Memory efficiency: Reduced GC pressure in protobuf serialization
Hash operations: Optimized complex-type hash implementations including murmur3 support for nested types
Runtime efficiency: Eliminated busy-polling of Tokio stream for plans without CometScan
Metrics overhead: Reduced timer and syscall overhead in native shuffle writer

Deprecations¶

The native_comet scan mode is now deprecated in favor of native_iceberg_compat and will be removed in a future release. The auto scan mode no longer falls back to native_comet.

Compatibility¶

This release upgrades to DataFusion 51, Arrow 57, and the latest iceberg-rust. The minimum supported Rust version is now 1.88.

Supported platforms include Spark 3.4.3, 3.5.4-3.5.7, and Spark 4.0.x with various JDK and Scala combinations.

The community encourages users to test Comet with existing Spark workloads and welcomes contributions to ongoing development.

Apache DataFusion 52.0.0 Released

2026-01-12T00:00:00+00:00

We are proud to announce the release of DataFusion 52.0.0. This post highlights some of the major improvements since DataFusion 51.0.0. The complete list of changes is available in the changelog. Thanks to the 121 contributors for making this release possible.

Performance Improvements 🚀¶

We continue to make significant performance improvements in DataFusion as explained below.

Faster `CASE` Expressions¶

DataFusion 52 has lookup-table-based evaluation for certain CASE expressions to avoid repeated evaluation for accelerating common ETL patterns such as

CASE company
    WHEN 1 THEN 'Apple'
    WHEN 5 THEN 'Samsung'
    WHEN 2 THEN 'Motorola'
    WHEN 3 THEN 'LG'
    ELSE 'Other'
END

This is the final work in our CASE performance epic (#18075), which has improved CASE evaluation significantly. Related PRs #18183. Thanks to rluvaton and pepijnve for the implementation. See the Optimizing SQL CASE Expression Evaluation blog post for more details.

`MIN`/`MAX` Aggregate Dynamic Filters¶

DataFusion now creates dynamic filters for queries with MIN/MAX aggregates that have filters, but no GROUP BY. These dynamic filters are used during scan to prune files and rows as tighter bounds are discovered during execution, as explained in the Dynamic Filtering Blog. For example, the following query:

SELECT min(l_shipdate)
FROM lineitem
WHERE l_returnflag = 'R';

Is now executed like this

SELECT min(l_shipdate)
FROM lineitem
--  '__current_min' is updated dynamically during execution
WHERE l_returnflag = 'R' AND l_shipdate < __current_min;

Thanks to 2010YOUY01 for implementing this feature, with reviews from martin-g, adriangb, and LiaCastaneda. Related PRs: #18644

New Merge Join¶

DataFusion 52 includes a rewrite of the sort-merge join (SMJ) operator, with speedups of three orders of magnitude in some pathological cases such as the case in #18487, which also affected Apache Comet workloads. Benchmarks in #18875 show dramatic gains for TPC-H Q21 (minutes to milliseconds) while leaving other queries unchanged or modestly faster. Thanks to mbutrovich for the implementation and reviews from Dandandan.

Caching Improvements¶

This release also includes several additional caching improvements.

A new statistics cache for File Metadata avoids repeatedly (re)calculating statistics for files. This significantly improves planning time for certain queries. You can see the contents of the new cache using the statistics_cache function in the CLI:

select * from statistics_cache();
+------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
| path             | file_modified       | file_size_bytes | e_tag                  | version | num_rows        | num_columns | table_size_bytes   | statistics_size_bytes |
+------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
| .../hits.parquet | 2022-06-25T22:22:22 | 14779976446     | 0-5e24d1ee16380-370f48 | NULL    | Exact(99997497) | 105         | Exact(36445943240) | 0                     |
+------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+

Thanks to bharath-techie and nuno-faria for implementing the statistics cache, with reviews from martin-g, alamb, and alchemist51. Related PRs: #18971, #19054

A prefix-aware list-files cache accelerates evaluating partition predicates for Hive partitioned tables.

-- Read the hive partitioned dataset from Overture Maps (100s of Parquet files)
CREATE EXTERNAL TABLE overturemaps
STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-12-17.0/';
-- Find all files where the path contains `theme=base without requiring another LIST call
select count(*) from overturemaps where theme='base';

You can see the contents of the new cache using the list_files_cache function in the CLI:

create external table overturemaps
stored as parquet
location 's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/type=infrastructure';
0 row(s) fetched.
> select table, path, metadata_size_bytes, expires_in, unnest(metadata_list)['file_size_bytes'] as file_size_bytes, unnest(metadata_list)['e_tag'] as e_tag from list_files_cache() limit 10;
+--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
| table        | path                                                | metadata_size_bytes | expires_in                        | file_size_bytes | e_tag                                 |
+--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 999055952       | "35fc8fbe8400960b54c66fbb408c48e8-60" |
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 975592768       | "8a16e10b722681cdc00242564b502965-59" |
...
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 1016732378      | "6d70857a0473ed9ed3fc6e149814168b-61" |
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 991363784       | "c9cafb42fcbb413f851691c895dd7c2b-60" |
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 1032469715      | "7540252d0d67158297a67038a3365e0f-62" |
+--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+

Thanks to BlakeOrth and Yuvraj-cyborg for implementing the list-files cache work, with reviews from gabotechs, alamb, alchemist51, martin-g, and BlakeOrth. Related PRs: #18146, #18855, #19366, #19298,

Improved Hash Join Filter Pushdown¶

Starting in DataFusion 51, filtering information from HashJoinExec is passed dynamically to scans, as explained in the Dynamic Filtering Blog using a technique referred to as Sideways Information Passing in Database research literature. The initial implementation passed min/max values for the join keys. DataFusion 52 extends the optimization (#17171 / #18393) to pass the contents of the build side hash map. These filters are evaluated on the probe side scan to prune files, row groups, and individual rows. When the build side contains 20 or fewer rows (configurable) the contents of the hash map are transformed to an IN expression and used for statistics-based pruning which can avoid reading entire files or row groups that contain no matching join keys. Thanks to adriangb for implementing this feature, with reviews from LiaCastaneda, asolimando, comphead, and mbutrovich.

Major Features ✨¶

Arrow IPC Stream file support¶

DataFusion can now read Arrow IPC stream files (#18457). This expands interoperability with systems that emit Arrow streams directly, making it simpler to ingest Arrow-native data without conversion. Thanks to corasaurus-hex for implementing this feature, with reviews from martin-g, Jefffrey, jdcasale, 2010YOUY01, and timsaucer.

CREATE EXTERNAL TABLE ipc_events
STORED AS ARROW
LOCATION 's3://bucket/events.arrow';

Related PRs: #18457

More Extensible SQL Planning with `RelationPlanner`¶

DataFusion now has an API for extending the SQL planner for relations, as explained in the Extending SQL in DataFusion Blog. In addition to the existing expression and types extension points, this new API now allows extending FROM clauses. Using these APIs it is straightforward to provide SQL support for almost any dialect, including vendor-specific syntax. Example use cases include:

-- Postgres-style JSON operators
SELECT payload->'user'->>'id' FROM logs;
-- MySQL-specific types
SELECT DATETIME '2001-01-01 18:00:00';
-- Statistical sampling
SELECT * FROM sensor_data TABLESAMPLE BERNOULLI(10 PERCENT);

Thanks to geoffreyclaude for implementing relation planner extensions, and to theirix, alamb, NGA-TRAN, and gabotechs for reviews and feedback on the design. Related PRs: #17843

Expression Evaluation Pushdown to Scans¶

DataFusion now pushes down expression evaluation into TableProviders using PhysicalExprAdapter, replacing the older SchemaAdapter approach (#14993, #16800). Predicates and expressions can now be customized for each individual file schema, opening additional optimization such as support for Variant shredding. Thanks to adriangb for implementing PhysicalExprAdapter and reworking pushdown to use it. Related PRs: #18998, #19345

Sort Pushdown to Scans¶

DataFusion can now push sorts into data sources (#10433, #19064). This allows table provider implementations to optimize based on sort knowledge for certain query patterns. For example, the provided Parquet data source now reverses the scan order of row groups and files when queried for the opposite of the file's natural sort (e.g. DESC when the files are sorted ASC). This reversal, combined with dynamic filtering, allows top-K queries with LIMIT on pre-sorted data to find the requested rows very quickly, pruning more files and row groups without even scanning them. We have seen a ~30x performance improvement on benchmark queries with pre-sorted data. Thanks to zhuqi-lucas and xudong963 for this feature, with reviews from martin-g, adriangb, and alamb.

`TableProvider` supports `DELETE` and `UPDATE` statements¶

The TableProvider trait now includes hooks for DELETE and UPDATE statements and the basic MemTable implements them (#19142). This lets downstream implementations and storage engines plug in their own mutation logic. See TableProvider::delete_from and TableProvider::update for more details.

Example:

DELETE FROM mem_table WHERE status = 'obsolete';

Thanks to ethan-tyler for the implementation and alamb and adriangb for reviews.

`CoalesceBatchesExec` Removed¶

The standalone CoalesceBatchesExec operator existed to ensure batches were large enough for subsequent vectorized execution, and was inserted after filter-like operators such as FilterExec, HashJoinExec, and RepartitionExec. However, using a separate operator also blocks other optimizations such as pushing LIMIT through joins and made optimizer rules more complex. In this release, we integrated the coalescing into the operators themselves (#18779) using Arrow's coalesce kernel. This reduces plan complexity while keeping batch sizes efficient, and allows additional focused optimization work in the Arrow kernel, such as Dandandan's recent work with filtering in arrow-rs/#8951.

Related PRs: #18540, #18604, #18630, #18972, #19002, #19342, #19239 Thanks to Tim-53, Dandandan, jizezhang, and feniljain for implementing this feature, with reviews from Jefffrey, alamb, martin-g, geoffreyclaude, milenkovicm, and jizezhang.

Upgrade Guide and Changelog¶

As always, upgrading to 52.0.0 should be straightforward for most users. Please review the Upgrade Guide for details on breaking changes and code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog.

About DataFusion¶

Apache DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast, data-centric systems such as databases, dataframe libraries, and machine learning and streaming applications. While DataFusion's primary design goal is to accelerate the creation of other data-centric systems, it provides a reasonable experience directly out of the box as a dataframe library, Python library, and command-line SQL tool.

How to Get Involved¶

Extending SQL in DataFusion: from ->> to TABLESAMPLE

2026-01-12T00:00:00+00:00

If you embed DataFusion in your product, your users will eventually run SQL that DataFusion does not recognize. Not because the query is unreasonable, but because SQL in practice includes many dialects and system-specific statements.

Suppose you store data as Parquet files on S3 and want users to attach an external catalog to query them. DataFusion has CREATE EXTERNAL TABLE for individual tables, but no built-in equivalent for catalogs. DuckDB has ATTACH, SQLite has its own variant, and maybe you really want something even more flexible:

CREATE EXTERNAL CATALOG my_lake
STORED AS iceberg
LOCATION 's3://my-bucket/warehouse'
OPTIONS ('region' 'eu-west-1');

This syntax does not exist in DataFusion today, but you can add it.

At the same time, many dialect gaps are smaller and show up in everyday queries:

-- Postgres-style JSON operators
SELECT payload->'user'->>'id' FROM logs;

-- MySQL-specific types
SELECT DATETIME '2001-01-01 18:00:00';

-- Statistical sampling
SELECT * FROM sensor_data TABLESAMPLE BERNOULLI(10 PERCENT);

You can implement all of these without forking DataFusion:

Parse new syntax (custom statements / dialect quirks)
Plan new semantics (expressions, types, FROM-clause constructs)
Execute new operators when rewrites are not sufficient

This post explains where and how to hook into each stage. For complete, working code, see the linked datafusion-examples.

Parse → Plan → Execute¶

DataFusion turns SQL into executable work in stages:

Parse: SQL text is parsed into an AST (Statement from sqlparser-rs)
Logical planning: SqlToRel converts the AST into a LogicalPlan
Physical planning: The PhysicalPlanner turns the logical plan into an ExecutionPlan

Each stage has extension points.

Figure 1: SQL flows through three stages: parsing, logical planning (via SqlToRel, where the Extension Planners hook in), and physical planning. Each stage has extension points: wrap the parser, implement planner traits, or add physical operators.

To choose the right extension point, look at where the query fails.

What fails?	What it looks like	Where to hook in
Parsing	`Expected: TABLE, found: CATALOG`	configure dialect or wrap `DFParser`
Planning	`This feature is not implemented: DATETIME`	`ExprPlanner`, `TypePlanner`, `RelationPlanner`
Execution	`No physical plan for TableSample`	`ExtensionPlanner` (+ physical operator)

We will follow that pipeline order.

1) Extending parsing: wrapping `DFParser` for custom statements¶

The CREATE EXTERNAL CATALOG syntax from the introduction fails at the parser because DataFusion only recognizes CREATE EXTERNAL TABLE. To support new statement-level syntax, you can wrap DFParser. Peek ahead in the token stream to detect your custom syntax, handle it yourself, and delegate everything else to DataFusion.

The custom_sql_parser.rs example demonstrates this pattern:

struct CustomParser<'a> { df_parser: DFParser<'a> }

impl<'a> CustomParser<'a> {
  pub fn parse_statement(&mut self) -> Result<CustomStatement> {
    // Peek tokens to detect CREATE EXTERNAL CATALOG
    if self.is_create_external_catalog() {
      return self.parse_create_external_catalog();
    }
    // Delegate everything else to DataFusion
    Ok(CustomStatement::DFStatement(Box::new(
      self.df_parser.parse_statement()?,
    )))
  }
}

You do not need to implement a full SQL parser. Reuse DataFusion's tokenizer and parser helpers to consume tokens, parse identifiers, and handle options—the example shows how.

Once parsed, the simplest integration is to treat custom statements as application commands:

match parser.parse_statement()? {
  CustomStatement::DFStatement(stmt) => ctx.sql(&stmt.to_string()).await?,
  CustomStatement::CreateExternalCatalog(stmt) => {
    handle_create_external_catalog(&ctx, stmt).await?
  }
}

This keeps the extension logic in your embedding application. The example includes a complete handle_create_external_catalog that registers tables from a location into a catalog, making them queryable immediately.

Full working example: custom_sql_parser.rs

2) Extending expression semantics: `ExprPlanner`¶

Once SQL parses, the next failure is often that DataFusion does not know what a particular expression means.

This is where dialect differences show up in day-to-day queries: operators like Postgres JSON arrows, vendor-specific functions, or small syntactic sugar that users expect to keep working when you switch engines.

ExprPlanner lets you define how specific SQL expressions become DataFusion Expr. Common examples:

Non-standard operators (JSON / geometry / regex operators)
Custom function syntaxes
Special identifier behavior

Example: Postgres JSON operators (`->`, `->>`)¶

The Postgres -> operator is a good illustration because it is widely used and parses only under the PostgreSQL dialect.

Configure the dialect:

let config = SessionConfig::new()
    .set_str("datafusion.sql_parser.dialect", "postgres");
let ctx = SessionContext::new_with_config(config);

Then implement ExprPlanner to map the parsed operator (BinaryOperator::Arrow) to DataFusion semantics:

fn plan_binary_op(&self, expr: RawBinaryExpr, _schema: &DFSchema)
  -> Result<PlannerResult<RawBinaryExpr>> {
  match expr.op {
    BinaryOperator::Arrow => Ok(Planned(/* your Expr */)),
    _ => Ok(Original(expr)),
  }
}

Return Planned(...) when you handled the expression; return Original(...) to pass it to the next planner.

For a complete JSON implementation, see datafusion-functions-json. For a minimal end-to-end example in the DataFusion repo, see expr_planner_tests.

3) Extending type support: `TypePlanner`¶

After expressions, types are often the next thing to break. Schemas and DDL may reference types that DataFusion does not support out of the box, like MySQL's DATETIME.

Type planning tends to come up when interoperating with other systems. You want to accept DDL or infer schemas from external catalogs without forcing users to rewrite types.

TypePlanner maps SQL types to Arrow/DataFusion types:

impl TypePlanner for MyTypePlanner {
  fn plan_type(&self, sql_type: &ast::DataType) -> Result<Option<DataType>> {
    match sql_type {
      ast::DataType::Datetime(Some(3)) => Ok(Some(DataType::Timestamp(TimeUnit::Millisecond, None))),
      _ => Ok(None), // let the default planner handle it
    }
  }
}

It is installed when building session state:

let state = SessionStateBuilder::new()
  .with_default_features()
  .with_type_planner(Arc::new(MyTypePlanner))
  .build();

Once installed, if your CREATE EXTERNAL CATALOG statement exposes tables with MySQL types, DataFusion can interpret them correctly.

4) Extending the FROM clause: `RelationPlanner`¶

Some extensions change what a relation means, not just expressions or types. RelationPlanner (available starting in DataFusion 52) intercepts FROM-clause constructs while SQL is being converted into a LogicalPlan.

Once you have RelationPlanner, there are two main approaches to implementing your extension.

Strategy A: rewrite to existing operators (PIVOT / UNPIVOT)¶

If you can translate your syntax into relational algebra that DataFusion already supports, you can implement the feature with no custom physical operator.

PIVOT rotates rows into columns, and UNPIVOT does the reverse. Neither requires new execution logic: PIVOT is just GROUP BY with CASE expressions, and UNPIVOT is a UNION ALL of each column. The planner rewrites them accordingly:

match relation {
  TableFactor::Pivot { .. } => /* rewrite to GROUP BY + CASE */,
  TableFactor::Unpivot { .. } => /* rewrite to UNION ALL */,
  other => Original(other),
}

Because the output is a standard LogicalPlan, DataFusion's usual optimization and physical planning apply automatically.

Full working example: pivot_unpivot.rs

Strategy B: custom logical + physical (TABLESAMPLE)¶

Sometimes rewriting is not sufficient. TABLESAMPLE returns a random subset of rows from a table and is useful for approximations or debugging on large datasets. Because it requires runtime randomness, you cannot express it as a rewrite to existing operators. Instead, you need a custom logical node and physical operator to execute it.

The approach (shown in table_sample.rs):

RelationPlanner recognizes TABLESAMPLE and produces a custom logical node
That node gets wrapped in LogicalPlan::Extension
ExtensionPlanner converts it to a custom ExecutionPlan

In code:

// Logical planning: FROM t TABLESAMPLE (...)  ->  LogicalPlan::Extension(...)
let plan = LogicalPlan::Extension(Extension { node: Arc::new(TableSamplePlanNode { /* ... */ }) });

// Physical planning: TableSamplePlanNode  ->  SampleExec
if let Some(sample_node) = node.as_any().downcast_ref::<TableSamplePlanNode>() {
  return Ok(Some(Arc::new(SampleExec::try_new(input, /* bounds, seed */)?)));
}

This is the general pattern for custom FROM constructs that need runtime behavior.

Full working example: table_sample.rs

Background: Origin of the API¶

RelationPlanner originally came out of trying to build MATCH_RECOGNIZE support in DataFusion as a Datadog hackathon project. MATCH_RECOGNIZE is a complex SQL feature for detecting patterns in sequences of rows, and it made sense to prototype as an extension first. At the time, DataFusion had no extension point at the right stage of SQL-to-rel planning to intercept and reinterpret relations.

@theirix's TABLESAMPLE work (#13563, #17633) demonstrated exactly where the gap was: their extension only worked when TABLESAMPLE appeared at the query root and any TABLESAMPLE inside a CTE or JOIN would error. That limitation motivated #17843, which introduced RelationPlanner to intercept relations at any nesting level. The same hook now supports PIVOT, UNPIVOT, TABLESAMPLE, and can translate dialect-specific FROM-clause syntax (for example, bridging Trino constructs into DataFusion plans).

This is how Datadog approaches compatibility work: build features in real systems first, then upstream the building blocks. A full MATCH_RECOGNIZE extension is now in progress, built on top of RelationPlanner, with the match_recognize.rs example as a starting point.

Summary: The Extensibility Workflow¶

DataFusion's SQL extensibility follows its processing pipeline. When building your own dialect extension, work incrementally:

Parse: Use a parser wrapper to intercept custom syntax in the token stream. Produce either a standard Statement or your own application-specific command.
Plan: Implement the planning traits (ExprPlanner, TypePlanner, RelationPlanner) to give your syntax meaning.
Execute: Prefer rewrites to existing operators (like PIVOT to CASE). Only add custom physical operators via ExtensionPlanner when you need specific runtime behavior like randomness or specialized I/O.

Debugging tips¶

Print the logical plan¶

let df = ctx.sql("SELECT * FROM t TABLESAMPLE (10 PERCENT)").await?;
println!("{}", df.logical_plan().display_indent());

Use `EXPLAIN`¶

EXPLAIN SELECT * FROM t TABLESAMPLE (10 PERCENT);

If your extension is not being invoked, it is usually visible in the logical plan first.

When hooks aren't enough¶

While these extension points cover the majority of dialect needs, some deep architectural areas still have limited or no hooks. If you are working in these parts of the SQL surface area, you may need to contribute upstream:

Statement-level planning: statement.rs
JOIN planning: relation/join.rs
TOP / FETCH clauses: select.rs, query.rs

Ideas to try¶

If you want to experiment with these extension points, here are a few suggestions:

Geometry operators (for example @>, <@) via ExprPlanner
Oracle NUMBER or SQL Server MONEY via TypePlanner
JSON_TABLE or semantic-layer style relations via RelationPlanner

Acknowledgements¶

Thank you to @jayzhan211 for designing and implementing the original ExprPlanner API (#11180), to @goldmedal for adding TypePlanner (#13294), and to @theirix for the TABLESAMPLE work (#13563, #17633) that helped shape RelationPlanner. Thank you to @alamb for driving DataFusion's extensibility philosophy and for feedback on this post.

Get Involved¶

Try it out: Implement one of the extension points and share your experience
File issues or join the conversation: GitHub for bugs and feature requests, Slack or Discord for discussion

Optimizing Repartitions in DataFusion: How I Went From Database Noob to Core Contribution

2025-12-15T00:00:00+00:00

Databases are some of the most complex yet interesting pieces of software. They are amazing pieces of abstraction: query engines optimize and execute complex plans, storage engines provide sophisticated infrastructure as the backbone of the system, while intricate file formats lay the groundwork for particular workloads. All of this is exposed by a user-friendly interface and query languages (typically a dialect of SQL).

Starting a journey learning about database internals can be daunting. With so many topics that are whole PhD degrees themselves, finding a place to start is difficult. In this blog post, I will share my early journey in the database world and a quick lesson on one of the first topics I dove into. If you are new to the space, this post will help you get your first foot into the database world, and if you are already a veteran, you may still learn something new.

Who Am I?¶

I am Gene Bordegaray (LinkedIn, GitHub), a recent computer science graduate from UCLA and software engineer at Datadog. Before starting my job, I had no real exposure to databases, only enough SQL knowledge to send CRUD requests and choose between a relational or no-SQL model in a systems design interview.

When I found out I would be on a team focusing on query engines and execution, I was excited but horrified. "Query engines?" From my experience, I typed SQL queries into pgAdmin and got responses without knowing the dark magic that happened under the hood.

With what seemed like an impossible task at hand, I began my favorite few months of learning.

Starting Out¶

I was no expert in databases or any of their subsystems, but I am someone who recently began learning about them. These are some tips I found useful when first starting.

Build a Foundation¶

The first thing I did, which I highly recommend, was watch Andy Pavlo's Intro To Database Systems course. This laid a great foundation for understanding how a database works from end-to-end at a high-level. It touches on topics ranging from file formats to query optimization, and it was helpful to have a general context for the whole system before diving deep into a single sector.

Narrow Your Scope¶

The next crucial step is to pick your niche to focus on. Database systems are so vast that trying to tackle the whole beast at once is a lost cause. If you want to effectively contribute to this space, you need to deeply understand the system you are working on, and you will have much better luck narrowing your scope.

When learning about the entire database stack at a high level, note what parts stick out as particularly interesting. For me, this focus is on query engines, more specifically, the physical planner and optimizer.

A "Slow" Start¶

The final piece of advice when starting, and I sound like a broken record, is to take your time to learn. This is not an easy sector of software to jump into; it will pay dividends to slow down, fully understand the system, and why it is designed the way it is.

When making your first contributions to an open-source project, start very small but go as deep as you can. Don't leave any stone unturned. I did this by looking for simpler issues, such as formatting or simple bug fixes, and stepping through the entire data flow that relates to the issue, noting what each component is responsible for.

This will give you familiarity with the codebase and using your tools, like your debugger, within the project.

Now that we have some general knowledge of database internals, a niche or subsystem we want to dive deeper into, and the mindset for acquiring knowledge before contributing, let's start with our first core issue.

Intro to DataFusion¶

As mentioned, the database subsystem I decided to explore was query engines. The query engine is responsible for interpreting, optimizing, and executing queries, aiming to do so as efficiently as possible.

My team was in full-swing of restructuring how query execution would work in our organization. The team decided we would use Apache DataFusion at the heart of our system, chosen for its blazing fast execution time for analytical workloads and vast extendability. DataFusion is written in Rust and builds on top of Apache Arrow (another great project), a columnar memory format that enables it to efficiently process large volumes of data in memory.

This project offered a perfect environment for my first steps into databases: clear, production-ready Rust programming, a manageable codebase, high performance for a specific use case, and a welcoming community.

Parallel Execution in DataFusion¶

Before discussing this issue, it is essential to understand how DataFusion handles parallel execution.

DataFusion implements a vectorized Volcano Model, similar to other state of the art engines such as ClickHouse. The Volcano Model is built on the idea that each operation is abstracted into an operator, and a DAG can represent an entire query. Each operator implements a next() function that returns a batch of tuples or a NULL marker if no data is available.

DataFusion achieves multi-core parallelism through the use of "exchange operators." Individual operators are implemented to use a single CPU core, and the RepartitionExec operator is responsible for distributing work across multiple processors.

What is Repartitioning?¶

Partitioning is a "divide-and-conquer" approach to executing a query. Each partition is a subset of the data that is being processed on a single core. Repartitioning is an operation that redistributes data across different partitions to balance workloads, reduce data skew, and increase parallelism. Two repartitioning methods are used in DataFusion: round-robin and hash.

Round-Robin Repartitioning¶

Round-robin repartitioning is the simplest partitioning strategy. Incoming data is processed in batches (chunks of rows), and these batches are distributed across partitions cyclically or sequentially, with each new batch assigned to the next available partition.

Round-robin repartitioning is useful when the data grouping isn't known or when aiming for an even distribution across partitions. Because it simply assigns batches in order without inspecting their contents, it is a low-overhead way to increase parallelism for downstream operations.

Hash Repartitioning¶

Hash repartitioning distributes data based on a hash function applied to one or more columns, called the partitioning key. Rows with the same hash value are placed in the same partition.

Hash repartitioning is useful when working with grouped data. Imagine you have a database containing information on company sales, and you are looking to find the total revenue each store produced. Hash repartitioning would make this query much more efficient. Rather than iterating over the data on a single thread and keeping a running sum for each store, it would be better to hash repartition on the store column and have multiple threads calculate individual store sales.

Note, the benefit of hash opposed to round-robin partitioning in this scenario. Hash repartitioning consolidates all rows with the same store value in distinct partitions. Because of this property we can compute the complete results for each store in parallel and merge them to get the final outcome. This parallel processing wouldn’t be possible with only round-robin partitioning as the same store value may be spread across multiple partitions, making the aggregation results partial, unable to merge them to produce a correct final outcome.

The Issue: Consecutive Repartitions¶

DataFusion contributors pointed out that consecutive repartition operators were being added to query plans, making them less efficient and more confusing to read (link to issue). This issue had stood for over a year, with some attempts to resolve it, but they fell short.

For some queries that required repartitioning, the plan would look along the lines of:

SELECT a, SUM(b) FROM data.parquet GROUP BY a;

Why Don’t We Want Consecutive Repartitions?¶

Repartitions would appear back-to-back in query plans, specifically a round-robin followed by a hash repartition.

Why is this such a big deal? Well, repartitions do not process the data; their purpose is to redistribute it in ways that enable more efficient computation for other operators. Having consecutive repartitions is counterintuitive because we are redistributing data, then immediately redistributing it again, making the first repartition pointless. While this didn't create extreme overhead for queries, since round-robin repartitioning does not copy data, just the pointers to batches, the behavior was unclear and unnecessary.

Optimally the plan should do one of two things:

If there is enough data to justify round-robin repartitioning, split the repartitions across a "worker" operator that leverages the redistributed data.
Otherwise, don't use any round-robin repartition and keep the hash repartition only in the middle of the two-stage aggregation.

As shown in the diagram for a large query plan above, the round-robin repartition takes place before the partial aggregation. This increases parallelism for this processing, which will yield great performance benefits in larger datasets.

Identifying the Bug¶

With an understanding of what the problem is, it is finally time to dive into isolating and identifying the bug.

No Code!¶

Before looking at any code, we can narrow the scope of where we should be looking. I found that tightening the boundaries of what you are looking for before reading any code is critical for being effective in large, complex codebases. If you are searching for a needle in a haystack, you will spend hours sifting through irrelevant code.

We can use what we know about the issue and provided tools to pinpoint where our search should begin. So far, we know the bug only exists where repartitioning is needed. Let's see how else we can narrow down our search.

From previous tickets, I was aware that DataFusion offered the EXPLAIN VERBOSE keywords. When put before a query, the CLI prints the logical and physical plan at each step of planning and optimization. Running this query:

EXPLAIN VERBOSE SELECT a, SUM(b) FROM data.parquet GROUP BY a;

we find a critical piece of information.

Physical Plan Before EnforceDistribution:

1.OutputRequirementExec: order_by=[], dist_by=Unspecified
2.  AggregateExec: mode=FinalPartitioned, gby=[a@0 as a], aggr=[sum(parquet_data.b)]
3.    AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[sum(parquet_data.b)]
4.      DataSourceExec:
            file_groups={1 group: [[...]]}
            projection=[a, b]
            file_type=parquet

Physical Plan After EnforceDistribution:

1.OutputRequirementExec: order_by=[], dist_by=Unspecified
2.  AggregateExec: mode=FinalPartitioned, gby=[a@0 as a], aggr=[sum(parquet_data.b)]
3.    RepartitionExec: partitioning=Hash([a@0], 16), input_partitions=16
4.      RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1 <-- EXTRA REPARTITION!
5.        AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[sum(parquet_data.b)]
6.          DataSourceExec:
                file_groups={1 group: [[...]]}
                projection=[a, b]
                file_type=parquet

We have found the exact rule, EnforceDistribution, that is responsible for introducing the bug before reading a single line of code! For experienced maintainers of DataFusion, they would've known where to look before starting, but for a newbie, this is great information.

The Root Cause¶

With a single rule to read, isolating the issue is much simpler. The EnforceDistribution rule takes a physical query plan as input, iterates over each child analyzing its requirements, and decides where adding repartition nodes is beneficial.

A great place to start looking is before any repartitions are inserted, and where the program decides if adding a repartition above/below an operator is useful. With the help of handy function header comments, it was easy to identify that this is done in the get_repartition_requirement_status function. Here, DataFusion sets four fields indicating how the operator would benefit from repartitioning:

The operator's distribution requirement: what type of partitioning does it need from its children (hash, single, or unknown)?
If round-robin is theoretically beneficial: does the operator benefit from parallelism?
If our data indicates round-robin to be beneficial: do we have enough data to justify the overhead of repartitioning?
If hash repartitioning is necessary: is the parent an operator that requires all column values to be in the same partition, like an aggregate, and are we already hash-partitioned correctly?

Ok, great! We understand the different components DataFusion uses to indicate if repartitioning is beneficial. Now all that's left to do is see how repartitions are inserted.

This logic takes place in the main loop of this rule. I find it helpful to draw algorithms like these into logic trees; this tends to make things much more straightforward and approachable:

Boom! This is the root of our problem: we are inserting a round-robin repartition, then still inserting a hash repartition afterwards. This means that if an operator indicates it would benefit from both round-robin and hash repartitioning, consecutive repartitions will occur.

The Fix¶

The logic shown before is, of course, incorrect, and the conditions for adding hash and round-robin repartitioning should be mutually exclusive since an operator will never benefit from shuffling data twice.

Well, what is the correct logic?

Based on our lesson on hash repartitioning and the heuristics DataFusion uses to determine when repartitioning can benefit an operator, the fix is easy. In the sub-tree where an operator's parent requires hash partitioning:

If we are already hashed correctly, don't do anything. If we insert a round-robin, we will break out the partitioning.
If a hash is required, just insert a hash repartition.

The new logic tree looks like this:

All that deep digging paid off, one condition (see the final PR for full details)!

Condition before:

 if add_roundrobin {

Condition after:

if add_roundrobin && !hash_necessary {

Results¶

This eliminated every consecutive repartition in the DataFusion test suite and benchmarks, reducing overhead, making plans clearer, and enabling further optimizations.

Plans became simpler:

Before:


1.ProjectionExec: expr=[env@0 as env, count(Int64(1))@1 as count(*)]
2.  AggregateExec: mode=FinalPartitioned, gby=[env@0 as env], aggr=[count(Int64(1))]
3.    CoalesceBatchesExec: target_batch_size=8192
4.      RepartitionExec: partitioning=Hash([env@0], 4), input_partitions=4
5.        RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 <-- EXTRA REPARTITION!
6.          AggregateExec: mode=Partial, gby=[env@0 as env], aggr=[count(Int64(1))]
7.            DataSourceExec:
                file_groups={1 group: [[...]}
                projection=[env]
                file_type=parquet

After:

1.ProjectionExec: expr=[env@0 as env, count(Int64(1))@1 as count(*)]
2.  AggregateExec: mode=FinalPartitioned, gby=[env@0 as env], aggr=[count(Int64(1))]
3.    CoalesceBatchesExec: target_batch_size=8192
4.      RepartitionExec: partitioning=Hash([env@0], 4), input_partitions=1
5.        AggregateExec: mode=Partial, gby=[env@0 as env], aggr=[count(Int64(1))]
6.          DataSourceExec:
                file_groups={1 group: [[...]]}
                projection=[env]
                file_type=parquet

For the benchmarking standard, TPCH, speedups were small but consistent:

TPCH Benchmark

TPCH10 Benchmark

And there it is, our first core contribution for a database system!

From this experience there are two main points I would like to emphasize:

Deeply understand the system you are working on. It is not only fun to figure these things out, but it also pays off in the long run when having surface-level knowledge won't cut it.
Narrow down the scope of your work when starting your journey into databases. Find a project that you are interested in and provides an environment that enhances your early learning process. I have found that Apache DataFusion and its community has been an amazing first step and plan to continue learning about query engines here.

I hope you gained something from my experience and have fun learning about databases.

Acknowledgements¶

Thank you to Nga Tran for continuous mentorship and guidance, the DataFusion community, specifically Andrew Lamb, for lending me support throughout my work, and Datadog for providing the opportunity to work on such interesting systems.

Apache DataFusion Comet 0.12.0 Release

2025-12-04T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.12.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately four weeks of development work and is the result of merging 105 PRs from 13 contributors. See the change log for more information.

Release Highlights¶

Experimental Native Apache Iceberg Scan Support¶

Comet has a new, experimental, native Iceberg scan. This work relies on iceberg-rust and the Parquet reader from arrow-rs that Comet already uses to great effect. Comet’s existing Iceberg integration relies on a modified Iceberg Java build to accelerate Parquet decoding. This new approach allows unmodified Iceberg Java to handle query planning (i.e., catalog access, partition pruning, etc.), then Comet serializes Iceberg FileScanTask objects directly to iceberg-rust, enabling native execution of Iceberg table scans through DataFusion.

This represents a significant step forward in Comet's support for data lakehouse architectures and expands the range of workloads that can benefit from native acceleration. Please take a look at the PR and Comet’s documentation to understand the current limitations and try it on your workloads! We are eager for feedback on this approach.

Code Architecture Improvements¶

This release includes significant refactoring to improve code maintainability and extensibility, and we will continue those efforts into 0.13.0 development:

Unified operator serialization: The CometExecRule refactor unifies CometNativeExec creation with serialization through the new CometOperatorSerde trait
Expression serde refactoring: Multiple PRs (#2738, #2741, #2791) moved expression serialization logic out of QueryPlanSerde into specialized traits
Aggregate expression improvements: Added getSupportLevel to CometAggregateExpressionSerde trait for better aggregate function handling

These architectural improvements make it easier for contributors to add new operators and expressions while reducing code complexity.

New SQL Functions¶

The following SQL functions are now supported:

concat - String concatenation
abs - Absolute value
sha1 - SHA-1 hash function
cot - Cotangent function
Hyperbolic trigonometric functions - sinh, cosh, tanh, and their inverse functions

New Operators¶

CometLocalTableScanExec - Native support for local table scans, eliminating fallback to Spark for small, in-memory datasets

Configuration and Usability Improvements¶

Simplified on-heap configuration: Simplified on-heap memory configuration for easier setup
Extended explain format: Renamed and improved COMET_EXTENDED_EXPLAIN_FORMAT with better defaults
Environment variable support: Improved framework for setting configs with environment variables
Native config passing: All Comet configs now passed to native plan
Config categorization: Categorized testing configs and added notes about known timezone issues
Removed legacy configs: Removed COMET_EXPR_ALLOW_INCOMPATIBLE config to simplify configuration

Bug Fixes¶

This release includes numerous bug fixes:

Fixed None.get in stringDecode when binary child cannot be converted
Proper fallback for lpad/rpad with unsupported arguments
Fixed trunc/date_trunc with unsupported format strings
Corrected single partition handling in native_datafusion
Fixed LeftSemi join handling - do not replace SMJ with HJ
Fixed CometLiteral class cast exception with arrays
Fixed missing SortOrder fallback reason in range partitioning
Improved checkSparkMaybeThrows to compare results in success case
Fixed null handling in CometVector implementations

Documentation Improvements¶

Added FFI documentation to contributor guide
Updated contributor guide for adding new expressions and operators
Improved documentation layout and navigation
Added prettier enforcement for consistent markdown formatting
CI check to ensure generated docs are in sync
Various documentation updates for SortOrder expressions, LocalTableScan and WindowExec, and Spark SQL tests

Dependency Updates¶

Upgraded to Spark 3.5.7
Upgraded to DataFusion 50.3.0
Upgraded Parquet from 56.0.0 to 56.2.0
Various other dependency updates via Dependabot

Spark Compatibility¶

Spark 3.4.3 with JDK 11 & 17, Scala 2.12 & 2.13
Spark 3.5.4 through 3.5.7 with JDK 11 & 17, Scala 2.12 & 2.13
Spark 4.0.1 with JDK 17, Scala 2.13

We are looking for help from the community to fully support Spark 4.0.1. See EPIC: Support 4.0.0 for more information.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

The easiest way to get involved is to test Comet with your current Spark jobs and file issues for any bugs or performance regressions that you find. See the Getting Started guide for instructions on downloading and installing Comet.

There are also many good first issues waiting for contributions.

Apache DataFusion 51.0.0 Released

2025-11-25T00:00:00+00:00

Introduction¶

We are proud to announce the release of DataFusion 51.0.0. This post highlights some of the major improvements since DataFusion 50.0.0. The complete list of changes is available in the changelog. Thanks to the 128 contributors for making this release possible.

Performance Improvements 🚀¶

We continue to make significant performance improvements in DataFusion, both in the core engine and in the Parquet reader.

Figure 1: Average and median normalized query execution times for ClickBench queries for DataFusion 51.0.0 compared to previous releases. Query times are normalized using the ClickBench definition. See the DataFusion Benchmarking Page for more details.

Faster `CASE` expression evaluation¶

This release builds on the CASE performance epic with significant improvements. Expressions short‑circuit earlier, reuse partial results, and avoid unnecessary scattering, speeding up common ETL patterns. Thanks to pepijnve, chenkovsky, and petern48 for leading this effort. You can find more details in the Optimizing SQL CASE Expression Evaluation blog post.

Better Defaults for Remote Parquet Reads¶

By default, DataFusion now always fetches the last 512KB (configurable) of Apache Parquet files which usually includes the footer and metadata (#18118). This change typically avoids 2 I/O requests for each Parquet. While this setting has existed in DataFusion for many years, it was not previously enabled by default. Users can tune the number of bytes fetched in the initial I/O request via the datafusion.execution.parquet.metadata_size_hint config setting. Thanks to zhuqi-lucas for leading this effort.

Faster Parquet metadata parsing¶

DataFusion 51 also includes the latest Parquet reader from Arrow Rust 57.0.0, which parses Parquet metadata significantly faster. This is especially beneficial for workloads with many small Parquet files and scenarios where startup time or low latency is important. You can read more about the upstream work by etseidl and jhorstmann that enabled these improvements in the Faster Apache Parquet Footer Metadata Using a Custom Thrift Parser blog.

Figure 2: Metadata parsing performance improvements in Arrow/Parquet 57.0.0.

New Features ✨¶

Decimal32/Decimal64 support¶

The new Arrow types Decimal32 and Decimal64 are now supported in DataFusion (#17501), including aggregations such as SUM, AVG, MIN/MAX, and window functions. Thanks to AdamGS for leading this effort.

SQL Pipe Operators¶

DataFusion now supports the SQL pipe operator syntax (#17278), enabling inline transforms such as:

SELECT * FROM t
|> WHERE a > 10
|> ORDER BY b
|> LIMIT 5;

This syntax, popularized by Google BigQuery, keeps multi-step transformations concise while preserving regular SQL semantics. Thanks to simonvandel for leading this effort.

I/O Profiling in `datafusion-cli`¶

datafusion-cli now has built-in instrumentation to trace object store calls (#17207). Toggle profiling with the \object_store_profiling command and inspect the exact GET/LIST requests issued during query execution:

DataFusion CLI v51.0.0
> \object_store_profiling trace
ObjectStore Profile mode set to Trace
> select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
+----------+
| count(*) |
+----------+
| 1000000  |
+----------+
1 row(s) fetched.
Elapsed 0.367 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Trace, inner: HttpStore
2025-11-19T21:10:43.476121+00:00 operation=Head duration=0.069763s path=hits_compatible/athena_partitioned/hits_1.parquet
2025-11-19T21:10:43.545903+00:00 operation=Head duration=0.025859s path=hits_compatible/athena_partitioned/hits_1.parquet
2025-11-19T21:10:43.571768+00:00 operation=Head duration=0.025684s path=hits_compatible/athena_partitioned/hits_1.parquet
2025-11-19T21:10:43.597463+00:00 operation=Get duration=0.034194s size=524288 range: bytes=174440756-174965043 path=hits_compatible/athena_partitioned/hits_1.parquet
2025-11-19T21:10:43.705821+00:00 operation=Head duration=0.022029s path=hits_compatible/athena_partitioned/hits_1.parquet

Summaries:
+-----------+----------+-----------+-----------+-----------+-----------+-------+
| Operation | Metric   | min       | max       | avg       | sum       | count |
+-----------+----------+-----------+-----------+-----------+-----------+-------+
| Get       | duration | 0.034194s | 0.034194s | 0.034194s | 0.034194s | 1     |
| Get       | size     | 524288 B  | 524288 B  | 524288 B  | 524288 B  | 1     |
| Head      | duration | 0.022029s | 0.069763s | 0.035834s | 0.143335s | 4     |
| Head      | size     |           |           |           |           | 4     |
+-----------+----------+-----------+-----------+-----------+-----------+-------+

This makes it far easier to diagnose slow remote scans and validate caching strategies. Thanks to BlakeOrth for leading this effort.

`DESCRIBE <query>`¶

DESCRIBE now works on arbitrary queries, returning the schema instead of being an alias for EXPLAIN (#18234). This brings DataFusion in line with engines like DuckDB and makes it easy to inspect the output schema of queries without executing them. Thanks to djanderson for leading this effort.

For example:

DataFusion CLI v51.0.0
> create table t(a int, b varchar, c float) as values (1, 'a', 2.0);
0 row(s) fetched.
Elapsed 0.002 seconds.

> DESCRIBE SELECT a, b, SUM(c) FROM t GROUP BY a, b;

+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| a           | Int32     | YES         |
| b           | Utf8View  | YES         |
| sum(t.c)    | Float64   | YES         |
+-------------+-----------+-------------+
3 row(s) fetched.

Named arguments in SQL functions¶

DataFusion now understands PostgreSQL-style named arguments (param => value) for scalar, aggregate, and window functions (#17379). You can mix positional and named arguments in any order, and error messages now list parameter names to make diagnostics clearer. UDF authors can also expose parameter names so their functions benefit from the same syntax. Thanks to timsaucer and bubulalabu for leading this effort.

For example, you can pass arguments to functions like this:

SELECT power(exponent => 3.0, base => 2.0);

Metrics improvements¶

The output of EXPLAIN ANALYZE has been improved to include more metrics about execution time and memory usage of each operator (#18217). You can learn more about these new metrics in the metrics user guide. Thanks to 2010YOUY01 for leading this effort.

The 51.0.0 release adds:

Configuration: adds a new option datafusion.explain.analyze_level, which can be set to summary for a concise output or dev for the full set of metrics (the previous default).
For all major operators: adds output_bytes, reporting how many bytes of data each operator produces.
FilterExec: adds a selectivity metric (output_rows / input_rows) to show how effective the filter is.
AggregateExec:
adds detailed timing metrics for group-ID computation, aggregate argument evaluation, aggregation work, and emitting final results.
adds a reduction_factor metric (output_rows / input_rows) to show how much grouping reduces the data.
NestedLoopJoinExec: adds a selectivity metric (output_rows / (left_rows * right_rows)) to show how many combinations actually pass the join condition.
Several display formatting improvements were added to make EXPLAIN ANALYZE output easier to read.

For example, the following query:

set datafusion.explain.analyze_level = summary

explain analyze 
select count(*) 
from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet' 
where "URL" <> '';

Now shows easier-to-understand metrics such as:

 metrics=[
   output_rows=1000000, 
   elapsed_compute=16ns, 
   output_bytes=222.5 MB, 
   files_ranges_pruned_statistics=16 total → 16 matched, 
   row_groups_pruned_statistics=3 total → 3 matched, 
   row_groups_pruned_bloom_filter=3 total → 3 matched, 
   page_index_rows_pruned=0 total → 0 matched,
   bytes_scanned=33661364,
   metadata_load_time=4.243098ms, 
]

Upgrade Guide and Changelog¶

Upgrading to 51.0.0 should be straightforward for most users. Please review the Upgrade Guide for details on breaking changes and code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog.

About DataFusion¶

How to Get Involved¶

Apache DataFusion Comet 0.11.0 Release

2025-10-21T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.11.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately five weeks of development work and is the result of merging 131 PRs from 15 contributors. See the change log for more information.

Release Highlights¶

Parquet Modular Encryption Support¶

Spark supports Parquet Modular Encryption to independently encrypt column values and metadata. Furthermore, Spark supports custom encryption factories for users to provide their own key-management service (KMS) implementations. Thanks to a number of contributions in upstream DataFusion and arrow-rs, Comet now supports Parquet Modular Encryption with Spark KMS for native readers, enabling secure reading of encrypted Parquet files in production environments.

Improved Memory Management¶

Comet 0.11.0 introduces significant improvements to memory management, making it easier to deploy and more resilient to out-of-memory conditions:

Changed default memory pool: The default off-heap memory pool has been changed from greedy_unified to fair_unified, providing better memory fairness across operations
Off-heap deployment recommended: To simplify configuration and improve performance, Comet now expects to be deployed with Spark's off-heap memory configuration. On-heap memory is still available for development and debugging, but is not recommended for deployment
Better disk management: The DiskManager max_temp_directory_size is now configurable for better control over temporary disk usage
Enhanced safety: Memory pool operations now use checked arithmetic operations to prevent overflow issues

These changes make Comet significantly easier to configure and deploy in production environments.

Improved Apache Spark 4.0 Support¶

Comet has improved its support for Apache Spark 4.0.1 with several important enhancements:

Updated support from Spark 4.0.0 to Spark 4.0.1
Spark 4.0 is now included in the release build script
Expanded ANSI mode compatibility with several new implementations:
ANSI evaluation mode arithmetic operations
ANSI mode integral divide
ANSI mode rounding functions
ANSI mode remainder function

Spark 4.0 compatible jar files are now available on Maven Central. See the installation guide for instructions on using published jar files.

Complex Types for Columnar Shuffle¶

ashdnazg submitted a fantastic refactoring PR that simplified the logic for writing rows in Comet’s JVM-based, columnar shuffle. A benefit of this refactoring is better support for complex types (e.g., structs, lists, and arrays) in columnar shuffle. Comet no longer falls back to Spark to shuffle these types, enabling native acceleration for queries involving nested data structures. This enhancement significantly expands the range of queries that can benefit from Comet's columnar shuffle implementation.

RangePartitioning for Native Shuffle¶

Comet's native shuffle now supports RangePartitioning, providing better performance for operations that require range-based data distribution. Comet now matches Spark behavior for computing and distributing range boundaries, and serializes them to native execution for faster shuffle operations.

New Functionality¶

The following SQL functions are now supported:

weekday - Extract day of week from date
lpad - Left pad a string with column support for pad length
rpad - Right pad a string with column support and additional character support
reverse - Support for ArrayType input in addition to strings
count(distinct) - Native support without falling back to Spark
bit_get - Get bit value at position

New expression capabilities include:

Performance Improvements¶

Improved BroadcastExchangeExec conversion for better broadcast join performance
Use of DataFusion's native count_udaf instead of SUM(IF(expr IS NOT NULL, 1, 0))
New configuration from shared conf to reduce overhead
Buffered index writes to reduce system calls in shuffle operations

Comet 0.11.0 TPC-H Performance¶

Comet 0.11.0 continues to deliver significant performance improvements over Spark. In our TPC-H benchmarks, Comet reduced overall query runtime from 687 seconds to 302 seconds when processing 100 GB of Parquet data using a single 8-core executor, achieving a 2.2x speedup.

The performance gains are consistent across individual queries, with most queries showing substantial improvements:

You can reproduce these benchmarks using our Comet Benchmarking Guide. We encourage you to run your own performance tests with your workloads.

Apache Iceberg Support¶

Updated support for Apache Iceberg 1.9.1
Additional Parquet-independent API improvements for Iceberg integration
Improved resource management in Iceberg reader instances

UX Improvements¶

Added plan conversion statistics to extended explain info for better observability
Improved fallback information to help users understand when and why Comet falls back to Spark
Added backtrace feature to simplify enabling native backtraces in CometNativeException
Native log level is now configurable via Comet configuration

Bug Fixes¶

Documentation Updates¶

Updated documentation for native shuffle configuration and tuning
Added documentation for ANSI mode support in various functions
Improved EC2 benchmarking guide
Split configuration guide into different sections (scan, exec, shuffle, etc.) for better organization
Various clarifications and improvements throughout the documentation

Spark Compatibility¶

Spark 3.4.3 with JDK 11 & 17, Scala 2.12 & 2.13
Spark 3.5.4 through 3.5.6 with JDK 11 & 17, Scala 2.12 & 2.13
Spark 4.0.1 with JDK 17, Scala 2.13

We are looking for help from the community to fully support Spark 4.0.1. See EPIC: Support 4.0.0 for more information.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Apache DataFusion 50.0.0 Released

2025-09-29T00:00:00+00:00

Introduction¶

We are proud to announce the release of DataFusion 50.0.0. This blog post highlights some of the major improvements since the release of DataFusion 49.0.0. The complete list of changes is available in the changelog. Thanks to numerous contributors for making this release possible!

Performance Improvements 🚀¶

DataFusion continues to focus on enhancing performance, as shown in ClickBench and other benchmark results.

Figure 1: Average and median normalized query execution times for ClickBench queries for each git revision. Query times are normalized using the ClickBench definition. See the DataFusion Benchmarking Page for more details.

Here are some noteworthy optimizations added since DataFusion 49:

Dynamic Filter Pushdown Improvements

The dynamic filter pushdown optimization, which allows runtime filters to cut down on the amount of data read, has been extended to support inner hash joins, dramatically improving performance when one relation is relatively small or filtered by a highly selective predicate. More details can be found in the Dynamic Filter Pushdown for Hash Joins section below. The dynamic filters in the TopK operator have also been improved in DataFusion 50.0.0, further increasing the effectiveness and efficiency of the optimization. More details can be found in this ticket.

Nested Loop Join Optimization

The nested loop join operator has been rewritten to reduce execution time and memory usage by adopting a finer-grained approach. Specifically, we now limit the intermediate data size to around a single RecordBatch for better memory efficiency, and we have eliminated redundant conversions from the old implementation to further improve execution speed. When evaluating this new approach in a microbenchmark, we measured up to a 5x improvement in execution time and a 99% reduction in memory usage. More details and results can be found in this ticket.

Parquet Metadata Caching

DataFusion now automatically caches the metadata of Parquet files (statistics, page indexes, etc.), to avoid unnecessary disk/network round-trips. This is especially useful when querying the same table multiple times over relatively slow networks, allowing us to achieve an order of magnitude faster execution time when running many small reads over large files. More information can be found in the Parquet Metadata Cache section.

Community Growth 📈¶

Between 49.0.0 and 50.0.0, we continue to see our community grow:

Qi Zhu (zhuqi-lucas) and Yoav Cohen (yoavcloud) became committers. See the mailing list for more details.
In the core DataFusion repo alone, we reviewed and accepted 318 PRs from 79 different committers, created over 235 issues, and closed 197 of them 🚀. All changes are listed in the detailed changelogs.
DataFusion published several blogs, including Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet, Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries, and Implementing User Defined Types and Custom Metadata in DataFusion.

New Features ✨¶

Improved Spilling Sorts for Larger-than-Memory Datasets¶

DataFusion has long been able to sort datasets that do not fit entirely in memory, but still struggled with particularly large inputs or highly memory-constrained setups. Larger-than-memory sorts in DataFusion 50.0.0 have been improved with the recent introduction of multi-level merge sorts (more details in the respective ticket). It is now possible to execute almost any sorting query that would have previously triggered out-of-memory errors, by relying on disk spilling. Thanks to Raz Luvaton, Yongting You, and ding-young for delivering this feature.

Dynamic Filter Pushdown for Hash Joins¶

The dynamic filter pushdown optimization has been extended to inner hash joins, dramatically reducing the amount of scanned data in some workloads—a technique sometimes referred to as Sideways Information Passing.

These filters are automatically applied to inner hash joins, while future work will introduce them to other join types.

For example, given a query that looks for a specific customer and their orders, DataFusion can now filter the orders relation based on the c_custkey of the target customer, reducing the amount of data read from disk by orders of magnitude.

-- retrieve the orders of the customer with c_phone = '25-989-741-2988'
SELECT *
FROM customer
JOIN orders ON c_custkey = o_custkey
WHERE c_phone = '25-989-741-2988';

The following shows an execution plan in DataFusion 50.0.0 with this optimization:

HashJoinExec
    DataSourceExec: <-- read customer
      predicate=c_phone@4 = 25-989-741-2988
      metrics=[output_rows=1, ...]
    DataSourceExec: <-- read orders
      -- dynamic filter is added here, filtering directly at scan time
      predicate=DynamicFilterPhysicalExpr [ o_custkey@1 >= 1 AND o_custkey@1 <= 1 ]
      -- the number of output rows is kept to a minimum
      metrics=[output_rows=11, ...]

Because there is a single customer in this query, almost all rows from orders are filtered out by the join. In previous versions of DataFusion, the entire orders relation would be scanned to join with the target customer, but now the dynamic filter pushdown can filter it right at the source, minimizing the amount of data decoded.

More information can be found in the respective ticket and the next step will be to extend the dynamic filters to other types of joins, such as LEFT and RIGHT outer joins. Thanks to Adrian Garcia Badaracco, Qi Zhu, xudong963, Daniël Heres, and Lía Adriana for delivering this feature.

Parquet Metadata Cache¶

The metadata of Parquet files (statistics, page indexes, etc.) is now automatically cached when using the built-in ListingTable, which reduces disk/network round-trips and repeated decoding of the same information. With a simple microbenchmark that executes point reads (e.g., SELECT v FROM t WHERE k = x) over large files, we measured a 12x improvement in execution time (more details can be found in the respective ticket). This optimization is production ready and enabled by default (more details in the Epic). Thanks to Nuno Faria, Jonathan Chen, Shehab Amin, Oleks V, Tim Saucer, and Blake Orth for delivering this feature.

Here is an example of the metadata cache in action:

-- disabling the metadata cache
> SET datafusion.runtime.metadata_cache_limit = '0M';

-- simple query (t.parquet: 100M rows, 3 cols)
> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
DataSourceExec: ... metrics=[..., metadata_load_time=229.196422ms, ...]
Elapsed 0.246 seconds.

-- enabling the metadata cache
> SET datafusion.runtime.metadata_cache_limit = '50M';

> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
DataSourceExec: ... metrics=[..., metadata_load_time=228.612µs, ...]
Elapsed 0.003 seconds. -- 82x improvement in this specific query

The cache can be configured with the following runtime parameter:

datafusion.runtime.metadata_cache_limit

The default FileMetadataCache uses a least-recently-used eviction algorithm and up to 50MB of memory. If the underlying file changes, the cache is automatically invalidated. Setting the limit to 0 will disable any metadata caching. As with most APIs in DataFusion, users can provide their own behavior using a custom FileMetadataCache implementation when setting up the RuntimeEnv.

For users with custom TableProvider:

If the custom provider uses the ParquetFormat, caching will work without any changes.
Otherwise the CachedParquetFileReaderFactory can be provided when creating a ParquetSource.

Users can inspect the cache contents through the FileMetadataCache::list_entries method, or with the metadata_cache() function in datafusion-cli:

> SELECT * FROM metadata_cache();
+---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
| path          | file_modified           | file_size_bytes | e_tag                    | version | metadata_size_bytes | hits | extra           |
+---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
| .../t.parquet | 2025-09-21T17:40:13.650 | 420827020       | 0-63f5331fb4458-19154f8c | NULL    | 44480534            | 27   | page_index=true |
+---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
1 row(s) fetched.
Elapsed 0.003 seconds.

`QUALIFY` Clause¶

DataFusion now supports the QUALIFY SQL clause (#16933), which simplifies filtering window function output (similar to how HAVING filters aggregation output).

For example, filtering the output of the rank() function previously required a query like this:

SELECT a, b, c
FROM (
   SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
   FROM t
)
WHERE rk = 1

The same query can now be written like this:

SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
FROM t
QUALIFY rk = 1

Although it is not part of the SQL standard (yet), it has been gaining adoption in several SQL analytical systems such as DuckDB, Snowflake, and BigQuery. Thanks to Huaijin and Jonah Gao for delivering this feature.

`FILTER` Support for Window Functions¶

Continuing the theme, the FILTER clause has been extended to support aggregate window functions. It allows these functions to apply to specific rows without having to rely on CASE expressions, similar to what was already possible with regular aggregate functions.

For example, we can gather multiple distinct sets of values matching different criteria with a single pass over the input:

SELECT 
  ARRAY_AGG(c2) FILTER (WHERE c2 >= 2) OVER (...)     -- e.g. [2, 3, 4]
  ARRAY_AGG(CASE WHEN c2 >= 2 THEN c2 END) OVER (...) -- e.g. [NULL, NULL, 2, 3, 4]
...
FROM table

Thanks to Geoffrey Claude and Jeffrey Vo for delivering this feature.

`ConfigOptions` Now Available to Functions¶

DataFusion 50.0.0 now passes session configuration parameters to User-Defined Functions (UDFs) via ScalarFunctionArgs (#16970). This allows behavior that varies based on runtime state; for example, time UDFs can use the session-specified time zone instead of just UTC.

Thanks to Bruce Ritchie, Piotr Findeisen, Oleks V, and Andrew Lamb for delivering this feature.

Additional Apache Spark Compatible Functions¶

Finally, due to Apache Spark's impact on analytical processing, many DataFusion users desire Spark compatibility in their workloads, so DataFusion provides a set of Spark-compatible functions in the datafusion-spark crate. You can read more about this project in the announcement and epic. DataFusion 50.0.0 adds several new such functions:

Thanks to David López, Chen Chongchen, Alan Tang, Peter Nguyen, and Evgenii Glotov for delivering these functions. We are looking for additional help reviewing and implementing more functions; please reach out on the epic if you are interested.

Known Issues / Patchset¶

As DataFusion continues to mature, we regularly release patch versions to fix issues in major releases. Since the release of 50.0.0, we have identified a few issues, and expect to release 50.1.0 to address them. You can track progress in this ticket.

Upgrade Guide and Changelog¶

Upgrading to 50.0.0 should be straightforward for most users. Please review the Upgrade Guide for details on breaking changes and code snippets to help with the transition. Recently, some users have reported success automatically upgrading DataFusion by pairing AI tools with the upgrade guide. For a comprehensive list of all changes, please refer to the changelog.

About DataFusion¶

How to Get Involved¶

Implementing User Defined Types and Custom Metadata in DataFusion

2025-09-21T00:00:00+00:00

Apache DataFusion significantly improves support for user defined types and metadata. The user defined function APIs let users access metadata on the input columns to functions and produce metadata in the output.

User defined types == extension types¶

DataFusion directly uses Apache Arrow's DataTypes as its type system. This has several benefits including being simple to explain, supports a rich set of both scalar and nested types, true zero copy interoperability with other Arrow implementations, and world-class library support (via arrow-rs). However, one challenge of directly using the Arrow type system is there is no distinction between logical types and physical types. For example, the Arrow type system contains multiple types which can store "String"s (sequences of UTF8 encoded bytes) such as Utf8, LargeUTF8, Dictionary(Utf8), and Utf8View.

However, Apache Arrow does provide extension types, a version of logical type information, which describe how to interpret data stored in one of the existing physical types. With the improved support for metadata in DataFusion 48.0.0, it is now easier to implement user defined types using Arrow extension types.

Metadata in Apache Arrow `Field`s¶

The Arrow specification defines Metadata as a map of key-value pairs of strings. This metadata is used to attach extension types and use case-specific context to a column of values. The Rust implementation of Apache Arrow, arrow-rs, stores metadata on Fields, but prior to DataFusion 48.0.0, many of DataFusion's internal APIs used DataTypes directly, and thus did not propagate metadata through all operations.

In previous versions of DataFusion Field metadata was propagated through certain operations (e.g., renaming or selecting a column) but was not others (e.g., scalar, window, or aggregate function calls). In DataFusion 48.0.0, and later, all user defined functions are passed the full input Field information and can return Field information to the caller.

Supporting extension types was a key motivation for adding metadata to the function processing, the same mechanism can store arbitrary metadata on the input and output fields, which supports other interesting use cases as we describe later in this post.

Metadata handling¶

Data in Arrow record batches carry a Schema in addition to the Arrow arrays. Each Field in this Schema contains a name, data type, nullability, and metadata. The metadata is specified as a map of key-value pairs of strings. In the new implementation, during processing of all user defined functions we pass the input field information.

Figure 1: Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns.

It is often desirable to write a generic function for reuse. Prior versions of user defined functions only had access to the DataType of the input columns. This works well for some features that only rely on the types of data, but other use cases may need additional information that describes the data.

For example, suppose I wish to write a function that takes in a UUID and returns a string of the variant of the input field. We would want this function to be able to handle all of the string types and also a binary encoded UUID. The Arrow specification does not contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary array where each element is 16 bytes long. With the metadata handling in [DataFusion 48.0.0] we can validate during planning that the input data not only has the correct underlying data type, but that it also represents the right kind of data. The UUID example is a common one, and it is included in the canonical extension types that are now supported in DataFusion.

Another common application of metadata handling is understanding encoding of a blob of data. Suppose you have a column that contains image data. Most likely this data is stored as an array of u8 data. Without knowing a priori what the encoding of that blob of data is, you cannot ensure you are using the correct methods for decoding it. You may work around this by adding another column to your data source indicating the encoding, but this can be wasteful for systems where the encoding never changes. Instead, you could use metadata to specify the encoding for the entire column.

How to use metadata in user defined functions¶

When working with metadata for user defined scalar functions, there are typically two places in the function definition that require implementation.

Computing the return field from the arguments
Invocation

During planning, we will attempt to call the function return_field_from_args(). This will provide a list of input fields to the function and return the output field. To evaluate metadata on the input side, you can write a functions similar to this example:

fn return_field_from_args(
    &self,
    args: ReturnFieldArgs,
) -> datafusion::common::Result<FieldRef> {
    if args.arg_fields.len() != 1 {
        return exec_err!("Incorrect number of arguments for uuid_version");
    }

    let input_field = &args.arg_fields[0];
    if &DataType::FixedSizeBinary(16) == input_field.data_type() {
        let Ok(CanonicalExtensionType::Uuid(_)) = input_field.try_canonical_extension_type()
        else {
            return exec_err!("Input field must contain the UUID canonical extension type");
        };
    }

    let is_nullable = args.arg_fields[0].is_nullable();

    Ok(Arc::new(Field::new(self.name(), DataType::UInt32, is_nullable)))
}

In this example, we take advantage of the fact that we already have support for extension types that evaluate metadata. If you were attempting to check for metadata other than extension type support, we could have instead written a snippet such as:

    if &DataType::FixedSizeBinary(16) == input_field.data_type() {
        let _ = input_field
            .metadata()
            .get("ARROW:extension:metadata")
            .ok_or(exec_datafusion_err!("Input field must contain the UUID canonical extension type"))?;
        };
    }

If you are writing a user defined function that will instead return metadata on output you can add this directly into the Field that is the output of the return_field_from_args call. In our above example, we could change the return line to:

    Ok(Arc::new(
        Field::new(self.name(), DataType::UInt32, is_nullable).with_metadata(
            [("my_key".to_string(), "my_value".to_string())]
                .into_iter()
                .collect(),
        ),
    ))

By checking the metadata during the planning process, we can identify errors early in the query process. There are cases were we wish to have access to this metadata during execution as well. The function invoke_with_args in the user defined function takes the updated struct ScalarFunctionArgs. This now contains the input fields, which can be used to check for metadata. For example, you can do the following:

fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
    assert_eq!(args.arg_fields.len(), 1);
    let my_value = args.arg_fields[0]
        .metadata()
        .get("encoding_type");
    ...

In this snippet we have extracted an Option<String> from the input field metadata which we can then use to determine which functions we might want to call. We could then parse the returned value to determine what type of encoding to use when evaluating the array in the arguments. Since return_field_from_args is not &mut self this check could not be performed during the planning stage.

The description in this section applies to scalar user defined functions, but equivalent support exists for aggregate and window functions.

Extension types¶

Extension types are one of the primary motivations for this enhancement in [Datafusion 48.0.0]. The official Rust implementation of Apache Arrow, arrow-rs, already contains support for the canonical extension types. This support includes helper functions such as try_canonical_extension_type() in the earlier example.

For a concrete example of how extension types can be used in DataFusion functions, there is an example repository that demonstrates using UUIDs. The UUID extension type specifies that the data are stored as a Fixed Size Binary of length 16. In the DataFusion core functions, we have the ability to generate string representations of UUIDs that match the version 4 specification. These are helpful, but a user may wish to do additional work with UUIDs where having them in the dense representation is preferable. Alternatively, the user may already have data with the binary encoding and we want to extract values such as the version, timestamp, or string representation.

In the example repository we have created three user defined functions: UuidVersion, StringToUuid, and UuidToString. Each of these implements ScalarUDFImpl and can be used thusly:

async fn main() -> Result<()> {
    let ctx = create_context()?;

    // get a DataFrame from the context
    let mut df = ctx.table("t").await?;

    // Create the string UUIDs
    df = df.select(vec![uuid().alias("string_uuid")])?;

    // Convert string UUIDs to canonical extension UUIDs
    let string_to_uuid = ScalarUDF::new_from_impl(StringToUuid::default());
    df = df.with_column("uuid", string_to_uuid.call(vec![col("string_uuid")]))?;

    // Extract version number from canonical extension UUIDs
    let version = ScalarUDF::new_from_impl(UuidVersion::default());
    df = df.with_column("version", version.call(vec![col("uuid")]))?;

    // Convert back to a string
    let uuid_to_string = ScalarUDF::new_from_impl(UuidToString::default());
    df = df.with_column("string_round_trip", uuid_to_string.call(vec![col("uuid")]))?;

    df.show().await?;

    Ok(())
}

The example repository also contains a crate that demonstrates how to expose these UDFs to datafusion-python. This requires version 48.0.0 or later.

Other use cases¶

The metadata attached to the fields can be used to store any user data in key/value pairs. Some of the other use cases that have been identified include:

Creating output for downstream systems. One user of DataFusion produces data visualizations that are dependant upon metadata in record batch fields. By enabling metadata on output of user defined functions, we can now produce batches that are directly consumable by these systems.
Describe the relationships between columns of data. You can store data about how one column of data relates to another and use these during function evaluation. For example, in robotics it is common to use transforms to describe how to convert from one coordinate system to another. It can be convenient to send the function all the columns that contain transform information and then allow the function to determine which columns to use based on the metadata. This allows for encapsulation of the transform logic within the user function.
Storing logical types of the data model. InfluxDB uses field metadata to specify which columns are used for tags, times, and fields.

Based on the experience of the authors, we recommend caution when using metadata for use cases other than type extension. One issue that can arises is that as columns are used to compute new fields, some functions may pass through the metadata and the semantic meaning may change. For example, suppose you decided to use metadata to store some kind of statistics for the entire stream of record batches. Then you pass that column through a filter that removes many rows of data. Your statistics metadata may now be invalid, even though it was passed through the filter.

Similarly, if you use metadata to form relations between one column and another and the naming of the columns has changed at some point in your workflow, then the metadata may indicate an incorrect column of data it is referring to. This can be mitigated by not relying on column naming but rather adding additional metadata to all columns of interest.

Acknowledgements¶

We would like to thank Rerun.io for sponsoring the development of this work. Rerun.io is building a data visualization system for Physical AI and uses metadata to specify context about columns in Arrow record batches.

Conclusion¶

The enhanced metadata handling in [DataFusion 48.0.0] is a significant step forward in the ability to handle more interesting types of data. Users can validate the input data matches the intent of the data to be processed, enable complex operations on binary data because we understand the encoding used, and use metadata to create new and interesting user defined data types. We can't wait to see what you build with it!

Get Involved¶

The DataFusion team is an active and engaging community and we would love to have you join us and help the project.

Here are some ways to get involved:

Learn more by visiting the DataFusion project page.
Try out the project and provide feedback, file issues, and contribute code.
Work on a good first issue.
Reach out to us via the communication doc.

Apache DataFusion Comet 0.10.0 Release

2025-09-16T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.10.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately ten weeks of development work and is the result of merging 183 PRs from 26 contributors. See the change log for more information.

Release Highlights¶

Improved Support for Apache Iceberg¶

It is now possible to use Comet with Apache Iceberg 1.8.1 to accelerate reads of Iceberg Parquet tables. Please refer to Comet's Iceberg Guide for information on building Iceberg with Comet.

Improved Spark 4.0.0 Support¶

Comet no longer falls back to Spark for all queries when ANSI mode is enabled (which is the default in Spark 4.0.0). Instead, Comet will now only fall back to Spark for arithmetic and aggregates expressions that support ANSI mode.

Setting spark.comet.ansi.ignore=true will override this behavior and force these expressions to continue to be accelerated by Comet. Full support for ANSI mode will be available in a future release.

Comet will now use the native_iceberg_compat scan for Spark 4.0.0 in most cases, which supports reading complex types.

New Functionality¶

The following SQL functions are now supported:

array_min
map_entries
map_from_array
randn
from_unixtime
monotonically_increasing_id
spark_partition_id
try_add
try_divide
try_mod
try_multiply
try_subtract

Other new features include:

Support for array literals
Support for limit with offset

UX Improvements¶

Improved reporting of reasons why Comet cannot accelerate some operators and expressions
New spark.comet.logFallbackReasons.enabled configuration setting for logging all fallback reasons
CometScan nodes in the physical plan now show which scan implementation is being used (native_comet, native_datafusion, or native_iceberg_compat)

Bug Fixes¶

Improved memory safety for FFI transfers
Fixed a double-free issue in the shuffle unified memory pool
Fixed an FFI issue with non-zero offsets
Fixed an issue with buffered reads from HDFS

Benchmarking¶

Benchmarking scripts for benchmarks based on TPC-H and TPS-DS are now available in the repository under dev/benchmarks.

Documentation Updates¶

The documentation for supported operators and expressions is now more complete, and Spark-compatibility status per operator/expression is now documented.
The documentation now contains a roadmap section.
New guide comparing Comet with Apache Gluten (incubating) + Velox
User guides are now available for multiple Comet versions

Spark Compatibility¶

Spark 3.4.3 with JDK 11 & 17, Scala 2.12 & 2.13
Spark 3.5.4 through 3.5.6 with JDK 11 & 17, Scala 2.12 & 2.13
Experimental support for Spark 4.0.0 with JDK 17, Scala 2.13

We are looking for help from the community to fully support Spark 4.0.0. See EPIC: Support 4.0.0 for more information.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries

2025-09-10T00:00:00+00:00

This blog post introduces the query engine optimization techniques called TopK and dynamic filters. We describe the motivating use case, how these optimizations work, and how we implemented them with the Apache DataFusion community to improve performance by an order of magnitude for some query patterns.

Motivation and Results¶

The main commercial product at Pydantic, Logfire, is an observability platform built on DataFusion. One of the most common workflows / queries is "show me the last K traces" which translates to a query similar to:

SELECT * FROM records ORDER BY start_timestamp DESC LIMIT 1000;

We noticed this was pretty slow, even though DataFusion has long had the classic TopK optimization (described below). After implementing the dynamic filter techniques described in this blog, we saw performance improve by over 10x for this query pattern, and are applying the optimization to other queries and operators as well.

Let's look at some preliminary numbers, using ClickBench, which has the same pattern as our motivating example:

SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10;

Figure 1: Execution times for ClickBench Q23 with and without dynamic filters (DF)¹, and late materialization (LM)² for different partitions / core usage. Dynamic filters alone (yellow) and late materialization alone (red) show a large improvement over the baseline (blue). When both optimizations are enabled (green) performance improves by up to 22x. See the appendix for more measurement details.

Background: TopK and Dynamic Filters¶

To explain how dynamic filters improve query performance, we first need to explain the so-called "TopK" optimization. To do so, we will use a simplified version of ClickBench Q23:

SELECT * 
FROM hits 
ORDER BY "EventTime"
LIMIT 10

A straightforward, though slow, plan to answer this query is shown in Figure 2.

Figure 2: Simple Query Plan for ClickBench Q23. Data flows in plans from the scan at the bottom to the limit at the top. This plan reads all 100M rows of the hits table, sorts them by EventTime, and then discards everything except the top 10 rows.

This naive plan requires substantial effort as all columns from all rows are decoded and sorted, even though only 10 are returned.

High-performance query engines typically avoid the expensive full sort with a specialized operator that tracks the current top rows using a heap, rather than sorting all the data. For example, this operator is called TopK in DataFusion, SortWithLimit in Snowflake, and topn in DuckDB. The plan for Q23 using this specialized operator is shown in Figure 3.

Figure 3: Query plan for Q23 in DataFusion using the TopK operator. This plan still reads all 100M rows of the hits table, but instead of first sorting them all by EventTime, the TopK operator keeps track of the current top 10 rows using a min/max heap. Credit to Visualgo for the heap icon

Figure 3 is better, but it still reads and decodes all 100M rows of the hits table, which is often unnecessary once we have found the top 10 rows. For example, while running the query, if the current top 10 rows all have EventTime in 2025, then any subsequent rows with EventTime in 2024 or earlier can be skipped entirely without reading or decoding them. This technique is especially effective at skipping entire files or row groups if the top 10 values are in the first few files read, which is very common when the data insert order is approximately the same as the timestamp order.

Leveraging this insight is the key idea behind dynamic filters, which introduce a runtime mechanism for the TopK operator to provide the current top values to the scan operator, allowing it to skip unnecessary rows, entire files, or portions of files. The plan for Q23 with dynamic filters is shown in Figure 4.

Figure 4: Query plan for Q23 in DataFusion with specialized TopK operator and dynamic filters. The TopK operator provides the minimum EventTime of the current top 10 rows to the scan operator, allowing it to skip rows with EventTime later than that value. The scan operator uses this dynamic filter to skip unnecessary files and rows, reducing the amount of data that needs to be read and processed.

Worked Example¶

To make dynamic filters more concrete, here is a fully worked example. Imagine we have a table records with a column start_timestamp and we are running the motivating query:

SELECT * 
FROM records 
ORDER BY start_timestamp 
DESC LIMIT 3;

In this example, at some point during execution, the heap in the TopK operator will contain the actual 3 most recent values, which might be:

start_timestamp
2025-08-16T20:35:15.00Z
2025-08-16T20:35:14.00Z
2025-08-16T20:35:13.00Z

Since 2025-08-16T20:35:13.00Z is the smallest of these values, we know that any subsequent rows with start_timestamp less than or equal to this value cannot possibly be in the top 3, and can be skipped entirely. This knowledge is encoded in a filter of the form start_timestamp > '2025-08-16T20:35:13.00Z'. If we knew the correct timestamp value before starting the plan, we could simply write:

SELECT *
FROM records
WHERE start_timestamp > '2025-08-16T20:35:13.00Z'  -- Filter to skip rows
ORDER BY start_timestamp DESC
LIMIT 3;

And DataFusion's existing hierarchical pruning (described in this blog) would skip reading unnecessary files and row groups, and only decode the necessary rows.

However, obviously when we start running the query we don't have the value '2025-08-16T20:35:13.00Z', so what DataFusion now does is put a dynamic filter into the plan instead, which you can think of as a function call like dynamic_filter(), something like this:

SELECT *
FROM records
WHERE dynamic_filter() -- Updated during execution as we know more
ORDER BY start_timestamp DESC
LIMIT 3;

In this case, dynamic_filter() initially has the value true (passes all rows) but will be progressively updated by the TopK operator as the query progresses to filter more and more rows. Note that while we are using SQL for illustrative purposes in this example, these optimizations are done at the physical plan (ExecutionPlan) level — and they apply equally to SQL, DataFrame APIs, and custom query languages built with DataFusion.

TopK + Dynamic Filters¶

As mentioned above, DataFusion has a specialized sort operator named TopK that only keeps K rows in memory. For a DESC sort order, each new input batch is compared against the current K largest values, and then the current K rows possibly get replaced with any new input rows that are larger. The code is here.

Prior to dynamic filters, DataFusion had no early termination: it would read the entire records table even if it already had the top K rows because it still had to check that there were no rows that had larger start_timestamp. You can see how this is a problem if you have 2 years' worth of time-series data and the largest 1000 values of start_timestamp are likely within the first few files read. Even once the TopK operator has seen 1000 timestamps (e.g. on August 16th, 2025), DataFusion would still read all remaining files (e.g. even those that contain data only from 2024) just to make sure.

InfluxData optimized a similar query pattern in InfluxDB IOx using another operator called ProgressiveEvalExec. However, ProgressiveEvalExec requires that the data is already sorted and a careful analysis of ordering to prove that it can be used and still produce correct results. That is not the case for Logfire data (and many other datasets): data tends to be roughly sorted (e.g. if you append to files as you receive it) but that does not guarantee that it is fully sorted, either within or between files.

We discussed possible solutions with the community, and ultimately decided to implement generic "dynamic filters", which are general enough to be used in joins as well (see next section). Our implementation appears very similar to recently announced optimizations in closed-source, commercial systems such as Accelerating TopK Queries in Snowflake, or self-sharpening runtime filters in Alibaba Cloud's PolarDB, and we are excited that we can offer similar features in an open source query engine like DataFusion.

At the query plan level, Q23 looks like this before it is executed:

┌───────────────────────────┐
│       SortExec(TopK)      │
│    --------------------   │
│ EventTime@4 ASC NULLS LAST│
│                           │
│         limit: 10         │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       DataSourceExec      │
│    --------------------   │
│         files: 100        │
│      format: parquet      │
│                           │
│         predicate:        │
│ CAST(URL AS Utf8View) LIKE│
│      %google% AND true    │
└───────────────────────────┘

Figure 5: Physical plan for ClickBench Q23 prior to execution. The dynamic filter is shown as true in the predicate field of the DataSourceExec operator.

The dynamic filter is updated by the SortExec(TopK) operator during execution as shown in Figure 6.

┌───────────────────────────┐
│       SortExec(TopK)      │
│    --------------------   │
│ EventTime@4 ASC NULLS LAST│
│                           │
│         limit: 10         │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       DataSourceExec      │
│    --------------------   │
│         files: 100        │
│      format: parquet      │
│                           │
│         predicate:        │
│ CAST(URL AS Utf8View) LIKE│
│      %google% AND         │
│ EventTime < 1372713773.0  │
└───────────────────────────┘

Figure 6: Physical plan for ClickBench Q23 after execution. The dynamic filter has been updated to EventTime < 1372713773.0, which allows the DataSourceExec operator to skip files and rows that do not match the filter.

Hash Join + Dynamic Filters¶

We spent significant effort to make dynamic filters a general-purpose optimization (see the Extensibility section below for more details). Instead of a one-off optimization for TopK queries, we created a general mechanism for passing information between operators during execution that can be used in multiple contexts. We have already used the dynamic filter infrastructure to improve hash joins by implementing a technique called sideways information passing, which is similar to Bloom filter joins in Apache Spark. See issue #7955 for more details.

In a Hash Join, the query engine picks one input of the join to be the "build" input and the other input to be the "probe" side.

First, the build side is loaded into memory, and turned into a hash table.
Then, the probe side is scanned, and matching rows are found by looking in the hash table. Non-matching rows are discarded and thus joins often act as filters.

Many hash joins act as selective filters for rows from the probe side (when only a small number of rows are matched), so it is natural to use the same dynamic filter technique. DataFusion 50.0.0 pushes down knowledge of what keys exist on the build side into the scan of the probe side with a dynamic filter based on min/max join key values. For example, if the build side only has keys in the range [100, 200], then DataFusion will filter out all probe rows with keys outside that range during the scan.

This simple approach is fast to evaluate and the filter improves performance significantly when combined with statistics pruning, late materialization, and other optimizations as shown in Figure 7.

Figure 7: Join performance with and without dynamic filters. In DataFusion 49.0.2 the join takes 2.5s, even with late materialization (LM) enabled. In DataFusion 50.0.0 with dynamic filters enabled (the default), the join takes only 0.7s, a 5x improvement. With both dynamic filters and late materialization, DataFusion 50.0.0 takes 0.1s, a 25x improvement. See this discussion for more details.

You can see dynamic join filters in action with the following example.

-- create two tables: small_table with 1K rows and large_table with 100K rows
COPY (SELECT i as k, i as v FROM generate_series(1, 1000) t(i)) TO 'small_table.parquet';
CREATE EXTERNAL TABLE small_table STORED AS PARQUET LOCATION 'small_table.parquet';
COPY (SELECT i as k FROM generate_series(1, 100000) t(i)) TO 'large_table.parquet';
CREATE EXTERNAL TABLE large_table STORED AS PARQUET LOCATION 'large_table.parquet';

-- Join the two tables, with a filter on small_table
EXPLAIN 
SELECT * 
FROM small_table JOIN large_table ON small_table.k = large_table.k 
WHERE small_table.v >= 50;

Note there are no filters on the large_table in the initial query, but a dynamic filter is introduced by DataFusion on the large_table scan. As the small_table is read and the hash table is built, the dynamic filter is updated to become more and more effective. Before execution, the plan looks like this:

+---------------+------------------------------------------------------------+
| plan_type     | plan                                                       |
+---------------+------------------------------------------------------------+
| physical_plan | ┌───────────────────────────┐                              |
|               | │    CoalesceBatchesExec    │                              |
|               | │    --------------------   │                              |
|               | │     target_batch_size:    │                              |
|               | │            8192           │                              |
|               | └─────────────┬─────────────┘                              |
|               | ┌─────────────┴─────────────┐                              |
|               | │        HashJoinExec       │                              |
|               | │    --------------------   ├──────────────┐               |
|               | │        on: (k = k)        │              │               |
|               | └─────────────┬─────────────┘              │               |
|               | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
|               | │   CoalescePartitionsExec  ││      RepartitionExec      │ |
|               | │                           ││    --------------------   │ |
|               | │                           ││ partition_count(in->out): │ |
|               | │                           ││          1 -> 16          │ |
|               | │                           ││                           │ |
|               | │                           ││    partitioning_scheme:   │ |
|               | │                           ││    RoundRobinBatch(16)    │ |
|               | └─────────────┬─────────────┘└─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    ││       DataSourceExec      │ |
|               | │    --------------------   ││    --------------------   │ |
|               | │     target_batch_size:    ││          files: 1         │ |
|               | │            8192           ││      format: parquet      │ |
|               | │                           ││      predicate: true      │ |
|               | └─────────────┬─────────────┘└───────────────────────────┘ |
|               | ┌─────────────┴─────────────┐                              |
|               | │         FilterExec        │                              |
|               | │    --------------------   │                              |
|               | │     predicate: v >= 50    │                              |
|               | └─────────────┬─────────────┘                              |
|               | ┌─────────────┴─────────────┐                              |
|               | │      RepartitionExec      │                              |
|               | │    --------------------   │                              |
|               | │ partition_count(in->out): │                              |
|               | │          1 -> 16          │                              |
|               | │                           │                              |
|               | │    partitioning_scheme:   │                              |
|               | │    RoundRobinBatch(16)    │                              |
|               | └─────────────┬─────────────┘                              |
|               | ┌─────────────┴─────────────┐                              |
|               | │       DataSourceExec      │                              |
|               | │    --------------------   │                              |
|               | │          files: 1         │                              |
|               | │      format: parquet      │                              |
|               | │     predicate: v >= 50    │                              |
|               | └───────────────────────────┘                              |
|               |                                                            |
+---------------+------------------------------------------------------------+

Figure 8: Physical plan for the join query before execution. The left input to the join is the build side, which scans small_table and applies the filter v >= 50. The right input to the join is the probe side, which scans large_table and has the dynamic filter (shown here as the placeholder true).

Dynamic Filter Extensibility: Custom `ExecutionPlan` Operators¶

We went to great efforts to ensure that dynamic filters are not a hardcoded black box that only works for internal operators. This is important not only for software maintainability, but also because DataFusion is used in many different contexts including advanced custom operators specialized for specific use cases.

Dynamic filter creation and pushdown are implemented as methods on the ExecutionPlan trait. Thus, it is possible for user-defined, custom ExecutionPlans to work with dynamic filters with little to no modification. We also provide an extensive library of helper structs and functions, so it often takes only 1-2 lines of code to implement filter pushdown support or a source of dynamic filters for custom operators.

This approach has already paid off, and we know of community members who have implemented support for dynamic filter pushdown using preview releases of DataFusion 50.0.0.

Design of Scan Operator Integration¶

A core design decision is to represent dynamic filters as Arc<dyn PhysicalExpr>, the same interface as all other expressions in DataFusion. This means that DataSourceExec and other scan operators do not require special logic to handle dynamic filters, and existing filter pushdown logic works without modification. We did add some new functionality to PhysicalExpr to make working with dynamic filters more performant for specific use cases:

PhysicalExpr::generation() -> u64: to track if a tree of filters has changed (e.g. it has a dynamic filter that has been updated). For example, if a predicate changes from c1 = 'a' AND DynamicFilter [ c2 > 1] to c1 = 'a' AND DynamicFilter [ c2 > 2] the generation value will also change so operators know if they should re-evaluate the filter against static data like file or row group level statistics. This is used in the ListingTable provider to do early termination of reading a file if the filter is updated mid scan to skip the entire file, without needlessly re-evaluating file level statistics on each batch.
PhysicalExpr::snapshot() -> Arc<dyn PhysicalExpr>: to create a snapshot of the filter at a given point in time. Dynamic filters use this to return the current value of their inner static filter. This can be used to serialize the filter across the network for distributed engines or pass to systems that support specific static filter patterns (e.g. stats pruning rewrites).

This is all implemented in the DynamicFilterPhysicalExpr struct.

Another important design point was handling concurrency and information flow. In early designs, the scan polled the source operators on every row / batch, which had significant overhead. The final design is a "push" model where the scan path has minimal locking and the write path (e.g. the TopK operator) is responsible for updating the filter. You can think of DynamicFilterPhysicalExpr as an Arc<RwLock<Arc<dyn PhysicalExpr>>>, which allows the TopK operator to update the filter without blocking the scan operator.

Future Work¶

Although we've made great progress and DataFusion now has one of the most advanced open-source dynamic filter / sideways information passing implementations that we know of, we see many areas of future improvement such as:

Support for more types of joins: This optimization is only implemented for INNER hash joins so far, but it could be implemented for other join algorithms (e.g. nested loop joins) and join types (e.g. LEFT OUTER JOIN).
Push down entire hash tables to the scan operator: Improve the representation of the dynamic filter beyond min/max values to improve performance for joins with many distinct matching keys that are not naturally ordered or have significant skew.
Use file level statistics to order files to match the ORDER BY clause as much as possible. This can help TopK dynamic filters be more effective at pruning by skipping more work earlier in the scan.

Acknowledgements¶

Thank you to Pydantic and InfluxData for supporting our work on DataFusion and open source in general. Thank you to zhuqi-lucas, xudong963, Dandandan, and LiaCastaneda, for helping with the dynamic join filter implementation and testing. Thank you to nuno-faria for providing join performance results and djanderson for their helpful review comments.

About the Authors¶

Adrian Garcia Badaracco is a Founding Engineer at Pydantic, and an Apache DataFusion committer.

Andrew Lamb is a Staff Engineer at InfluxData, and a member of the Apache DataFusion and Apache Arrow PMCs. He has been working on databases and related systems for more than 20 years.

About DataFusion¶

Apache DataFusion is an extensible query engine toolkit, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion and similar technology are part of the next generation “Deconstructed Database” architectures, where new systems are built on a foundation of fast, modular components, rather than as a single tightly integrated system.

The DataFusion community is always looking for new contributors to help improve the project. If you are interested in learning more about how query execution works, help document or improve the DataFusion codebase, or just try it out, we would love for you to join us.

Footnotes¶

¹ Dynamic Filters (DF) refers to the optimization described in this blog post. The TopK operator will generate a filter that is applied to the scan operators, which will first be used to skip rows and then as we open new files (if there are more to open) it will be used to skip entire files that do not match the filter.

² Late Materialization (LM) refers to the optimization described in this blog post. Late Materialization is particularly effective when combined with dynamic filters as it can apply filters during a scan. Without late materialization, dynamic filters can only be used to prune row groups or entire files, which will be less effective if the files themselves are large or the top values are not in the first few files read.

Appendix¶

Queries and Data¶

Figure 1: ClickBench Q23¶

-- Data was downloaded using apache/datafusion -> benchmarks/bench.sh -> ./benchmarks/bench.sh data clickbench_partitioned
create external table hits stored as parquet location 'benchmarks/data/hits_partitioned';

-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
set datafusion.execution.parquet.binary_as_string = true;
-- Only matters if pushdown_filters is enabled but they don't get enabled together sadly
set datafusion.execution.parquet.reorder_filters = true;

set datafusion.execution.target_partitions = 1;  -- or set to 12 to use multiple cores
set datafusion.optimizer.enable_dynamic_filter_pushdown = false;
set datafusion.execution.parquet.pushdown_filters = false;

explain analyze
SELECT *
FROM hits
WHERE "URL" LIKE '%google%'
ORDER BY "EventTime"
LIMIT 10;

dynamic filters	late materialization	cores	time (s)
False	False	1	32.039
False	True	1	16.903
True	False	1	18.195
True	True	1	1.42
False	False	12	5.04
False	True	12	2.37
True	False	12	5.055
True	True	12	0.602

Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet

2025-08-15T00:00:00+00:00

It is a common misconception that Apache Parquet requires (slow) reparsing of metadata and is limited to indexing structures provided by the format. In fact, caching parsed metadata and using custom external indexes along with Parquet's hierarchical data organization can significantly speed up query processing.

In this blog, I describe the role of external indexes, caches, and metadata stores in high performance systems, and demonstrate how to apply these concepts to Parquet processing using Apache DataFusion. Note this is an expanded version of the companion video and presentation.

Motivation¶

System designers choose between a pre-configured data system or the often daunting task of building their own custom data platform from scratch. For many users and use cases, one of the existing data systems will likely be good enough. However, traditional systems such as Apache Spark, DuckDB, ClickHouse, Hive, or Snowflake are each optimized for a certain set of tradeoffs between performance, cost, availability, interoperability, deployment target, cloud / on-premises, operational ease and many other factors.

For new, or especially demanding use cases, where no existing system makes your optimal tradeoffs, you can build your own custom data platform. Previously this was a long and expensive endeavor, but today, in the era of Composable Data Systems, it is increasingly feasible. High quality, open source building blocks such as Apache Parquet for storage, Apache Arrow for in-memory processing, and Apache DataFusion for query execution make it possible to quickly build custom data platforms optimized for your specific needs¹.

Introduction to External Indexes / Catalogs / Metadata Stores / Caches¶

Figure 1: Using external indexes to speed up queries in an analytic system. Given a user's query (Step 1), the system uses an external index (one that is not stored inline in the data files) to quickly find files that may contain relevant data (Step 2). Then, for each file, the system uses the external index to further narrow the required data to only those parts of each file (e.g. data pages) that are relevant (Step 3). Finally, the system reads only those parts of the file and returns the results to the user (Step 4).

In this blog, I use the term "index" to mean any structure that helps locate relevant data during processing, and a high level overview of how external indexes are used to speed up queries is shown in Figure 1.

All data systems typically store both the data itself and additional information (metadata) to more quickly find data relevant to a query. Metadata is often stored in structures with names like "index," "catalog" and "cache" and the terminology varies widely across systems.

There are many different types of indexes, types of content stored in indexes, strategies to keep indexes up to date, and ways to apply indexes during query processing. These differences each have their own set of tradeoffs, and thus different systems understandably make different choices depending on their use case. There is no one-size-fits-all solution for indexing. For example, Hive uses the Hive Metastore, Vertica uses a purpose-built Catalog, and open data lake systems typically use a table format such as Apache Iceberg or Delta Lake.

External Indexes store information separately ("external") to the data itself. External indexes are flexible and widely used, but require additional operational overhead to keep in sync with the data files. For example, if you add a new Parquet file to your data lake, you must also update the relevant external index to include information about the new file. Note, you can avoid the operational overhead of external indexes by using only the data files themselves, including Embedding User-Defined Indexes in Apache Parquet Files. However, this approach comes with its own set of tradeoffs such as increased file sizes and the need to update the data files to update the index.

Examples of information commonly stored in external indexes include:

Min/Max statistics
Bloom filters
Inverted indexes / Full Text indexes
Information needed to read the remote file (e.g the schema, or Parquet footer metadata)

Examples of locations where external indexes can be stored include:

Separate files such as JSON or Parquet files.
Transactional databases such as PostgreSQL tables.
Distributed key-value stores such as Redis or Cassandra.
Local memory such as an in-memory hash map.

Using Apache Parquet for Storage¶

While the rest of this blog focuses on building custom external indexes using Parquet and DataFusion, I first briefly discuss why Parquet is a good choice for modern analytic systems. The research community frequently confuses limitations of a particular implementation of the Parquet format with the Parquet Format itself, and this confusion often obscures capabilities that make Parquet a good target for external indexes.

Apache Parquet's combination of good compression, high-performance, high quality open source libraries, and wide ecosystem interoperability make it a compelling choice when building new systems. While there are some niche use cases that may benefit from specialized formats, Parquet is typically the obvious choice. While recent proprietary file formats differ in details, they all use the same high level structure² as Parquet:

Metadata (typically at the end of the file)
Data divided into columns and then into horizontal slices (e.g. Parquet Row Groups and/or Data Pages).

The structure is so widespread because it enables the hierarchical pruning approach described in the next section. For example, the native Clickhouse MergeTree format consists of Parts (similar to Parquet files), and Granules (similar to Row Groups). The Clickhouse indexing strategy follows a classic hierarchical pruning approach that first locates the Parts and then the Granules that may contain relevant data for the query. This is exactly the same pattern as Parquet based systems, which first locate the relevant Parquet files and then the Row Groups / Data Pages within those files.

A common criticism of using Parquet is that it is not as performant as some new proposal. These criticisms typically cherry-pick a few queries and/or datasets and build a specialized index or data layout for that specific case. However, as I explain in the companion video of this blog, even for ClickBench⁶, the current benchmaxxing³ target of analytics vendors, there is less than a factor of two difference in performance between custom file formats and Parquet. The difference becomes even lower when using Parquet files that use the full range of existing Parquet features such Column and Offset Indexes and Bloom Filters⁷. Compared to the low interoperability and expensive transcoding/loading step of alternate file formats, Parquet is hard to beat.

Hierarchical Pruning Overview¶

The key technique for optimizing query processing systems is skipping as much data as possible, as quickly as possible. Analytic systems typically use a hierarchical approach to progressively narrow the set of data that needs to be processed. The standard approach is shown in Figure 2:

Entire files are ruled out
Within each file, large sections (e.g. Row Groups) are ruled out
(Optionally) smaller sections (e.g. Data Pages) are ruled out
Finally, the system reads only the relevant data pages and applies the query predicate to the data

Figure 2: Hierarchical Pruning: The system first rules out files, then Row Groups, then Data Pages, and finally reads only the relevant data pages.

The process is hierarchical because the per-row computation required at the earlier stages (e.g. skipping an entire file) is lower than the computation required at later stages (apply predicates to the data). As mentioned before, while the details of what metadata is used and how that metadata is managed varies substantially across query systems, they almost all use a hierarchical pruning strategy.

Apache Parquet Overview¶

This section provides a brief background on the organization of Apache Parquet files which is needed to fully understand the sections on implementing external indexes. If you are already familiar with Parquet, you can skip this section.

Logically, Parquet files are organized into Row Groups and Column Chunks as shown below.

Figure 3: Logical Parquet File Layout: Data is first divided in horizontal slices called Row Groups. The data is then stored column by column in Column Chunks. This arrangement allows efficient access to only the portions of columns needed for a query.

Physically, Parquet data is stored as a series of Data Pages along with metadata stored at the end of the file (in the footer), as shown below.

Figure 4: Physical Parquet File Layout: A typical Parquet file is composed of many data pages, which contain the raw encoded data, and a footer that stores metadata about the file, including the schema and the location of the relevant data pages, and optional statistics such as min/max values for each Column Chunk.

Parquet files are organized to minimize IO and processing using two key mechanisms:

Projection Pushdown: if a query needs only a subset of columns from a table, it only needs to read the pages for the relevant Column Chunks
Filter Pushdown: Similarly, given a query with a filter predicate such as WHERE C > 25, query engines can use statistics such as (but not limited to) the min/max values stored in the metadata to skip reading and decoding pages that cannot possibly match the predicate.

The high level mechanics of Parquet predicate pushdown is shown below:

Figure 5: Filter Pushdown in Parquet: query engines use the predicate, C > 25, from the query along with statistics from the metadata, to identify pages that may match the predicate which are read for further processing. Please refer to the Efficient Filter Pushdown blog for more details. NOTE the exact same pattern can be applied using information from external indexes, as described in the next sections.

Pruning Files with External Indexes¶

The first step in hierarchical pruning is quickly ruling out files that cannot match the query. For example, if a system expects to see queries that apply to a time range, it might create an external index to store the minimum and maximum time values for each file. Then, during query processing, the system can quickly rule out files that cannot possibly contain relevant data.

For example, if the user issues a query that only matches the last 7 days of data:

WHERE time > now() - interval '7 days'

then the index can quickly rule out files that only have data older than the most recent 7 days.

Figure 6: Step 1: File Pruning. Given a query predicate, systems use external indexes to quickly rule out files that cannot match the query. In this case, by consulting the index all but two files can be ruled out.

External indexes offer much faster lookups and lower I/O overhead than Parquet's built-in file-level indexes by skipping further processing for many data files. Without an external index, systems typically fall back to reading each file's footer to find files needed for further processing. Skipping per-file processing is especially important when reading from remote object stores such as S3, GCS or Azure Blob Store, where each request adds tens to hundreds of milliseconds of latency.

There are many different systems that use external indexes to find files such as Hive Metadata Store, Iceberg, Delta Lake, DuckLake, and Hive Style Partitioning⁴. Of course, each of these systems works well for their intended use cases, but if none meets your needs, or you want to experiment with different strategies, you can easily build your own external index using DataFusion.

Pruning Files with External Indexes Using DataFusion¶

To implement file pruning in DataFusion, you implement a custom TableProvider with the supports_filter_pushdown and scan methods. The supports_filter_pushdown method tells DataFusion which predicates can be used and the scan method uses those predicates with the external index to find the files that may contain data that matches the query.

The DataFusion repository contains a fully working and well-commented example, parquet_index.rs, of this technique that you can use as a starting point. The example creates a simple index that stores the min/max values for a column called value along with the file name. Then it runs the following query:

SELECT file_name, value FROM index_table WHERE value = 150

The custom IndexTableProvider's scan method uses the index to find files that may contain data matching the predicate as shown below:

impl TableProvider for IndexTableProvider {
    async fn scan(
        &self,
        state: &dyn Session,
        projection: Option<&Vec<usize>>,
        filters: &[Expr],
        limit: Option<usize>,
    ) -> Result<Arc<dyn ExecutionPlan>> {
        let df_schema = DFSchema::try_from(self.schema())?;
        // Combine all the filters into a single ANDed predicate
        let predicate = conjunction(filters.to_vec());

        // Use the index to find the files that might have data that matches the
        // predicate. Any file that can not have data that matches the predicate
        // will not be returned.
        let files = self.index.get_files(predicate.clone())?;

        let object_store_url = ObjectStoreUrl::parse("file://")?;
        let source = Arc::new(ParquetSource::default().with_predicate(predicate));
        let mut file_scan_config_builder =
            FileScanConfigBuilder::new(object_store_url, self.schema(), source)
                .with_projection(projection.cloned())
                .with_limit(limit);

        // Add the files to the scan config
        for file in files {
            file_scan_config_builder = file_scan_config_builder.with_file(
                PartitionedFile::new(file.path(), file_size.size()),
            );
        }
        Ok(DataSourceExec::from_data_source(
            file_scan_config_builder.build(),
        ))
    }
    ...
}

DataFusion handles the details of pushing down the filters to the TableProvider and the mechanics of reading the Parquet files, so you can focus on the system specific details such as building, storing, and applying the index. While this example uses a standard min/max index, you can implement any indexing strategy you need, such as bloom filters, a full text index, or a more complex multidimensional index.

DataFusion also includes several libraries to help with common filtering and pruning tasks, such as:

A full and well documented expression representation (Expr) and APIs for building, visiting, and rewriting query predicates.
Range Based Pruning (PruningPredicate) for cases where your index stores min/max values.
Expression simplification (ExprSimplifier) for simplifying predicates before applying them to the index.
Range analysis for predicates (cp_solver) for interval-based range analysis (e.g. col > 5 AND col < 10).

Pruning Parts of Parquet Files with External Indexes¶

Once the set of files to be scanned has been determined, the next step in the hierarchical pruning process is to further narrow down the data within each file. Similarly to the previous step, almost all advanced query processing systems use additional metadata to prune unnecessary parts of the file, such as Data Skipping Indexes in ClickHouse.

For Parquet-based systems, the most common strategy is using the built-in metadata such as min/max statistics and Bloom Filters. However, it is also possible to use external indexes for filtering WITHIN Parquet files as shown below.

Figure 7: Step 2: Pruning Parquet Row Groups and Data Pages. Given a query predicate, systems can use external indexes / metadata stores as well as Parquet's built-in structures to quickly rule out Row Groups and Data Pages that cannot match the query. In this case, the index has ruled out all but three data pages which must then be fetched for more processing.

Pruning Parts of Parquet Files with External Indexes using DataFusion¶

To implement pruning within Parquet files, you use the same [TableProvider] APIs as for pruning files. For each file your provider wants to scan, you provide an additional ParquetAccessPlan that tells DataFusion what parts of the file to read. This plan is then further refined by the DataFusion Parquet reader using the built-in Parquet metadata to potentially prune additional row groups and data pages during query execution. You can find a full working example in the advanced_parquet_index.rs example of the DataFusion repository.

Here is how you build a ParquetAccessPlan to scan only specific row groups and rows within those row groups.

// Default to scan all (4) row groups
let mut access_plan = ParquetAccessPlan::new_all(4);
access_plan.skip(0); // skip row group 0
// Specify scanning rows 100-200 and 350-400
// in row group 1 that has 1000 rows
let row_selection = RowSelection::from(vec![
   RowSelector::skip(100),
   RowSelector::select(100),
   RowSelector::skip(150),
   RowSelector::select(50),
   RowSelector::skip(600),  // skip last 600 rows
]);
access_plan.scan_selection(1, row_selection);
access_plan.skip(2); // skip row group 2
// all of row group 3 is scanned by default

The rows that are selected by the resulting plan look like this:

┌───────────────────┐
│                   │
│                   │  SKIP
│                   │
└───────────────────┘
     Row Group 0

┌───────────────────┐
│ ┌───────────────┐ │  SCAN ONLY ROWS
│ └───────────────┘ │  100-200
│ ┌───────────────┐ │  350-400
│ └───────────────┘ │
└───────────────────┘
     Row Group 1

┌───────────────────┐
│                   │
│                   │  SKIP
│                   │
└───────────────────┘
     Row Group 2

┌───────────────────┐
│                   │
│                   │  SCAN ALL ROWS
│                   │
└───────────────────┘
     Row Group 3

In the scan method, you return an ExecutionPlan that includes the ParquetAccessPlan for each file as shown below (again, slightly simplified for clarity):

impl TableProvider for IndexTableProvider {
    async fn scan(
        &self,
        state: &dyn Session,
        projection: Option<&Vec<usize>>,
        filters: &[Expr],
        limit: Option<usize>,
    ) -> Result<Arc<dyn ExecutionPlan>> {
        let indexed_file = &self.indexed_file;
        let predicate = self.filters_to_predicate(state, filters)?;

        // Use the external index to create a starting ParquetAccessPlan
        // that determines which row groups to scan based on the predicate
        let access_plan = self.create_plan(&predicate)?;

        let partitioned_file = indexed_file
            .partitioned_file()
            // provide the access plan to the DataSourceExec by
            // storing it as  "extensions" on PartitionedFile
            .with_extensions(Arc::new(access_plan) as _);

        let file_source = Arc::new(
            ParquetSource::default()
                // provide the predicate to the standard DataFusion source as well so
                // DataFusion's Parquet reader will apply row group pruning based on
                // the built-in Parquet metadata (min/max, bloom filters, etc) as well
                .with_predicate(predicate)
        );
        let file_scan_config =
            FileScanConfigBuilder::new(object_store_url, schema, file_source)
                .with_limit(limit)
                .with_projection(projection.cloned())
                .with_file(partitioned_file)
                .build();

        // Finally, put it all together into a DataSourceExec
        Ok(DataSourceExec::from_data_source(file_scan_config))
    }
    ...
}

Caching Parquet Metadata¶

It is often said that Parquet is unsuitable for low latency query systems because the footer must be read and parsed for each query. This is simply not true, and many systems use Parquet for low latency analytics and cache the parsed metadata in memory to avoid re-reading and re-parsing the footer for each query.

Caching Parquet Metadata using DataFusion¶

Reusing cached Parquet Metadata is also shown in the advanced_parquet_index.rs example. The example reads and caches the metadata for each file when the index is first built and then uses the cached metadata when reading the files during query execution.

(Note that thanks to Nuno Faria, Jonathan Chen, and Shehab Amin the built in ListingTable TableProvider included with DataFusion will cache Parquet metadata in the next release of DataFusion (50.0.0). See the mini epic for details).

To avoid reparsing the metadata, first implement a custom ParquetFileReaderFactory as shown below, again slightly simplified for clarity:

impl ParquetFileReaderFactory for CachedParquetFileReaderFactory {
    fn create_reader(
        &self,
        _partition_index: usize,
        file_meta: FileMeta,
        metadata_size_hint: Option<usize>,
        _metrics: &ExecutionPlanMetricsSet,
    ) -> Result<Box<dyn AsyncFileReader + Send>> {
        let filename = file_meta.location();

        // Pass along the information to access the underlying storage
        // (e.g. S3, GCS, local filesystem, etc)
        let object_store = Arc::clone(&self.object_store);
        let mut inner =
            ParquetObjectReader::new(object_store, file_meta.object_meta.location)
                .with_file_size(file_meta.object_meta.size);

        // retrieve the pre-parsed metadata from the cache
        // (which was built when the index was built and is kept in memory)
        let metadata = self
            .metadata
            .get(&filename)
            .expect("metadata for file not found: {filename}");

        // Return a ParquetReader that uses the cached metadata
        Ok(Box::new(ParquetReaderWithCache {
            filename,
            metadata: Arc::clone(metadata),
            inner,
        }))
    }
}

Then, in your TableProvider use the factory to avoid re-reading the metadata for each file:

impl TableProvider for IndexTableProvider {
    async fn scan(
        &self,
        state: &dyn Session,
        projection: Option<&Vec<usize>>,
        filters: &[Expr],
        limit: Option<usize>,
    ) -> Result<Arc<dyn ExecutionPlan>> {
        // Configure a factory interface to avoid re-reading the metadata for each file
        let reader_factory =
            CachedParquetFileReaderFactory::new(Arc::clone(&self.object_store))
                .with_file(indexed_file);

        // build the partitioned file (see example above for details)
        let partitioned_file = ...; 

        // Create the ParquetSource with the predicate and the factory
        let file_source = Arc::new(
            ParquetSource::default()
                // provide the factory to create Parquet reader without re-reading metadata
                .with_parquet_file_reader_factory(Arc::new(reader_factory)),
        );

        // Pass along the information needed to read the files
        let file_scan_config =
            FileScanConfigBuilder::new(object_store_url, schema, file_source)
                .with_limit(limit)
                .with_projection(projection.cloned())
                .with_file(partitioned_file)
                .build();

        // Finally, put it all together into a DataSourceExec
        Ok(DataSourceExec::from_data_source(file_scan_config))
    }
    ...
}

Conclusion¶

Parquet has the right structure for high performance analytics via hierarchical pruning, and it is straightforward to build external indexes to speed up queries using DataFusion without changing the file format. If you need to build a custom data platform, it has never been easier to build it with Parquet and DataFusion.

I am a firm believer that data systems of the future will be built on a foundation of modular, high quality, open source components such as Parquet, Arrow, and DataFusion. We should focus our efforts as a community on improving these components rather than building new file formats that are optimized for narrow use cases.

Come Join Us! 🎣

About the Author¶

Andrew Lamb is a Staff Engineer at InfluxData, and a member of the Apache DataFusion and Apache Arrow PMCs. He has been working on Databases and related systems more than 20 years.

About DataFusion¶

Acknowledgements¶

Thank you to Qi Zhu, Adam Reeve, Jigao Luo, Oleks V, Shehab Amin, Nuno Faria and Bruce Ritchie for their insightful feedback on this blog post.

Footnotes¶

1: This trend is described in more detail in the FDAP Stack blog

2: This layout is referred to as PAX in the database literature after the first research paper to describe the technique.

3: Benchmaxxing (verb): to add specific optimizations that only impact benchmark results and are not widely applicable to real world use cases.

4: Hive Style Partitioning is a simple and widely used form of indexing based on directory paths, where the directory structure is used to store information about the data in the files. For example, a directory structure like year=2025/month=08/day=15/ can be used to store data for a specific day and the system can quickly rule out directories that do not match the query predicate.

5: I am also convinced that we can speed up the process of parsing Parquet footer with additional engineering effort (see Xiangpeng Hao's previous blog on the topic). Ed Seidl is beginning this effort. See the ticket for details.

6: ClickBench includes a wide variety of query patterns such as point lookups, filters of different selectivity, and aggregations.

7: For example, Qi Zhu was able to speed up reads by over 2x simply by rewriting the Parquet files with Offset Indexes and no compression (see issue #16149 comment for details). There is likely significant additional performance available by using Bloom Filters and resorting the data to be clustered in a more optimal way for the queries.

Apache DataFusion 49.0.0 Released

2025-07-28T00:00:00+00:00

Introduction¶

We are proud to announce the release of DataFusion 49.0.0. This blog post highlights some of the major improvements since the release of DataFusion 48.0.0. The complete list of changes is available in the changelog.

Performance Improvements 🚀¶

DataFusion continues to focus on enhancing performance, as shown in the ClickBench and other results.

Figure 1: ClickBench performance improvements over time Average and median normalized query execution times for ClickBench queries for each git revision. Query times are normalized using the ClickBench definition. Data and definitions on the DataFusion Benchmarking Page.

Here are some noteworthy optimizations added since DataFusion 48:

Equivalence system upgrade: The lower levels of the equivalence system, which is used to implement the optimizations described in Using Ordering for Better Plans, were rewritten, leading to much faster planning times, especially for queries with a large number of columns. This change also prepares the way for more sophisticated sort-based optimizations in the future. (PR #16217 by ozankabak).

Dynamic Filters and TopK pushdown

DataFusion now supports dynamic filters, which are improved during query execution, and physical filter pushdown. Together, these features improve the performance of queries that use LIMIT and ORDER BY clauses, such as the following:

SELECT *
FROM data
ORDER BY timestamp DESC
LIMIT 10

While the query above is simple, without dynamic filtering or knowing that the data is already sorted by timestamp, a query engine must decode all of the data to find the top 10 values. With the dynamic filters system, DataFusion applies an increasingly selective filter during query execution. It checks the current top 10 values of the timestamp column before opening files or reading Parquet Row Groups and Data Pages, which can skip older data very quickly.

Dynamic predicates are a common feature of advanced engines such as Dynamic Filters in Starburst and Top-K Aggregation Optimization at Snowflake. The technique drastically improves query performance (we've seen over a 1.5x improvement for some TPC-H-style queries), especially in combination with late materialization and columnar file formats such as Parquet. We plan to write a blog post explaining the details of this optimization in the future, and we expect to use the same mechanism to implement additional optimizations such as Sideways Information Passing for joins (Issue #15037 PR #15770 by adriangb).

Community Growth 📈¶

The last few months, between 46.0.0 and 49.0.0, have seen our community grow:

New PMC members and committers: berkay, xudong963 and timsaucer joined the PMC. blaginin, milenkovicm, adriangb and kosiew joined as committers. See the mailing list for more details.
In the core DataFusion repo alone, we reviewed and accepted over 850 PRs from 172 different committers, created over 669 issues, and closed 379 of them 🚀. All changes are listed in the detailed changelogs.
DataFusion published a number of blog posts, including User defined Window Functions, Optimizing SQL (and DataFrames) in DataFusion part 1, part 2, Using Rust async for Query Execution and Cancelling Long-Running Queries, and Embedding User-Defined Indexes in Apache Parquet Files.

New Features ✨¶

Async User-Defined Functions¶

It is now possible to write async User-Defined Functions (UDFs) in DataFusion that perform asynchronous operations, such as network requests or database queries, without blocking the execution of the query. This enables new use cases, such as integrating with large language models (LLMs) or other external services, and we can't wait to see what the community builds with it.

See the documentation for more details and the async UDF example for working code.

You could, for example, implement a function ask_llm that asks a large language model (LLM) service a question based on the content of two columns.

SELECT * 
FROM animal a
WHERE ask_llm(a.name, 'Is this animal furry?')")

The implementation of an async UDF is almost identical to a normal UDF, except that it must implement the AsyncScalarUDFImpl trait in addition to ScalarUDFImpl and provide an async implementation via invoke_async_with_args:

#[derive(Debug)]
struct AskLLM {
    signature: Signature,
}

#[async_trait]
impl AsyncScalarUDFImpl for AskLLM {
    /// The `invoke_async_with_args` method is similar to `invoke_with_args`,
    /// but it returns a `Future` that resolves to the result.
    ///
    /// Since this signature is `async`, it can do any `async` operations, such
    /// as network requests.
    async fn invoke_async_with_args(
        &self,
        args: ScalarFunctionArgs,
        options: &ConfigOptions,
    ) -> Result<ArrayRef> {
        // Converts the arguments to arrays for simplicity.
        let args = ColumnarValue::values_to_arrays(&args.args)?;
        let [column_of_interest, question] = take_function_args(self.name(), args)?;
        let client = Client::new();

        // Make a network request to a hypothetical LLM service
        let res = client
            .post(URI)
            .headers(get_llm_headers(options))
            .json(&req)
            .send()
            .await?
            .json::<LLMResponse>()
            .await?;

        let results = extract_results_from_llm_response(&res);

        Ok(Arc::new(results))
    }
}

(Issue #6518, PR #14837 from goldmedal 🏆)

Better Cancellation for Certain Long-Running Queries¶

In rare cases, it was previously not possible to cancel long-running queries, leading to unresponsiveness. Other projects would likely have fixed this issue by treating the symptom, but pepijnve and the DataFusion community worked together to treat the root cause. The general solution required a deep understanding of the DataFusion execution engine, Rust Streams, and the tokio cooperative scheduling model. The resulting PR is a model of careful community engineering and a great example of using Rust's async ecosystem to implement complex functionality. It even resulted in a contribution upstream to tokio (since accepted). See the blog post for more details.

Metadata for User Defined Types such as `Variant` and `Geometry`¶

User-defined types have been a long-requested feature, and this release provides the low-level APIs to support them efficiently.

Metadata handling in PRs #15646 and #16170 from timsaucer
Pushdown of filters and expressions (see "Dynamic Filters and TopK pushdown" section above)

We still have some work to do to fully support user-defined types, specifically in documentation and testing, and we would love your help in this area. If you are interested in contributing, please see issue #12644.

Parquet Modular Encryption¶

DataFusion now supports reading and writing encrypted Apache Parquet files with modular encryption. This allows users to encrypt specific columns in a Parquet file using different keys, while still being able to read data without needing to decrypt the entire file.

Here is an example of how to configure DataFusion to read an encrypted Parquet table with two columns, double_field and float_field, using modular encryption:

CREATE EXTERNAL TABLE encrypted_parquet_table
(
double_field double,
float_field float
)
STORED AS PARQUET LOCATION 'pq/' OPTIONS (
    -- encryption
    'format.crypto.file_encryption.encrypt_footer' 'true',
    'format.crypto.file_encryption.footer_key_as_hex' '30313233343536373839303132333435',  -- b"0123456789012345"
    'format.crypto.file_encryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450"
    'format.crypto.file_encryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451"
    -- decryption
    'format.crypto.file_decryption.footer_key_as_hex' '30313233343536373839303132333435', -- b"0123456789012345"
    'format.crypto.file_decryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450"
    'format.crypto.file_decryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451"
);

(Issue #15216, PR #16351 from corwinjoy and adamreeve)

Support for `WITHIN GROUP` for Ordered-Set Aggregate Functions¶

DataFusion now supports the WITHIN GROUP clause for ordered-set aggregate functions such as approx_percentile_cont, percentile_cont, and percentile_disc, which allows users to specify the precise order.

For example, the following query computes the 50th percentile for the temperature column in the city_data table, ordered by date:

SELECT
    percentile_disc(0.5) WITHIN GROUP (ORDER BY date) AS median_temperature
FROM city_data;

(Issue #11732, PR #13511, by Garamda)

Compressed Spill Files¶

DataFusion now supports compressing the files written to disk when spilling larger-than-memory datasets while sorting and grouping. Using compression can significantly reduce the size of the intermediate files and improve performance when reading them back into memory.

(Issue #16130, PR #16268 by ding-young)

Support for `REGEX_INSTR` function¶

DataFusion now supports the [REGEXP_INSTR function], which returns the position of a regular expression match within a string.

For example, to find the position of the first match of the regular expression C(.)(..) in the string ABCDEF, you can use:

> SELECT regexp_instr('ABCDEF', 'C(.)(..)');
+---------------------------------------------------------------+
| regexp_instr(Utf8("ABCDEF"),Utf8("C(.)(..)"))                 |
+---------------------------------------------------------------+
| 3                                                             |
+---------------------------------------------------------------+

(Issue #13009, PR #15928 by nirnayroy)

Upgrade Guide and Changelog¶

Upgrading to 49.0.0 should be straightforward for most users. Please review the Upgrade Guide for details on breaking changes and code snippets to help with the transition. Recently, some users have reported success automatically upgrading DataFusion by pairing AI tools with the upgrade guide. For a comprehensive list of all changes, please refer to the changelog.

About DataFusion¶

DataFusion's core thesis is that as a community, together we can build much more advanced technology than any of us as individuals or companies could do alone. Without DataFusion, highly performant vectorized query engines would remain the domain of a few large companies and world-class research institutions. With DataFusion, we can all build on top of a shared foundation and focus on what makes our projects unique.

How to Get Involved¶

Apache DataFusion 48.0.0 Released

2025-07-16T00:00:00+00:00

We’re excited to announce the release of Apache DataFusion 48.0.0! As always, this version packs in a wide range of improvements and fixes. You can find the complete details in the full changelog. We’ll highlight the most important changes below and guide you through upgrading.

Breaking Changes¶

DataFusion 48.0.0 brings a few breaking changes that may require adjustments to your code as described in the Upgrade Guide. Here are the most notable ones:

datafusion.execution.collect_statistics defaults to true: In DataFusion 48.0.0, the default value of this configuration setting is now true, and DataFusion will collect and store statistics when a table is first created via CREATE EXTERNAL TABLE or one of the DataFrame::register_* APIs.
Expr::Literal has optional metadata: The Expr::Literal variant now includes optional metadata, which allows for carrying through Arrow field metadata to support extension types and other uses. This means code such as

match expr {
...
  Expr::Literal(scalar) => ...
...
}

Should be updated to:

match expr {
...
  Expr::Literal(scalar, _metadata) => ...
...
}

Expr::WindowFunction is now Boxed: Expr::WindowFunction is now a Box<WindowFunction> instead of a WindowFunction directly. This change was made to reduce the size of Expr and improve performance when planning queries (see details on #16207).
UDFs changed to use FieldRef instead of DataType: To support metadata handling and prepare for extension types, UDF traits now use FieldRef rather than a DataType and nullability. FieldRef contains the type and nullability, and additionally allows access to metadata fields, which can be used for extension types.
Physical Expression return Field: Similarly to UDFs, in order to prepare for extension type support the PhysicalExpr trait has been changed to return Field rather than DataType. To upgrade structs which implement PhysicalExpr you need to implement the return_field function.
FileFormat::supports_filters_pushdown was replaced with FileSource::try_pushdown_filters to support upcoming work to push down dynamic filters and physical filter pushdown.
ParquetExec, AvroExec, CsvExec, JsonExec removed: ParquetExec, AvroExec, CsvExec, and JsonExec were deprecated in DataFusion 46 and are removed in DataFusion 48.

Performance Improvements¶

DataFusion 48.0.0 comes with some noteworthy performance enhancements:

Fewer unnecessary projections: DataFusion now removes additional unnecessary Projections in queries. (PRs #15787, #15761, and #15746 by xudong963).
Accelerated string functions: The ascii function was optimized to significantly improve its performance (PR #16087 by tlm365). The character_length function was optimized resulting in up to 3x performance improvement (PR #15931 by Dandandan)
Constant aggregate window expressions: For unbounded aggregate window functions the result is the same for all rows within a partition. DataFusion 48.0.0 avoids unnecessary computation for such queries, resulting in improved performance by 5.6x (PR #16234 by suibianwanwank)

Highlighted New Features¶

New `datafusion-spark` crate¶

The DataFusion community has requested Apache Spark-compatible functions for many years, but the current builtin function library is most similar to Postgresql, which leads to friction. Unfortunately, there are even functions with the same name but different signatures and/or return types in the two systems.

One of the many uses of DataFusion is to enhance (e.g. Apache DataFusion Comet) or replace (e.g. Sail) Apache Spark. To support the community requests and the use cases mentioned above, we have introduced a new datafusion-spark crate for DataFusion with spark-compatible functions so the community can collaborate to build this shared resource. There are several hundred functions to implement, and we are looking for help to complete datafusion-spark Spark Compatible Functions.

To register all functions in datafusion-spark you can use:

    // Create a new session context
    let mut ctx = SessionContext::new();
    // register all spark functions with the context
    datafusion_spark::register_all(&mut ctx)?;
    // run a query. Note the `sha2` function is now available which
    // has Spark semantics
    let df = ctx.sql("SELECT sha2('The input String', 256)").await?;
    ...
}

Or, to use an individual function, you can do:

use datafusion_expr::{col, lit};
use datafusion_spark::expr_fn::sha2;
// Create the expression `sha2(my_data, 256)`
let expr = sha2(col("my_data"), lit(256));
...

Thanks to shehabgamin for the initial PR #15168 and many others for their help adding additional functions. Please consider helping complete datafusion-spark Spark Compatible Functions.

`ORDER BY ALL sql` support¶

Inspired by DuckDB, DataFusion 48.0.0 adds support for ORDER BY ALL. This allows for easy ordering of all columns in a query:

> set datafusion.sql_parser.dialect = 'DuckDB';
0 row(s) fetched.
> CREATE OR REPLACE TABLE addresses AS
    SELECT '123 Quack Blvd' AS address, 'DuckTown' AS city, '11111' AS zip
    UNION ALL
    SELECT '111 Duck Duck Goose Ln', 'DuckTown', '11111'
    UNION ALL
    SELECT '111 Duck Duck Goose Ln', 'Duck Town', '11111'
    UNION ALL
    SELECT '111 Duck Duck Goose Ln', 'Duck Town', '11111-0001';
0 row(s) fetched.
> SELECT * FROM addresses ORDER BY ALL;
+------------------------+-----------+------------+
| address                | city      | zip        |
+------------------------+-----------+------------+
| 111 Duck Duck Goose Ln | Duck Town | 11111      |
| 111 Duck Duck Goose Ln | Duck Town | 11111-0001 |
| 111 Duck Duck Goose Ln | DuckTown  | 11111      |
| 123 Quack Blvd         | DuckTown  | 11111      |
+------------------------+-----------+------------+
4 row(s) fetched.

Thanks to PokIsemaine for PR #15772

FFI Support for `AggregateUDF` and `WindowUDF`¶

This improvement allows for using user defined aggregate and user defined window functions across FFI boundaries, which enables shared libraries to pass functions back and forth. This feature unlocks:

Modules to provide DataFusion based FFI aggregates that can be reused in projects such as datafusion-python
Using the same aggregate and window functions without recompiling with different DataFusion versions.

This completes the work to add support for all UDF types to DataFusion's FFI bindings. Thanks to timsaucer for PRs #16261 and #14775.

Reduced size of `Expr` struct¶

The Expr struct is widely used across the DataFusion and downstream codebases. By Boxing WindowFunctions, we reduced the size of Expr by almost 50%, from 272 to 144 bytes. This reduction improved planning times between 10% and 20% and reduced memory usage. Thanks to hendrikmakait for PR #16207

Upgrade Guide and Changelog¶

Upgrading to 48.0.0 should be straightforward for most users, but do review the Upgrade Guide for DataFusion 48.0.0 for detailed steps and code changes. The upgrade guide covers the breaking changes mentioned above and provides code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog for the 48.0.0 release. The changelog enumerates every merged PR in this release, including many smaller fixes and improvements that we couldn’t cover in this post.

Get Involved¶

Apache DataFusion is an open-source project, and we welcome involvement from anyone interested. Now is a great time to take 48.0.0 for a spin: try it out on your workloads, and let us know if you encounter any issues or have suggestions. You can report bugs or request features on our GitHub issue tracker, or better yet, submit a pull request. Join our community discussions – whether you have questions, want to share how you’re using DataFusion, or are looking to contribute, we’d love to hear from you. A list of open issues suitable for beginners is here and you can find how to reach us on the communication doc.

Happy querying!

Embedding User-Defined Indexes in Apache Parquet Files

2025-07-14T00:00:00+00:00

It’s a common misconception that Apache Parquet files are limited to basic Min/Max/Null Count statistics and Bloom filters, and that adding more advanced indexes requires changing the specification or creating a new file format. In fact, footer metadata and offset-based addressing already provide everything needed to embed user-defined index structures within Parquet files without breaking compatibility with other Parquet readers.

Motivating Example: Imagine your data has a Nation column with dozens of distinct values across thousands of Parquet files. You execute:

  SELECT AVG(sales_amount)
  FROM sales
  WHERE nation = 'Singapore'
  GROUP BY year;

Relying on the min/max statistics from the Parquet format will be ineffective at pruning files when Nation spans "Argentina" through "Zimbabwe". Instead of relying on a Bloom Filter, you may want to store a list of every distinct Nation value in the file near the end. At query time, your engine will read that tiny list and skip any file that does not contain 'Singapore'. This special distinct value index can yield dramatically better file‑pruning performance for your engine, all while preserving full compatibility with standard Parquet readers.

In this post, we review how indexes are stored in the Apache Parquet format, explain the mechanism for storing user-defined indexes, and finally show how to read and write a user-defined index using Apache DataFusion.

Introduction¶

Apache Parquet is a popular columnar file format with well understood and production grade libraries for high‑performance analytics. Features like efficient encodings, column pruning, and predicate pushdown work well for many common query patterns. Apache DataFusion includes a highly optimized Parquet implementation and has excellent performance in general. However, some production query patterns require more than the statistics included in the Parquet format itself¹.

Many systems improve query performance using external indexes or other metadata in addition to Parquet. For example, Apache Iceberg's Scan Planning uses metadata stored in separate files or an in memory cache, and the parquet_index.rs and advanced_parquet_index.rs examples in the DataFusion repository use external files for Parquet pruning (skipping).

External indexes are powerful and widespread, but they have some drawbacks:

Increased Cost and Operational Complexity: You need additional files and systems as well as the original Parquet.
Synchronization Risks: The external index may become out of sync with the Parquet data if you do not manage it carefully.

Proponents have even cited these drawbacks as justification for new file formats, such as Microsoft's Amudai.

However, Parquet is extensible with user-defined indexes: Parquet tolerates unknown bytes within the file body and permits arbitrary key/value pairs in its footer metadata. These two features enable embedding user-defined indexes directly in the file—no extra files, no format forks, and no compatibility breakage.

Parquet File Anatomy & Standard Index Structures¶

Logically, Parquet files contain row groups, each with column chunks, which in turn contain data pages. Physically, a Parquet file is a sequence of bytes with a Thrift-encoded footer metadata containing metadata about the file structure. The footer metadata includes the schema, row groups, column chunks, and other metadata required to read the file.

The Parquet format includes three main types² of optional index structures:

Min/Max/Null Count Statistics for each chunk in a row group. Engines use these to quickly skip row groups that do not match a query predicate.
Page Index: Offsets, sizes, and statistics for each data page. Engines use these to quickly locate data pages without scanning all pages for a column chunk.
Bloom Filters: Data structure to quickly determine if a value is present in a column chunk without scanning any data pages. Particularly useful for equality and IN predicates.

Figure 1: Parquet file layout with standard index structures (as written by arrow-rs).

Only the Min/Max/Null Count Statistics are stored inline in the Parquet footer metadata. The Page Index and Bloom Filters are typically stored in the file body before the Thrift-encoded footer metadata. The locations of these index structures are recorded in the footer metadata, as shown in Figure 1. Parquet readers that do not understand these structures simply ignore them.

Modern Parquet writers create these indexes automatically and provide APIs to control their generation and placement. For example, the Rust Parquet Library provides Parquet WriterProperties, EnabledStatistics, and BloomFilterPosition.

Embedding User Defined Indexes in Parquet Files¶

Embedding user-defined indexes in Parquet files is straightforward and follows the same principles as standard index structures⁶:

Serialize the index into a binary format and write it into the file body before the Thrift-encoded footer metadata.
Record the index location in the footer metadata as a key/value pair, such as "my_index_offset" -> "<byte-offset>".

Figure 2 shows the resulting file layout.

Figure 2: Parquet file layout with user-defined indexes.

Like standard index structures, user-defined indexes can be stored anywhere in the file body, such as after row group data or before the footer. There is no limit to the number of user-defined indexes, nor any restriction on their granularity: they can operate at the file, row group, page, or even row level. This flexibility enables a wide range of use cases, including:

Row group or page-level distinct sets: a finer-grained version of the file-level example in this blog.
HyperLogLog sketches for distinct value estimation, addressing a common criticism³ of Parquet’s lack of cardinality estimation.
Additional zone maps (small materialized aggregates) such as precomputed sums at the column chunk or data page level for faster query execution.
Histograms or samples at the row group or column chunk level for predicate selectivity estimates.

Example: Embedding a User Defined Distinct Value Index in Parquet Files¶

This section demonstrates how to embed a simple distinct value index in Parquet files and use it for file-level pruning (skipping) in DataFusion. The full example is available in the DataFusion repository at parquet_embedded_index.rs.

Note that the example requires arrow‑rs v55.2.0 or later, which includes the new “buffered write” API (apache/arrow-rs#7714) to keep the internal byte count in sync after appending index bytes immediately after data pages.

This example is intentionally simple for clarity, but you can adapt the same approach for any index type or data types. The high-level design is:

Define your index payload (e.g., bitmap, Bloom filter, sketch, distinct values list, etc.).
Serialize your index to bytes and append them into the Parquet file body before writing the footer.
Record the index location by adding a key/value entry (e.g., "my_index_offset" -> "<byte‑offset>") in the Parquet footer metadata.
Extend DataFusion with a custom TableProvider (or wrap the existing Parquet provider) to use the index.

The TableProvider simply reads the footer metadata to discover the index offset, seeks to that offset and deserializes the index, and then uses the index to speed up processing (e.g., skip files, row groups, data pages, etc.).

The resulting Parquet files remain fully compatible with other tools such as DuckDB and Spark, which simply ignore the unknown index bytes and key/value metadata.

Introduction to Distinct Value Indexes¶

A distinct value index stores the unique values of a specific column. This type of index is effective for columns with a small number of distinct values and can be used to quickly skip files that do not match the query. These indexes are popular in several engines, such as the "set" Skip Index in ClickHouse and the Distinct Value Cache in InfluxDB 3.0.

For example, if the files contain a column named Category like this:

Category

foo

bar

...

baz

foo

The distinct value index will contain the values foo, bar, and baz. In contrast, traditional min/max statistics would store only the minimum (bar) and maximum (foo) values, so a query like

SELECT * FROM t WHERE Category = 'bas'

cannot skip the file using min/max values because bas falls between bar and foo in lexicographic order, even though bas does not appear in the column.

This is a key benefit of a distinct value index: accurate filtering without requiring the column to be sorted, unlike min/max-based pruning which is most effective when data is ordered.

While not a traditional index structure like a B-tree, the distinct value set acts as a lightweight, embedded index that enables fast pruning and is especially effective for columns with low cardinality.

Supported Filters

Distinct value indexes are most effective for equality filters, such as:

WHERE category = 'foo'
WHERE category IN ('foo', 'bar')

They can also help with NOT IN and anti-joins, as long as the engine can evaluate them using the list of known distinct values.

However, these indexes are not suitable for range predicates (e.g., category > 'foo'), as they do not preserve any ordering information. For such cases, other structures such as min/max statistics or sorted data layouts may be more effective.

We represent a distinct value index in Rust for our example as a simple HashSet<String>:

/// An index of distinct values for a single column
#[derive(Debug, Clone)]
struct DistinctIndex {
   inner: HashSet<String>,
}

File Layout with Distinct Value Index¶

In this example, we write a distinct value index for the Category column into the Parquet file body after all the data pages, and record the index location in the footer metadata. The resulting file layout looks like this:

                  ┌──────────────────────┐                           
                  │┌───────────────────┐ │                           
                  ││     DataPage      │ │                           
                  │└───────────────────┘ │                           
 Standard Parquet │┌───────────────────┐ │                           
 Data Pages       ││     DataPage      │ │                           
                  │└───────────────────┘ │                           
                  │        ...           │                           
                  │┌───────────────────┐ │                           
                  ││     DataPage      │ │                           
                  │└───────────────────┘ │                           
                  │┏━━━━━━━━━━━━━━━━━━━┓ │                           
Non standard      │┃                   ┃ │                           
index (ignored by │┃Custom Binary Index┃ │                           
other Parquet     │┃ (Distinct Values) ┃◀│─ ─ ─                      
readers)          │┃                   ┃ │     │                     
                  │┗━━━━━━━━━━━━━━━━━━━┛ │                           
Standard Parquet  │┏━━━━━━━━━━━━━━━━━━━┓ │     │  key/value metadata
Page Index        │┃    Page Index     ┃ │        contains location  
                  │┗━━━━━━━━━━━━━━━━━━━┛ │     │  of special index   
                  │╔═══════════════════╗ │                           
                  │║ Parquet Footer w/ ║ │     │                     
                  │║     Metadata      ║ ┼ ─ ─                       
                  │║ (Thrift Encoded)  ║ │                           
                  │╚═══════════════════╝ │                           
                  └──────────────────────┘

Serializing the Distinct‑Value Index¶

The example uses a simple newline‑separated UTF‑8 format as the binary format. The code to serialize the distinct index is shown below:

/// Magic bytes to identify our custom index format
const INDEX_MAGIC: &[u8] = b"IDX1";

/// Serialize the distinct index to a writer as bytes
fn serialize<W: Write + Send>(
   &self,
   arrow_writer: &mut ArrowWriter<W>,
) -> Result<()> {
   let serialized = self
           .inner
           .iter()
           .map(|s| s.as_str())
           .collect::<Vec<_>>()
           .join("\n");
   let index_bytes = serialized.into_bytes();

   // Set the offset for the index
   let offset = arrow_writer.bytes_written();
   let index_len = index_bytes.len() as u64;

   // Write the index magic and length to the file
   arrow_writer.write_all(INDEX_MAGIC)?;
   arrow_writer.write_all(&index_len.to_le_bytes())?;

   // Write the index bytes
   arrow_writer.write_all(&index_bytes)?;

   // Append metadata about the index to the Parquet file footer metadata
   arrow_writer.append_key_value_metadata(KeyValue::new(
      "distinct_index_offset".to_string(),
      offset.to_string(),
   ));
   Ok(())
}

This code does the following:

Creates a newline‑separated UTF‑8 string from the distinct values.
Writes a magic header (IDX1) and the length of the index.
Writes the index bytes to the file using the ArrowWriter API.
Records the index location by adding a key/value entry ("distinct_index_offset" -> <offset>) in the Parquet footer metadata.

Note: Use the ArrowWriter::write_all API to ensure the offsets in the footer metadata are correctly tracked.

Reading the Index¶

This code reads the distinct index from a Parquet file:

/// Read a `DistinctIndex` from a Parquet file
fn read_distinct_index(path: &Path) -> Result<DistinctIndex> {
    let file = File::open(path)?;

    let file_size = file.metadata()?.len();
    println!("Reading index from {} (size: {file_size})", path.display(), );

    let reader = SerializedFileReader::new(file.try_clone()?)?;
    let meta = reader.metadata().file_metadata();

    let offset = get_key_value(meta, "distinct_index_offset")
        .ok_or_else(|| ParquetError::General("Missing index offset".into()))?
        .parse::<u64>()
        .map_err(|e| ParquetError::General(e.to_string()))?;

    println!("Reading index at offset: {offset}, length");
    DistinctIndex::new_from_reader(file, offset)
}

This function:

Opens the Parquet footer metadata and extracts distinct_index_offset from the metadata.
Calls DistinctIndex::new_from_reader to read the index from the file at that offset.

DistinctIndex::new_from_reader actually reads the index as shown below:

 /// Read the distinct values index from a reader at the given offset and length
 pub fn new_from_reader<R: Read + Seek>(mut reader: R, offset: u64) -> Result<DistinctIndex> {
     reader.seek(SeekFrom::Start(offset))?;

     let mut magic_buf = [0u8; 4];
     reader.read_exact(&mut magic_buf)?;
     if magic_buf != INDEX_MAGIC {
         return exec_err!("Invalid index magic number at offset {offset}");
     }

     let mut len_buf = [0u8; 8];
     reader.read_exact(&mut len_buf)?;
     let stored_len = u64::from_le_bytes(len_buf) as usize;

     let mut index_buf = vec![0u8; stored_len];
     reader.read_exact(&mut index_buf)?;

     let Ok(s) = String::from_utf8(index_buf) else {
         return exec_err!("Invalid UTF-8 in index data");
     };

     Ok(Self {
         inner: s.lines().map(|s| s.to_string()).collect(),
     })
 }

This code:

Seeks to the offset of the index in the file.
Reads the magic bytes and checks they match IDX1.
Reads the length of the index and allocates a buffer.
Reads the index bytes, converts them to a String, and splits into lines to populate the HashSet<String>.

Extending DataFusion’s `TableProvider`¶

To use the distinct index for file-level pruning, extend DataFusion's TableProvider to read the index and apply it during query execution:

impl TableProvider for DistinctIndexTable {
    /* ... */

    /// Prune files before reading: only keep files whose distinct set
    /// contains the filter value
    async fn scan(
        &self,
        _ctx: &dyn Session,
        _proj: Option<&Vec<usize>>,
        filters: &[Expr],
        _limit: Option<usize>,
    ) -> Result<Arc<dyn ExecutionPlan>> {
        // This example only handles filters of the form
        // `category = 'X'` where X is a string literal
        //
        // You can use `PruningPredicate` for much more general range and
        // equality analysis or write your own custom logic.
        let mut target: Option<&str> = None;

        if filters.len() == 1 {
            if let Expr::BinaryExpr(expr) = &filters[0] {
                if expr.op == Operator::Eq {
                    if let (
                        Expr::Column(c),
                        Expr::Literal(ScalarValue::Utf8(Some(v)), _),
                    ) = (&*expr.left, &*expr.right)
                    {
                        if c.name == "category" {
                            println!("Filtering for category: {v}");
                            target = Some(v);
                        }
                    }
                }
            }
        }
        // Determine which files to scan
        // files_and_index is a Vec<(String, DistinctIndex)>,
        // See the full example for how this is populated.
        let files_to_scan: Vec<_> = self
            .files_and_index
            .iter()
            .filter_map(|(f, distinct_index)| {
                // keep file if no target or target is in the distinct set
                if target.is_none() || distinct_index.contains(target?) {
                    Some(f)
                } else {
                    None
                }
            })
            .collect();

        // Build ParquetSource to actually read the files
        let url = ObjectStoreUrl::parse("file://")?;
        let source = Arc::new(ParquetSource::default().with_enable_page_index(true));
        let mut builder = FileScanConfigBuilder::new(url, self.schema.clone(), source);
        for file in files_to_scan {
            let path = self.dir.join(file);
            let len = std::fs::metadata(&path)?.len();
           // If the index contained information about row groups or pages,
           // you could also pass that information here to further prune
           // the data read from the file.
           let partitioned_file =
                   PartitionedFile::new(path.to_str().unwrap().to_string(), len);
           builder = builder.with_file(partitioned_file);
        }
        Ok(DataSourceExec::from_data_source(builder.build()))
    }

    /// Tell DataFusion that we can handle filters on the "category" column
    fn supports_filters_pushdown(
        &self,
        fs: &[&Expr],
    ) -> Result<Vec<TableProviderFilterPushDown>> {
        // Mark as inexact since pruning is file‑granular
        Ok(vec![TableProviderFilterPushDown::Inexact; fs.len()])
    }
}

This code does the following:

Implements the scan method to filter files based on the distinct index.
Checks if the filter is an equality predicate on the category column.
If the target value is specified, checks if the distinct index contains that value.
Builds a FileScanConfig with only the files that match the filter.

Putting It All Together¶

To use the distinct index in a DataFusion query, write sample Parquet files with the embedded index, register the DistinctIndexTable provider, and run a query with a predicate that can be optimized by the index as shown below.

// Write sample files with embedded indexes
tmp_dir.iter().for_each(|(name, vals)| {
    write_file_with_index(&dir.join(name), vals).unwrap();
});

// Register provider and query
let provider = Arc::new(DistinctIndexTable::try_new(dir, schema.clone())?);
ctx.register_table("t", provider)?;

// Only files containing 'foo' will be scanned
let df = ctx.sql("SELECT * FROM t WHERE category = 'foo'").await?;
df.show().await?;

Verifying Compatibility with DuckDB¶

Even with extra bytes and unknown metadata keys, standard Parquet readers ignore the index. You can verify this using another system such as DuckDB to read the Parquet created in the example. DuckDB will read the files without any issues, ignoring the custom index and unknown footer metadata.

SELECT * FROM read_parquet('/tmp/parquet_index_data/*');
┌──────────┐
│ category │
│ varchar  │
├──────────┤
│ foo      │
│ bar      │
│ foo      │
│ baz      │
│ qux      │
│ foo      │
│ quux     │
│ quux     │
└──────────┘

Conclusion¶

In this post, we explained how index structures are stored in Apache Parquet, how to embed user-defined indexes without changing the format, and how to use user-defined indexes to speed up query processing.

Parquet-based systems can achieve significant performance improvements for almost any query pattern while still retaining broad compatibility, using user-defined embedded indexes, external indexes⁴ and rewriting files optimized for specific queries⁵. System designers can choose among the available options to make the appropriate trade-offs between operational complexity, performance, file size, and cost for their specific use cases.

We hope this post inspires you to explore custom indexes in Parquet files, rather than proposing new file formats and reimplementing existing features. The DataFusion community is excited to see how you use this feature in your projects!

About the Authors¶

Qi Zhu is a Senior Engineer at Cloudera, an active contributor to Apache DataFusion and Apache Arrow, a committer on Apache Hadoop and Apache YuniKorn. He has extensive experience in distributed systems, scheduling, and large-scale computing.

Jigao Luo is a 1.5-year PhD student at Systems Group @ TU Darmstadt. Regarding Parquet, he is an external contributor to NVIDIA RAPIDS cuDF, focusing on the GPU Parquet reader.

Andrew Lamb is a Staff Engineer at InfluxData, and a member of the Apache DataFusion and Apache Arrow PMCs. He has been working on Databases and related systems more than 20 years.

About DataFusion¶

Footnotes¶

1: A commonly cited example is highly selective predicates (e.g. category = 'foo') but for which the built in BloomFilters are not sufficient.

2: There are other index structures, but they are either 1) not widely supported (such as statistics in the page headers) or 2) not yet widely used in practice at the time of this writing (such as GeospatialStatistics and SizeStatistics).

3: Seamless Integration of Parquet Files into Data Processing. / Rey, Alice; Freitag, Michael; Neumann, Thomas. / BTW 2023

4: For more information about external indexes, see this talk and the parquet_index.rs and advanced_parquet_index.rs examples in the DataFusion repository.

5: For information about rewriting files to optimize for specific queries, such as resorting, repartitioning, and tuning data page and row group sizes, see XiangpengHao/liquid‑cache#227 and the conversation between JigaoLuo and XiangpengHao for details. We hope to make a future post about this topic.

6: An index can also be stored inline in the key-value metadata. This approach is simple to implement and ensures the index is available once the footer is read, without additional I/O. However, it requires the index to be serialized as a UTF-8 string, which may be less efficient and increases the size of the footer metadata, impacting all Parquet readers, even those that ignore the index.

Apache DataFusion 47.0.0 Released

2025-07-11T00:00:00+00:00

We’re excited to announce the release of Apache DataFusion 47.0.0! This new version represents a significant milestone for the project, packing in a wide range of improvements and fixes. You can find the complete details in the full changelog. We’ll highlight the most important changes below and guide you through upgrading.

Note that DataFusion 47.0.0 was released in April 2025, but we are only now publishing the blog post due to limited bandwidth in the DataFusion community. We apologize for the delay and encourage you to come help us accelerate the next release and announcements by joining the community 🎣.

Breaking Changes¶

DataFusion 47.0.0 brings a few breaking changes that may require adjustments to your code as described in the Upgrade Guide. Here are some notable ones:

Upgrades to arrow-rs and arrow-parquet 55.0.0 and object_store 0.12.0: Several APIs changed in the underlying arrow, parquet and object_store libraries to use a u64 instead of usize to better support WASM. This requires converting from usize to u64 occasionally as well as changes to ObjectStore implementations such as

impl ObjectStore {
    ...

    // The range is now a u64 instead of usize
    async fn get_range(&self, location: &Path, range: Range<u64>) -> ObjectStoreResult<Bytes> {
        self.inner.get_range(location, range).await
    }

    ...

    // the lifetime is now 'static instead of '_ (meaning the captured closure can't contain references)
    // (this also applies to list_with_offset)
    fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, ObjectStoreResult<ObjectMeta>> {
        self.inner.list(prefix)
    }
}

DisplayFormatType::TreeRender: Implementations of ExecutionPlan must also provide a description in the DisplayFormatType::TreeRender format to provide support for the new tree style explains. This can be the same as the existing DisplayFormatType::Default.

Performance Improvements¶

DataFusion 47.0.0 comes with numerous performance enhancements across the board. Here are some of the noteworthy optimizations in this release:

FIRST_VALUE and LAST_VALUE: FIRST_VALUE and LAST_VALUE functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in 7 seconds compared to 36 seconds in DataFusion 46.0.0: select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4 (h2o.ai dataset). (PR's #15266 and #15542 by UBarney).
MIN, MAX and AVG for Durations: DataFusion executes aggregate queries up to 2.5x faster when they include MIN, MAX and AVG on Duration columns. (PRs #15322 and #15748 by shruti2522).
Short circuit evaluation for AND and OR: DataFusion now eagerly skips the evaluation of the right operand if the left is known to be false (AND) or true (OR) in certain cases. For complex predicates, such as those with many LIKE or CASE expressions, this optimization results in significant performance improvements (up to 100x in extreme cases). (PRs #15462 and #15694 by acking-you).
TopK optimization for partially sorted input: Previous versions of DataFusion implemented early termination optimization (TopK) for fully sorted data. DataFusion 47.0.0 extends the optimization for partially sorted data, which is common in many real-world datasets, such as time-series data sorted by day but not within each day. (PR #15563 by geoffreyclaude).
Disable re-validation of spilled files: DataFusion no longer does unnecessary re-validation of temporary spill files. The validation is unnecessary and expensive as the data is known to be valid when it was written out (PR #15454 by zebsme).

Highlighted New Features¶

Tree style explains¶

In previous releases the EXPLAIN statement results in a formatted table which is succinct and contains important details for implementers, but was often hard to read especially with queries that included joins or unions having multiple children.

DataFusion 47.0.0 includes the new EXPLAIN FORMAT TREE (default in datafusion-cli) rendered in a visual tree style that is much easier to quickly understand.

Example of the new explain output:

> explain select * from t1 inner join t2 on t1.ti=t2.ti;
+---------------+------------------------------------------------------------+
| plan_type     | plan                                                       |
+---------------+------------------------------------------------------------+
| physical_plan | ┌───────────────────────────┐                              |
|               | │    CoalesceBatchesExec    │                              |
|               | │    --------------------   │                              |
|               | │     target_batch_size:    │                              |
|               | │            8192           │                              |
|               | └─────────────┬─────────────┘                              |
|               | ┌─────────────┴─────────────┐                              |
|               | │        HashJoinExec       │                              |
|               | │    --------------------   ├──────────────┐               |
|               | │       on: (ti = ti)       │              │               |
|               | └─────────────┬─────────────┘              │               |
|               | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      ││       DataSourceExec      │ |
|               | │    --------------------   ││    --------------------   │ |
|               | │         bytes: 112        ││         bytes: 112        │ |
|               | │       format: memory      ││       format: memory      │ |
|               | │          rows: 1          ││          rows: 1          │ |
|               | └───────────────────────────┘└───────────────────────────┘ |
|               |                                                            |
+---------------+------------------------------------------------------------+

Example of the EXPLAIN FORMAT INDENT output for the same query

> explain format indent select * from t1 inner join t2 on t1.ti=t2.ti;
+---------------+----------------------------------------------------------------------+
| plan_type     | plan                                                                 |
+---------------+----------------------------------------------------------------------+
| logical_plan  | Inner Join: t1.ti = t2.ti                                            |
|               |   TableScan: t1 projection=[ti]                                      |
|               |   TableScan: t2 projection=[ti]                                      |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                          |
|               |   HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(ti@0, ti@0)] |
|               |     DataSourceExec: partitions=1, partition_sizes=[1]                |
|               |     DataSourceExec: partitions=1, partition_sizes=[1]                |
|               |                                                                      |
+---------------+----------------------------------------------------------------------+
2 row(s) fetched.

Thanks to irenjj for the initial work in PR #14677 and many others for completing the followup epic

SQL `VARCHAR` defaults to Utf8View¶

In previous releases when a column was created in SQL the column would be mapped to the Utf8 Arrow data type. In this release the SQL varchar columns will be mapped to the Utf8View arrow data type by default, which is a more efficient representation of UTF-8 strings in Arrow.

create table foo(x varchar);
0 row(s) fetched.

> describe foo;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| x           | Utf8View  | YES         |
+-------------+-----------+-------------+

Previous versions of DataFusion used Utf8View when reading parquet files and it is faster in most cases.

Thanks to zhuqi-lucas for PR #15104

Context propagation in spawned tasks (for tracing, logging, etc.)¶

This release introduces an API for propagating user-defined context (such as tracing spans, logging, or metrics) across thread boundaries without depending on any specific instrumentation library. You can use the JoinSetTracer API to instrument DataFusion plans with your own tracing or logging libraries, or use pre-integrated community crates such as the datafusion-tracing crate.

Previously, tasks spawned on new threads — such as those performing repartitioning or Parquet file reads — could lose thread-local context, which is often used in instrumentation libraries. A full example of how to use this new API is available in the DataFusion examples, and a simple example is shown below.

/// Models a simple tracer. Calling `in_current_span()` and `in_scope()` saves thread-specific state
/// for the current span and must be called at the start of each new task or thread.
struct SpanTracer;

/// Implements the `JoinSetTracer` trait so we can inject instrumentation
/// for both async futures and blocking closures.
impl JoinSetTracer for SpanTracer {
    /// Instruments a boxed future to run in the current span. The future's
    /// return type is erased to `Box<dyn Any + Send>`, which we simply
    /// run inside the `Span::current()` context.
    fn trace_future(
        &self,
        fut: BoxFuture<'static, Box<dyn Any + Send>>,
    ) -> BoxFuture<'static, Box<dyn Any + Send>> {
        // Ensures any thread-local context is set in this future 
        fut.in_current_span().boxed()
    }

    /// Instruments a boxed blocking closure by running it inside the
    /// `Span::current()` context.
    fn trace_block(
        &self,
        f: Box<dyn FnOnce() -> Box<dyn Any + Send> + Send>,
    ) -> Box<dyn FnOnce() -> Box<dyn Any + Send> + Send> {
        let span = Span::current();
        // Ensures any thread-local context is set for this closure
        Box::new(move || span.in_scope(f))
    }
}

...
set_join_set_tracer(&SpanTracer).expect("Failed to set tracer");
...

Thanks to geoffreyclaude for PR #14914

Upgrade Guide and Changelog¶

Upgrading to 47.0.0 should be straightforward for most users, but do review the Upgrade Guide for DataFusion 47.0.0 for detailed steps and code changes. The upgrade guide covers the breaking changes mentioned above and provides code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog for 47.0.0. The changelog enumerates every merged PR in this release, including many smaller fixes and improvements that we couldn’t cover in this post.

Get Involved¶

Apache DataFusion is an open-source project, and we welcome involvement from anyone interested. Now is a great time to take 47.0.0 for a spin: try it out on your workloads, and let us know if you encounter any issues or have suggestions. You can report bugs or request features on our GitHub issue tracker, or better yet, submit a pull request. Join our community discussions – whether you have questions, want to share how you’re using DataFusion, or are looking to contribute, we’d love to hear from you. A list of open issues suitable for beginners is here and you can find how to reach us on the communication doc.

Happy querying!

Apache DataFusion Comet 0.9.0 Release

2025-07-01T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.9.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately ten weeks of development work and is the result of merging 139 PRs from 24 contributors. See the change log for more information.

Release Highlights¶

Complex Type Support in Parquet Scans¶

Comet now supports complex types (Structs, Maps, and Arrays) when reading Parquet files. This functionality is not yet available when reading Parquet files from Apache Iceberg.

This functionality was only available in previous releases when manually specifying one of the new experimental scan implementations. Comet now automatically chooses the best scan implementation based on the input schema, and no longer requires manual configuration.

Complex Type Processing Improvements¶

Numerous improvements have been made to complex type support to ensure Spark-compatible behavior when casting between structs and accessing fields within deeply nested types.

Shuffle Improvements¶

Comet now accelerates a broader range of shuffle operations, leading to more queries running fully natively. In previous releases, some shuffle operations fell back to Spark to avoid some known bugs in Comet, and these bugs have now been fixed.

New Features¶

Comet 0.9.0 adds support for the following Spark expressions:

ArrayDistinct
ArrayMax
ArrayRepeat
ArrayUnion
BitCount
BitNot
Expm1
MapValues
Signum
ToPrettyString
map[]

Improved Spark SQL Test Coverage¶

Comet now passes 97% of the Spark SQL test suite, with more than 24,000 tests passing (based on testing against Spark 3.5.6). The remaining 3% of tests are ignored for various reasons, such as being too specific to Spark internals, or testing for features that are not relevant to Comet, such as whole-stage code generation, which is not needed when using a vectorized execution engine.

This release contains numerous bug fixes to achieve this coverage, including improved support for exchange reuse when AQE is enabled.

Module	Passed	Ignored	Canceled	Total
catalyst	7,232	5	1	7,238
core-1	9,186	246	6	9,438
core-2	2,649	393	0	3,042
core-3	1,757	136	16	1,909
hive-1	2,174	14	4	2,192
hive-2	19	1	4	24
hive-3	1,058	11	4	1,073
Total	24,075	806	31	24,912

Memory & Performance Tracing¶

Comet now provides a tracing feature for analyzing performance and off-heap versus on-heap memory usage. See the Comet Tracing Guide for more information.

Spark Compatibility¶

Spark 3.4.3 with JDK 11 & 17, Scala 2.12 & 2.13
Spark 3.5.4 through 3.5.6 with JDK 11 & 17, Scala 2.12 & 2.13
Experimental support for Spark 4.0.0 with JDK 17, Scala 2.13

We are looking for help from the community to fully support Spark 4.0.0. See EPIC: Support 4.0.0 for more information.

Note that Java 8 support was removed from this release because Apache Arrow no longer supports it.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Using Rust async for Query Execution and Cancelling Long-Running Queries

2025-06-30T00:00:00+00:00

Have you ever tried to cancel a query that just wouldn't stop? In this post, we'll review how Rust's async programming model works, how DataFusion uses that model for CPU intensive tasks, and how this is used to cancel queries. Then we'll review some cases where queries could not be canceled in DataFusion and what the community did to resolve the problem.

Understanding Rust's Async Model¶

DataFusion, somewhat unconventionally, uses the Rust async system and the Tokio task scheduler for CPU intensive processing. To really understand the cancellation problem you first need to be familiar with Rust's asynchronous programming model which is a bit different from what you might be used to from other ecosystems. Let's go over the basics again as a refresher. If you're familiar with the ins and outs of Future and async you can skip this section.

Futures Are Inert¶

Rust's asynchronous programming model is built around the Future<T> trait. In contrast to, for instance, Javascript's Promise or Java's Future a Rust Future does not necessarily represent an actively running asynchronous job. Instead, a Future<T> represents a lazy calculation that only makes progress when explicitly asked to do so. This is done by calling the poll method of a Future. If nobody polls a Future explicitly, it is an inert object.

Calling Future::poll results in one of two options:

Poll::Pending if the evaluation is not yet complete, most often because it needs to wait for something like I/O before it can continue
Poll::Ready<T> when it has completed and produced a value

When a Future returns Pending, it saves its internal state so it can pick up where it left off the next time you poll it. This internal state management makes Rust's Futures memory-efficient and composable. Rather than freezing the full call stack leading to a certain point, only the relevant state to resume the future needs to be retained.

Additionally, a Future must set up the necessary signaling to notify the caller when it should call poll again, to avoid a busy-waiting loop. This is done using a Waker which the Future receives via the Context parameter of the poll function.

Manual implementations of Future are most often little finite state machines. Each state in the process of completing the calculation is modeled as a variant of an enum. Before a Future returns Pending, it bundles the data required to resume in an enum variant, stores that enum variant in itself, and then returns. While compact and efficient, the resulting code is often quite verbose.

The async keyword was introduced to make life easier on Rust programmers. It provides elegant syntactic sugar for the manual state machine Future approach. When you write an async function or block, the compiler transforms linear code into a state machine based Future similar to the one described above for you. Since all the state management is compiler generated and hidden from sight, async code tends to be easier to write initially, more readable afterward, while maintaining the same underlying mechanics.

The await keyword complements async pausing execution until a Future completes.
When you .await a Future, you're essentially telling the compiler to generate code that:

Polls the Future with the current (implicit) asynchronous context
If poll returns Poll::Pending, save the state of the Future so that it can resume at this point and return Poll::Pending
If it returns Poll::Ready(value), continue execution with that value

From Futures to Streams¶

The futures crate extends the Future model with a trait named Stream. Stream<Item = T> represents a sequence of values that are each produced asynchronously rather than just a single value. It's the asynchronous equivalent of Iterator<Item = T>.

The Stream trait has one method named poll_next that returns:

Poll::Pending when the next value isn't ready yet, just like a Future would
Poll::Ready(Some(value)) when a new value is available
Poll::Ready(None) when the stream is exhausted

Under the hood, an implementation of Stream is very similar to a Future. Typically, they're also implemented as state machines, the main difference being that they produce multiple values rather than just one. Just like Future, a Stream is inert unless explicitly polled.

Now that we understand the basics of Rust's async model, let's see how DataFusion leverages these concepts to execute queries.

How DataFusion Executes Queries¶

In DataFusion, the short version of how queries are executed is as follows (you can find more in-depth coverage of this in the DataFusion documentation):

First the query is compiled into a tree of ExecutionPlan nodes
ExecutionPlan::execute is called on the root of the tree.
This method returns a SendableRecordBatchStream (a pinned Box<dyn Stream<RecordBatch>>)
Stream::poll_next is called in a loop to get the results

In other words, the execution of a DataFusion query boils down to polling an asynchronous stream. Like all Stream implementations, we need to explicitly poll the stream for the query to make progress.

The Stream we get in step 2 is actually the root of a tree of Streams that mostly mirrors the execution plan tree. Each stream tree node processes the record batches it gets from its children. The leaves of the tree produce record batches themselves.

Query execution progresses each time you call poll_next on the root stream. This call typically cascades down the tree, with each node calling poll_next on its children to get the data it needs to process.

Here's where the first signs of problems start to show up: some operations (like aggregations, sorts, or certain join phases) need to process a lot of data before producing any output. When poll_next encounters one of these operations, it might require substantial work before it can return a record batch.

Tokio and Cooperative Scheduling¶

We need to make a small detour now via Tokio's scheduler before we can get to the query cancellation problem. DataFusion makes use of the Tokio asynchronous runtime, which uses a cooperative scheduling model. This is fundamentally different from preemptive scheduling that you might be used to:

In preemptive scheduling, the system can interrupt a task at any time to run something else
In cooperative scheduling, tasks must voluntarily yield control back to the scheduler

This distinction is crucial for understanding our cancellation problem.

A task in Tokio is modeled as a Future which is passed to one of the task initiation functions like spawn. Tokio runs the task by calling Future::poll in a loop until it returns Poll::Ready. While that Future::poll call is running, Tokio has no way to forcibly interrupt it. It must cooperate by periodically yielding control, either by returning Poll::Pending or Poll::Ready.

Similarly, when you try to abort a task by calling JoinHandle::abort(), the Tokio runtime can't immediately force it to stop. You're just telling Tokio: "When this task next yields control, don't call Future::poll anymore." If the task never yields, it can't be aborted.

The Cancellation Problem¶

With all the necessary background in place, now let's look at how the DataFusion CLI tries to run and cancel a query. The code below is a simplified version of what the CLI actually does:

fn exec_query() {
    let runtime: tokio::runtime::Runtime = ...;
    let stream: SendableRecordBatchStream = ...;

    runtime.block_on(async {
        tokio::select! {
            next_batch = stream.next() => ...
            _ = signal::ctrl_c() => ...,
        }
    })
}

First the CLI sets up a Tokio runtime instance. It then reads the query to execute from standard input or file and turns it into a Stream. Then it calls next on stream which is an async wrapper for poll_next. It passes this to the select! macro along with a ctrl-C handler.

The select! macro races these two Futures and completes when either one finishes. The intent is that when you press Ctrl+C, the signal::ctrl_c() Future should complete. The stream is cancelled when it is dropped as it is inert by itself and nothing will be able to call poll_next again.

But there's a catch: select! still follows cooperative scheduling rules. It polls each Future in sequence, and if the first one (our query) gets stuck in a long computation, it never gets around to polling the cancellation signal.

Imagine a query that needs to calculate something intensive, like sorting billions of rows. Unless the sorting Stream is written with care (which the one in DataFusion is), the poll_next call may take several minutes or even longer without returning. During this time, Tokio can't check if you've pressed Ctrl+C, and the query continues running despite your cancellation request.

A Closer Look at Blocking Operators¶

Let's peel back a layer of the onion and look at what's happening in a blocking poll_next implementation. Here's a drastically simplified version of a COUNT(*) aggregation - something you might use in a query like SELECT COUNT(*) FROM table:

struct BlockingStream {
    // the input: an inner stream that is wrapped
    stream: SendableRecordBatchStream,
    count: usize,
    finished: bool,
}

impl Stream for BlockingStream {
    type Item = Result<RecordBatch>;
    fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
        if self.finished {
            // return None if we're finished
            return Poll::Ready(None);
        }

        loop {
            // poll the input stream to get the next batch if ready
            match ready!(self.stream.poll_next_unpin(cx)) {
                // increment the counter if we got a batch
                Some(Ok(batch)) => self.count += batch.num_rows(),
                // on end-of-stream, create a record batch for the counter
                None => {
                    self.finished = true;
                    return Poll::Ready(Some(Ok(create_record_batch(self.count))));
                }
                // pass on any errors verbatim
                Some(Err(e)) => return Poll::Ready(Some(Err(e))),
            }
        }
    }
}

How does this code work? Let's break it down step by step:

1. Initial check: We first check if we've already finished processing. If so, we return Ready(None) to signal the end of our stream:

if self.finished {
    return Poll::Ready(None);
}

2. Processing loop: If we're not done yet, we enter a loop to process incoming batches from our input stream:

loop {
    match ready!(self.stream.poll_next_unpin(cx)) {
        // Handle different cases...
    }
}

The ready! macro checks if the input stream returned Pending and if so, immediately returns Pending from our function as well.

3. Processing data: For each batch we receive, we simply add its row count to our running total:

Some(Ok(batch)) => self.count += batch.num_rows(),

4. End of input: When the child stream is exhausted (returns None), we calculate our final result and convert it into a record batch (omitted for brevity):

None => {
    self.finished = true;
    return Poll::Ready(Some(Ok(create_record_batch(self.count))));
}

5. Error handling: If we encounter an error, we pass it along immediately:

Some(Err(e)) => return Poll::Ready(Some(Err(e))),

This code looks perfectly reasonable at first glance. But there's a subtle issue lurking here: what happens if the input stream always returns Ready and never returns Pending?

In that case, the processing loop will keep running without returning Poll::Pending and thus never yield control back to Tokio's scheduler. This means we could be stuck in a single poll_next call for quite some time - exactly the scenario that prevents query cancellation from working!

So how do we solve this problem? Let's explore some strategies to ensure our operators yield control periodically.

Unblocking Operators¶

Now let's look at how we can ensure we return Pending every now and then.

Independent Cooperative Operators¶

One simple way to return Pending is using a loop counter. We do the exact same thing as before, but on each loop iteration we decrement our counter. If the counter hits zero we return Pending. The following example ensures we iterate at most 128 times before yielding.

struct CountingSourceStream {
   counter: usize
}

impl Stream for CountingSourceStream {
    type Item = Result<RecordBatch>;

    fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
        if self.counter >= 128 {
            self.counter = 0;
            cx.waker().wake_by_ref();
            return Poll::Pending;
        }

        self.counter += 1;
        let batch = ...;
        Ready(Some(Ok(batch)))
    }
}

If CountingSourceStream was the input for the BlockingStream example above, the BlockingStream will receive a Pending periodically causing it to yield too. Can we really solve the cancel problem simply by periodically yielding in source streams?

Unfortunately, no. Let's look at what happens when we start combining operators in more complex configurations. Suppose we create a plan like this.

A plan that merges two branches by alternating between them.

Each CountingSource produces a Pending every 128 batches. The Filter is a stream that drops a batch every 50 record batches. Merge is a simple combining operator the uses futures::stream::select to combine two stream.

When we set this stream in motion, the merge operator will poll the left and right branch in a round-robin fashion. The sources will each emit Pending every 128 batches, but since the Filter drops batches, they arrive out-of-phase at the merge operator. As a consequence the merge operator will always have the opportunity of polling the other stream when one returns Pending. The Merge stream thus is an always ready stream, even though the sources are yielding. If we use Merge as the input to our aggregating operator we're right back where we started.

Coordinated Cooperation¶

Wouldn't it be great if we could get all the operators to coordinate amongst each other? When one of them determines that it's time to yield, all the other operators agree and start returning Pending as well. That way our task would be coaxed towards yielding even if it tried to poll many different operators.

Luckily(?), the developers of Tokio ran into the exact same problem described above when network servers were under heavy load and came up with a solution. Back in 2020, Tokio 0.2.14 introduced a per-task operation budget. Rather than having individual counters littered throughout the code, the Tokio runtime itself manages a per task counter which is decremented by Tokio resources. When the counter hits zero, all resources start returning Pending. The task will then yield, after which the Tokio runtime resets the counter.

To illustrate what this process looks like, let's have a look at the execution of the following query Stream tree when polled in a Tokio task.

Query plan for aggregating a sorted stream from two sources. Each source reads a stream of `RecordBatch`es, which are then merged into a single Stream by the `MergeStream` operator which is then aggregated by the `AggregateExec` operator. Arrows represent the data flow direction

If we assume a task budget of 1 unit, each time Tokio schedules the task would result in the following sequence of function calls.

Tokio task budget system, assuming the task budget is set to 1, for the plan above.

The aggregation stream would try to poll the merge stream in a loop. The first iteration of the loop consumes the single unit of budget, and returns Ready. The second iteration polls the merge stream again which now tries to poll the second scan stream. Since there is no budget remaining Pending is returned. The merge stream may now try to poll the first source stream again, but since the budget is still depleted Pending is returned as well. The merge stream now has no other option than to return Pending itself as well, causing the aggregation to break out of its loop. The Pending result bubbles all the way up to the Tokio runtime, at which point the runtime regains control. When the runtime reschedules the task, it resets the budget and calls poll on the task Future again for another round of progress.

The key mechanism that makes this work well is the single task budget that's shared amongst all the scan streams. Once the budget is depleted, no streams can make any further progress without first returning control to tokio. This causes all possible avenues the task has to make progress to return Pending which results in the task being nudged towards yielding control.

As it turns out DataFusion was already using this mechanism implicitly. Every exchange-like operator (such as RepartitionExec) internally makes use of a Tokio multiple producer, single consumer Channel. When calling Receiver::recv for one of these channels, a unit of Tokio task budget is consumed. As a consequence, query plans that made use of exchange-like operators were already mostly cancelable. The plan cancellation bug only showed up when running parts of plans without such operators, such as when using a single core.

Now let's see how we can explicitly implement this budget-based approach in our own operators.

Depleting The Tokio Budget¶

Let's revisit our original BlockingStream and adapt it to use Tokio's budget system.

The examples given here make use of functions from the Tokio coop module that are still internal at the time of writing. PR #7405 on the Tokio project will make these accessible for external use. The current DataFusion code emulates these functions as well as possible using has_budget_remaining and consume_budget.

struct BudgetSourceStream {
}

impl Stream for BudgetSourceStream {
    type Item = Result<RecordBatch>;

    fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
        let coop = ready!(tokio::task::coop::poll_proceed(cx));
        let batch: Poll<Option<Self::Item>> = ...;
        if batch.is_ready() {
            coop.made_progress();
        }
        batch
    }
}

The Stream now goes through the following steps:

1. Try to consume budget: the first thing the operator does is use poll_proceed to try to consume a unit of budget. If the budget is depleted, this function will return Pending. Otherwise, we consumed one budget unit and we can continue.

let coop = ready!(tokio::task::coop::poll_proceed(cx));

2. Try to do some work: next we try to produce a record batch. That might not be possible if we're reading from some asynchronous resource that's not ready.

let batch: Poll<Option<Self::Item>> = ...;

3. Commit the budget consumption: finally, if we did produce a batch, we need to tell Tokio that we were able to make progress.

That's done by calling the made_progress method on the value poll_proceed returned.

if batch.is_ready() {
   coop.made_progress();
}

You might be wondering why the call to made_progress is necessary. This clever construct makes it easier to manage the budget. The value returned by poll_proceed will actually restore the budget to its original value when it is dropped unless made_progress is called. This ensures that if we exit early from our poll_next implementation by returning Pending, that the budget we had consumed becomes available again. The task that invoked poll_next can then use that budget again to try to make some other Stream (or any resource for that matter) make progress.

Automatic Cooperation For All Operators¶

DataFusion 49.0.0 integrates the Tokio task budget based fix in all built-in source operators. This ensures that going forward, most queries will automatically be cancelable. See the PR for more details.

The design includes:

A new ExecutionPlan property that indicates if an operator participates in cooperative scheduling or not.
A new EnsureCooperative optimizer rule to inspect query plans and insert CooperativeExec nodes as needed to ensure custom source operators also participate.

These two changes combined already make it very unlikely you'll encounter any query that refuses to stop, even with custom operators. For those situations where the automatic mechanisms are still not sufficient, there's a new datafusion::physical_plan::coop module with utility functions that make it easy to adopt cooperative scheduling in your custom operators as well.

Acknowledgments¶

Thank you to Datadobi for sponsoring the development of this feature and to the DataFusion community contributors including Qi Zhu and Mehmet Ozan Kabak.

About DataFusion¶

Optimizing SQL (and DataFrames) in DataFusion, Part 1: Query Optimization Overview

2025-06-15T00:00:00+00:00

Note: this blog was originally published on the InfluxData blog

Introduction¶

Sometimes Query Optimizers are seen as a sort of black magic, “the most challenging problem in computer science,” according to Father Pavlo, or some behind-the-scenes player. We believe this perception is because:

One must implement the rest of a database system (data storage, transactions, SQL parser, expression evaluation, plan execution, etc.) before the optimizer becomes critical⁵.
Some parts of the optimizer are tightly tied to the rest of the system (e.g., storage or indexes), so many classic optimizers are described with system-specific terminology.
Some optimizer tasks, such as access path selection and join order are known challenges and not yet solved (practically)—maybe they really do require black magic 🤔.

However, Query Optimizers are no more complicated in theory or practice than other parts of a database system, as we will argue in a series of posts:

Part 1: (this post):

Review what a Query Optimizer is, what it does, and why you need one for SQL and DataFrames.
Describe how industrial Query Optimizers are structured and standard optimization classes.

Part 2:

Describe the optimization categories with examples and pointers to implementations.
Describe Apache DataFusion’s rationale and approach to query optimization, specifically for access path and join ordering.

After reading these blogs, we hope people will use DataFusion to:

Build their own system specific optimizers.
Perform practical academic research on optimization (especially researchers working on new optimizations / join ordering—looking at you CMU 15-799, next year).

Query Optimizer Background¶

The key pitch for querying databases, and likely the key to the longevity of SQL (despite people’s love/hate relationship—see SQL or Death? Seminar Series – Spring 2025), is that it disconnects the WHAT you want to compute from the HOW to do it. SQL is a declarative language—it describes what answers are desired rather than an imperative language such as Python, where you describe how to do the computation as shown in Figure 1.

Figure 1: Query Execution: Users describe the answer they want using either SQL or a DataFrame. For SQL, a Query Planner translates the parsed query into an initial plan. The DataFrame API creates an initial plan directly. The initial plan is correct, but slow. Then, the Query Optimizer rewrites the initial plan into an optimized plan, which computes the same results but faster and more efficiently. Finally, the Execution Engine executes the optimized plan producing results.

SQL, DataFrames, LogicalPlan Equivalence¶

Given their name, it is not surprising that Query Optimizers can improve the performance of SQL queries. However, it is under-appreciated that this also applies to DataFrame style APIs.

Classic DataFrame systems such as pandas and Polars (by default) execute eagerly and thus have limited opportunities for optimization. However, more modern APIs such as Polars' lazy API, Apache Spark's DataFrame. and DataFusion's DataFrame are much faster as they use the design shown in Figure 1 and apply many query optimization techniques.

Example of Query Optimizer¶

This section motivates the value of a Query Optimizer with an example. Let’s say you have some observations of animal behavior, as illustrated in Table 1.

Table 1: Example observational data.

If the user wants to know the average population for some species in the last month, a user can write a SQL query or a DataFrame such as the following:

SQL:

SELECT location, AVG(population)
FROM observations
WHERE species = ‘contrarian spider’ AND 
  observation_time >= now() - interval '1 month'
GROUP BY location

DataFrame:

df.scan("observations")
  .filter(col("species").eq("contrarian spider"))
  .filter(col("observation_time").ge(now()).sub(interval('1 month')))
  .agg(vec![col(location)], vec![avg(col("population")])

Within DataFusion, both the SQL and DataFrame are translated into the same LogicalPlan, a “tree of relational operators.” This is a fancy way of saying data flow graphs where the edges represent tabular data (rows + columns) and the nodes represent a transformation (see this DataFusion overview video for more details). The initial LogicalPlan for the queries above is shown in Figure 2.

Figure 2: Example initial LogicalPlan for SQL and DataFrame query. The plan is read from bottom to top, computing the results in each step.

The optimizer's job is to take this query plan and rewrite it into an alternate plan that computes the same results but faster, such as the one shown in Figure 3.

Figure 3: An example optimized plan that computes the same result as the plan in Figure 2 more efficiently. The diagram highlights where the optimizer has applied Projection Pushdown, Filter Pushdown, and Constant Evaluation. Note that this is a simplified example for explanatory purposes, and actual optimizers such as the one in DataFusion perform additional tasks such as choosing specific aggregation algorithms.

Query Optimizer Implementation¶

Industrial optimizers, such as DataFusion’s (source), ClickHouse (source, source), DuckDB (source), and Apache Spark (source), are implemented as a series of passes or rules that rewrite a query plan. The overall optimizer is composed of a sequence of these rules,⁶ as shown in Figure 4. The specific order of the rules also often matters, but we will not discuss this detail in this post.

A multi-pass design is standard because it helps:

Understand, implement, and test each pass in isolation
Easily extend the optimizer by adding new passes

Figure 4: Query Optimizers are implemented as a series of rules that each rewrite the query plan. Each rule’s algorithm is expressed as a transformation of a previous plan.

There are three major classes of optimizations in industrial optimizers:

Always Optimizations: These are always good to do and thus are always applied. This class of optimization includes expression simplification, predicate pushdown, and limit pushdown. These optimizations are typically simple in theory, though they require nontrivial amounts of code and tests to implement in practice.
Engine Specific Optimizations: These optimizations take advantage of specific engine features, such as how expressions are evaluated or what particular hash or join implementations are available.
Access Path and Join Order Selection: These passes choose one access method per table and a join order for execution, typically using heuristics and a cost model to make tradeoffs between the options. Databases often have multiple ways to access the data (e.g., index scan or full-table scan), as well as many potential orders to combine (join) multiple tables. These methods compute the same result but can vary drastically in performance.

This brings us to the end of Part 1. In Part 2, we will explain these classes of optimizations in more detail and provide examples of how they are implemented in DataFusion and other systems.

About the Authors¶

Andrew Lamb is a Staff Engineer at InfluxData and an Apache DataFusion PMC member. A Database Optimizer connoisseur, he worked on the Vertica Analytic Database Query Optimizer for six years, has several granted US patents related to query optimization¹, co-authored several papers² about the topic (including in VLDB 2024³), and spent several weeks⁴ deeply geeking out about this topic with other experts (thank you Dagstuhl).

Mustafa Akur is a PhD Student at OHSU Knight Cancer Institute and an Apache DataFusion PMC member. He was previously a Software Developer at Synnada where he contributed significant features to the DataFusion optimizer, including many sort-based optimizations.

Notes¶

^[1] Modular Query Optimizer, US 8,312,027 · Issued Nov 13, 2012, Query Optimizer with schema conversion US 8,086,598 · Issued Dec 27, 2011

^[2] The Vertica Query Optimizer: The case for specialized Query Optimizers

^[3] https://www.vldb.org/pvldb/vol17/p1350-justen.pdf

^[4] https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101, https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111, https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/12321

^[5] And thus in academic classes, by the time you get around to an optimizer the semester is over and everyone is ready for the semester to be done. Once industrial systems mature to the point where the optimizer is a bottleneck, the shiny new-ness of the hype cycle has worn off and it is likely in the trough of disappointment.

^[6] Often systems will classify these passes into different categories, but I am simplifying here

Optimizing SQL (and DataFrames) in DataFusion, Part 2: Optimizers in Apache DataFusion

2025-06-15T00:00:00+00:00

Note, this blog was originally published on the InfluxData blog.

In the first part of this post, we discussed what a Query Optimizer is, what role it plays, and described how industrial optimizers are organized. In this second post, we describe various optimizations that are found in Apache DataFusion and other industrial systems in more detail.

DataFusion contains high quality, full-featured implementations for Always Optimizations and Engine Specific Optimizations (defined in Part 1). Optimizers are implemented as rewrites of LogicalPlan in the logical optimizer or rewrites of ExecutionPlan in the physical optimizer. This design means the same optimizer passes are applied for SQL queries, DataFrame queries, as well as plans for other query language frontends such as InfluxQL in InfluxDB 3.0, PromQL in Greptime, and vega in VegaFusion.

Always Optimizations¶

Some optimizations are so important they are found in almost all query engines and are typically the first implemented as they provide the largest cost / benefit ratio (and performance is terrible without them).

Predicate/Filter Pushdown¶

Why: Avoid carrying unneeded rows as soon as possible

What: Moves filters “down” in the plan so they run earlier during execution, as shown in Figure 1.

Example Implementations: DataFusion, DuckDB, ClickHouse

The earlier data is filtered out in the plan, the less work the rest of the plan has to do. Most mature databases aggressively use filter pushdown / early filtering combined with techniques such as partition and storage pruning (e.g. Parquet Row Group pruning) for performance.

An extreme, and somewhat contrived, is the query

SELECT city, COUNT(*) FROM population GROUP BY city HAVING city = 'BOSTON';

Semantically, HAVING is evaluated after GROUP BY in SQL. However, computing the population of all cities and discarding everything except Boston is much slower than only computing the population for Boston and so most Query Optimizers will evaluate the filter before the aggregation.

Figure 1: Filter Pushdown. In (A) without filter pushdown, the operator processes more rows, reducing efficiency. In (B) with filter pushdown, the operator receives fewer rows, resulting in less overall work and leading to a faster and more efficient query.

Projection Pushdown¶

Why: Avoid carrying unneeded columns as soon as possible

What: Pushes “projection” (keeping only certain columns) earlier in the plan, as shown in Figure 2.

Example Implementations: Implementations: DataFusion, DuckDB, ClickHouse

Similarly to the motivation for Filter Pushdown, the earlier the plan stops doing something, the less work it does overall and thus the faster it runs. For Projection Pushdown, if columns are not needed later in a plan, copying the data to the output of other operators is unnecessary and the costs of copying can add up. For example, in Figure 3 of Part 1, the species column is only needed to evaluate the Filter within the scan and notes are never used, so it is unnecessary to copy them through the rest of the plan.

Projection Pushdown is especially effective and important for column store databases, where the storage format itself (such as Apache Parquet) supports efficiently reading only a subset of required columns, and is especially powerful in combination with filter pushdown. Projection Pushdown is still important, but less effective for row oriented formats such as JSON or CSV where each column in each row must be parsed even if it is not used in the plan.

Figure 2: In (A) without projection pushdown, the operator receives more columns, reducing efficiency. In (B) with projection pushdown, the operator receives fewer columns, leading to optimized execution.

Limit Pushdown¶

Why: The earlier the plan stops generating data, the less overall work it does, and some operators have more efficient limited implementations.

What: Pushes limits (maximum row counts) down in a plan as early as possible.

Example Implementations: DataFusion, DuckDB, ClickHouse, Spark (Window and Projection)

Often queries have a LIMIT or other clause that allows them to stop generating results early so the sooner they can stop execution, the more efficiently they will execute.

In addition, DataFusion and other systems have more efficient implementations of some operators that can be used if there is a limit. The classic example is replacing a full sort + limit with a TopK operator that only tracks the top values using a heap. Similarly, DataFusion’s Parquet reader stops fetching and opening additional files once the limit has been hit.

Figure 3: In (A), without limit pushdown all data is sorted and everything except the first few rows are discarded. In (B), with limit pushdown, Sort is replaced with TopK operator which does much less work.

Expression Simplification / Constant Folding¶

Why: Evaluating the same expression for each row when the value doesn’t change is wasteful.

What: Partially evaluates and/or algebraically simplify expressions.

Example Implementations: DataFusion, DuckDB (has several rules such as constant folding, and comparison simplification), Spark

If an expression doesn’t change from row to row, it is better to evaluate the expression once during planning. This is a classic compiler technique and is also used in database systems

For example, given a query that finds all values from the current year

SELECT … WHERE extract(year from time_column) = extract(year from now())

Evaluating extract(year from now()) on every row is much more expensive than evaluating it once during planning time so that the query becomes comparison to a constant

SELECT … WHERE extract(year from time_column) = 2025

Furthermore, it is often possible to push such predicates into scans.

Rewriting `OUTER JOIN` → `INNER JOIN`¶

Why: INNER JOIN implementations are almost always faster (as they are simpler) than OUTER JOIN implementations, and INNER JOIN s impose fewer restrictions on other optimizer passes (such as join reordering and additional filter pushdown).

What: In cases where it is known that NULL rows introduced by an OUTER JOIN will not appear in the results, it can be rewritten to an INNER JOIN.

Example Implementations: DataFusion, Spark, ClickHouse.

For example, given a query such as the following

SELECT …
FROM orders LEFT OUTER JOIN customer ON (orders.cid = customer.id)
WHERE customer.last_name = 'Lamb'

The LEFT OUTER JOIN keeps all rows in orders that don’t have a matching customer, but fills in the fields with null. All such rows will be filtered out by customer.last_name = 'Lamb', and thus an INNER JOIN produces the same answer. This is illustrated in Figure 4.

Figure 4: Rewriting OUTER JOIN to INNER JOIN. In (A) the original query contains an OUTER JOIN but also a filter on customer.last_name, which filters out all rows that might be introduced by the OUTER JOIN. In (B) the OUTER JOIN is converted to inner join, a more efficient implementation can be used.

Engine Specific Optimizations¶

As discussed in Part 1 of this blog, optimizers also contain a set of passes that are still always good to do, but are closely tied to the specifics of the query engine. This section describes some common types

Subquery Rewrites¶

Why: Actually implementing subqueries by running a query for each row of the outer query is very expensive.

What: It is possible to rewrite subqueries as joins which often perform much better.

Example Implementations: DataFusion (one, two, three), Spark

Evaluating subqueries a row at a time is so expensive that execution engines in high performance analytic systems such as DataFusion and Vertica may not even support row-at-a-time evaluation given how terrible the performance would be. Instead, analytic systems rewrite such queries into joins which can perform 100s or 1000s of times faster for large datasets. However, transforming subqueries to joins requires “exotic” join semantics such as SEMI JOIN, ANTI JOIN and variations on how to treat equality with null^7.

For a simple example, consider that a query like this:

SELECT customer.name 
FROM customer 
WHERE (SELECT sum(value) 
       FROM orders WHERE
       orders.cid = customer.id) > 10;

Can be rewritten like this:

SELECT customer.name 
FROM customer 
JOIN (
  SELECT customer.id as cid_inner, sum(value) s 
  FROM orders 
  GROUP BY customer.id
 ) ON (customer.id = cid_inner AND s > 10);

We don’t have space to detail this transformation or why it is so much faster to run, but using this and many other transformations allow efficient subquery evaluation.

Optimized Expression Evaluation¶

Why: The capabilities of expression evaluation vary from system to system.

What: Optimize expression evaluation for the particular execution environment.

Example Implementations: There are many examples of this type of optimization, including DataFusion’s Common Subexpression Elimination, unwrap_cast, and identifying equality join predicates. DuckDB rewrites IN clauses, and SUM expressions. Spark also unwraps casts in binary comparisons, and adds special runtime filters.

To give a specific example of what DataFusion’s common subexpression elimination does, consider this query that refers to a complex expression multiple times:

SELECT date_bin('1 hour', time, '1970-01-01') 
FROM table 
WHERE date_bin('1 hour', time, '1970-01-01') >= '2025-01-01 00:00:00'
ORDER BY date_bin('1 hour', time, '1970-01-01')

Evaluating date_bin('1 hour', time, '1970-01-01')each time it is encountered is inefficient compared to calculating its result once, and reusing that result in when it is encountered again (similar to caching). This reuse is called Common Subexpression Elimination.

Some execution engines implement this optimization internally to their expression evaluation engine, but DataFusion represents it explicitly using a separate Projection plan node, as illustrated in Figure 5. Effectively, the query above is rewritten to the following

SELECT time_chunk 
FROM(SELECT date_bin('1 hour', time, '1970-01-01') as time_chunk 
     FROM table)
WHERE time_chunk >= '2025-01-01 00:00:00'
ORDER BY time_chunk

Figure 5: Adding a Projection to evaluate common complex sub expression decreases complexity for later stages.

Algorithm Selection¶

Why: Different engines have different specialized operators for certain operations.

What: Selects specific implementations from the available operators, based on properties of the query.

Example Implementations: DataFusion’s EnforceSorting pass uses sort optimized implementations, Spark’s rewrite to use a special operator for ASOF joins, and ClickHouse’s join algorithm selection such as when to use MergeJoin

For example, DataFusion uses a TopK (source) operator rather than a full Sort if there is also a limit on the query. Similarly, it may choose to use the more efficient PartialOrdered grouping operation when the data is sorted on group keys or a MergeJoin

Figure 6: An example of specialized operation for grouping. In (A), input data has no specified ordering and DataFusion uses a hashing-based grouping operator (source) to determine distinct groups. In (B), when the input data is ordered by the group keys, DataFusion uses a specialized grouping operator (source) to find boundaries that separate groups.

Using Statistics Directly¶

Why: Using pre-computed statistics from a table, without actually reading or opening files, is much faster than processing data.

What: Replace calculations on data with the value from statistics.

Example Implementations: DataFusion, DuckDB,

Some queries, such as the classic COUNT(*) from my_table used for data exploration can be answered using only statistics. Optimizers often have access to statistics for other reasons (such as Access Path and Join Order Selection) and statistics are commonly stored in analytic file formats. For example, the Metadata of Apache Parquet files stores MIN, MAX, and COUNT information.

Figure 7: When the aggregation result is already stored in the statistics, the query can be evaluated using the values from statistics without looking at any compressed data. The optimizer replaces the Aggregation operation with values from statistics.

Access Path and Join Order Selection¶

Overview¶

Last, but certainly not least, are optimizations that choose between plans with potentially (very) different performance. The major options in this category are

Join Order: In what order to combine tables using JOINs?
Access Paths: Which copy of the data or index should be read to find matching tuples?
Materialized View: Can the query can be rewritten to use a materialized view (partially computed query results)? This topic deserves its own blog (or book) and we don’t discuss further here.

Figure 8: Access Path and Join Order Selection in Query Optimizers. Optimizers use heuristics to enumerate some subset of potential join orders (shape) and access paths (color). The plan with the smallest estimated cost according to some cost model is chosen. In this case, Plan 2 with a cost of 180,000 is chosen for execution as it has the lowest estimated cost.

This class of optimizations is a hard problem for at least the following reasons:

Exponential Search Space: the number of potential plans increases exponentially as the number of joins and indexes increases.
Performance Sensitivity: Often different plans that are very similar in structure perform very differently. For example, swapping the input order to a hash join can result in 1000x or more (yes, a thousand-fold!) run time differences.
Cardinality Estimation Errors: Determining the optimal plan relies on cardinality estimates (e.g., how many rows will come out of each join). It is a known hard problem to estimate this cardinality, and in practice queries with as few as 3 joins often have large cardinality estimation errors.

Heuristics and Cost-Based Optimization¶

Industrial optimizers handle these problems using a combination of

Heuristics: to prune the search space and avoid considering plans that are (almost) never good. Examples include considering left-deep trees, or using Foreign Key / Primary Key relationships to pick the build size of a hash join.
Cost Model: Given the smaller set of candidate plans, the Optimizer then estimates their cost and picks the one using the lowest cost.

For some examples, you can read about Spark’s cost-based optimizer or look at the code for DataFusion’s join selection and DuckDB’s cost model and join order enumeration.

However, the use of heuristics and (imprecise) cost models means optimizers must

Make deep assumptions about the execution environment: For example the heuristics often include assumptions that joins implement sideways information passing (RuntimeFilters), or that Join operators always preserve a particular input's order.
Use one particular objective function: There are almost always trade-offs between desirable plan properties, such as execution speed, memory use, and robustness in the face of cardinality estimation. Industrial optimizers typically have one cost function which attempts to balance between the properties or a series of hard to use indirect tuning knobs to control the behavior.
Require statistics: Typically cost models require up-to-date statistics, which can be expensive to compute, must be kept up to date as new data arrives, and often have trouble capturing the non-uniformity of real world datasets

Join Ordering in DataFusion¶

DataFusion purposely does not include a sophisticated cost based optimizer. Instead, keeping with its design goals it provides a reasonable default implementation along with extension points to customize behavior.

Specifically, DataFusion includes

“Syntactic Optimizer” (joins in the order they are listed in the query^{8) with basic join re-ordering (source) to prevent join disasters.}
Support for ColumnStatistics and Table Statistics
The framework for filter selectivity + join cardinality estimation.
APIs for easily rewriting plans, such as the TreeNode API and reordering joins

This combination of features along with custom optimizer passes lets users customize the behavior to their use case, such as custom indexes like uWheel and materialized views.

The rationale for including only a basic optimizer is that any one particular set of heuristics and cost model is unlikely to work well for the wide variety of DataFusion users because of the tradeoffs involved.

For example, some users may always have access to adequate resources, and want the fastest query execution, and are willing to tolerate runtime errors or a performance cliff when there is insufficient memory. Other users, however, may be willing to accept a slower maximum performance in return for more predictable performance when running in a resource constrained environment. This approach is not universally agreed. One of us has previously argued the case for specialized optimizers in a more academic paper, and the topic comes up regularly in the DataFusion community, (e.g. this recent comment).

Note: We are actively improving this part of the code to help people write their own optimizers (🎣 come help us define and implement it!)

Conclusion¶

Optimizers are awesome, and we hope these two posts have demystified what they are and how they are implemented in industrial systems. Like many modern query engine designs, the common techniques are well known, though require substantial effort to get right. DataFusion’s industrial strength optimizers can and do serve many real world systems well and we expect that number to grow over time.

We also think DataFusion provides interesting opportunities for optimizer research. As we discussed, there are still unsolved problems such as optimal join ordering. Experiments in papers often use academic systems or modify optimizers in tightly integrated open source systems (for example, the recent POLARs paper uses DuckDB). However, using a tightly integrated system constrains the research to the set of heuristics and structure provided by that system. Hopefully DataFusion’s documentation, newly citeable SIGMOD paper, and modular design will encourage more broadly applicable research in this area.

And finally, as always, if you are interested in working on query engines and learning more about how they are designed and implemented, please join our community. We welcome first time contributors as well as long time participants to the fun of building a database together.

Notes¶

^[7] See Unnesting Arbitrary Queries from Neumann and Kemper for a more academic treatment.

^[8] One of my favorite terms I learned from Andy Pavlo’s CMU online lectures

Apache DataFusion Comet 0.8.0 Release

2025-05-06T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.8.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately six weeks of development work and is the result of merging 81 PRs from 11 contributors. See the change log for more information.

Release Highlights¶

Performance & Stability¶

Up to 4x speedup in jobs using dropDuplicates, thanks to optimizations in the first_value and last_value aggregate functions in DataFusion 47.0.0.
Introduction of a global Tokio runtime, which resolves potential deadlocks in certain multi-task scenarios.

Native Shuffle Improvements¶

Significant enhancements to the native shuffle mechanism include:

Lower memory usage through using interleave_record_batches instead of using array builders.
Support for complex types in shuffle data (note: hash partition expressions still require primitive types).
Reclaimable shuffle files, reducing disk pressure.
Respects spark.local.dir for temporary storage.
Per-task shuffle metrics are now available, providing better visibility into execution behavior.

Experimental Support for DataFusion’s Parquet Scan¶

It is now possible to configure Comet to use DataFusion’s Parquet reader instead of Comet’s current Parquet reader. This has the advantage of supporting complex types, and also has performance optimizations that are not present in Comet's existing reader.

This release continues with the ongoing improvements and bug fixes and supports more use cases, but there are still some known issues:

There are schema coercion bugs for nested types containing INT96 columns, which can cause incorrect results.
There are compatibility issues when reading integer values that are larger than their type annotation, such as the value 1024 being stored in a field annotated as int(8).
A small number of Spark SQL tests remain unsupported (#1545).

To enable DataFusion’s Parquet reader, either set spark.comet.scan.impl=native_datafusion or set the environment variable COMET_PARQUET_SCAN_IMPL=native_datafusion.

Updates to Supported Spark Versions¶

Added support for Spark 3.5.5
Dropped support for Spark 3.3.x

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

User defined Window Functions in DataFusion

2025-04-19T00:00:00+00:00

Window functions are a powerful feature in SQL, allowing for complex analytical computations over a subset of data. However, efficiently implementing them, especially sliding windows, can be quite challenging. With Apache DataFusion's user-defined window functions, developers can easily take advantage of all the effort put into DataFusion's implementation.

In this post, we'll explore:

What window functions are and why they matter
Understanding sliding windows
The challenges of computing window aggregates efficiently
How to implement user-defined window functions in DataFusion

Understanding Window Functions in SQL¶

Imagine you're analyzing sales data and want insights without losing the finer details. This is where window functions come into play. Unlike GROUP BY, which condenses data, window functions let you retain each row while performing calculations over a defined range —like having a moving lens over your dataset.

Picture a business tracking daily sales. They need a running total to understand cumulative revenue trends without collapsing individual transactions. SQL makes this easy:

SELECT id, value, SUM(value) OVER (ORDER BY id) AS running_total
FROM sales;

example:
+------------+--------+-------------------------------+
|   Date     | Sales  | Rows Considered               |
+------------+--------+-------------------------------+
| Jan 01     | 100    | [100]                         |
| Jan 02     | 120    | [100, 120]                    |
| Jan 03     | 130    | [100, 120, 130]               |
| Jan 04     | 150    | [100, 120, 130, 150]          |
| Jan 05     | 160    | [100, 120, 130, 150, 160]     |
| Jan 06     | 180    | [100, 120, 130, 150, 160, 180]|
| Jan 07     | 170    | [100, ..., 170] (7 days)      |
| Jan 08     | 175    | [120, ..., 175]               |
+------------+--------+-------------------------------+

Figure 1: A row-by-row representation of how a 7-day moving average includes the previous 6 days and the current one.

This helps in analytical queries where we need cumulative sums, moving averages, or ranking without losing individual records.

User Defined Window Functions¶

DataFusion's Built-in window functions such as first_value, rank and row_number serve many common use cases, but sometimes custom logic is needed—for example:

Calculating moving averages with complex conditions (e.g. exponential averages, integrals, etc)
Implementing a custom ranking strategy
Tracking non-standard cumulative logic

Thus, User-Defined Window Functions (UDWFs) allow developers to define their own behavior while allowing DataFusion to handle the calculations of the windows and grouping specified in the OVER clause

Understanding Sliding Window¶

Sliding windows define a moving range of data over which aggregations are computed. Unlike simple cumulative functions, these windows are dynamically updated as new data arrives.

For instance, if we want a 7-day moving average of sales:

SELECT date, sales, 
       AVG(sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg
FROM sales;

Here, each row’s result is computed based on the last 7 days, making it computationally intensive as data grows.

Why Computing Sliding Windows Is Hard¶

Imagine you’re at a café, and the barista is preparing coffee orders. If they made each cup from scratch without using pre-prepared ingredients, the process would be painfully slow. This is exactly the problem with naïve sliding window computations.

Computing sliding windows efficiently is tricky because:

High Computation Costs: Just like making coffee from scratch for each customer, recalculating aggregates for every row is expensive.
Data Shuffling: In large distributed systems, data must often be shuffled between nodes, causing delays—like passing orders between multiple baristas who don’t communicate efficiently.
State Management: Keeping track of past computations is like remembering previous orders without writing them down—error-prone and inefficient.

Many traditional query engines struggle to optimize these computations effectively, leading to sluggish performance.

How DataFusion Evaluates Window Functions Quickly¶

In the world of big data, every millisecond counts. Imagine you’re analyzing stock market data, tracking sensor readings from millions of IoT devices, or crunching through massive customer logs—speed matters. This is where DataFusion shines, making window function computations blazing fast. Let’s break down how it achieves this remarkable performance.

DataFusion implements the battle tested sort-based approach described in this paper which is also used in systems such as Postgresql and Vertica. The input is first sorted by both the PARTITION BY and ORDER BY expressions and then the WindowAggExec operator efficiently determines the partition boundaries and creates appropriate PartitionEvaluator instances.

The sort-based approach is well understood, scales to large data sets, and leverages DataFusion's highly optimized sort implementation. DataFusion minimizes resorting by leveraging the sort order tracking and optimizations described in the Using Ordering for Better Plans blog.

For example, given the query such as the following to compute the starting, ending and average price for each stock:

SELECT 
  FIRST_VALUE(price) OVER (PARTITION BY date_bin('1 month', time) ORDER BY time DESC) AS start_price, 
  FIRST_VALUE(price) OVER (PARTITION BY date_bin('1 month', time) ORDER BY time DESC) AS end_price,
  AVG(price)         OVER (PARTITION BY date_bin('1 month', time))                    AS avg_price
FROM quotes;

If the input data is not sorted, DataFusion will first sort the data by the date_bin and time and then WindowAggExec computes the partition boundaries and invokes the appropriate PartitionEvaluator API methods depending on the window definition in the OVER clause and the declared capabilities of the function.

For example, evaluating window_func(val) OVER (PARTITION BY col) on the following data:

col | val
--- + ----
 A  | 10
 A  | 10
 C  | 20
 D  | 30
 D  | 30

Will instantiate three PartitionEvaluators, one each for the partitions defined by col=A, col=B, and col=C.

col | val
--- + ----
 A  | 10     <--- partition 1
 A  | 10

col | val
--- + ----
 C  | 20     <--- partition 2

col | val
--- + ----
 D  | 30     <--- partition 3
 D  | 30

Creating your own Window Function¶

DataFusion supports user-defined window aggregates (UDWAs), meaning you can bring your own window function logic using the exact same APIs and performance as the built in functions.

For example, we will declare a user defined window function that computes a moving average.

use datafusion::arrow::{array::{ArrayRef, Float64Array, AsArray}, datatypes::Float64Type};
use datafusion::logical_expr::{PartitionEvaluator};
use datafusion::common::ScalarValue;
use datafusion::error::Result;
/// This implements the lowest level evaluation for a window function
///
/// It handles calculating the value of the window function for each
/// distinct values of `PARTITION BY`
#[derive(Clone, Debug)]
struct MyPartitionEvaluator {}

impl MyPartitionEvaluator {
    fn new() -> Self {
        Self {}
    }
}

Different evaluation methods are called depending on the various settings of WindowUDF and the query. In the first example, we use the simplest and most general, evaluate function. We will see how to use PartitionEvaluator for the other more advanced uses later in the article.

impl PartitionEvaluator for MyPartitionEvaluator {
    /// Tell DataFusion the window function varies based on the value
    /// of the window frame.
    fn uses_window_frame(&self) -> bool {
        true
    }

    /// This function is called once per input row.
    ///
    /// `range`specifies which indexes of `values` should be
    /// considered for the calculation.
    ///
    /// Note this is the SLOWEST, but simplest, way to evaluate a
    /// window function. It is much faster to implement
    /// evaluate_all or evaluate_all_with_rank, if possible
    fn evaluate(
        &mut self,
        values: &[ArrayRef],
        range: &std::ops::Range<usize>,
    ) -> Result<ScalarValue> {
        // Again, the input argument is an array of floating
        // point numbers to calculate a moving average
        let arr: &Float64Array = values[0].as_ref().as_primitive::<Float64Type>();

        let range_len = range.end - range.start;

        // our smoothing function will average all the values in the
        let output = if range_len > 0 {
            let sum: f64 = arr.values().iter().skip(range.start).take(range_len).sum();
            Some(sum / range_len as f64)
        } else {
            None
        };

        Ok(ScalarValue::Float64(output))
    }
}

/// Create a `PartitionEvaluator` to evaluate this function on a new
/// partition.
fn make_partition_evaluator() -> Result<Box<dyn PartitionEvaluator>> {
    Ok(Box::new(MyPartitionEvaluator::new()))
}

Registering a Window UDF¶

To register a Window UDF, you need to wrap the function implementation in a WindowUDF struct and then register it with the SessionContext. DataFusion provides the create_udwf helper functions to make this easier. There is a lower level API with more functionality but is more complex, that is documented in advanced_udwf.rs.

use datafusion::logical_expr::{Volatility, create_udwf};
use datafusion::arrow::datatypes::DataType;
use std::sync::Arc;

// here is where we define the UDWF. We also declare its signature:
let smooth_it = create_udwf(
    "smooth_it",
    DataType::Float64,
    Arc::new(DataType::Float64),
    Volatility::Immutable,
    Arc::new(make_partition_evaluator),
);

The create_udwf functions take five arguments:

The first argument is the name of the function. This is the name that will be used in SQL queries.
The second argument is the DataType of input array (attention: this is not a list of arrays). I.e. in this case, the function accepts Float64 as argument.
The third argument is the return type of the function. I.e. in this case, the function returns an Float64.
The fourth argument is the volatility of the function. In short, this is used to determine if the function’s performance can be optimized in some situations. In this case, the function is Immutable because it always returns the same value for the same input. A random number generator would be Volatile because it returns a different value for the same input.
The fifth argument is the function implementation. This is the function that we defined above.

That gives us a WindowUDF that we can register with the SessionContext:

use datafusion::execution::context::SessionContext;

let ctx = SessionContext::new();

ctx.register_udwf(smooth_it);

For example, if we have a cars.csv whose contents like

car,speed,time
red,20.0,1996-04-12T12:05:03.000000000
red,20.3,1996-04-12T12:05:04.000000000
green,10.0,1996-04-12T12:05:03.000000000
green,10.3,1996-04-12T12:05:04.000000000
...

Then, we can query like below:

use datafusion::datasource::file_format::options::CsvReadOptions;

#[tokio::main]
async fn main() -> Result<()> {

    let ctx = SessionContext::new();

    let smooth_it = create_udwf(
        "smooth_it",
        DataType::Float64,
        Arc::new(DataType::Float64),
        Volatility::Immutable,
        Arc::new(make_partition_evaluator),
    );
    ctx.register_udwf(smooth_it);

    // register csv table first
    let csv_path = "../../datafusion/core/tests/data/cars.csv".to_string();
    ctx.register_csv("cars", &csv_path, CsvReadOptions::default().has_header(true)).await?;

    // do query with smooth_it
    let df = ctx
        .sql(r#"
            SELECT
                car,
                speed,
                smooth_it(speed) OVER (PARTITION BY car ORDER BY time) as smooth_speed,
                time
            FROM cars
            ORDER BY car
        "#)
        .await?;

    // print the results
    df.show().await?;
    Ok(())
}

The output will be like:

+-------+-------+--------------------+---------------------+
| car   | speed | smooth_speed       | time                |
+-------+-------+--------------------+---------------------+
| green | 10.0  | 10.0               | 1996-04-12T12:05:03 |
| green | 10.3  | 10.15              | 1996-04-12T12:05:04 |
| green | 10.4  | 10.233333333333334 | 1996-04-12T12:05:05 |
| green | 10.5  | 10.3               | 1996-04-12T12:05:06 |
| green | 11.0  | 10.440000000000001 | 1996-04-12T12:05:07 |
| green | 12.0  | 10.700000000000001 | 1996-04-12T12:05:08 |
| green | 14.0  | 11.171428571428573 | 1996-04-12T12:05:09 |
| green | 15.0  | 11.65              | 1996-04-12T12:05:10 |
| green | 15.1  | 12.033333333333333 | 1996-04-12T12:05:11 |
| green | 15.2  | 12.35              | 1996-04-12T12:05:12 |
| green | 8.0   | 11.954545454545455 | 1996-04-12T12:05:13 |
| green | 2.0   | 11.125             | 1996-04-12T12:05:14 |
| red   | 20.0  | 20.0               | 1996-04-12T12:05:03 |
| red   | 20.3  | 20.15              | 1996-04-12T12:05:04 |
...
...
+-------+-------+--------------------+---------------------+

This gives you full flexibility to build domain-specific logic that plugs seamlessly into DataFusion’s engine — all without sacrificing performance.

Final Thoughts and Recommendations¶

Window functions may be common in SQL, but efficient and extensible window functions in engines are rare. While many databases support user defined scalar and user defined aggregate functions, user defined window functions are not as common and Datafusion making it easier for all .

For anyone who is curious about DataFusion I highly recommend giving it a try. This post was designed to make it easier for new users to work with User Defined Window Functions by giving a few examples of how one might implement these.

When it comes to designing UDFs, I strongly recommend reviewing the Window functions documentation.

A heartfelt thank you to @alamb and @andygrove for their invaluable reviews and thoughtful feedback—they’ve been instrumental in shaping this post.

The Apache Arrow and Apache DataFusion communities are vibrant, welcoming, and full of passionate developers building something truly powerful. If you’re excited about high-performance analytics and want to be part of an open-source journey, I highly encourage you to explore the official documentation and dive into one of the many open issues. There’s never been a better time to get involved!

tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust

2025-04-10T00:00:00+00:00

TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ with dbgen.

3 members of the Apache DataFusion community used Rust and open source development to build tpchgen-rs, a fully open TPC-H data generator over 20x faster than any other implementation we know of.

It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic dbgen which takes 30 minutes¹ (0.05GB/sec). On the same machine, it takes less than 2 minutes to create all 36 GB of SF=100 in Apache Parquet format, which takes 44 minutes using DuckDB. It is finally convenient and efficient to run TPC-H queries locally when testing analytical engines such as DataFusion.

Figure 1: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP VM with 88GB of memory. For Scale Factor(SF) 100 tpchgen takes 1 minute and 14 seconds and DuckDB takes 17 minutes and 48 seconds. For SF=1000, tpchgen takes 10 minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure DuckDB’s time as it requires 647 GB of RAM, more than the 88 GB that was available on our test machine. The testing methodology is in the documentation.

This blog explains what TPC-H is, how we ported the vintage C data generator to Rust (yes, RWIR) and optimized its performance over the course of a few weeks of part-time work. We began this project so we can easily generate TPC-H data in Apache DataFusion and GlareDB.

Try it for yourself¶

The tool is entirely open source under the Apache 2.0 license. Visit the tpchgen-rs repository or try it for yourself by run the following commands after installing Rust:

$ cargo install tpchgen-cli

# create SF=1 in classic TBL format
$ tpchgen-cli -s 1 

# create SF=10 in Parquet
$ tpchgen-cli -s 10 --format=parquet

What is TPC-H / dbgen?¶

The popular TPC-H benchmark (often referred to as TPCH) helps evaluate the performance of database systems on OLAP queries, the kind used to build BI dashboards.

TPC-H has become a de facto standard for analytic systems. While there are well known limitations as the data and queries do not well represent many real world use cases, the majority of analytic database papers and industrial systems still use TPC-H query performance benchmarks as a baseline. You will inevitably find multiple results for “TPCH Performance <your favorite database>” in any search engine.

The benchmark was created at a time when access to high performance analytical systems was not widespread, so the Transaction Processing Performance Council defined a process of formal result verification. More recently, given the broad availability of free and open source database systems, it is common for users to run and verify TPC-H performance themselves.

TPC-H simulates a business environment with eight tables: REGION, NATION, SUPPLIER, CUSTOMER, PART, PARTSUPP, ORDERS, and LINEITEM. These tables are linked by foreign keys in a normalized schema representing a supply chain with parts, suppliers, customers and orders. The benchmark itself is 22 SQL queries containing joins, aggregations, and sorting operations.

The queries run against data created with dbgen, a program written in a pre C-99 dialect, which generates data in a format called TBL (example in Figure 2). dbgen creates data for each of the 8 tables for a certain Scale Factor, commonly abbreviated as SF. Example Scale Factors and corresponding dataset sizes are shown in Table 1. There is no theoretical upper bound on the Scale Factor.

103|2844|845|3|23|40177.32|0.01|0.04|N|O|1996-09-11|1996-09-18|1996-09-26|NONE|FOB|ironic accou|
229|10540|801|6|29|42065.66|0.04|0.00|R|F|1994-01-14|1994-02-16|1994-01-22|NONE|FOB|uriously pending |
263|2396|649|1|22|28564.58|0.06|0.08|R|F|1994-08-24|1994-06-20|1994-09-09|NONE|FOB|efully express fo|
327|4172|427|2|9|9685.53|0.09|0.05|A|F|1995-05-24|1995-07-11|1995-06-05|NONE|AIR| asymptotes are fu|
450|5627|393|4|40|61304.80|0.05|0.03|R|F|1995-03-20|1995-05-25|1995-04-14|NONE|RAIL|ve. asymptote|

Figure 2: Example TBL formatted output of dbgen for the LINEITEM table

Scale Factor	Data Size (TBL)	Data Size (Parquet)
0.1	103 Mb	31 Mb
1	1 Gb	340 Mb
10	10 Gb	3.6 Gb
100	107 Gb	38 Gb
1000	1089 Gb	379 Gb

Table 1: TPC-H data set sizes at different scale factors for both TBL and Apache Parquet.

Why do we need a new TPC-H Data generator?¶

Despite the known limitations of the TPC-H benchmark, it is so well known that it is used frequently in database performance analysis. To run TPC-H, you must first load the data, using dbgen, which is not ideal for several reasons:

You must find and compile a copy of the 15+ year old C program (for example electrum/tpch-dbgen)
dbgen requires substantial time (Figure 3) and is not able to use more than one core.
It outputs TBL format, which typically requires loading into your database (for example, here is how to do so in Apache DataFusion) prior to query.
The implementation makes substantial assumptions about the operating environment, making it difficult to extend or embed into other systems.²

Figure 3: Time to generate TPC-H data in TBL format. tpchgen is shown in blue. tpchgen restricted to a single core is shown in red. Unmodified dbgen is shown in green and dbgen modified to use -O3 optimization level is shown in yellow.

dbgen is so inconvenient and takes so long that vendors often provide preloaded TPC-H data, for example Snowflake Sample Data, Databricks Sample datasets and DuckDB Pre-Generated Data Sets.

In addition to pre-generated datasets, DuckDB also provides a TPC-H extension for generating TPC-H datasets within DuckDB. This is so much easier to use than the current alternatives that it leads many researchers and other thought leaders to use DuckDB to evaluate new ideas. For example, Wan Shen Lim explicitly mentioned the ease of creating the TPC-H dataset as one reason the first student project of CMU-799 Spring 2025 used DuckDB.

As beneficial as the DuckDB TPC-H extension is, it is non-ideal for several reasons:

Creates data in a proprietary format, which requires export to use in other systems.
Requires significant time (e.g. 17 minutes for Scale Factor 10).
Requires unnecessarily large amounts of memory (e.g. 71 GB for Scale Factor 10)

The above limitations makes it impractical to generate Scale Factor 100 and above on laptops or standard workstations, though DuckDB offers pre-computed files for larger factors³.

Why Rust?¶

Realistically we used Rust because we wanted to integrate the data generator into Apache DataFusion and GlareDB. However, we also believe Rust is superior to C/C++ due to its comparable performance, but much higher programmer productivity (Figure 4). Productivity in this case refers to the ease of optimizing and adding multithreading without introducing hard to debug memory safety or concurrency issues.

While Rust does allow unsafe access to memory (eliding bounds checking, for example), when required for performance, our implementation is entirely memory safe. The only unsafe code is used to skip UTF8 validation on known ASCII strings.

Figure 4: Lamb Theory of System Language Evolution from Boston University MiDAS Fall 2024 (Data Systems Seminar), recording. Special thanks to @KurtFehlhauer

How: The Journey¶

We did it together as a team in the open over the course of a few weeks. Wan Shen Lim inspired the project by pointing out the benefits of easy TPC-H dataset creation and suggesting we check out a Java port on February 11, 2025. Achraf made first commit a few days later on February 16, and Andrew and Sean started helping on March 8, 2025 and we released version 0.1 on March 30, 2025.

Optimizing Single Threaded Performance¶

Archaf completed the end to end conformance tests, to ensure correctness, and an initial cli check in on March 15, 2025.

On a Macbook Pro M3 (Nov 2023), the initial performance numbers were actually slower than the original Java implementation which was ported 😭. This wasn’t surprising since the focus of the first version was to get a byte of byte compatible port, and knew about the performance shortcomings and how to approach them.

Scale Factor	Time
1	0m10.307s
10	1m26.530s
100	14m56.986s

Table 2: Performance of running the initial tpchgen-cli, measured with time target/release/tpchgen-cli -s $SCALE_FACTOR

With this strong foundation we began optimizing the code using Rust’s low level memory management to improve performance while retaining memory safely. We spent several days obsessing over low level details and implemented a textbook like list of optimizations:

Avoiding startup overhead,
not copying strings (many more PRs as well)
Rust’s zero overhead abstractions for dates
Static strings (entirely safely with static lifetimes)
Generics to avoid virtual function call overhead
Moving lookups from runtime to load time

At the time of writing, single threaded performance is now 2.5x-2.7x faster than the initial version, as shown in Table 3.

Scale Factor	Time	Times faster
1	0m4.079s	2.5x
10	0m31.616s	2.7x
100	5m28.083s	2.7x

Table 3: Single threaded tpchgen-cli performance, measured with time target/release/tpchgen-cli -s $SCALE_FACTOR --num-threads=1

Multi-threading¶

Then we applied Rust’s fearless concurrency – with a single, small PR (272 net new lines) we updated the same memory safe code to run with multiple threads and consume bounded memory using tokio for the thread scheduler⁴.

As shown in Table 4, with this change, tpchgen-cli generates the full SF=100 dataset in 32 seconds (which is 3.3 GB/sec 🤯). Further investigation reveals that at SF=100 our generator is actually IO bound (which is not the case for dbgen or duckdb) – it creates data faster than can be written to an SSD. When writing to /dev/null tpchgen generates the entire dataset in 25 seconds (4 GB/s).

Scale Factor	Time	Times faster than initial implementation	Times faster than optimized single threaded
1	0m1.369s	7.3x	3x
10	0m3.828s	22.6x	8.2x
100	0m32.615s	27.5x	10x
100 (to /dev/null)	0m25.088s	35.7x	13.1x

Table 4: tpchgen-cli (multithreaded) performance measured with time target/release/tpchgen-cli -s $SCALE_FACTOR

Using Rust and async streams, the data generator is also fully streaming: memory use does not increase with increasing data size / scale factors⁵. The DuckDB generator seems to require far more memory than is commonly available on developer laptops and memory use increases with scale factor. With tpchgen-cli it is perfectly possible to create data for SF=10000 or larger on a machine with 16GB of memory (assuming sufficient storage capacity).

Direct to parquet¶

At this point, tpchgen-cli could very quickly generate the TBL format. However, as described above, the TBL is annoying to work with, because

It has no header
It is like a CSV but the delimiter is |
Each line ends with an extra | delimiter before the newline 🙄
No system that we know can read them without additional configuration.

We next added support for CSV generation (special thanks @niebayes from Datalayers for finding and fixing bugs) which performs at the same speed as TBL. While CSV files are far more standard than TBL, they must still be parsed prior to load and automatic type inference may not deduce the types needed for the TPC-H benchmarks (e.g. floating point vs Decimal).

What would be far more useful is a typed, efficient columnar format such as Apache Parquet which is supported by all modern query engines. So we made a tpchgen-arrow crate to create Apache Arrow arrays directly and then a small 300 line PR to feed those arrays to the Rust Parquet writer, again using tokio for parallelized but memory bound work.

This approach was simple, fast and scalable, as shown in Table 5. Even though creating Parquet files is significantly more computationally expensive than TBL or CSV, tpchgen-cli creates the full SF=100 parquet format dataset in less than 45 seconds.

Scale Factor	Time to generate Parquet	Speed compared to tbl generation
1	0m1.649s	0.8x
10	0m5.643s	0.7x
100	0m45.243s	0.7x
100 (to /dev/null)	0m45.153s	0.5x

Table 5: tpchgen-cli Parquet generation performance measured with time target/release/tpchgen-cli -s $SCALE_FACTOR --format=parquet

Conclusion 👊🎤¶

In just a few days, with some fellow database nerds and the power of Rust, we built something 10x better than what currently exists. We hope it inspires more research into analytical systems using the TPC-H dataset and that people build awesome things with it. For example, Sean has already added on-demand generation of tables to GlareDB. Please consider joining us and helping out at https://github.com/clflushopt/tpchgen-rs.

We met while working together on Apache DataFusion in various capacities. If you are looking for a community of like minded people hacking on databases, we welcome you to come join us. We are in the process of integrating this into DataFusion (see apache/datafusion#14608) if you are interested in helping 🎣

About the Authors:¶

Andrew Lamb (@alamb) is a Staff Engineer at InfluxData and a PMC member of Apache DataFusion and Apache Arrow.
Achraf B (@clflushopt) is a Software Engineer at Optable where he works on data infrastructure.
Sean Smith (@scsmithr) is the founder of GlareDB focused on building a fast analytics database.

Footnotes¶

1: Actual Time: 30:35

2: It is possible to embed the dbgen code, which appears to be the approach taken by DuckDB. This approach was tried in GlareDB (GlareDB/glaredb#3313), but ultimately shelved given the amount of effort needed to adapt and isolate the dbgen code.

3: It is pretty amazing to imagine the machine required to generate SF300 that had 1.8TB (!!) of RAM

4: We tried to use Rayon (see discussion here), but could not easily keep memory bounded.

5: tpchgen-cli memory usage is a function of the number of threads: each thread needs some buffer space

Apache DataFusion Python 46.0.0 Released

2025-03-30T00:00:00+00:00

We are happy to announce that datafusion-python 46.0.0 has been released. This release brings in all of the new features of the core DataFusion 46.0.0 library. Since the last blog post for datafusion-python 43.1.0, a large number of improvements have been made that can be found in the changelogs.

We highly recommend reviewing the upstream DataFusion 46.0.0 announcement.

Easier file reading¶

In these releases we have introduced two new ways to more easily read files into DataFrames.

PR #982 introduced a series of easier read functions for Parquet, JSON, CSV, and AVRO files. This introduces a concept of a global context that is available by default when using these methods. Now instead of creating a default Session Context and then calling the read methods, you can simply import these read alternative methods and begin working with your DataFrames. Below is an example of how easy to use this new approach is.

from datafusion.io import read_parquet
df = read_parquet(path="./examples/tpch/data/customer.parquet")

PR #980 adds a method for setting up a session context to use URL tables. With this enabled, you can use a path to a local file as a table name. An example of how to use this is demonstrated in the following snippet.

import datafusion
ctx = datafusion.SessionContext().enable_url_table()
df = ctx.table("./examples/tpch/data/customer.parquet")

Registering Table Views¶

DataFusion supports registering a logical plan as a view with a session context. This allows creating views in one part of your work flow and passinng the session context to other places where that logical plan can be reused. This is an useful feature for building up complex workflows and for code clarity. PR #1016 enables this feature in datafusion-python.

For example, supposing you have a DataFrame called df1, you could use this code snippet to register the view and then use it in another place:

ctx.register_view("view1", df1)

And then in another portion of your code which has access to the same session context you can retrieve the DataFrame with:

df2 = ctx.table("view1")

Asynchronous Iteration of Record Batches¶

Retrieving a RecordBatch from a RecordBatchStream was a synchronous call, which would require the end user's code to wait for the data retrieval. This is described in Issue 974. We continue to support this as a synchronous iterator, but we have also added in the ability to retrieve the RecordBatch using the Python asynchronous anext function.

Default ZSTD Compression for Parquet files¶

With PR #981, we change the saving of Parquet files to use zstd compression by default. Previously the default was uncompressed, causing excessive disk storage. Zstd is an excellent compression scheme that balances speed and compression ratio. Users can still save their Parquet files uncompressed by passing in the appropriate value to the compression argument when calling DataFrame.write_parquet.

UDF Decorators¶

In PRs #1040 and #1061 we add methods to make creating user defined functions easier and take advantage of Python decorators. With these PRs you can save a step from defining a method and then defining a udf of that method. Instead you can simply add the appropriate udf decorator. Similar methods exist for aggregate and window user defined functions.

@udf([pa.int64(), pa.int64()], pa.bool_(), "stable")
def my_custom_function(
    age: pa.Array,
    favorite_number: pa.Array,
) -> pa.Array:
    pass

`uv` package management¶

uv is an extremely fast Python package manager, written in Rust. In the previous version of datafusion-python we had a combination of settings of PyPi and Conda. Instead, we switch to using uv is our primary method for dependency management.

For most users of DataFusion, this change will be transparent. You can still install via pip or conda. For developers, the instructions in the repository have been updated.

Code cleanup¶

In an effort to improve our code cleanliness and ensure we are following Python best practices, we use ruff to perform Python linting. Until now we enabled only a portion of the available linters available. In PRs #1055 and #1062, we enable many more of these linters and made code improvements to ensure we are following these recommendations.

Improved Jupyter Notebook rendering¶

Since PR #839 in DataFusion 41.0.0 we have been able to render DataFrames using html in jupyter notebooks. This is a big improvement over the show command when we have the ability to render tables. In PR #1036 we went a step further and added in a variety of features.

Now html tables are scrollable, vertically and horizontally.
When data are truncated, we report this to the user.
Instead of showing a small number of rows, we collect up to 2 megabytes of data to display. Since we have scrollable tables, we are able to make more data available to the user without sacrificing notebook usability.
We report explicitly when the DataFrame is empty. Previously we would not output anything for an empty table. This indicator is helpful to users to ensure their plans are written correctly. Sometimes a non-output can be overlooked.
For long output of data, we generate a collapsed view of the data with an option for the user to click on it to expand the data.

In the below view you can see an example of some of these features such as the expandable text and scroll bars.

Figure 1: With the html rendering enhancements, tables are more easily viewable in jupyter notebooks.

Extension Documentation¶

We have recently added Extension Documentation to the DataFusion in Python website. We have received many requests about how to better understand how to integrate DataFusion in Python with other Rust libraries. To address these questions we wrote an article about some of the difficulties that we encounter when using Rust libraries in Python and our approach to addressing them.

Migration Guide¶

During the upgrade from DataFusion 43.0.0 to DataFusion 44.0.0 as our upstream core dependency, we discovered a few changes were necessary within our repository and our unit tests. These notes serve to help guide users who may encounter similar issues when upgrading.

RuntimeConfig is now deprecated in favor of RuntimeEnvBuilder. The migration is fairly straightforward, and the corresponding classes have been marked as deprecated. For end users it should be simply a matter of changing the class name.
If you perform a concat of a string_view and string, it will now return a string_view instead of a string. This likely only impacts unit tests that are validating return types. In general, it is recommended to switch to using string_view whenever possible. You can see the blog articles String View Pt 1 and Pt 2 for more information on these performance improvements.
The function date_part now returns an int32 instead of a float64. This is likely only impactful to unit tests.
We have upgraded the Python minimum version to 3.9 since 3.8 is no longer officially supported.

Coming Soon¶

There is a lot of excitement around the upcoming work. This list is not comprehensive, but a glimpse into some of the upcoming work includes:

Reusable DataFusion UDFs: The way user defined functions are currently written in datafusion-python is slightly different from those written for the upstream Rust datafusion. The core ideas are usually the same, but it means it takes effort for users to re-implement functions already written for Rust projects to be usable in Python. Issue #1017 addresses this topic. Work is well underway to make it easier to expose these user functions through the FFI boundary. This means that the work that already exists in repositories such as those found in the datafusion-contrib project can be easily re-used in Python. This will provide a low effort way to expose significant functionality to the DataFusion in Python community.
Additional table providers: We have work well underway to provide a host of table providers to datafusion-python including: sqlite, duckdb, postgres, odbc, and mysql! In datafusion-contrib #279 we track the progress of this excellent work. Once complete, users will be able to pip install this library and get easy access to all of these table providers. This is another way we are leveraging the FFI work to greatly expand the usability of datafusion-python with relatively low effort.
External catalog and schema providers: For users who wish to go beyond table providers and have an entire custom catalog with schema, Issue #1091 tracks the progress of exposing this in Python. With this work, if you have already written a Rust based table catalog you will be able to interface it in Python similar to the work described for the table providers above.

This is only a sample of the great work that is being done. If there are features you would love to see, we encourage you to open an issue and join us as we build something wonderful.

Appreciation¶

We would like to thank everyone who has helped with these releases through their helpful conversations, code review, issue descriptions, and code authoring. We would especially like to thank the following authors of PRs who made these releases possible, listed in alphabetical order by username: @chenkovsky, @CrystalZhou0529, @ion-elgreco, @jsai28, @kevinjqliu, @kylebarron, @kosiew, @nirnayroy, and @Spaarsh.

Thank you!

Get Involved¶

The DataFusion Python team is an active and engaging community and we would love to have you join us and help the project.

Here are some ways to get involved:

Learn more by visiting the DataFusion Python project page.
Try out the project and provide feedback, file issues, and contribute code.
Join us on ASF Slack or the Arrow Rust Discord Server.

Apache DataFusion 46.0.0 Released

2025-03-24T00:00:00+00:00

We’re excited to announce the release of Apache DataFusion 46.0.0! This new version represents a significant milestone for the project, packing in a wide range of improvements and fixes. You can find the complete details in the full changelog. We’ll highlight the most important changes below and guide you through upgrading.

Breaking Changes¶

DataFusion 46.0.0 brings a few breaking changes that may require adjustments to your code as described in the Upgrade Guide. Here are the most notable ones:

Unified DataSourceExec Execution Plan: DataFusion 46.0.0 introduces a major refactor of scan operators. The separate file-format-specific execution plan nodes (ParquetExec, CsvExec, JsonExec, AvroExec, etc.) have been deprecated and merged into a single DataSourceExec plan. Format-specific logic is now encapsulated in new DataSource and FileSource traits. This change simplifies the execution model, but if you have code that directly references the old plan nodes, you’ll need to update it to use DataSourceExec (see the Upgrade Guide for examples of the new API).
**Error Handling Improvements (DataFusionError::Collection):** We began overhauling DataFusion’s approach to error handling. In this release, a new error variant DataFusionError::Collection (and related mechanisms) has been introduced to aggregate multiple errors into one. This is part of a broader effort to provide richer error context and reduce internal panics. As a result, some error types or messages have changed. Downstream code that matches on specific DataFusionError variants might need adjustment.

Performance Improvements¶

DataFusion 46.0.0 comes with a slew of performance enhancements across the board. Here are some of the noteworthy optimizations in this release:

Faster median() (no grouping): The median() aggregate function got a special fast path when used without a GROUP BY. By optimizing its accumulator, median calculation is about 2× faster in the single-group case. If you use MEDIAN() on large datasets (especially as a single value), you should notice reduced query times (PR #14399 by @2010YOUY01).
Optimized FIRST_VALUE/LAST_VALUE: The FIRST_VALUE and LAST_VALUE window functions have been improved by avoiding an internal sort of rows. Instead of sorting each partition, the implementation now uses a direct approach to pick the first/last element. This yields 10–100% performance improvement for these functions, depending on the scenario. Queries using FIRST_VALUE(...) OVER (PARTITION BY ... ORDER BY ...) will run faster, especially when partitions are large (PR #14402 by @blaginin).
repeat() String Function Boost: Repeating strings is now more efficient – the repeat(text, n) function was optimized by about 50%. This was achieved by reducing allocations and using a more efficient concatenation strategy. If you generate large repeated strings in queries, this can cut the time nearly in half (PR #14697 by @zjregee).
Ultra-fast uuid() UDF: The uuid() function (which generates random UUID strings) received a major speed-up. It’s now roughly 40× faster than before! The new implementation avoids unnecessary string copying and uses a more direct conversion to hex, making bulk UUID generation far more practical (PR #14675 by @simonvandel).
Accelerated chr() and to_hex(): Several scalar functions have been micro-optimized. The chr() function (which returns the character for a given ASCII code) is about 4× faster now, and the to_hex() function (which converts numbers to hex string) is roughly 2× faster. These improvements may be most noticeable in tight loops or when these functions are applied to large arrays of values (PR #14700 for chr, #14686 for to_hex by @simonvandel).
No More RowConverter in Grouped Ordering: We removed an inefficient step in the partial grouping algorithm. The GroupOrderingPartial operator no longer converts data to “row format” for each batch (via RowConverter). Instead, it uses a direct arrow-based approach to detect sort key changes. This eliminated overhead and yields a nice speedup for certain aggregation queries. (PR #14566 by @ctsk).
Predicate Pruning for NOT LIKE: DataFusion’s parquet reader can now prune row groups using NOT LIKE filters, similar to how it handles LIKE. This means if you have a filter such as column NOT LIKE 'prefix%', DataFusion can use min/max statistics to skip reading files/parts that can be determined to either entirely match or not match the predicate. In particular, a pattern like NOT LIKE 'X%' can skip data ranges that definitely start with "X". While a niche case, it contributes to query efficiency in those scenarios (PR #14567 by @UBarney).

Google Summer of Code 2025¶

Another exciting development: Apache DataFusion has been accepted as a mentoring organization for Google Summer of Code (GSoC) 2025! 🎉 This means that this summer, students from around the world will have the opportunity to contribute to DataFusion under the guidance of our committers. We have put together a list of project ideas that candidates can choose from.

If you’re interested, check out our GSoC Application Guidelines. We encourage students to reach out, discuss ideas with us, and apply.

Highlighted New Features¶

Improved Diagnostics¶

DataFusion 46.0.0 introduces a new SQL Diagnostics framework to make error messages more understandable. This comes in the form of new Diagnostic and DiagnosticEntry types, which allow the system to attach rich context (like source query text spans) to error messages. In practical terms, certain planner errors will now point to the exact location in your SQL query that caused the issue.

For example, if you reference an unknown table or miss a column in GROUP BY the error message will include the query snippet causing the error. These diagnostics are meant for end-users of applications built on DataFusion, providing clearer messages instead of generic errors. Here’s an example:

Currently, diagnostics cover unresolved table/column references, missing GROUP BY columns, ambiguous references, wrong number of UNION columns, type mismatches, and a few others. Future releases will extend this to more error types. This feature should greatly ease debugging of complex SQL by pinpointing errors directly in the query text. We thank @eliaperantoni for his contributions in this project.

Unified `DataSourceExec` for Table Providers¶

As mentioned, DataFusion now uses a unified DataSourceExec for reading tables, which is both a breaking change and a feature. Why is this important? The new approach simplifies how custom table providers are integrated and optimized. Namely, the optimizer can treat file scans uniformly and push down filters/limits more consistently when there is one execution plan that handles all data sources. The new DataSourceExec is paired with a DataSource trait that encapsulates format-specific behaviors (Parquet, CSV, JSON, Avro, etc.) in a pluggable way.

All built-in sources (Parquet, CSV, Avro, Arrow, JSON, etc.) have been migrated to this framework. This unification makes the codebase cleaner and sets the stage for future enhancements (like consistent metadata handling and limit pushdown across all formats). Check out PR #14224 for design details. We thank @mertak-synnada and @ozankabak for their contributions.

FFI Support for Scalar UDFs¶

DataFusion’s Foreign Function Interface (FFI) has been extended to support user-defined scalar functions defined in external languages. In 46.0.0, you can now expose a custom scalar UDF through the FFI layer and use it in DataFusion as if it were built-in. This is particularly exciting for the Python bindings and other language integrations – it means you could define a function in Python (or C, etc.) and register it with DataFusion’s Rust core via the FFI crate. Thanks, @timsaucer!

New Statistics/Distribution Framework¶

This release, thanks mainly to @Fly-Style with contributions from @ozankabak and @berkaysynnada, includes the initial pieces of a **redesigned statistics framework. DataFusion’s optimizer can now represent column data distributions using a new Distribution enum, instead of the old precision or range estimations. The supported distribution types currently include Uniform, Gaussian (normal), Exponential, Bernoulli, and a Generic** catch-all.

For example, if a filter expression is applied to a column with a known uniform distribution range, the optimizer can propagate that to estimate result selectivity more accurately. Similarly, comparisons (=, >, etc.) on columns yield Bernoulli distributions (with true/false probabilities) in this model.

This is a foundational change with many follow-on PRs underway. Even though the immediate user-visible effect is limited (the optimizer didn't magically improve by an order of magnitude overnight), but it lays groundwork for more advanced query planning in the future. Over time, as statistics information encapsulated in Distributions get integrated, DataFusion will be able to make smarter decisions like more aggressive parquet pruning, better join orderings, and so on based on data distribution information. The core framework is now in place and is being hooked up to column and table level statistics.

Aggregate Monotonicity and Window Ordering¶

DataFusion 46.0.0 adds a new concept of set-monotonicity for certain transformations, which helps avoid unnecessary sort operations. In particular, the planner now understands when a window function introduces new orderings of data.

For example, DataFusion now recognizes that a window-aggregate like MAX on a column can produce a result that is monotonically increasing, even if the input column is unordered — depending on the window frame used.

Consider the following query:

SELECT MAX(c1) OVER (
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS max_c1
FROM c1_table
ORDER BY max_c1;

In earlier versions of DataFusion, this query would require an additional SortExec on max_c1 to satisfy the ORDER BY clause. However, with the new set-monotonicity logic, the planner knows that MAX(...) OVER (...) produces values that are not smaller than the previous row, making the extra sort redundant. This leads to more efficient query execution.

PR #14271 introduced the core monotonicity tracking for aggregates and window functions. PR #14813 improved ordering preservation within various window frame types, and brought an extensive test coverage. Huge thanks to @berkaysynnada and @mertak-synnada for designing and implementing this optimizer enhancement!

UNION [ALL | DISTINCT] BY NAME Support¶

DataFusion now supports UNION BY NAME and UNION ALL BY NAME, which align columns by name instead of position. This matches functionality found in systems like Spark and DuckDB and simplifies combining heterogeneously ordered result sets.

You no longer need to rewrite column order manually — just write:

SELECT col1, col2 FROM t1
UNION ALL BY NAME
SELECT col2, col1 FROM t2;

Under the hood, this is supported by the new union_by_name() and union_by_name_distinct() plan builder methods.

Thanks to @rkrishn7 for PR #14538.

New range() Table Function¶

A new table-valued function range(start, stop, step) has been added to make it easy to generate integer sequences — similar to PostgreSQL’s generate_series() or Spark’s range().

Example:

SELECT * FROM range(1, 10, 2);

This returns: 1, 3, 5, 7, 9. It’s great for testing, cross joins, surrogate keys, and more.

Thanks to @simonvandel for PR #14830.

Upgrade Guide and Changelog¶

Upgrading to 46.0.0 should be straightforward for most users, but do review the Upgrade Guide for DataFusion 46.0.0 for detailed steps and code changes. The upgrade guide covers the breaking changes mentioned (like replacing old exec nodes with DataSourceExec, updating UDF invocation to invoke_with_args, etc.) and provides code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog for 46.0.0 (linked above and in the repository). The changelog enumerates every merged PR in this release, including many smaller fixes and improvements that we couldn’t cover in this post.

Get Involved¶

Apache DataFusion is an open-source project, and we welcome involvement from anyone interested. Now is a great time to take 46.0.0 for a spin: try it out on your workloads, and let us know if you encounter any issues or have suggestions. You can report bugs or request features on our GitHub issue tracker, or better yet, submit a pull request. Join our community discussions – whether you have questions, want to share how you’re using DataFusion, or are looking to contribute, we’d love to hear from you. A list of open issues suitable for beginners is here and you can find how to reach us on the communication doc.

Happy querying!

Efficient Filter Pushdown in Parquet

2025-03-21T00:00:00+00:00

Editor's Note: This blog was first published on Xiangpeng Hao's blog. Thanks to InfluxData for sponsoring this work as part of his PhD funding.

In the previous post, we discussed how Apache DataFusion prunes Apache Parquet files to skip irrelevant files/row_groups (sometimes also pages).

This post discusses how Parquet readers skip irrelevant rows while scanning data, leveraging Parquet's columnar layout by first reading only filter columns, and then selectively reading other columns only for matching rows.

Why filter pushdown in Parquet?¶

Below is an example query that reads sensor data with filters on date_time and location. Without filter pushdown, all rows from location, val, and date_time columns are decoded before location='office' is evaluated. Filter pushdown is especially useful when the filter is selective, i.e., removes many rows.

SELECT val, location 
FROM sensor_data 
WHERE date_time > '2025-03-11' AND location = 'office';

Parquet pruning skips irrelevant files/row_groups, while filter pushdown skips irrelevant rows. Without filter pushdown, all rows from location, val, and date_time columns are decoded before `location='office'` is evaluated. Filter pushdown is especially useful when the filter is selective, i.e., removes many rows.

In our setup, sensor data is aggregated by date — each day has its own Parquet file. At planning time, DataFusion prunes the unneeded Parquet files, i.e., 2025-03-10.parquet and 2025-03-11.parquet.

Once the files to read are located, the DataFusion's current default implementation reads all the projected columns (sensor_id, val, and location) into Arrow RecordBatches, then applies the filters over location to get the final set of rows.

A better approach is called filter pushdown with late materialization, which evaluates filter conditions first and only decodes data that passes these conditions. In practice, this works by first processing only the filter columns (date_time and location), building a boolean mask of rows that satisfy our conditions, then using this mask to selectively decode only the relevant rows from other columns (sensor_id, val). This eliminates the waste of decoding rows that will be immediately filtered out.

While simple in theory, practical implementations often make performance worse.

How can filter pushdown be slower?¶

At a high level, the Parquet reader first builds a filter mask -- essentially a boolean array indicating which rows meet the filter criteria -- and then uses this mask to selectively decode only the needed rows from the remaining columns in the projection.

Let's dig into details of how filter pushdown is implemented in the current Rust Parquet reader implementation, illustrated in the following figure.

Implementation of filter pushdown in Rust Parquet readers -- the first phase builds the filter mask, the second phase applies the filter mask to the other columns

The filter pushdown has two phases:

Build the filter mask (steps 1-3)
Use the filter mask to selectively decode other columns (steps 4-7), e.g., output step 3 is used as input for step 5 and 7.

Within each phase, it takes three steps from Parquet to Arrow:

Decompress the Parquet pages using generic decompression algorithms like LZ4, Zstd, etc. (steps 1, 4, 6)
Decode the page content into Arrow format (steps 2, 5, 7)
Evaluate the filter over Arrow data (step 3)

In the figure above, we can see that location is decompressed and decoded twice, first when building the filter mask (steps 1, 2), and second when building the output (steps 4, 5). This happens for all columns that appear both in the filter and output.

The table below shows the corresponding CPU time on the ClickBench query 22:

+------------+--------+-------------+--------+
| Decompress | Decode | Apply filter| Others |
+------------+--------+-------------+--------+
| 206 ms     | 117 ms | 22 ms       | 48 ms  |
+------------+--------+-------------+--------+

Clearly, decompress/decode operations dominate the time spent. With filter pushdown, it needs to decompress/decode twice; but without filter pushdown, it only needs to do this once. This explains why filter pushdown is slower in some cases.

Note: Highly selective filters may skip the entire page; but as long as it reads one row from the page, it needs to decompress and often decode the entire page.

Attempt: cache filter columns¶

Intuitively, caching the filter columns and reusing them later could help.

But naively caching decoded pages consumes prohibitively high memory:

It needs to cache Arrow arrays, which are on average 4x larger than Parquet data.
It needs to cache the entire column chunk in memory, because in Phase 1 it builds filters over the column chunk, and only use it in Phase 2.
The memory usage is proportional to the number of filter columns, which can be unboundedly high.

Worse, caching filter columns means it needs to read partially from Parquet and partially from cache, which is complex to implement, likely requiring a substantial change to the current implementation.

Feel the complexity: consider building a cache that properly handles nested columns, multiple filters, and filters with multiple columns.

Real solution¶

We need a solution that:

Is simple to implement, i.e., doesn't require thousands of lines of code.
Incurs minimal memory overhead.

This section describes my <700 LOC PR (with lots of comments and tests) that reduces total ClickBench time by 15%, with up to 2x lower latency for some queries, no obvious regression on other queries, and caches at most 2 pages (~2MB) per column in memory.

New decoding pipeline, building filter mask and output columns are interleaved in a single pass, allowing us to cache minimal pages for minimal amount of time

The new pipeline interleaves the previous two phases into a single pass, so that:

The page being decompressed is immediately used to build filter masks and output columns.
Decompressed pages are cached for minimal time; after one pass (steps 1-6), the cache memory is released for the next pass.

This allows the cache to only hold 1 page at a time, and to immediately discard the previous page after it's used, significantly reducing the memory requirement for caching.

What pages are cached?¶

You may have noticed that only location is cached, not val, because val is only used for output. More generally, only columns that appear both in the filter and output are cached, and at most 1 page is cached for each such column.

More examples:

SELECT val 
FROM sensor_data 
WHERE date_time > '2025-03-11' AND location = 'office';

In this case, no columns are cached, because val is not used for filtering.

SELECT COUNT(*) 
FROM sensor_data 
WHERE date_time > '2025-03-11' AND location = 'office';

In this case, again, no columns are cached, because the output projection is empty after query plan optimization.

Then why cache 2 pages per column instead of 1?¶

This is another real-world nuance regarding how Parquet layouts the pages.

Parquet by default encodes data using dictionary encoding, which writes a dictionary page as the first page of a column chunk, followed by the keys referencing the dictionary.

You can see this in action using parquet-viewer:

Parquet viewer shows the page layout of a column chunk

This means that to decode a page of data, it actually references two pages: the dictionary page and the data page.

This is why it caches 2 pages per column: one dictionary page and one data page. The data page slot will move forward as it reads the data; but the dictionary page slot always references the first page.

Cached two pages, one for dictionary (pinned), one for data (moves as it reads the data)

How does it perform?¶

Here are my results on ClickBench on my AMD 9900X machine. The total time is reduced by 15%, with Q23 being 2.24x faster, and queries that get slower are likely due to noise.

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ no-pushdown ┃ new-pushdown ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │      0.47ms │       0.43ms │ +1.10x faster │
│ QQuery 1     │     51.10ms │      50.10ms │     no change │
│ QQuery 2     │     68.23ms │      64.49ms │ +1.06x faster │
│ QQuery 3     │     90.68ms │      86.73ms │     no change │
│ QQuery 4     │    458.93ms │     458.59ms │     no change │
│ QQuery 5     │    522.06ms │     478.50ms │ +1.09x faster │
│ QQuery 6     │     49.84ms │      49.94ms │     no change │
│ QQuery 7     │     55.09ms │      55.77ms │     no change │
│ QQuery 8     │    565.26ms │     556.95ms │     no change │
│ QQuery 9     │    575.83ms │     575.05ms │     no change │
│ QQuery 10    │    164.56ms │     178.23ms │  1.08x slower │
│ QQuery 11    │    177.20ms │     191.32ms │  1.08x slower │
│ QQuery 12    │    591.05ms │     569.92ms │     no change │
│ QQuery 13    │    861.06ms │     848.59ms │     no change │
│ QQuery 14    │    596.20ms │     580.73ms │     no change │
│ QQuery 15    │    554.96ms │     548.77ms │     no change │
│ QQuery 16    │   1175.08ms │    1146.07ms │     no change │
│ QQuery 17    │   1150.45ms │    1121.49ms │     no change │
│ QQuery 18    │   2634.75ms │    2494.07ms │ +1.06x faster │
│ QQuery 19    │     90.15ms │      89.24ms │     no change │
│ QQuery 20    │    620.15ms │     591.67ms │     no change │
│ QQuery 21    │    782.38ms │     703.15ms │ +1.11x faster │
│ QQuery 22    │   1927.94ms │    1404.35ms │ +1.37x faster │
│ QQuery 23    │   8104.11ms │    3610.76ms │ +2.24x faster │
│ QQuery 24    │    360.79ms │     330.55ms │ +1.09x faster │
│ QQuery 25    │    290.61ms │     252.54ms │ +1.15x faster │
│ QQuery 26    │    395.18ms │     362.72ms │ +1.09x faster │
│ QQuery 27    │    891.76ms │     959.39ms │  1.08x slower │
│ QQuery 28    │   4059.54ms │    4137.37ms │     no change │
│ QQuery 29    │    235.88ms │     228.99ms │     no change │
│ QQuery 30    │    564.22ms │     584.65ms │     no change │
│ QQuery 31    │    741.20ms │     757.87ms │     no change │
│ QQuery 32    │   2652.48ms │    2574.19ms │     no change │
│ QQuery 33    │   2373.71ms │    2327.10ms │     no change │
│ QQuery 34    │   2391.00ms │    2342.15ms │     no change │
│ QQuery 35    │    700.79ms │     694.51ms │     no change │
│ QQuery 36    │    151.51ms │     152.93ms │     no change │
│ QQuery 37    │    108.18ms │      86.03ms │ +1.26x faster │
│ QQuery 38    │    114.64ms │     106.22ms │ +1.08x faster │
│ QQuery 39    │    260.80ms │     239.13ms │ +1.09x faster │
│ QQuery 40    │     60.74ms │      73.29ms │  1.21x slower │
│ QQuery 41    │     58.75ms │      67.85ms │  1.15x slower │
│ QQuery 42    │     65.49ms │      68.11ms │     no change │
└──────────────┴─────────────┴──────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (no-pushdown)    │ 38344.79ms │
│ Total Time (new-pushdown)   │ 32800.50ms │
│ Average Time (no-pushdown)  │   891.74ms │
│ Average Time (new-pushdown) │   762.80ms │
│ Queries Faster              │         13 │
│ Queries Slower              │          5 │
│ Queries with No Change      │         25 │
└─────────────────────────────┴────────────┘

Conclusion¶

Despite being simple in theory, filter pushdown in Parquet is non-trivial to implement. It requires understanding both the Parquet format and reader implementation details. The challenge lies in efficiently navigating through the dynamics of decoding, filter evaluation, and memory management.

If you are interested in this level of optimization and want to help test, document and implement this type of optimization, come find us in the DataFusion Community. We would love to have you.

Apache DataFusion Comet 0.7.0 Release

2025-03-20T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.7.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

Comet runs on commodity hardware and aims to provide 100% compatibility with Apache Spark. Any operators or expressions that are not fully compatible will fall back to Spark unless explicitly enabled by the user. Refer to the compatibility guide for more information.

This release covers approximately four weeks of development work and is the result of merging 46 PRs from 11 contributors. See the change log for more information.

Release Highlights¶

Performance¶

Comet 0.7.0 has improved performance compared to the previous release due to improvements in the native shuffle implementation and performance improvements in DataFusion 46.

For single-node TPC-H at 100 GB, Comet now delivers a greater than 2x speedup compared to Spark using the same CPU and RAM. Even with half the resources, Comet still provides a measurable performance improvement.

These benchmarks were performed on a Linux workstation with PCIe 5, AMD 7950X CPU (16 cores), 128 GB RAM, and data stored locally in Parquet format on NVMe storage. Spark was running in Kubernetes with hard memory limits.

Shuffle Improvements¶

There are several improvements to shuffle in this release:

When running in off-heap mode (which is the recommended approach), Comet was using the wrong memory allocator implementation for some types of shuffle operation, which could result in OOM rather than spilling to disk.
The number of spill files is drastically reduced. In previous releases, each instance of ShuffleMapTask could potentially create a new spill file for each output partition each time that spill was invoked. Comet now creates a maximum of one spill file per output partition per instance of ShuffleMapTask, which is appended to in subsequent spills.
There was a flaw in the memory accounting which resulted in Comet requesting approximately twice the amount of memory that was needed, resulting in premature spilling. This is now resolved.
The metric for number of spilled bytes is now accurate. It was previously reporting invalid information.

Improved Hash Join Performance¶

When using the spark.comet.exec.replaceSortMergeJoin setting to replace sort-merge joins with hash joins, Comet will now do a better job of picking the optimal build side. Thanks to @hayman42 for suggesting this, and thanks to the Apache Gluten(incubating) project for the inspiration in implementing this feature.

Experimental Support for DataFusion’s Parquet Scan¶

Support should still be considered experimental, but most of Comet’s unit tests are now passing with the new reader. Known issues include handling of INT96 timestamps and unsigned bytes and shorts.

To enable DataFusion’s Parquet reader, either set spark.comet.scan.impl=native_datafusion or set the environment variable COMET_PARQUET_SCAN_IMPL=native_datafusion.

Complex Type Support¶

With DataFusion’s Parquet reader enabled, there is now some early support for reading structs from Parquet. This is not thoroughly tested yet. We would welcome additional testing from the community to help determine what is and isn’t working, as well as contributions to improve support for structs and other complex types. The tracking issue is https://github.com/apache/datafusion-comet/issues/1043.

Updates to supported Spark versions¶

Comet 0.7.0 is now tested against Spark 3.5.4 rather than 3.5.1
This will be the last Comet release to support Spark 3.3.x

Improved Tuning Guide¶

The Comet Tuning Guide has been improved and now provides guidance on determining how much memory to allocate to Comet.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Parquet Pruning in DataFusion: Read Only What Matters

2025-03-20T00:00:00+00:00

Editor's Note: This blog was first published on Xiangpeng Hao's blog. Thanks to InfluxData for sponsoring this work as part of his PhD funding.

Apache Parquet has become the industry standard for storing columnar data, and reading Parquet efficiently -- especially from remote storage -- is crucial for query performance.

Apache DataFusion implements advanced Parquet pruning techniques to effectively read only the data that matters for a given query.

Achieving high performance adds complexity. This post provides an overview of the techniques used in DataFusion to selectively read Parquet files.

The pipeline¶

The diagram below illustrates the Parquet reading pipeline in DataFusion, highlighting how data flows through various pruning stages before being converted to Arrow format:

Background: Parquet file structure¶

As shown in the figure above, each Parquet file has multiple row groups. Each row group contains a set of columns, and each column contains a set of pages.

Pages are the smallest units of data in Parquet files and typically contain compressed and encoded values for a specific column. This hierarchical structure enables efficient columnar access and forms the foundation for the pruning techniques we'll discuss.

Check out Querying Parquet with Millisecond Latency for more details on the Parquet file structure.

1. Read metadata¶

DataFusion first reads the Parquet metadata to understand the data in the file. Metadata often includes data schema, the exact location of each row group and column chunk, and their corresponding statistics (e.g., min/max values). It also optionally includes page-level stats and Bloom filters. This information is used to prune the file before reading the actual data.

Fetching metadata requires up to two network requests: one to read the footer size from the end of the file, and another to read the footer itself.

Decoding metadata is generally fast since it only requires parsing a small amount of data. However, for tables with hundreds or thousands of columns, the metadata can become quite large and decoding it can become a bottleneck. This is particularly noticeable when scanning many small files.

Reading metadata is latency-critical, so DataFusion allows users to cache metadata through the ParquetFileReaderFactory trait.

2. Prune by projection¶

The simplest yet perhaps most effective pruning is to read only the columns that are needed. This is because queries usually don't select all columns, e.g., SELECT a FROM table only reads column a. As a columnar format, Parquet allows DataFusion to only read the columns that are needed.

This projection pruning happens at the column level and can dramatically reduce I/O when working with wide tables where queries typically access only a small subset of columns.

3. Prune by row group stats and Bloom filters¶

Each row group has basic stats like min/max values for each column. DataFusion applies the query predicates to these stats to prune row groups, e.g., SELECT * FROM table WHERE a > 10 will only read row groups where a has a max value greater than 10.

Sometimes min/max stats are too simple to prune effectively, so Parquet also supports Bloom filters. DataFusion uses Bloom filters when available.

Bloom filters are particularly effective for equality predicates (WHERE a = 10) and can significantly reduce the number of row groups that need to be read for point queries or queries with highly selective predicates.

4. Prune by page stats¶

Parquet optionally supports page-level stats -- similar to row group stats but more fine-grained. DataFusion implements page pruning when the stats are present.

Page-level pruning provides an additional layer of filtering after row group pruning. It allows DataFusion to skip individual pages within a row group, further reducing the amount of data that needs to be read and decoded.

5. Read from storage¶

Now we (hopefully) have pruned the Parquet file into small ranges of bytes, i.e., the Access Plan. The last step is to make requests to fetch those bytes and decode them into Arrow RecordBatch.

Preview of coming attractions: filter pushdown¶

So far we have discussed techniques that prune the Parquet file using only the metadata, i.e., before reading the actual data.

Filter pushdown, also known as predicate pushdown or late materialization, is a technique that prunes data during scanning, with filters being generated and applied in the Parquet reader.

Unlike metadata-based pruning which works at the row group or page level, filter pushdown operates at the row level, allowing DataFusion to filter out individual rows that don't match the query predicates during the decoding process.

DataFusion implements filter pushdown but has not enabled it by default due to some performance regressions.

We are working to remove the remaining performance issues and enable it by default, which we will discuss in the next blog post.

Conclusion¶

DataFusion employs a multi-step approach to Parquet pruning, from column projection to row group stats, page stats, and potentially row-level filtering. Each step may reduce the amount of data to be read and processed, significantly improving query performance.

Using Ordering for Better Plans in Apache DataFusion

2025-03-11T00:00:00+00:00

Introduction¶

In this blog post, we explain when an ordering requirement of an operator is satisfied by its input data. This analysis is essential for order-based optimizations and is often more complex than one might initially think.

Ordering Requirement for an operator describes how the input data to that operator must be sorted for the operator to compute the correct result. It is the job of the planner to make sure that these requirements are satisfied during execution (See DataFusion EnforceSorting for an implementation of such a rule).

There are various use cases where this type of analysis can be useful such as the following examples.

Removing Unnecessary Sorts¶

Imagine a user wants to execute the following query:

SELECT hostname, log_line 
FROM telemetry ORDER BY time ASC limit 10

If we don't know anything about the telemetry table we need to sort it by time ASC and then retrieve the first 10 rows to get the correct result. However, if the table is already ordered by time ASC, we can simply retrieve the first 10 rows. This approach executes much faster and uses less memory compared to resorting the entire table, even when the TopK operator is used.

In order to avoid the sort the query optimizer must determine the data is already sorted. For simple queries the analysis is straightforward however it gets complicated fast. For example, what if your data is sorted by [hostname, time ASC] and your query is

SELECT hostname, log_line 
FROM telemetry WHERE hostname = 'app.example.com' ORDER BY time ASC;

In this case a sort still isn't needed, but the analysis must reason about the sortedness of the stream when it knows hostname has a single value.

Optimized Operator Implementations¶

As another use case, some operators can utilize the ordering information to change its underlying algorithm to execute more efficiently. Consider the following query:

SELECT COUNT(log_line) 
FROM telemetry GROUP BY hostname;

Most analytic systems, including DataFusion, by default implement such a query using a hash table keyed on values of hostname to store the counts. However, if the telemetry table is sorted by hostname, there are much more efficient algorithms for grouping on hostname values than hashing every value and storing it in memory. However, the more efficient algorithm can only be used when the input is sorted correctly. To see this in practice, check out the source for ordered variant of the Aggregation in DataFusion.

Streaming-Friendly Execution¶

Stream processing aims to produce results immediately as they become available ensuring minimal latency for real-time workloads. However, some operators need to consume all input data before producing any output. Consider the Sort operation: before it can start generating output, the algorithm must first process all input data. As a result, data flow halts whenever such an operator is encountered until all input is consumed. When a physical query plan contains such an operator (Sort, CrossJoin, ..) we refer to this as pipeline breaking, meaning the query cannot be executed in a streaming fashion.

For a query to be executed in a streaming fashion we need to satisfy 2 conditions:

Logically Streamable
It should be possible to generate what user wants in streaming fashion. Consider following query:

SELECT SUM(amount)  
FROM orders

Here, the user wants to compute the sum of all amounts in the orders table. By the nature of the query this requires scanning the entire table to generate a result making it impossible to execute in a streaming fashion.

Streaming Aware Planner
Being logically streamable does not guarantee that a query will execute in a streaming fashion. SQL is a declarative language, meaning it specifies 'WHAT' user wants. It is up to the planner 'HOW' to generate the result. In most cases there are many ways to compute the correct result for a given query. The query planner is responsible for choosing "a way" (ideally the best^* one) among the all alternatives to generate what user asks for. If a plan contains a pipeline-breaking operator the execution will not be streaming—even if the query is logically streamable. To generate truly streaming plans from logically streamable queries the planner must carefully analyze the existing orderings in the source tables to ensure that the final plan does not contain any pipeline-breaking operators.

Analysis¶

Let's start by creating an example table that we will refer throughout the post. This table models the input data of an operator for the analysis:

Example Virtual Table¶

amount	price	hostname	currency	time_bin	time	price_cloned	time_cloned
12	25	app.example.com	USD	08:00:00	08:01:30	25	08:01:30
12	26	app.example.com	USD	08:00:00	08:11:30	26	08:11:30
15	30	app.example.com	USD	08:00:00	08:41:30	30	08:41:30
15	32	app.example.com	USD	08:00:00	08:55:15	32	08:55:15
15	35	app.example.com	USD	09:00:00	09:10:23	35	09:10:23
20	18	app.example.com	USD	09:00:00	09:20:33	18	09:20:33
20	22	app.example.com	USD	09:00:00	09:40:15	22	09:40:15

How can a table have multiple orderings? At first glance it may seem counterintuitive for a table to have more than one valid ordering. However, during query execution such scenarios can arise. For example consider the following query:
SELECT time, date_bin('1 hour', time, '1970-01-01') as time_bin  
FROM table;
If we know that the table is ordered by time ASC we can infer that time_bin ASC is also a valid ordering. This is because the date_bin function is monotonic, meaning it preserves the order of its input. DataFusion leverages these functional dependencies to infer new orderings as data flows through different query operators. For details on the implementation see the source code.

By inspection, you can see this table is sorted by the amount column, but It is also sorted by time and time_bin as well as the compound (time_bin, amount) and many other variations. While this example is an extreme case, real-world data often has multiple sort orders.

A naive approach for analyzing whether the ordering requirement of an operator is satisfied by its input would be:

Store all the valid ordering expressions that the tables satisfies
Check whether the ordering requirement by the operator is among valid orderings.

This naive algorithm works and correct. However, listing all valid orderings can be quite lengthy and is of exponential complexity as the number of orderings grows. For the example table here is a (small) subset of the valid orderings:

[amount ASC]
[amount ASC, price ASC]
[amount ASC, price_cloned ASC]
[hostname ASC, amount ASC, price_cloned ASC]
[amount ASC, hostname ASC, price_cloned ASC]
[amount ASC, price_cloned ASC, hostname ASC]
.
.
.

As can be seen from the listing above storing all valid orderings is wasteful and contains significant redundancy. Here are some observations which suggest that we can do much better:

Storing a prefix of another valid ordering is redundant. If the table satisfies the lexicographic ordering¹: [amount ASC, price ASC], it already satisfies ordering [amount ASC] trivially. Hence, once we store [amount ASC, price ASC] storing [amount ASC] is redundant.
Using all columns that are equal to each other in the listings is redundant. If we know the table is ordered by [amount ASC, price ASC], it is also ordered by [amount ASC, price_cloned ASC] since price and price_cloned are copy of each other. It is enough to use just one expression among the expressions that exact copy of each other.
Constant expressions can be inserted anywhere in a valid ordering with an arbitrary direction (e.g. ASC, DESC). Hence, if the table is ordered by [amount ASC, price ASC], it is also ordered by:
[hostname ASC, amount ASC, price ASC],
[hostname DESC, amount ASC, price ASC],
[amount ASC, hostname ASC, price ASC],
.
.

This is clearly redundant. For this reason, it is better to avoid explicitly encoding constant expressions in valid sort orders.

In summary,

We should store only the longest lexicographic ordering (shouldn't use any prefix of it)
Using expressions that are exact copies of each other is redundant.
Ordering expressions shouldn't contain any constant expression.

Key Concepts for Analyzing Orderings¶

To solve the shortcomings above DataFusion needs to track of following properties for the table:

Constant Expressions
Equivalent Expression Groups (will be explained shortly)
Succinct Valid Orderings (will be explained shortly)

Note: These properties are implemented in the EquivalenceProperties structure in DataFusion, please see the source for more details

These properties allow us to analyze whether the ordering requirement is satisfied by the data already.

1. Constant Expressions¶

Constant expressions are those where each row in the expression has the same value across all rows. Although constant expressions may seem odd in a table they can arise after operations like Filter or Join occur.

For instance in the example table:

Columns hostname and currency are constant because every row in the table has the same value ('app.example.com' for hostname, and 'USD' for currency) for these columns.

Note: Constant expressions can arise during query execution. For example, in following query:
SELECT hostname FROM logs
WHERE hostname='app.example.com'
after filtering is done, for subsequent operators the hostname column will be constant.

2. Equivalent Expression Groups¶

Equivalent expression groups are expressions that always hold the same value across rows. These expressions can be thought of as clones of each other and may arise from operations like Filter, Join, or Projection.

In the example table, the expressions price and price_cloned form one equivalence group, and time and time_cloned form another equivalence group.

Note: Equivalent expression groups can arise during the query execution. For example, in the following query:
SELECT time, time as time_cloned FROM logs
after the projection is done, for subsequent operators time and time_cloned will form an equivalence group. As another example, in the following query:
SELECT employees.id, employees.name, departments.department_name FROM employees JOIN departments ON employees.department_id = departments.id;
after joining, employees.department_id and departments.id will form an equivalence group.

3. Succinct Encoding of Valid Orderings¶

Valid orderings are the orderings that the table already satisfies. However, naively listing them requires exponential space as the number of columns grows as discussed before. Instead, we list all valid orderings after following constraints are applied:

Do not use any constant expressions in the valid ordering construction
Use only one entry (by convention the first entry) in the equivalent expression group.
Lexicographic ordering shouldn't contain any leading ordering²except the first position ³.
Do not use any prefix of a valid lexicographic ordering⁴.

After applying the first and second constraint, the example table simplifies to

amount	price	time_bin	time
12	25	08:00:00	08:01:30
12	26	08:00:00	08:11:30
15	30	08:00:00	08:41:30
15	32	08:00:00	08:55:15
15	35	09:00:00	09:10:23
20	18	09:00:00	09:20:33
20	22	09:00:00	09:40:15

Following third and fourth constraints for the simplified table, the succinct valid orderings are:
[amount ASC, price ASC],
[time_bin ASC],
[time ASC]

How can DataFusion find orderings?
DataFusion's CREATE EXTERNAL TABLE has a WITH ORDER clause (see docs) to specify the known orderings of the table during table creation. For example the following query:
CREATE EXTERNAL TABLE source (
    amount INT NOT NULL,
    price DOUBLE NOT NULL,
    time TIMESTAMP NOT NULL,
    ...
)
STORED AS CSV
WITH ORDER (time ASC)
WITH ORDER (amount ASC, price ASC)
LOCATION '/path/to/FILE_NAME.csv'
OPTIONS ('has_header' 'true');
communicates that source table has the orderings: [time ASC] and [amount ASC, price ASC].
When orderings are communicated from the source, DataFusion tracks the orderings through each operator while optimizing the plan.

add new orderings (such as when "date_bin" function is applied to the "time" column)

Remove orderings, if operation doesn't preserve the ordering of the data at its input

Update equivalent groups

Update constant expressions

Figure 1 shows an example how DataFusion generates an efficient plan for the query:
SELECT 
  row_number() OVER (ORDER BY time) as rn,
  time
FROM events
ORDER BY rn, time
using the orderings of the query intermediates.

Figure 1: DataFusion analyzes orderings of the sources and query intermediates to generate efficient plans

Table Properties¶

In summary, for the example table, the following properties correctly describe the sort properties:

Constant Expressions = hostname, currency
Equivalent Expression Groups = [price, price_cloned], [time, time_cloned]
Valid Orderings = [amount ASC, price ASC], [time_bin ASC], [time ASC]

Algorithm for Analyzing Ordering Requirements¶

After deriving these properties for the data, following algorithm can be used to check whether an ordering requirement is satisfied by the table:

Prune constant expressions: Remove any constant expressions from the ordering requirement.
Normalize the requirement: Replace each expression in the ordering requirement with the first entry from its equivalence group.
De-duplicate expressions: If an expression appears more than once, remove duplicates, keeping only the first occurrence.
Match leading orderings: Check whether the leading ordering requirement⁵ matches the leading valid orderings⁶ of table. If so:
- Remove the leading ordering requirement from the ordering requirement
- Remove the matching leading valid ordering from the valid orderings of table.
Iterate through the remaining expressions: Go back to step 4 until ordering requirement is empty or leading ordering requirement is not found among the leading valid orderings of table.

If, at the end of the procedure above, the ordering requirement is an empty list, we can conclude that the requirement is satisfied by the table.

Example Walkthrough¶

Let's say the user provided a query such as the following:

SELECT * FROM table
ORDER BY hostname DESC, amount ASC, time_bin ASC, price_cloned ASC, time ASC, currency ASC, price DESC;

And the input has the same properties explained above

Constant Expressions = hostname, currency
Equivalent Expressions Groups = [price, price_cloned], [time, time_cloned]
Succinct Valid Orderings = [amount ASC, price ASC], [time_bin ASC], [time ASC]

In order to remove a sort the optimizer must check if the ordering requirement [hostname DESC, amount ASC, time_bin ASC, price_cloned ASC, time ASC, currency ASC, price DESC] is satisfied by the properties.

Algorithm Steps¶

Prune constant expressions:
Remove hostname and currency from the requirement. The requirement becomes:
[amount ASC, time_bin ASC, price_cloned ASC, time ASC, price DESC].
Normalize using equivalent groups:
Replace price_cloned with price and time_cloned with time. The requirement becomes:
[amount ASC, time_bin ASC, price ASC, time ASC, price DESC].
De-duplicate expressions:
Since price appears twice, we simplify the requirement to:
[amount ASC, time_bin ASC, price ASC, time ASC] (keeping the first occurrence from the left side).
Match leading orderings:
Check if leading ordering requirement amount ASC is among the leading valid orderings: amount ASC, time_bin ASC, time ASC. Since this is the case, we remove amount ASC from both the ordering requirement and the valid orderings of the table.
Iterate through the remaining expressions: Now, the problem is converted from
"whether the requirement: [amount ASC, time_bin ASC, price ASC, time ASC] is satisfied by valid orderings: [amount ASC, price ASC], [time_bin ASC], [time ASC]"
into
"whether the requirement: [time_bin ASC, price ASC, time ASC] is satisfied by valid orderings: [price ASC], [time_bin ASC], [time ASC]"
We go back to step 4 until the ordering requirement list is exhausted or its length no longer decreases.

At the end of stages above we end up with an empty ordering requirement list. Given this, we can conclude that the table satisfies the ordering requirement and thus no sort is required.

Conclusion¶

In this post, we described the conditions under which an ordering requirement is satisfied based on the properties of a table. We introduced key concepts such as constant expressions, equivalence groups, and valid orderings, and used them to determine whether a given ordering requirement are satisfied by an input table.

This analysis plays a crucial role in:

Choosing more efficient algorithm variants
Generating streaming-friendly plans

The DataFusion query engine employs this analysis (and many more) during its planning stage to ensure correct and efficient query execution. We welcome you to come and join the project.

Appendix¶

^[1]Lexicographic order is a way of ordering sequences (like strings, list of expressions) based on the order of their components, similar to how words are ordered in a dictionary. It compares each element of the sequences one by one, from left to right.

^[2]Leading ordering is the first ordering in a lexicographic ordering list. As an example, for the ordering: [amount ASC, price ASC], leading ordering will be: amount ASC.

^[3]This means that, if we know that [amount ASC] and [time ASC] are both valid orderings for the table. We shouldn't enlist [amount ASC, time ASC] or [time ASC, amount ASC] as valid orderings. These orderings can be deduced if we know that table satisfies the ordering [amount ASC] and [time ASC].

^[4]This means that, if ordering [amount ASC, price ASC] is a valid ordering for the table. We shouldn't enlist [amount ASC] as valid ordering. Validity of it can be deduced from the ordering: [amount ASC, price ASC]

^[5]Leading ordering requirement is the first ordering requirement in the list of lexicographic ordering requirement expression. As an example for the requirement: [amount ASC, time_bin ASC, prices ASC, time ASC], leading ordering requirement is: amount ASC.

^[6]Leading valid orderings are the first ordering for each valid ordering list in the table. As an example, for the valid orderings: [amount ASC, prices ASC], [time_bin ASC], [time ASC], leading valid orderings will be: amount ASC, time_bin ASC, time ASC.

^*Best depends on the use case, DataFusion has many various flags to communicate what user thinks the best plan is (e.g. streamable, fastest, lowest memory, etc.). See configurations for detail.

Apache DataFusion 45.0.0 Released

2025-02-20T00:00:00+00:00

Introduction¶

We are very proud to announce DataFusion 45.0.0. This blog highlights some of the many major improvements since we released DataFusion 40.0.0 and a preview of what the community is thinking about in the next 6 months. It has been an exciting period of development for DataFusion!

Apache DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast data centric systems such as databases, dataframe libraries, machine learning and streaming applications. While DataFusion’s primary design goal is to accelerate the creation of other data centric systems, it has a reasonable experience directly out of the box as a dataframe library, python library and command line SQL tool.

Community Growth 📈¶

In the last 6 months, between 40.0.0 and 45.0.0, our community continues to grow in new and exciting ways.

We added several PMC members and new committers: @jayzhan211 and @jonahgao joined the PMC, @2010YOUY01, @rachelint, @findpi, @iffyio, @goldmedal, @Weijun-H, @Michael-J-Ward and @korowa joined as committers. See the mailing list for more details.
In the core DataFusion repo alone we reviewed and accepted almost 1600 PRs from 206 different committers, created over 1100 issues and closed 751 of them 🚀. All changes are listed in the detailed changelogs.
DataFusion focused meetups happened in multiple cities around the world: Hangzhou, Belgrade, New York, Seattle, Chicago, Boston and Amsterdam as well as a Rust NYC meetup in NYC focused on DataFusion.

DataFusion has put in an application to be part of Google Summer of Code with a number of ideas for projects with mentors already selected. Additionally, some ideas on how to make DataFusion an ideal selection for university database projects such as the CMU database classes have been put forward.

In addition, DataFusion has been appearing publicly more and more, both online and offline. Here are some highlights:

A demonstration of how uwheel is integrated into DataFusion
Integrating StringView into DataFusion - part 1 and part 2
Building streams with DataFusion
Caching in DataFusion: Don't read twice
Parquet pruning in DataFusion: Read no more than you need
DataFusion is one of The 10 coolest open source software tools
Building databases over a weekend

Improved Performance 🚀¶

DataFusion hit a milestone in its development by becoming the fastest single node engine for querying Apache Parquet files in clickbench benchmark for the 43.0.0 release. A lot of work went into making this happen! While other engines have subsequently gotten faster, displacing DataFusion from the top spot, DataFusion still remains near the top and we are planning more improvements.

Figure 1: ClickBench performance improved over 33% between DataFusion 33 (released Nov. 2023) and DataFusion 45 (released Feb. 2025).

The task of integrating the new Arrow StringView which significantly improves performance for workloads that scan, filter and group by variable length string and binary data was completed and enabled by default in the past 6 months. The improvement is especially pronounced for Parquet files due to upstream work in the parquet reader. Kudos to @XiangpengHong, @AriesDevil, @PsiACE, @Weijun-H, @a10y, and @RinChanNOWWW for driving this project.

Improved Quality 📋¶

DataFusion continues to improve overall in quality. In addition to ongoing bug fixes, one of the most exciting improvements in the last 6 months was the addition of the SQLite sqllogictest suite thanks to @Omega359. These tests run over 5 million sql statements on every push to the main branch.

Support for explicitly checking logical plan invariants was added by @wiedld which can help catch implicit changes that might cause problems during upgrades.

We have also started other quality initiatives to make it easier to use DataFusion based on GlareDB's experience along with more extensive prerelease testing.

Improved Documentation 📚¶

We continue to improve the documentation to make it easier to get started using DataFusion. During the last 6 months two projects were initiated to migrate the function documentation from strictly static markdown files. First, @Omega359 to allow function documentation to be generated from code and @jonathanc-n and others helped with the migration, then @comphead lead a project to create a doc macro to allow for an even easier way to write function documentation. A special thanks to @Chen-Yuan-Lai for migrating many functions to the new syntax.

Additionally, the examples were refactored and cleaned up to improve their usefulness.

New Features ✨¶

There are too many new features in the last 6 months to list them all, but here are some highlights:

Functions¶

Uniform Window Functions: BuiltInWindowFunctions was removed and all now use UDFs (@jcsherin)
Uniform Aggregate Functions: BuiltInAggregateFunctions was removed and all now use UDFs
As mentioned above function documentation was extracted from the markdown files
Some new functions and sql support were added including 'show functions', 'to_local_time', 'regexp_count', 'map_extract', 'array_distance', 'array_any_value', 'greatest', 'least', 'arrays_overlap'

FFI¶

Foreign Function Interface work has started. This should allow for using table providers across languages and versions of DataFusion. This is especially pertinent for integration with delta-rs and other table formats.

Materialized Views¶

@suremarc has added a materialized view implementation in datafusion-contrib 🚀

Substrait¶

A lot of work was put into improving and enhancing substrait support (@Blizzara, @westonpace, @tokoko, @vbarua, @LatrecheYasser, @notfilippo and others)

Looking Ahead: The Next Six Months 🔭¶

One of the long term goals of @alamb, DataFusion's PMC chair, has been to have 1000 DataFusion based projects. This may be the year that happens!

The community has been discussing what we will work on in the next six months. Some major initiatives are likely to be:

Performance: A number of items have been identified as areas that could use additional work
Memory usage: Tracking and improving memory usage, statistics and spilling to disk
Google Summer of Code (GSOC): DataFusion is hopefully selected as a project and we start accepting and supporting student projects
FFI: Extending the FFI implementation to support to all types of UDF's and SessionContext
Spark Functions: A proposal has been made to add a crate covering spark compatible builtin functions

How to Get Involved¶

DataFusion is not a project built or driven by a single person, company, or foundation. Rather, our community of users and contributors work together to build a shared technology that none of us could have built alone.

If you are interested in joining us we would love to have you. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests or code. A list of open issues suitable for beginners is here and you can find how to reach us on the communication doc.

Apache DataFusion Comet 0.6.0 Release

2025-02-17T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.6.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately four weeks of development work and is the result of merging 39 PRs from 12 contributors. See the change log for more information.

Starting with this release, we now plan on releasing new versions of Comet more frequently, typically within 1-2 weeks of each major DataFusion release. The main motivation for this change is to better support downstream Rust projects that depend on the datafusion_comet_spark_expr crate.

Release Highlights¶

DataFusion Upgrade¶

Comet 0.6.0 uses DataFusion 45.0.0

New Features¶

Comet now supports array_join, array_intersect, and arrays_overlap. Note that these expressions are not yet guaranteed to be 100% compatible with Spark for all input data types, so these expressions are only enabled with the configuration setting spark.comet.expression.allowIncompatible=true.

Performance & Stability¶

Metrics from native execution are now updated in Spark every 3 seconds by default, rather than for each batch being processed. The mechanism for passing the metrics via JNI is also more efficient.
New memory pool options "fair unified" and "unbounded" have been added. See the Comet Tuning Guide for more information.

Bug Fixes¶

Hashing of decimal values with precision <= 18 is now compatible with Spark
Comet falls back to Spark when hashing decimals with precision > 18

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Apache DataFusion Ballista 43.0.0 Released

2025-02-02T00:00:00+00:00

We are pleased to announce version 43.0.0 of the DataFusion Ballista. Ballista allows existing DataFusion applications to be scaled out on a cluster for use cases that are not practical to run on a single node.

Highlights of this release¶

Seamless Integration with DataFusion¶

The primary objective of this release has been to achieve a more seamless integration with the DataFusion ecosystem and try to achieve the same level of flexibility as DataFusion.

In recent months, our development efforts have been directed toward providing a robust and extensible Ballista API. This new API empowers end-users to tailor Ballista's core functionality to their specific use cases. As a result, we have deprecated several experimental features from the Ballista core, allowing users to reintroduce them as custom extensions outside the core framework. This shift reduces the maintenance burden on Ballista's core maintainers and paves the way for optional features, such as delta-rs support, to be added externally when needed.

The most significant enhancement in this release is the deprecation of BallistaContext, which has been superseded by the DataFusion SessionContext. This change enables DataFusion applications written in Rust to execute on a Ballista cluster with minimal modifications. Beyond simplifying migration and reducing maintenance overhead, this update introduces distributed write functionality to Ballista for the first time, significantly enhancing its capabilities.

use ballista::prelude::*;
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {

  // Instead of creating classic SessionContext
  // let ctx = SessionContext::new();

  // create DataFusion SessionContext with ballista standalone cluster started
  // let ctx = SessionContext::standalone().await;

  // create DataFusion SessionContext with ballista remote cluster started
  let ctx = SessionContext::remote("df://localhost:50050").await;

  // register the table
  ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new()).await?;

  // create a plan to run a SQL query
  let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?;

  // execute and print results
  df.show().await?;
  Ok(())
}

Additionally, Ballista’s versioning scheme has been aligned with that of DataFusion, ensuring that Ballista's version number reflects the compatible DataFusion version.

At the moment there is a gap between DataFusion and Ballista, which we will try to bridge in the future.

Removal of Experimental Features¶

Ballista had grown in scope to include several experimental features in various states of completeness. Some features have been removed from this release in an effort to strip Ballista back to its core and make it easier to maintain and extend.

Specifically, the caching subsystem, predefined object store registry, plugin subsystem, key-value stores for persistent scheduler state, and the UI have been removed.

Performance & Scalability¶

Ballista has significantly leveraged the advancements made in the DataFusion project over the past year. Benchmark results demonstrate notable improvements in performance, highlighting the impact of these enhancements:

Per query comparison:

Relative speedup:

The overall speedup is 2.9x

New Logo¶

Ballista now has a new logo, which is visually similar to other DataFusion projects.

Roadmap¶

Moving forward, Ballista will adopt the same release cadence as DataFusion, providing synchronized updates across the ecosystem. Currently, there is no established long-term roadmap for Ballista. A plan will be formulated in the coming months based on community feedback and the availability of additional maintainers.

In the short term, development efforts will concentrate on closing the feature gap between DataFusion and Ballista. Key priorities include implementing support for INSERT INTO, enabling table URL functionality, and achieving deeper integration with the Python ecosystem.

Apache DataFusion Comet 0.5.0 Release

2025-01-17T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.5.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately 8 weeks of development work and is the result of merging 69 PRs from 15 contributors. See the change log for more information.

Release Highlights¶

Performance¶

Comet 0.5.0 achieves a 1.9x speedup for single-node TPC-H @ 100 GB, an improvement from 1.7x in the previous release.

More benchmarking results can be found in the Comet Benchmarking Guide.

Shuffle Improvements¶

Comet now supports multiple compression algorithms for compressing shuffle files. Previously, only ZSTD was supported but Comet now also supports LZ4 and Snappy. The default is now LZ4, which matches the default in Spark. ZSTD may be a better choice when the compression ratio is more important than CPU overhead.

Previously, Comet used Arrow IPC to encode record batches into shuffle files. Although Arrow IPC is a good general-purpose framework for serializing Arrow record batches, we found that we could get better performance using a custom serialization approach optimized for Comet. One optimization is that the schema is encoded once per shuffle operation rather than once per batch. There are some planned performance improvements in the Rust implementation of Arrow IPC and Comet may switch back to Arrow IPC in the future.

Comet provides two shuffle implementations. Comet native shuffle is the fastest and performs repartitioning in native code. Comet columnar shuffle delegates to Spark to perform repartitioning and is used in cases where native shuffle is not supported, such as with RangePartitioning. Comet generally tries to use native shuffle first, then columnar shuffle, and finally falls back to Spark if neither is supported. There was a bug in previous releases where Comet would sometimes fall back to Spark shuffle if native shuffle was not supported and missed opportunities to use columnar shuffle. This bug was fixed in this release but currently requires the configuration setting spark.comet.exec.shuffle.fallbackToColumnar=true. This will be enabled by default in the next release.

Memory Management¶

Comet 0.4.0 required Spark to be configured to use off-heap memory. In this release it is no longer required and there are multiple options for configuring Comet to use on-heap memory instead. More details are available in the Comet Tuning Guide.

Spark SQL Metrics¶

Comet now provides detailed metrics for native shuffle, showing time for repartitioning, encoding and compressing, and writing to disk.

Crate Reorganization¶

One of the goals of the Comet project is to make Spark-compatible functionality available to other projects that are based on DataFusion. In this release, many implementations of Spark-compatible expressions were moved from the unpublished datafusion-comet crate, which provides the native part of the Spark plugin, into the datafusion-comet-spark-expr crate. There is also ongoing work to reorganize this crate to move expressions into subfolders named after the group name that Spark uses to organize expressions. For example, there are now subfolders named agg_funcs, datetime_funcs, hash_funcs, and so on.

Update on Complex Type Support¶

Good progress has been made with proof-of-concept work using DataFusion’s ParquetExec, which has the advantage of supporting complex types. This work is available on the comet-parquet-exec branch, and the current focus is on fixing test regressions, particularly regarding timestamp conversion issues.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Apache DataFusion Python 43.1.0 Released

2024-12-14T00:00:00+00:00

We are happy to announce that datafusion-python 43.1.0 has been released. This release brings in all of the new features of the core DataFusion 43.0.0 library. Since the last blog post for datafusion-python 40.1.0, a large number of improvements have been made that can be found in the changelogs.

We would like to point out four features that are particularly noteworthy.

Arrow PyCapsule import and export
User-Defined Window Functions
Foreign Table Providers
String View performance enhancements

Arrow PyCapsule import and export¶

Arrow has stable C interface for moving data between different libraries, but difficulties sometimes arise when different Python libraries expose this interface through different methods, requiring developers to write function calls for each library they are attempting to work with. A better approach is to use the Arrow PyCapsule Interface which gives a consistent method for exposing these data structures across libraries.

In PR #825, we introduced support for both importing and exporting Arrow data in datafusion-python. With this improvement, you can now use a single function call to import a table from any Python library that implements the Arrow PyCapsule Interface. Many popular libraries, such as Pandas and Polars already support these interfaces.

Suppose you have a Pandas and Polars DataFrames named df_pandas or df_polars, respectively:

ctx = SessionContext()
df_dfn1 = ctx.from_arrow(df_pandas)
df_dfn1.show()

df_dfn2 = ctx.from_arrow(df_polars)
df_dfn2.show()

One great thing about using this interface is that as any new library is developed and uses these stable interfaces, they will work out of the box with DataFusion!

Additionally, DataFusion DataFrames allow for exporting via the PyCapsule interface. For example, to convert a DataFrame to a PyArrow table, it is simply

import pyarrow as pa
table = pa.table(df)

User-Defined Window Functions¶

In datafusion-python 42.0.0 we released User-Defined Window Support in PR #880. For a detailed description of how these work please see the online documentation for all user-defined functions. Additionally the examples folder contains a complete example demonstrating the four different modes of operation of window functions within DataFusion.

Foreign Table Providers¶

In the core DataFusion 43.0.0 release, support was added for a Foreign Function Interface to table providers. This creates a stable way for sharing functionality across different libraries, similar to the Arrow C data interface operates. This enables libraries, such as delta lake and datafusion-contrib to write their own table providers in Rust and expose them in Python without requiring a Rust dependency on datafusion-python. This is important because it allows these libraries to operate with datafusion-python regardless of which version of datafusion they were built against.

To implement this feature in a table provider is quite simple. There is a complete example in the examples folder, but the relevant code is here, exposed as a Python function via pyo3:

    fn __datafusion_table_provider__<'py>(
        &self,
        py: Python<'py>,
    ) -> PyResult<Bound<'py, PyCapsule>> {
        let name = CString::new("datafusion_table_provider").unwrap();

        let provider = self
            .create_table()
            .map_err(|e| PyRuntimeError::new_err(e.to_string()))?;
        let provider = FFI_TableProvider::new(Arc::new(provider), false);

        PyCapsule::new_bound(py, provider, Some(name.clone()))
    }

That's it! All of the work of converting the table provider to use the FFI interface is performed by the core library.

String View performance enhancements¶

In the core DataFusion 43.0.0 release, the option to enable StringView by default was turned on. This leads to some significant performance enhancements, but it may require some changes to users of datafusion-python.

To learn more about the excellent work on this feature please read part 1 and part 2 of the blog post describing how these enhancements can lead to 20-200% performance gains in some tests.

During our testing we identified some cases where we needed to adjust workflows to account for the fact that StringView is now the default type for string based operations. First, when performing manipulations on string objects there is a performance loss when needing to cast from string to string view or vice versa. To reap the best performance, ideally all of your string type data will use StringView. For most users this should be transparent. However if you specify a schema for reading or creating data, then you likely need to change from pa.string() to pa.string_view(). For our testing, this primarily happens during data loading operations and in unit tests.

If you wish to disable StringView as the default type to retain the old approach, you can do so following this example:

from datafusion import SessionContext
from datafusion import SessionConfig
config = SessionConfig({"datafusion.execution.parquet.schema_force_view_types": "false"})
ctx = SessionContext(config=config)

Appreciation¶

We would like to thank everyone who has helped with these releases through their helpful conversations, code review, issue descriptions, and code authoring. We would especially like to thank the following authors of PRs who made these releases possible, listed in alphabetical order by username: @andygrove, @drauschenbach, @emgeee, @ion-elgreco, @jcrist, @kosiew, @mesejo, @Michael-J-Ward, and @sir-sigurd.

Thank you!

Get Involved¶

The DataFusion Python team is an active and engaging community and we would love to have you join us and help the project.

Here are some ways to get involved:

Learn more by visiting the DataFusion Python project page.
Try out the project and provide feedback, file issues, and contribute code.

Apache DataFusion Comet 0.4.0 Release

2024-11-20T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.4.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately six weeks of development work and is the result of merging 51 PRs from 10 contributors. See the change log for more information.

Release Highlights¶

Performance & Stability¶

There are a number of performance and stability improvements in this release. Here is a summary of some of the larger changes. Current benchmarking results can be found in the Comet Benchmarking Guide.

Unified Memory Management¶

Comet now uses a unified memory management approach that shares an off-heap memory pool with Apache Spark, resulting in a much simpler configuration. Comet now requires spark.memory.offHeap.enabled=true. This approach provides a holistic view of memory usage in Spark and Comet and makes it easier to optimize system performance.

Faster Joins¶

Apache Spark supports sort-merge and hash joins, which have similar performance characteristics. Spark defaults to using sort-merge joins because they are less likely to result in OutOfMemory exceptions. In vectorized query engines such as DataFusion, hash joins outperform sort-merge joins. Comet now has an experimental feature to replace Spark sort-merge joins with hash joins for improved performance. This feature is experimental because there is currently no spill-to-disk support in the hash join implementation. This feature can be enabled by setting spark.comet.exec.replaceSortMergeJoin=true.

Bloom Filter Aggregates¶

Spark’s optimizer can insert Bloom filter aggregations and filters to prune large result sets before a shuffle. However, Comet would fall back to Spark for the aggregation. Comet now has native support for Bloom filter aggregations after previously supporting Bloom filter testing. Users no longer need to set spark.sql.optimizer.runtime.bloomFilter.enabled=false when using Comet.

Complex Type support¶

This release has the following improvements to complex type support:

Implemented ArrayAppend and GetArrayStructFields.
Implemented native cast between structs
Implemented native cast from structs to string

Roadmap¶

One of the highest priority items on the roadmap is to add support for reading complex types (maps, structs, and arrays) from Parquet sources, both when reading Parquet directly and from Iceberg.

Comet currently has proprietary native code for decoding Parquet pages, native column readers for all of Spark’s primitive types, and special handling for Spark-specific use cases such as timestamp rebasing and decimal type promotion. This implementation does not yet support complex types. File IO, decryption, and decompression are handled in JVM code, and Parquet pages are passed on to native code for decoding.

Rather than add complex type support to this existing code, we are exploring two main options to allow us to leverage more of the upstream Arrow and DataFusion code.

Use DataFusion’s ParquetExec¶

For use cases where DataFusion can support reading a Parquet source, Comet could create a native plan that uses DataFusion’s ParquetExec. We are investigating using DataFusion’s SchemaAdapter to handle some Spark-specific handling of timestamps and decimals.

Use Arrow’s Parquet Batch Reader¶

For use cases not supported by DataFusion’s ParquetExec, such as integrating with Iceberg, we are exploring replacing our current native Parquet decoding logic with the Arrow readers provided by the Parquet crate.

Iceberg already provides a vectorized Spark reader for Parquet. A PR is open against Iceberg for adding a native version based on Comet, and we hope to update this to leverage the improvements outlined above.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project and have a weekly DataFusion video call.

There are also many good first issues waiting for contributions.

Comparing approaches to User Defined Functions in Apache DataFusion using Python

2024-11-19T00:00:00+00:00

Personal Context¶

For a few months now I’ve been working with Apache DataFusion, a fast query engine written in Rust. From my experience the language that nearly all data scientists are working in is Python. In general, data scientists often use Pandas for in-memory tasks and PySpark for larger tasks that require distributed processing.

In addition to DataFusion, there is another Rust based newcomer to the DataFrame world, Polars. The latter is growing extremely fast, and it serves many of the same use cases as DataFusion. For my use cases, I'm interested in DataFusion because I want to be able to build small scale tests rapidly and then scale them up to larger distributed systems with ease. I do recommend evaluating Polars for in-memory work.

Personally, I would love a single query approach that is fast for both in-memory usage and can extend to large batch processing to exploit parallelization. I think DataFusion, coupled with Ballista or DataFusion-Ray, may provide this solution.

As I’m testing, I’m primarily limiting my work to the datafusion-python project, a wrapper around the Rust DataFusion library. This wrapper gives you the speed advantages of keeping all of the data in the Rust implementation and the ergonomics of working in Python. Personally, I would prefer to work purely in Rust, but I also recognize that since the industry works in Python we should meet the people where they are.

User-Defined Functions¶

The focus of this post is User-Defined Functions (UDFs). The DataFusion library gives a lot of useful functions already for doing DataFrame manipulation. These are going to be similar to those you find in other DataFrame libraries. You’ll be able to do simple arithmetic, create substrings of columns, or find the average value across a group of rows. These cover most of the use cases you’ll need in a DataFrame.

However, there will always arise times when you want a custom function. With UDFs you open a world of possibilities in your code. Sometimes there simply isn’t an easy way to use built-in functions to achieve your goals.

In the following, I’m going to demonstrate two example use cases. These are based on real world problems I’ve encountered. Also I want to demonstrate the approach of “make it work, make it work well, make it work fast” that is a motto I’ve seen thrown around in data science.

I will demonstrate three approaches to writing UDFs. In order of increasing performance they are

Writing a pure Python function to do your computation
Using the PyArrow libraries in Python to accelerate your processing
Writing a UDF in Rust and exposing it to Python

Additionally I will demonstrate two variants of this. The first will be nearly identical to the PyArrow library approach to simplify understanding how to connect the Rust code to Python. In the second version we will do the iteration through the input arrays ourselves to give even greater flexibility to the user.

Here are the two example use cases, taken from my own work but generalized.

Use Case 1: Scalar Function¶

I have a DataFrame and a list of tuples that I’m interested in. I want to filter out the DataFrame to only have values that match those tuples from certain columns in the DataFrame.

To give a concrete example, we will use data generated for the TPC-H benchmarks. Suppose I have a table of sales line items. There are many columns, but I am interested in three: a part key (p_partkey), supplier key (p_suppkey), and return status (p_returnflag). I want only to return a DataFrame with a specific combination of these three values. That is, I want to know if part number 1530 from supplier 4031 was sold (not returned), so I want a specific combination of p_partkey = 1530, p_suppkey = 4031, and p_returnflag = 'N'. I have a small handful of these combinations I want to return.

Probably the most ergonomic way to do this without UDF is to turn that list of tuples into a DataFrame itself, perform a join, and select the columns from the original DataFrame. If we were working in PySpark we would probably broadcast join the DataFrame created from the tuple list since it is tiny. In practice, I have found that with some DataFrame libraries performing a filter rather than a join can be significantly faster. This is worth profiling for your specific use case.

Use Case 2: Aggregate Function¶

I have a DataFrame with many values that I want to aggregate. I have already analyzed it and determined there is a noise level below which I do not want to include in my analysis. I want to compute a sum of only values that are above my noise threshold.

This can be done fairly easy without leaning on a User Defined Aggregate Function (UDAF). You can simply filter the DataFrame and then aggregate using the built-in sum function. Here, we demonstrate doing this as a UDF primarily as an example of how to write UDAFs. We will use the PyArrow compute approach.

Pure Python approach¶

The fastest way (developer time, not code time) for me to implement the scalar problem solution was to do something along the lines of “for each row, check the values of interest contains that tuple”. I’ve published this as an example in the datafusion-python repository. Here is an example of how this can be done:

values_of_interest = [
    (1530, 4031, "N"),
    (6530, 1531, "N"),
    (5618, 619, "N"),
    (8118, 8119, "N"),
]

def is_of_interest_impl(
    partkey_arr: pa.Array,
    suppkey_arr: pa.Array,
    returnflag_arr: pa.Array,
) -> pa.Array:
    result = []
    for idx, partkey in enumerate(partkey_arr):
        partkey = partkey.as_py()
        suppkey = suppkey_arr[idx].as_py()
        returnflag = returnflag_arr[idx].as_py()
        value = (partkey, suppkey, returnflag)
        result.append(value in values_of_interest)

    return pa.array(result)

# Wrap our custom function with `datafusion.udf`, annotating expected 
# parameter and return types
is_of_interest = udf(
    is_of_interest_impl,
    [pa.int64(), pa.int64(), pa.utf8()],
    pa.bool_(),
    "stable",
)

df_udf_filter = df_lineitem.filter(
    is_of_interest(col("l_partkey"), col("l_suppkey"), col("l_returnflag"))
)

When working with a DataFusion UDF in Python, you define your function to take in some number of expressions. During the evaluation, these will get computed into their corresponding values and passed to your UDF as a PyArrow Array. We must return an Array also with the same number of elements (rows). So the UDF example just iterates through all of the arrays and checks to see if the tuple created from these columns matches any of those that we’re looking for.

I’ll repeat because this is something that tripped me up the first time I wrote a UDF for datafusion: DataFusion UDFs, even scalar UDFs, process an array of values at a time not a single row. This is different from some other DataFrame libraries and you may need to recognize a slight change in mentality.

Some important lines here are the lines like partkey = partkey.as_py(). When we do this, we pay a heavy cost. Now instead of keeping the analysis in the Rust code, we have to take the values in the array and convert them over to Python objects. In this case we end up getting two numbers and a string as real Python objects, complete with reference counting and all. Also we are iterating through the array in Python rather than Rust native. These will significantly slow down your code. Any time you have to cross the barrier where you change values inside the Rust arrays into Python objects or vice versa you will pay heavy cost in that transformation. You will want to design your UDFs to avoid this as much as possible.

Python approach using PyArrow compute¶

DataFusion uses Apache Arrow as its in-memory data format. This can be seen in the way that Arrow Arrays are passed into the UDFs. We can take advantage of the fact that PyArrow, the canonical Python Arrow implementation, provides a variety of useful functions. In the example below, we are only using a few of the boolean functions and the equality function. Each of these functions takes two arrays and analyzes them row by row. In the below example, we shift the logic around a little since we are now operating on an entire array of values instead of checking a single row ourselves.

import pyarrow.compute as pc

def udf_using_pyarrow_compute_impl(
    partkey_arr: pa.Array,
    suppkey_arr: pa.Array,
    returnflag_arr: pa.Array,
) -> pa.Array:
    results = None
    for partkey, suppkey, returnflag in values_of_interest:
        filtered_partkey_arr = pc.equal(partkey_arr, partkey)
        filtered_suppkey_arr = pc.equal(suppkey_arr, suppkey)
        filtered_returnflag_arr = pc.equal(returnflag_arr, returnflag)

        resultant_arr = pc.and_(filtered_partkey_arr, filtered_suppkey_arr)
        resultant_arr = pc.and_(resultant_arr, filtered_returnflag_arr)

        if results is None:
            results = resultant_arr
        else:
            results = pc.or_(results, resultant_arr)

    return results


udf_using_pyarrow_compute = udf(
    udf_using_pyarrow_compute_impl,
    [pa.int64(), pa.int64(), pa.utf8()],
    pa.bool_(),
    "stable",
)

df_udf_pyarrow_compute = df_lineitem.filter(
    udf_using_pyarrow_compute(col("l_partkey"), col("l_suppkey"), col("l_returnflag"))
)

The idea in the code above is that we will iterate through each of the values of interest, which we expect to be small. For each of the columns, we compare the value of interest to it’s corresponding array using pyarrow.compute.equal. This will give use three boolean arrays. We have a match to the tuple if we have a row in all three arrays that is true, so we use pyarrow.compute.and_. Now our return value from the UDF needs to include arrays for which any of the values of interest list of tuples exists, so we take the result from the current loop and perform a pyarrow.compute.or_ on it.

From my benchmarking, switching from approach of converting values into Python objects to this approach of using the PyArrow built-in functions leads to about a 10x speed improvement in this simple problem.

It’s worth noting that almost all of the PyArrow compute functions expect to take one or two arrays as their arguments. If you need to write a UDF that is evaluating three or more columns, you’ll need to do something akin to what we’ve shown here.

Rust UDF with Python wrapper¶

This is the most complicated approach, but has the potential to be the most performant. What we will do here is write a Rust function to perform our computation and then expose that function to Python. I know of two use cases where I would recommend this approach. The first is the case when the PyArrow compute functions are insufficient for your needs. Perhaps your code is too complex or could be greatly simplified if you pulled in some outside dependency. The second use case is when you have written a UDF that you’re sharing across multiple projects and have hardened the approach. It is possible that you can implement your function in Rust to give a speed improvement and then every project that is using this shared UDF will benefit from those updates.

When deciding to use this approach, it’s worth considering how much you think you’ll actually benefit from the Rust implementation to decide if it’s worth the additional effort to maintain and deploy the Python wheels you generate. It is certainly not necessary for every use case.

Due to the excellent work by the Python arrow team, we can simplify our work to needing only two dependencies on the Rust side, arrow-rs and pyo3. I have posted a minimal example. You’ll need maturin to build the project, and you must use release mode when building to get the expected performance.

maturin develop --release

When you write your UDF in Rust you generally will need to take these steps

Write a function description that takes in some number of Python generic objects.
Convert these objects to Arrow Arrays of the appropriate type(s).
Perform your computation and create a resultant Array.
Convert the array into a Python generic object.

For the conversion to and from Python objects, we can take advantage of the ArrayData::from_pyarrow_bound and ArrayData::to_pyarrow functions. All that remains is to perform your computation.

We are going to demonstrate doing this computation in two ways. The first is to mimic what we’ve done in the above approach using PyArrow. In the second we demonstrate iterating through the three arrays ourselves.

In our first approach, we can expect the performance to be nearly identical to when we used the PyArrow compute functions. On the Rust side we will have slightly less overhead but the heavy lifting portions of the code are essentially the same between this Rust implementation and the PyArrow approach above.

The reason for demonstrating this, even though it doesn’t provide a significant speedup over Python, is to primarily demonstrate how to make the Python to Rust with Python wrapper transition. In the second implementation you can see how we can iterate through all of the arrays ourselves.

In this first example, we are hard coding the values of interest, but in the following section we demonstrate passing these in during initialization.

#[pyfunction]
pub fn tuple_filter_fn(
    py: Python<'_>,
    partkey_expr: &Bound<'_, PyAny>,
    suppkey_expr: &Bound<'_, PyAny>,
    returnflag_expr: &Bound<'_, PyAny>,
) -> PyResult<Py<PyAny>> {
    let partkey_arr: PrimitiveArray<Int64Type> =
        ArrayData::from_pyarrow_bound(partkey_expr)?.into();
    let suppkey_arr: PrimitiveArray<Int64Type> =
        ArrayData::from_pyarrow_bound(suppkey_expr)?.into();
    let returnflag_arr: StringArray = ArrayData::from_pyarrow_bound(returnflag_expr)?.into();

    let values_of_interest = vec![
        (1530, 4031, "N".to_string()),
        (6530, 1531, "N".to_string()),
        (5618, 619, "N".to_string()),
        (8118, 8119, "N".to_string()),
    ];

    let mut res: Option<BooleanArray> = None;

    for (partkey, suppkey, returnflag) in &values_of_interest {
        let filtered_partkey_arr = BooleanArray::from_unary(&partkey_arr, |p| p == *partkey);
        let filtered_suppkey_arr = BooleanArray::from_unary(&suppkey_arr, |s| s == *suppkey);
        let filtered_returnflag_arr =
            BooleanArray::from_unary(&returnflag_arr, |s| s == returnflag);

        let part_and_supp = compute::and(&filtered_partkey_arr, &filtered_suppkey_arr)
            .map_err(|e| PyValueError::new_err(e.to_string()))?;
        let resultant_arr = compute::and(&part_and_supp, &filtered_returnflag_arr)
            .map_err(|e| PyValueError::new_err(e.to_string()))?;

        res = match res {
            Some(r) => compute::or(&r, &resultant_arr).ok(),
            None => Some(resultant_arr),
        };
    }

    res.unwrap().into_data().to_pyarrow(py)
}


#[pymodule]
fn tuple_filter_example(module: &Bound<'_, PyModule>) -> PyResult<()> {
    module.add_function(wrap_pyfunction!(tuple_filter_fn, module)?)?;
    Ok(())
}

To use this we use the udf function in datafusion-python just as before.

from datafusion import udf
import pyarrow as pa
from tuple_filter_example import tuple_filter_fn

udf_using_custom_rust_fn = udf(
    tuple_filter_fn,
    [pa.int64(), pa.int64(), pa.utf8()],
    pa.bool_(),
    "stable",
)

That's it! We've now got a third party Rust UDF with Python wrappers working with DataFusion's Python bindings!

Rust UDF with initialization¶

Looking at the code above, you can see that it is hard coding the values we're interested in. There are many types of UDFs that don't require any additional data provided to them before they start the computation. The code above is sloppy, so let's clean it up.

We want to write the function to take some additional data. A limitation of the UDFs we create is that they expect to operate on entire arrays of data at a time. We can get around this problem by creating an initializer for our UDF. We do this by defining a Rust struct that contains the data we need and implement two methods on this struct, new and __call__. By doing this we will create a Python object that is callable, so it can be the function we provide to udf.

#[pyclass]
pub struct TupleFilterClass {
    values_of_interest: Vec<(i64, i64, String)>,
}

#[pymethods]
impl TupleFilterClass {
    #[new]
    fn new(values_of_interest: Vec<(i64, i64, String)>) -> Self {
        Self {
            values_of_interest,
        }
    }

    fn __call__(
        &self,
        py: Python<'_>,
        partkey_expr: &Bound<'_, PyAny>,
        suppkey_expr: &Bound<'_, PyAny>,
        returnflag_expr: &Bound<'_, PyAny>,
    ) -> PyResult<Py<PyAny>> {
        let partkey_arr: PrimitiveArray<Int64Type> =
            ArrayData::from_pyarrow_bound(partkey_expr)?.into();
        let suppkey_arr: PrimitiveArray<Int64Type> =
            ArrayData::from_pyarrow_bound(suppkey_expr)?.into();
        let returnflag_arr: StringArray = ArrayData::from_pyarrow_bound(returnflag_expr)?.into();

        let mut res: Option<BooleanArray> = None;

        for (partkey, suppkey, returnflag) in &self.values_of_interest {
            let filtered_partkey_arr = BooleanArray::from_unary(&partkey_arr, |p| p == *partkey);
            let filtered_suppkey_arr = BooleanArray::from_unary(&suppkey_arr, |s| s == *suppkey);
            let filtered_returnflag_arr =
                BooleanArray::from_unary(&returnflag_arr, |s| s == returnflag);

            let part_and_supp = compute::and(&filtered_partkey_arr, &filtered_suppkey_arr)
                .map_err(|e| PyValueError::new_err(e.to_string()))?;
            let resultant_arr = compute::and(&part_and_supp, &filtered_returnflag_arr)
                .map_err(|e| PyValueError::new_err(e.to_string()))?;

            res = match res {
                Some(r) => compute::or(&r, &resultant_arr).ok(),
                None => Some(resultant_arr),
            };
        }

        res.unwrap().into_data().to_pyarrow(py)
    }
}

#[pymodule]
fn tuple_filter_example(module: &Bound<'_, PyModule>) -> PyResult<()> {
    module.add_class::<TupleFilterClass>()?;
    Ok(())
}

When you write this, you don't have to call your constructor new. The more important part is that you have #[new] designated on the function. With this you can provide any kinds of data you need during processing. Using this initializer in Python is fairly straightforward.

from datafusion import udf
import pyarrow as pa
from tuple_filter_example import TupleFilterClass

tuple_filter_class = TupleFilterClass(values_of_interest)

udf_using_custom_rust_fn_with_data = udf(
    tuple_filter_class,
    [pa.int64(), pa.int64(), pa.utf8()],
    pa.bool_(),
    "stable",
    name="tuple_filter_with_data"
)

When you use this approach you will need to provide a name argument to udf. This is because our class/struct does not get the __qualname__ attribute that the udf function is looking for. You can give this udf any name you choose.

Rust UDF with direct iteration¶

The final version of our scalar UDF is one where we implement it in Rust and iterate through all of the arrays ourselves. If you are iterating through more than 3 arrays at a time I recommend looking at izip in the itertools crate. For ease of understanding and since we only have 3 arrays here I will just explicitly create my own tuple here.

#[pyclass]
pub struct TupleFilterDirectIterationClass {
    values_of_interest: Vec<(i64, i64, String)>,
}

#[pymethods]
impl TupleFilterDirectIterationClass {
    #[new]
    fn new(values_of_interest: Vec<(i64, i64, String)>) -> Self {
        Self { values_of_interest }
    }

    fn __call__(
        &self,
        py: Python<'_>,
        partkey_expr: &Bound<'_, PyAny>,
        suppkey_expr: &Bound<'_, PyAny>,
        returnflag_expr: &Bound<'_, PyAny>,
    ) -> PyResult<Py<PyAny>> {
        let partkey_arr: PrimitiveArray<Int64Type> =
            ArrayData::from_pyarrow_bound(partkey_expr)?.into();
        let suppkey_arr: PrimitiveArray<Int64Type> =
            ArrayData::from_pyarrow_bound(suppkey_expr)?.into();
        let returnflag_arr: StringArray = ArrayData::from_pyarrow_bound(returnflag_expr)?.into();

        let values_to_search: Vec<(&i64, &i64, &str)> = (&self.values_of_interest)
            .iter()
            .map(|(a, b, c)| (a, b, c.as_str()))
            .collect();

        let values = partkey_arr
            .values()
            .iter()
            .zip(suppkey_arr.values().iter())
            .zip(returnflag_arr.iter())
            .map(|((a, b), c)| (a, b, c.unwrap_or_default()))
            .map(|v| values_to_search.contains(&v));

        let res: BooleanArray = BooleanBuffer::from_iter(values).into();

        res.into_data().to_pyarrow(py)
    }
}

We convert the values_of_interest into a vector of borrowed types so that we can do a fast search without creating additional memory. The other option is to turn the returnflag into a String but that memory allocation is unnecessary. After that we use two zip operations so that we can iterate over all three columns in a single pass. Since each zip will return a tuple of two elements, a quick map turns them into the tuple format we need. Also, StringArray is a little different in the buffer it uses, so it is treated slightly differently from the others.

User Defined Aggregate Function¶

Writing a user defined aggregate function or user defined window function is slightly more complex than scalar functions. This is because we must accumulate values and there is no guarantee that one batch will contain all the values we are aggregating over. For this we need to define an Accumulator which will do a few things.

Process a batch and compute an internal state
Share the state so that we can combine multiple batches
Merge the results across multiple batches
Return the final result

In the example below, we're going to look at customer orders and we want to know per customer ID, how much they have ordered total. We want to ignore small orders, which we define as anything under 5000.

from datafusion import Accumulator, udaf
import pyarrow as pa
import pyarrow.compute as pc

IGNORE_THRESHOLD = 5000.0
class AboveThresholdAccum(Accumulator):
    def __init__(self) -> None:
        self._sum = 0.0

    def update(self, values: pa.Array) -> None:
        over_threshold = pc.greater(values, pa.scalar(IGNORE_THRESHOLD))
        sum_above = pc.sum(values.filter(over_threshold)).as_py()
        if sum_above is None:
            sum_above = 0.0
        self._sum = self._sum + sum_above

    def merge(self, states: List[pa.Array]) -> None:
        self._sum = self._sum + pc.sum(states[0]).as_py()

    def state(self) -> List[pa.Scalar]:
        return [pa.scalar(self._sum)]

    def evaluate(self) -> pa.Scalar:
        return pa.scalar(self._sum)

sum_above_threshold = udaf(AboveThresholdAccum, [pa.float64()], pa.float64(), [pa.float64()], 'stable')

df_orders.aggregate([col("o_custkey")],[sum_above_threshold(col("o_totalprice")).alias("sales")]).show()

Since we are doing a sum we can keep a single value as our internal state. When we call update() we will process a single array and update the internal state, which we share with the state() function. For larger batches we may merge() these states. It is important to note that the states in the merge() function are an array of the values returned from state(). It is entirely possible that the merge function is significantly different than the update, though in our example they are very similar.

One example of implementing a user defined aggregate function where the update() and merge() operations are different is computing an average. In update() we would create a state that is both a sum and a count. state() would return a list of these two values, and merge() would compute the final result.

User Defined Window Functions¶

Writing a user defined window function is slightly more complex than an aggregate function due to the variety of ways that window functions are called. I recommend reviewing the online documentation for a description of which functions need to be implemented. The details of how to implement these generally follow the same patterns as described above for aggregate functions.

Performance Comparison¶

For the scalar functions above, we performed a timing evaluation, repeating the operation 100 times. For this simple example these are our results.

+-----------------------------+--------------+---------+
| approach                    | Average Time | Std Dev |
+-----------------------------+--------------+---------+
| python udf                  | 4.969        | 0.062   |
| simple filter               | 1.075        | 0.022   |
| explicit filter             | 0.685        | 0.063   |
| pyarrow compute             | 0.529        | 0.017   |
| arrow rust compute          | 0.511        | 0.034   |
| arrow rust compute as class | 0.502        | 0.011   |
| rust custom iterator        | 0.478        | 0.009   |
+-----------------------------+--------------+---------+

As expected, the conversion to Python objects is by far the worst performance. As soon as we drop into using any functions that keep the data entirely on the Native (Rust or C/C++) side we see a near 10x speed improvement. Then as we increase our complexity from using PyArrow compute functions to implementing the UDF in Rust we see incremental improvements. Our fastest approach - iterating through the arrays ourselves does operate nearly 10% faster than the PyArrow compute approach.

Final Thoughts and Recommendations¶

For anyone who is curious about DataFusion I highly recommend giving it a try. This post was designed to make it easier for new users to the Python implementation to work with User Defined Functions by giving a few examples of how one might implement these.

When it comes to designing UDFs, I strongly recommend seeing if you can write your UDF using PyArrow functions rather than pure Python objects. As shown in the scalar example above, you can achieve a 10x speedup by using PyArrow functions. If you must do something that isn't well represented by the PyArrow compute functions, then I would consider using a Rust based UDF in the manner shown above.

I would like to thank @alamb, @andygrove, @comphead, @emgeee, @kylebarron, and @Omega359 for their helpful reviews and feedback.

Lastly, the Apache Arrow and DataFusion community is an active group of very helpful people working to make a great tool. If you want to get involved, please take a look at the online documentation and jump in to help with one of the open issues.

Apache DataFusion is now the fastest single node engine for querying Apache Parquet files

2024-11-18T00:00:00+00:00

I am extremely excited to announce that Apache DataFusion is the fastest engine for querying Apache Parquet files in ClickBench. It is faster than DuckDB, chDB and Clickhouse using the same hardware. It also marks the first time a Rust-based engine holds the top spot, which has previously been held by traditional C/C++-based engines.

Figure 1: 2024-11-16 ClickBench Results for the ‘hot’[^1] run against the partitioned 14 GB Parquet dataset (100 files, each ~140MB) on a c6a.4xlarge (16 CPU / 32 GB RAM) VM. Measurements are relative (1.x) to results using different hardware.

Best in class performance on Parquet is now available to anyone. DataFusion’s open design lets you start quickly with a full featured Query Engine, including SQL, data formats, catalogs, and more, and then customize any behavior you need. I predict the continued emergence of new classes of data systems now that creators can focus the bulk of their innovation on areas such as query languages, system integrations, and data formats rather than trying to play catchup with core engine performance.

ClickBench also includes results for proprietary storage formats, which require costly load / export steps, making them useful in fewer use cases and thus much less important than open formats (though the idea of use case specific formats is interesting[^2]).

This blog post highlights some of the techniques we used to achieve this performance, and celebrates the teamwork involved.

A Strong History of Performance Improvements¶

Performance has long been a core focus for DataFusion's community, and speed attracts users and contributors. Recently, we seem to have been even more focused on performance, including in July, 2024 when Mehmet Ozan Kabak, CEO of Synnada, again suggested focusing on performance. This got many of us excited (who doesn’t love a challenge!), and we have subsequently rallied to steadily improve the performance release on release as shown in Figure 2.

Figure 2: ClickBench performance improved over 30% between DataFusion 34 (released Dec. 2023) and DataFusion 43 (released Nov. 2024).

Like all good optimization efforts, ours took sustained effort as DataFusion ran out of single 2x performance improvements several years ago. Working together our community of engineers from around the world[^3] and all experience levels[^4] pulled it off (check out this discussion to get a sense). It may be a "hobo sandwich" [^5], but it is a tasty one!

Of course, most of these techniques have been implemented and described before, but until now they were only available in proprietary systems such as Vertica, DataBricks Photon, or Snowflake or in tightly integrated open source systems such as DuckDB or ClickHouse which were not designed to be extended.

StringView¶

Performance improved for all queries when DataFusion switched to using Arrow StringView. Using StringView “just” saves some copies and avoids one memory access for certain comparisons. However, these copies and comparisons happen to occur in many of the hottest loops during query processing, so optimizing them resulted in measurable performance improvements.

Figure 3: Figure from Using StringView / German Style Strings to Make Queries Faster: Part 1 showing how StringView saves copying data in many cases.

Using StringView to make DataFusion faster for ClickBench required substantial careful, low level optimization work described in Using StringView / German Style Strings to Make Queries Faster: Part 1 and Part 2. However, it also required extending the rest of DataFusion’s operations to support the new type. You can get a sense of the magnitude of the work required by looking at the 100+ pull requests linked to the epic in arrow-rs (here) and three major epics (here, here and here) in DataFusion.

Here is a partial list of people involved in the project (I am sorry to those whom I forgot)

Arrow: Xiangpeng Hao (InfluxData’s amazing 2024 summer intern and UW Madison PhD), Yijun Zhao from DataBend Labs, and Raphael Taylor-Davies laid the foundation. RinChanNOW from Tencent and Andrew Duffy from SpiralDB helped push it along in the early days, and Liang-Chi Hsieh, Daniël Heres reviewed and provided guidance.
DataFusion: Xiangpeng Hao, again charted the initial path and Weijun Huang, Dharan Aditya Lordworms, Jax Liu, wiedld, Tai Le Manh, yi wang, doupache, Jay Zhan , Xin Li and Kaifeng Zheng made it real.
DataFusion String Function Migration: Trent Hauck organized the effort and set the patterns, Jax Liu made a clever testing framework, and Austin Liu, Dmitrii Bu, Tai Le Manh, Chojan Shang, WeblWabl, Lordworms, iamthinh, Bruce Ritchie, Kaifeng Zheng, and Xin Li bashed out the conversions.

Parquet¶

Part of the reason for DataFusion's speed in ClickBench is reading Parquet files (really) quickly, which reflects invested effort in the Parquet reading system (see Querying Parquet with Millisecond Latency )

The DataFusion ParquetExec (built on the Rust Parquet Implementation) is now the most sophisticated open source Parquet reader I know of. It has every optimization we can think of for reading Parquet, including projection pushdown, predicate pushdown (row group metadata, page index, and bloom filters), limit pushdown, parallel reading, interleaved I/O, and late materialized filtering (coming soon ™️ by default). Some recent work from June recently unblocked a remaining hurdle for enabling late materialized filtering, and conveniently Xiangpeng Hao is working on the final piece (no pressure😅)

Skipping Partial Aggregation When It Doesn't Help¶

Many ClickBench queries are aggregations that summarize millions of rows, a common task for reporting and dashboarding. DataFusion uses state of the art two phase aggregation plans. Normally, two phase aggregation works well as the first phase consolidates many rows immediately after reading, while the data is still in cache. However, for certain “high cardinality” aggregate queries (that have large numbers of groups), the two phase aggregation strategy used in DataFusion was inefficient, manifesting in relatively slower performance compared to other engines for ClickBench queries such as

SELECT "WatchID", "ClientIP", COUNT(*) AS c, ... 
FROM hits 
GROUP BY "WatchID", "ClientIP" /* <----- 13M Distinct Groups!!! */
ORDER BY c DESC 
LIMIT 10;

For such queries, the first aggregation phase does not significantly reduce the number of rows, which wastes significant effort. Eduard Karacharov contributed a dynamic strategy to bypass the first phase when it is not working efficiently, shown in Figure 4.

Figure 4: Diagram from DataFusion API docs showing when the multi-phase grouping is not effective

Optimized Multi-Column Grouping¶

Another method for improving analytic database performance is specialized (aka highly optimized) versions of operations for different data types, which the system picks at runtime based on the query. Like other systems, DataFusion has specialized code for handling different types of group columns. For example, there is special code that handles GROUP BY int_id and different special code that handles GROUP BY string_id .

When a query groups by multiple columns, it is tricker to apply this technique. For example GROUP BY string_id, int_id and GROUP BY int_id, string_id have different optimal structures, but it is not possible to include specialized versions for all possible combinations of group column types.

DataFusion includes a general Row based mechanism that works for any combination of column types, but this general mechanism copies each value twice as shown in Figure 5. The cost of this copy is especially high for variable length strings and binary data.

Figure 5: Prior to DataFusion 43.0.0, queries with multiple group columns used Row based group storage and copied each group value twice. This copy consumes a substantial amount of the query time for queries with many distinct groups, such as several of the queries in ClickBench.

Many optimizations in Databases boil down to simply avoiding copies, and this was no exception. The trick was to figure out how to avoid copies without causing per-column comparison overhead to dominate or complexity to get out of hand. In a great example of diligent and disciplined engineering, Jay Zhan tried several, different approaches until arriving at the [one shipped in DataFusion 43.0.0], shown in Figure 6.

Figure 6: DataFusion 43.0.0’s new columnar group storage copies each group value exactly once, which is significantly faster when grouping by multiple columns.

Huge thanks as well to Emil Ejbyfeldt and Daniël Heres for their help reviewing and to Rachelint (kamille) for reviewing and contributing a faster vectorized append and compare for multiple groups which will be released in DataFusion 44. The discussion on the ticket is another great example of the power of the DataFusion community working together to build great software.

What’s Next 🚀¶

Just as I expect the performance of other engines to improve, DataFusion has several more performance improvements lined up itself:

Intermediate results blocked management (thanks again Rachelint (kamille)
Enable parquet filter pushdown by default

We are also talking about what to focus on over the next three months and are always looking for people to help! If you want to geek out (obsess??) about performance and other features with engineers from around the world, we would love you to join us.

Additional Thanks¶

In addition to the people called out above, thanks:

Patrick McGleenon for running ClickBench and gathering this data (source).
Everyone I missed in the shoutouts – there are so many of you. We appreciate everyone.

Conclusion¶

I have dreamed about DataFusion being on top of the ClickBench leaderboard for several years. I often watched with envy improvements in systems backed by large VC investments, internet companies, or world class research institutions, and doubted that we could pull off something similar in an open source project with always limited time.

The fact that we have now surpassed those other systems in query performance I think speaks to the power and possibility of focusing on community and aligning our collective enthusiasm and skills towards a common goal. Of course, being on the top in any particular benchmark is likely fleeting as other engines will improve, but so will DataFusion!

I love working on DataFusion – the people, the quality of the code, my interactions and the results we have achieved together far surpass my expectations as well as most of my other software development experiences. I can’t wait to see what people will build next, and hope to see you online.

Notes¶

[^1]: Note that DuckDB is slightly faster on the ‘cold’ run.

[^2]: Want to try your hand at a custom format for ClickBench fame / glory?: Make DataFusion the fastest engine in ClickBench with custom file format

[^3]: We have contributors from North America, South American, Europe, Asia, Africa and Australia

[^4]: Undergraduates, PhD, Junior engineers, and getting-kind-of-crotchety experienced engineers

[^5]: Thanks to Andy Pavlo, I love that nomenclature

Apache DataFusion Comet 0.3.0 Release

2024-09-27T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.3.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately four weeks of development work and is the result of merging 57 PRs from 12 contributors. See the change log for more information.

Release Highlights¶

Binary Releases¶

Comet jar files are now published to Maven central for amd64 and arm64 architectures (Linux only).

Files can be found at https://central.sonatype.com/search?q=org.apache.datafusion

Spark versions 3.3, 3.4, and 3.5 are supported.
Scala versions 2.12 and 2.13 are supported.

New Features¶

The following expressions are now supported natively:

DateAdd
DateSub
ElementAt
GetArrayElement
ToJson

Performance & Stability¶

Upgraded to DataFusion 42.0.0
Reduced memory overhead due to some memory leaks being fixed
Comet will now fall back to Spark for queries that use DPP, to avoid performance regressions because Comet does not have native support for DPP yet
Improved performance when converting Spark columnar data to Arrow format
Faster decimal sum and avg functions

Documentation Updates¶

Improved documentation for deploying Comet with Kubernetes and Helm in the Comet Kubernetes Guide
More detailed architectural overview of Comet scan and execution in the Comet Plugin Overview in the contributor guide

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project.

There are also many good first issues waiting for contributions.

Using StringView / German Style Strings to Make Queries Faster: Part 1- Reading Parquet

2024-09-13T00:00:00+00:00

Editor's Note: This is the first of a two part blog series that was first published on the InfluxData blog. Thanks to InfluxData for sponsoring this work as Xiangpeng Hao's summer intern project

This blog describes our experience implementing StringView in the Rust implementation of Apache Arrow, and integrating it into Apache DataFusion, significantly accelerating string-intensive queries in the ClickBench benchmark by 20%- 200% (Figure 1[^1]).

Getting significant end-to-end performance improvements was non-trivial. Implementing StringView itself was only a fraction of the effort required. Among other things, we had to optimize UTF-8 validation, implement unintuitive compiler optimizations, tune block sizes, and time GC to realize the FDAP ecosystem’s benefit. With other members of the open source community, we were able to overcome performance bottlenecks that could have killed the project. We would like to contribute by explaining the challenges and solutions in more detail so that more of the community can learn from our experience.

StringView is based on a simple idea: avoid some string copies and accelerate comparisons with inlined prefixes. Like most great ideas, it is “obvious” only after someone describes it clearly. Although simple, straightforward implementation actually slows down performance for almost every query. We must, therefore, apply astute observations and diligent engineering to realize the actual benefits from StringView.

Although this journey was successful, not all research ideas are as lucky. To accelerate the adoption of research into industry, it is valuable to integrate research prototypes with practical systems. Understanding the nuances of real-world systems makes it more likely that research designs[^2] will lead to practical system improvements.

StringView support was released as part of arrow-rs v52.2.0 and DataFusion v41.0.0. You can try it by setting the schema_force_view_types DataFusion configuration option, and we are hard at work with the community to make it the default. We invite everyone to try it out, take advantage of the effort invested so far, and contribute to making it better.

Figure 1: StringView improves string-intensive ClickBench query performance by 20% - 200%

What is StringView?¶

Figure 2: Use StringArray and StringViewArray to represent the same string content.

The concept of inlined strings with prefixes (called “German Strings” by Andy Pavlo, in homage to TUM, where the Umbra paper that describes them originated) has been used in many recent database systems (Velox, Polars, DuckDB, CedarDB, etc.) and was introduced to Arrow as a new StringViewArray[^3] type. Arrow’s original StringArray is very memory efficient but less effective for certain operations. StringViewArray accelerates string-intensive operations via prefix inlining and a more flexible and compact string representation.

A StringViewArray consists of three components:

The view array
The buffers
The buffer pointers (IDs) that map buffer offsets to their physical locations

Each view is 16 bytes long, and its contents differ based on the string’s length:

string length < 12 bytes: the first four bytes store the string length, and the remaining 12 bytes store the inlined string.
string length > 12 bytes: the string is stored in a separate buffer. The length is again stored in the first 4 bytes, followed by the buffer id (4 bytes), the buffer offset (4 bytes), and the prefix (first 4 bytes) of the string.

Figure 2 shows an example of the same logical content (left) using StringArray (middle) and StringViewArray (right):

The first string – "Apache DataFusion" – is 17 bytes long, and both StringArray and StringViewArray store the string’s bytes at the beginning of the buffer. The StringViewArray also inlines the first 4 bytes – "Apac" – in the view.
The second string, "InfluxDB" is only 8 bytes long, so StringViewArray completely inlines the string content in the view struct while StringArray stores the string in the buffer as well.
The third string "Arrow Rust Impl" is 15 bytes long and cannot be fully inlined. StringViewArray stores this in the same form as the first string.
The last string "Apache DataFusion" has the same content as the first string. It’s possible to use StringViewArray to avoid this duplication and reuse the bytes by pointing the view to the previous location.

StringViewArray provides three opportunities for outperforming StringArray:

Less copying via the offset + buffer format
Faster comparisons using the inlined string prefix
Reusing repeated string values with the flexible view layout

The rest of this blog post discusses how to apply these opportunities in real query scenarios to improve performance, what challenges we encountered along the way, and how we solved them.

Faster Parquet Loading¶

Apache Parquet is the de facto format for storing large-scale analytical data commonly stored LakeHouse-style, such as Apache Iceberg and Delta Lake. Efficiently loading data from Parquet is thus critical to query performance in many important real-world workloads.

Parquet encodes strings (i.e., byte array) in a slightly different format than required for the original Arrow StringArray. The string length is encoded inline with the actual string data (as shown in Figure 4 left). As mentioned previously, StringArray requires the data buffer to be continuous and compact—the strings have to follow one after another. This requirement means that reading Parquet string data into an Arrow StringArray requires copying and consolidating the string bytes to a new buffer and tracking offsets in a separate array. Copying these strings is often wasteful. Typical queries filter out most data immediately after loading, so most of the copied data is quickly discarded.

On the other hand, reading Parquet data as a StringViewArray can re-use the same data buffer as storing the Parquet pages because StringViewArray does not require strings to be contiguous. For example, in Figure 4, the StringViewArray directly references the buffer with the decoded Parquet page. The string "Arrow Rust Impl" is represented by a view with offset 37 and length 15 into that buffer.

Figure 4: StringViewArray avoids copying by reusing decoded Parquet pages.

Mini benchmark

Reusing Parquet buffers is great in theory, but how much does saving a copy actually matter? We can run the following benchmark in arrow-rs to find out:

Our benchmarking machine shows that loading BinaryViewArray is almost 2x faster than loading BinaryArray (see next section about why this isn’t String ViewArray).

You can read more on this arrow-rs issue: https://github.com/apache/arrow-rs/issues/5904

From Binary to Strings¶

You may wonder why we reported performance for BinaryViewArray when this post is about StringViewArray. Surprisingly, initially, our implementation to read StringViewArray from Parquet was much slower than StringArray. Why? TLDR: Although reading StringViewArray copied less data, the initial implementation also spent much more time validating UTF-8 (as shown in Figure 5).

Strings are stored as byte sequences. When reading data from (potentially untrusted) Parquet files, a Parquet decoder must ensure those byte sequences are valid UTF-8 strings, and most programming languages, including Rust, include highly optimized routines for doing so.

Figure 5: Time to load strings from Parquet. The UTF-8 validation advantage initially eliminates the advantage of reduced copying for StringViewArray.

A StringArray can be validated in a single call to the UTF-8 validation function as it has a continuous string buffer. As long as the underlying buffer is UTF-8[^4], all strings in the array must be UTF-8. The Rust parquet reader makes a single function call to validate the entire buffer.

However, validating an arbitrary StringViewArray requires validating each string with a separate call to the validation function, as the underlying buffer may also contain non-string data (for example, the lengths in Parquet pages).

UTF-8 validation in Rust is highly optimized and favors longer strings (as shown in Figure 6), likely because it leverages SIMD instructions to perform parallel validation. The benefit of a single function call to validate UTF-8 over a function call for each string more than eliminates the advantage of avoiding the copy for StringViewArray.

Figure 6: UTF-8 validation throughput vs string length—StringArray’s contiguous buffer can be validated much faster than StringViewArray’s buffer.

Does this mean we should only use StringArray? No! Thankfully, there’s a clever way out. The key observation is that in many real-world datasets, 99% of strings are shorter than 128 bytes, meaning the encoded length values are smaller than 128, in which case the length itself is also valid UTF-8 (in fact, it is ASCII).

This observation means we can optimize validating UTF-8 strings in Parquet pages by treating the length bytes as part of a single large string as long as the length value is less than 128. Put another way, prior to this optimization, the length bytes act as string boundaries, which require a UTF-8 validation on each string. After this optimization, only those strings with lengths larger than 128 bytes (less than 1% of the strings in the ClickBench dataset) are string boundaries, significantly increasing the UTF-8 validation chunk size and thus improving performance.

The actual implementation is only nine lines of Rust (with 30 lines of comments). You can find more details in the related arrow-rs issue: https://github.com/apache/arrow-rs/issues/5995. As expected, with this optimization, loading StringViewArray is almost 2x faster than loading StringArray.

Be Careful About Implicit Copies¶

After all the work to avoid copying strings when loading from Parquet, performance was still not as good as expected. We tracked the problem to a few implicit data copies that we weren't aware of, as described in this issue.

The copies we eventually identified come from the following innocent-looking line of Rust code, where self.buf is a reference counted pointer that should transform without copying into a buffer for use in StringViewArray.

However, Rust-type coercion rules favored a blanket implementation that did copy data. This implementation is shown in the following code block where the impl<T: AsRef<[u8]>> will accept any type that implements AsRef<[u8]> and copies the data to create a new buffer. To avoid copying, users need to explicitly call from_vec, which consumes the Vec and transforms it into a buffer.

Diagnosing this implicit copy was time-consuming as it relied on subtle Rust language semantics. We needed to track every step of the data flow to ensure every copy was necessary. To help other users and prevent future mistakes, we also removed the implicit API from arrow-rs in favor of an explicit API. Using this approach, we found and fixed several other unintentional copies in the code base—hopefully, the change will help other downstream users avoid unnecessary copies.

Help the Compiler by Giving it More Information¶

The Rust compiler’s automatic optimizations mostly work very well for a wide variety of use cases, but sometimes, it needs additional hints to generate the most efficient code. When profiling the performance of view construction, we found, counterintuitively, that constructing long strings was 10x faster than constructing short strings, which made short strings slower on StringViewArray than on StringArray!

As described in the first section, StringViewArray treats long and short strings differently. Short strings (<12 bytes) directly inline to the view struct, while long strings only inline the first 4 bytes. The code to construct a view looks something like this:

It appears that both branches of the code should be fast: they both involve copying at most 16 bytes of data and some memory shift/store operations. How could the branch for short strings be 10x slower?

Looking at the assembly code using Compiler Explorer, we (with help from Ao Li) found the compiler used CPU load instructions to copy the fixed-sized 4 bytes to the view for long strings, but it calls a function, ptr::copy_non_overlapping, to copy the inlined bytes to the view for short strings. The difference is that long strings have a prefix size (4 bytes) known at compile time, so the compiler directly uses efficient CPU instructions. But, since the size of the short string is unknown to the compiler, it has to call the general-purpose function ptr::copy_non_coverlapping. Making a function call is significant unnecessary overhead compared to a CPU copy instruction.

However, we know something the compiler doesn’t know: the short string size is not arbitrary—it must be between 0 and 12 bytes, and we can leverage this information to avoid the function call. Our solution generates 13 copies of the function using generics, one for each of the possible prefix lengths. The code looks as follows, and checking the assembly code, we confirmed there are no calls to ptr::copy_non_overlapping, and only native CPU instructions are used. For more details, see the ticket.

End-to-End Query Performance¶

In the previous sections, we went out of our way to make sure loading StringViewArray is faster than StringArray. Before going further, we wanted to verify if obsessing about reducing copies and function calls has actually improved end-to-end performance in real-life queries. To do this, we evaluated a ClickBench query (Q20) in DataFusion that counts how many URLs contain the word "google":

This is a relatively simple query; most of the time is spent on loading the “URL” column to find matching rows. The query plan looks like this:

We ran the benchmark in the DataFusion repo like this:

With StringViewArray we saw a 24% end-to-end performance improvement, as shown in Figure 7. With the --string-view argument, the end-to-end query time is 944.3 ms, 869.6 ms, 861.9 ms (three iterations). Without --string-view, the end-to-end query time is 1186.1 ms, 1126.1 ms, 1138.3 ms.

Figure 7: StringView reduces end-to-end query time by 24% on ClickBench Q20.

We also double-checked with detailed profiling and verified that the time reduction is indeed due to faster Parquet loading.

Conclusion¶

In this first blog post, we have described what it took to improve the performance of simply reading strings from Parquet files using StringView. While this resulted in real end-to-end query performance improvements, in our next post, we explore additional optimizations enabled by StringView in DataFusion, along with some of the pitfalls we encountered while implementing them.

Footnotes¶

[^1]: Benchmarked with AMD Ryzen 7600x (12 core, 24 threads, 32 MiB L3), WD Black SN770 NVMe SSD (5150MB/4950MB seq RW bandwidth)

[^2]: Xiangpeng is a PhD student at the University of Wisconsin-Madison

[^3]: There is also a corresponding BinaryViewArray which is similar except that the data is not constrained to be UTF-8 encoded strings.

[^4]: We also make sure that offsets do not break a UTF-8 code point, which is cheaply validated.

Using StringView / German Style Strings to make Queries Faster: Part 2 - String Operations

2024-09-13T00:00:00+00:00

Editor's Note: This blog series was first published on the InfluxData blog. Thanks to InfluxData for sponsoring this work as Xiangpeng Hao's summer intern project

In the first post, we discussed the nuances required to accelerate Parquet loading using StringViewArray by reusing buffers and reducing copies. In this second part of the post, we describe the rest of the journey: implementing additional efficient operations for real query processing.

Faster String Operations¶

Faster comparison¶

String comparison is ubiquitous; it is the core of cmp, min/max, and like/ilike kernels. StringViewArray is designed to accelerate such comparisons using the inlined prefix—the key observation is that, in many cases, only the first few bytes of the string determine the string comparison results.

For example, to compare the strings InfluxDB with Apache DataFusion, we only need to look at the first byte to determine the string ordering or equality. In this case, since A is earlier in the alphabet than I, Apache DataFusion sorts first, and we know the strings are not equal. Despite only needing the first byte, comparing these strings when stored as a StringArray requires two memory accesses: 1) load the string offset and 2) use the offset to locate the string bytes. For low-level operations such as cmp that are invoked millions of times in the very hot paths of queries, avoiding this extra memory access can make a measurable difference in query performance.

For StringViewArray, typically, only one memory access is needed to load the view struct. Only if the result can not be determined from the prefix is the second memory access required. For the example above, there is no need for the second access. This technique is very effective in practice: the second access is never necessary for the more than 60% of real-world strings which are shorter than 12 bytes, as they are stored completely in the prefix.

However, functions that operate on strings must be specialized to take advantage of the inlined prefix. In addition to low-level comparison kernels, we implemented a wide range of other StringViewArray operations that cover the functions and operations seen in ClickBench queries. Supporting StringViewArray in all string operations takes quite a bit of effort, and thankfully the Arrow and DataFusion communities are already hard at work doing so (see https://github.com/apache/datafusion/issues/11752 if you want to help out).

Faster `take`and`filter`¶

After a filter operation such as WHERE url <> '' to avoid processing empty urls, DataFusion will often coalesce results to form a new array with only the passing elements. This coalescing ensures the batches are sufficiently sized to benefit from vectorized processing in subsequent steps.

The coalescing operation is implemented using the take and filter kernels in arrow-rs. For StringArray, these kernels require copying the string contents to a new buffer without “holes” in between. This copy can be expensive especially when the new array is large.

However, take and filter for StringViewArray can avoid the copy by reusing buffers from the old array. The kernels only need to create a new list of views that point at the same strings within the old buffers. Figure 1 illustrates the difference between the output of both string representations. StringArray creates two new strings at offsets 0-17 and 17-32, while StringViewArray simply points to the original buffer at offsets 0 and 25.

Figure 1: Zero-copy take/filter for StringViewArray

When to GC?¶

Zero-copy take/filter is great for generating large arrays quickly, but it is suboptimal for highly selective filters, where most of the strings are filtered out. When the cardinality drops, StringViewArray buffers become sparse—only a small subset of the bytes in the buffer’s memory are referred to by any view. This leads to excessive memory usage, especially in a filter-then-coalesce scenario. For example, a StringViewArray with 10M strings may only refer to 1M strings after some filter operations; however, due to zero-copy take/filter, the (reused) 10M buffers can not be released/reused.

To release unused memory, we implemented a garbage collection (GC) routine to consolidate the data into a new buffer to release the old sparse buffer(s). As the GC operation copies strings, similarly to StringArray, we must be careful about when to call it. If we call GC too early, we cause unnecessary copying, losing much of the benefit of StringViewArray. If we call GC too late, we hold large buffers for too long, increasing memory use and decreasing cache efficiency. The Polars blog on StringView also refers to the challenge presented by garbage collection timing.

arrow-rs implements the GC process, but it is up to users to decide when to call it. We leverage the semantics of the query engine and observed that the CoalesceBatchesExec operator, which merge smaller batches to a larger batch, is often used after the record cardinality is expected to shrink, which aligns perfectly with the scenario of GC in StringViewArray. We, therefore, implemented the GC procedure inside CoalesceBatchesExec[^5] with a heuristic that estimates when the buffers are too sparse.

The art of function inlining: not too much, not too little¶

Like string inlining, function inlining is the process of embedding a short function into the caller to avoid the overhead of function calls (caller/callee save). Usually, the Rust compiler does a good job of deciding when to inline. However, it is possible to override its default using the #[inline(always)] directive. In performance-critical code, inlined code allows us to organize large functions into smaller ones without paying the runtime cost of function invocation.

However, function inlining is not always better, as it leads to larger function bodies that are harder for LLVM to optimize (for example, suboptimal register spilling) and risk overflowing the CPU’s instruction cache. We observed several performance regressions where function inlining caused slower performance when implementing the StringViewArray comparison kernels. Careful inspection and tuning of the code was required to aid the compiler in generating efficient code. More details can be found in this PR: https://github.com/apache/arrow-rs/pull/5900.

Buffer size tuning¶

StringViewArray permits multiple buffers, which enables a flexible buffer layout and potentially reduces the need to copy data. However, a large number of buffers slows down the performance of other operations. For example, get_array_memory_size needs to sum the memory size of each buffer, which takes a long time with thousands of small buffers. In certain cases, we found that multiple calls to concat_batches lead to arrays with millions of buffers, which was prohibitively expensive.

For example, consider a StringViewArray with the previous default buffer size of 8 KB. With this configuration, holding 4GB of string data requires almost half a million buffers! Larger buffer sizes are needed for larger arrays, but we cannot arbitrarily increase the default buffer size, as small arrays would consume too much memory (most arrays require at least one buffer). Buffer sizing is especially problematic in query processing, as we often need to construct small batches of string arrays, and the sizes are unknown at planning time.

To balance the buffer size trade-off, we again leverage the query processing (DataFusion) semantics to decide when to use larger buffers. While coalescing batches, we combine multiple small string arrays and set a smaller buffer size to keep the total memory consumption low. In string aggregation, we aggregate over an entire Datafusion partition, which can generate a large number of strings, so we set a larger buffer size (2MB).

To assist situations where the semantics are unknown, we also implemented a classic dynamic exponential buffer size growth strategy, which starts with a small buffer size (8KB) and doubles the size of each new buffer up to 2MB. We implemented this strategy in arrow-rs and enabled it by default so that other users of StringViewArray can also benefit from this optimization. See this issue for more details: https://github.com/apache/arrow-rs/issues/6094.

End-to-end query performance¶

We have made significant progress in optimizing StringViewArray filtering operations. Now, let’s test it in the real world to see how it works!

Let’s consider ClickBench query 22, which selects multiple string fields (URL, Title, and SearchPhase) and applies several filters.

We ran the benchmark using the following command in the DataFusion repo. Again, the --string-view option means we use StringViewArray instead of StringArray.

To eliminate the impact of the faster Parquet reading using StringViewArray (see the first part of this blog), Figure 2 plots only the time spent in FilterExec. Without StringViewArray, the filter takes 7.17s; with StringViewArray, the filter only takes 4.86s, a 32% reduction in time. Moreover, we see a 17% improvement in end-to-end query performance.

Figure 2: StringViewArray reduces the filter time by 32% on ClickBench query 22.

Faster String Aggregation¶

So far, we have discussed how to exploit two StringViewArray features: reduced copy and faster filtering. This section focuses on reusing string bytes to repeat string values.

As described in part one of this blog, if two strings have identical values, StringViewArray can use two different views pointing at the same buffer range, thus avoiding repeating the string bytes in the buffer. This makes StringViewArray similar to an Arrow DictionaryArray that stores Strings—both array types work well for strings with only a few distinct values.

Deduplicating string values can significantly reduce memory consumption in StringViewArray. However, this process is expensive and involves hashing every string and maintaining a hash table, and so it cannot be done by default when creating a StringViewArray. We introduced an opt-in string deduplication mode in arrow-rs for advanced users who know their data has a small number of distinct values, and where the benefits of reduced memory consumption outweigh the additional overhead of array construction.

Once again, we leverage DataFusion query semantics to identify StringViewArray with duplicate values, such as aggregation queries with multiple group keys. For example, some ClickBench queries group by two columns:

UserID (an integer with close to 1 M distinct values)
MobilePhoneModel (a string with less than a hundred distinct values)

In this case, the output row count iscount(distinct UserID) * count(distinct MobilePhoneModel), which is 100M. Each string value of MobilePhoneModel is repeated 1M times. With StringViewArray, we can save space by pointing the repeating values to the same underlying buffer.

Faster string aggregation with StringView is part of a larger project to improve DataFusion aggregation performance. We have a proof of concept implementation with StringView that can improve the multi-column string aggregation by 20%. We would love your help to get it production ready!

StringView Pitfalls¶

Most existing blog posts (including this one) focus on the benefits of using StringViewArray over other string representations such as StringArray. As we have discussed, even though it requires a significant engineering investment to realize, StringViewArray is a major improvement over StringArray in many cases.

However, there are several cases where StringViewArray is slower than StringArray. For completeness, we have listed those instances here:

Tiny strings (when strings are shorter than 8 bytes): every element of the StringViewArray consumes at least 16 bytes of memory—the size of the view struct. For an array of tiny strings, StringViewArray consumes more memory than StringArray and thus can cause slower performance due to additional memory pressure on the CPU cache.
Many repeated short strings: Similar to the first point, StringViewArray can be slower and require more memory than a DictionaryArray because 1) it can only reuse the bytes in the buffer when the strings are longer than 12 bytes and 2) 32-bit offsets are always used, even when a smaller size (8 bit or 16 bit) could represent all the distinct values.
Filtering: As we mentioned above, StringViewArrays often consume more memory than the corresponding StringArray, and memory bloat quickly dominates the performance without GC. However, invoking GC also reduces the benefits of less copying so must be carefully tuned.

Conclusion and Takeaways¶

In these two blog posts, we discussed what it takes to implement StringViewArray in arrow-rs and then integrate it into DataFusion. Our evaluations on ClickBench queries show that StringView can improve the performance of string-intensive workloads by up to 2x.

Given that DataFusion already performs very well on ClickBench, the level of end-to-end performance improvement using StringViewArray shows the power of this technique and, of course, is a win for DataFusion and the systems that build upon it.

StringView is a big project that has received tremendous community support. Specifically, we would like to thank @tustvold, @ariesdevil, @RinChanNOWWW, @ClSlaid, @2010YOUY01, @chloro-pn, @a10y, @Kev1n8, @Weijun-H, @PsiACE, @tshauck, and @xinlifoobar for their valuable contributions!

As the introduction states, “German Style Strings” is a relatively straightforward research idea that avoid some string copies and accelerates comparisons. However, applying this (great) idea in practice requires a significant investment in careful software engineering. Again, we encourage the research community to continue to help apply research ideas to industrial systems, such as DataFusion, as doing so provides valuable perspectives when evaluating future research questions for the greatest potential impact.

Footnotes¶

[^5]: There are additional optimizations possible in this operation that the community is working on, such as https://github.com/apache/datafusion/issues/7957.

Apache DataFusion Comet 0.2.0 Release

2024-08-28T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce version 0.2.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately four weeks of development work and is the result of merging 87 PRs from 14 contributors. See the change log for more information.

Release Highlights¶

Docker Images¶

Docker images are now available from the GitHub Container Registry.

Performance improvements¶

Native shuffle is now enabled by default
Improved handling of decimal types
Reduced some redundant copying of batches in Filter/Scan operations
Optimized performance of count aggregates
Optimized performance of CASE expressions for specific uses:
CASE WHEN expr THEN column ELSE null END
CASE WHEN expr THEN literal ELSE literal END
Optimized performance of IS NOT NULL

New Features¶

Window operations now support count and sum aggregates
CreateArray
GetStructField
Support nested types in hash join
Basic implementation of RLIKE expression

Current Performance¶

We use benchmarks derived from the industry standard TPC-H and TPC-DS benchmarks for tracking progress with performance. The following charts shows the time it takes to run the queries against 100 GB of data in Parquet format using a single executor with eight cores. See the Comet Benchmarking Guide for details of the environment used for these benchmarks.

Benchmark derived from TPC-H¶

Comet 0.2.0 provides a 62% speedup compared to Spark. This is slightly better than the Comet 0.1.0 release.

Benchmark derived from TPC-DS¶

Comet 0.2.0 provides a 21% speedup compared to Spark, which is a significant improvement compared to Comet 0.1.0, which did not provide any speedup for this benchmark.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project.

There are also many good first issues waiting for contributions.

Apache DataFusion Python 40.1.0 Released, Significant usability updates

2024-08-20T00:00:00+00:00

Introduction¶

We are happy to announce that DataFusion in Python 40.1.0 has been released. In addition to bringing in all of the new features of the core DataFusion 40.0.0 package, this release contains significant updates to the user interface and documentation. We listened to the python user community to create a more pythonic experience. If you have not used the python interface to DataFusion before, this is an excellent time to give it a try!

Background¶

Until now, the python bindings for DataFusion have primarily been a thin layer to expose the underlying Rust functionality. This has been worked well for early adopters to use DataFusion within their Python projects, but some users have found it difficult to work with. As compared to other DataFrame libraries, these issues were raised:

Most of the functions had little or no documentation. Users often had to refer to the Rust documentation or code to learn how to use DataFusion. This alienated some python users.
Users could not take advantage of modern IDE features such as type hinting. These are valuable tools for rapid testing and development.
Some of the interfaces felt “clunky” to users since some Python concepts do not always map well to their Rust counterparts.

This release aims to bring a better user experience to the DataFusion Python community.

What's Changed¶

The most significant difference is that we have added wrapper functions and classes for most of the user facing interface. These wrappers, written in Python, contain both documentation and type annotations.

This documentation is now available on the DataFusion in Python API website. There you can browse the available functions and classes to see the breadth of available functionality.

Modern IDEs use language servers such as Pylance or Jedi to perform analysis of python code, provide useful hints, and identify usage errors. These are major tools in the python user community. With this release, users can fully use these tools in their workflow.

Figure 1: With the enhanced python wrappers, users can see helpful tool tips with type annotations directly in modern IDEs.

By having the type annotations, these IDEs can also identify quickly when a user has incorrectly used a function's arguments as shown in Figure 2.

Figure 2: Modern Python language servers can perform static analysis and quickly find errors in the arguments to functions.

In addition to these wrapper libraries, we have enhancements to some of the functions to feel more easy to use.

Improved DataFrame filter arguments¶

You can now apply multiple filter statements in a single step. When using DataFrame.filter you can pass in multiple arguments, separated by a comma. These will act as a logical AND of all of the filter arguments. The following two statements are equivalent:

df.filter(col("size") < col("max_size")).filter(col("color") == lit("green"))
df.filter(col("size") < col("max_size"), col("color") == lit("green"))

Comparison against literal values¶

It is very common to write DataFrame operations that compare an expression to some fixed value. For example, filtering a DataFrame might have an operation such as df.filter(col("size") < lit(16)). To make these common operations more ergonomic, you can now simply use df.filter(col("size") < 16).

For the right hand side of the comparison operator, you can now use any Python value that can be coerced into a Literal. This gives an easy to ready expression. For example, consider these few lines from one of the TPC-H examples provided in the DataFusion Python repository.

df = (
    df_lineitem.filter(col("l_shipdate") >= lit(date))
    .filter(col("l_discount") >= lit(DISCOUNT) - lit(DELTA))
    .filter(col("l_discount") <= lit(DISCOUNT) + lit(DELTA))
    .filter(col("l_quantity") < lit(QUANTITY))
)

The above code mirrors closely how these filters would need to be applied in rust. With this new release, the user can simplify these lines. Also shown in the example below is that filter() now accepts a variable number of arguments and filters on all such arguments (boolean AND).

df = df_lineitem.filter(
    col("l_shipdate") >= date,
    col("l_discount") >= DISCOUNT - DELTA,
    col("l_discount") <= DISCOUNT + DELTA,
    col("l_quantity") < QUANTITY,
)

Select columns by name¶

It is very common for users to perform DataFrame selection where they simply want a column. For this we have had the function select_columns("a", "b") or the user could perform select(col("a"), col("b")). In the new release, we accept either full expressions in select() or strings of the column names. You can mix these as well.

Where before you may have to do an operation like

df_subset = df.select(col("a"), col("b"), f.abs(col("c")))

You can now simplify this to

df_subset = df.select("a", "b", f.abs(col("c")))

Creating named structs¶

Creating a struct with named fields was previously difficult to use and allowed for potential user errors when specifying the name of each field. Now we have a cleaner interface where the user passes a list of tuples containing the name of the field and the expression to create.

df.select(f.named_struct([
  ("a", col("a")),
  ("b", col("b"))
]))

Next Steps¶

While most of the user facing classes and functions have been exposed, there are a few that require exposure. Namely the classes in datafusion.object_store and the logical plans used by datafusion.substrait. The team is working on these issues.

Additionally, in the next release of DataFusion there have been improvements made to the user-defined aggregate and window functions to make them easier to use. We plan on bringing these enhancements to this project.

Thank You¶

We would like to thank the following members for their very helpful discussions regarding these updates: @andygrove, @max-muoto, @slyons, @Throne3d, @Michael-J-Ward, @datapythonista, @austin362667, @kylebarron, @simicd. The primary PR (#750) that includes these updates had an extensive conversation, leading to a significantly improved end product. Again, thank you to all who provided input!

We would like to give an special thank you to @3ok who created the initial version of the wrapper definitions. The work they did was time consuming and required exceptional attention to detail. It provided enormous value to starting this project. Thank you!

Get Involved¶

The DataFusion Python team is an active and engaging community and we would love to have you join us and help the project.

Here are some ways to get involved:

Learn more by visiting the DataFusion Python project page.
Try out the project and provide feedback, file issues, and contribute code.

Apache DataFusion 40.0.0 Released

2024-07-24T00:00:00+00:00

Introduction¶

We are proud to announce DataFusion 40.0.0. This blog highlights some of the many major improvements since we released DataFusion 34.0.0 and a preview of what the community is thinking about in the next 6 months. We are hoping to make more regular blog posts -- if you are interested in helping write them, please reach out!

Community Growth 📈¶

In the last 6 months, between 34.0.0 and 40.0.0, our community continues to grow in new and exciting ways.

DataFusion became a top level Apache Software Foundation project (read the press release and blog post).
We added several PMC members and new committers: @comphead, @mustafasrepo, @ozankabak, and @waynexia joined the PMC, @jonahgao and @lewiszlw joined as committers. See the mailing list for more details.
DataFusion Comet was donated and is nearing its first release.
In the core DataFusion repo alone we reviewed and accepted almost 1500 PRs from 182 different committers, created over 1000 issues and closed 781 of them 🚀. This is up almost 50% from our last post (1000 PRs from 124 committers with 650 issues created in our last post) 🤯. All changes are listed in the detailed CHANGELOG.
DataFusion focused meetups happened or are happening in multiple cities around the world: Austin, San Francisco, Hangzhou, New York, and Belgrade.
Many new projects started in the datafusion-contrib organization, including Table Providers, SQLancer, Open Variant, JSON, and ORC.

In addition, DataFusion has been appearing publicly more and more, both online and offline. Here are some highlights:

Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine, was presented in SIGMOD '24, one of the major database conferences
As part of the trend to define "the POSIX of databases" in "What Goes Around Comes Around... And Around..." from Andy Pavlo and Mike Stonebraker
"Why you should keep an eye on Apache DataFusion and its community"
Apache DataFusion offline meetup in the Bay Area

Improved Performance 🚀¶

Performance is a key feature of DataFusion, and the community continues to work to keep DataFusion state of the art in this area. One major area DataFusion improved is the time it takes to convert a SQL query into a plan that can be executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and over 10x faster for some queries with many columns.

Here is a chart showing the improvement due to the concerted effort of many contributors including @jackwener, @alamb, @Lordworms, @dmitrybugakov, @appletreeisyellow, @ClSlaid, @rohitrastogi, @emgeee, @kevinmingtarja, and @peter-toth over several months (see ticket for more details)

DataFusion is now up to 40% faster for queries that GROUP BY a single string or binary column due to a specialization for single Uft8/LargeUtf8/Binary/LargeBinary. We are working on improving performance when there are [multiple variable length columns in the GROUP BY clause].

We are also in the final phases of integrating the new Arrow StringView which significantly improves performance for workloads that scan, filter and group by variable length string and binary data. We expect the improvement to be especially pronounced for Parquet files due to upstream work in the parquet reader. Kudos to @XiangpengHong, @AriesDevil, @PsiACE, @Weijun-H, @a10y, and @RinChanNOWWW for driving this project.

Improved Quality 📋¶

DataFusion continues to improve overall in quality. In addition to ongoing bug fixes, one of the most exciting improvements is the addition of a new SQLancer based DataFusion Fuzzing suite thanks to @2010YOUY01 that has already found several bugs and thanks to @jonahgao, @tshauck, @xinlifoobar, @LorrensP-2158466 for fixing them so fast.

Improved Documentation 📚¶

We continue to improve the documentation to make it easier to get started using DataFusion with the Library Users Guide, API documentation, and Examples.

Some notable new examples include: * sql_analysis.rs to analyse SQL queries with DataFusion structures (thanks @LorrensP-2158466) * function_factory.rs to create custom functions via SQL (thanks @milenkovicm) * plan_to_sql.rs to generate SQL from DataFusion Expr and LogicalPlan (thanks @edmondop) * parquet_index.rs and advanced_parquet_index.rs for parquet indexing, described more below (thanks @alamb)

New Features ✨¶

There are too many new features in the last 6 months to list them all, but here are some highlights:

SQL¶

Support for UNNEST (thanks @duongcongtoai, @JasonLi-cn and @jayzhan211)
Support for Recursive CTEs (thanks @jonahgao and @matthewgapp)
Support for CREATE FUNCTION (see below)
Many new SQL functions

DataFusion now has much improved support for structured types such STRUCT, LIST/ARRAY and MAP. For example, you can now create STRUCT literals in SQL like this:

> select {'foo': {'bar': 2}};
+--------------------------------------------------------------+
| named_struct(Utf8("foo"),named_struct(Utf8("bar"),Int64(2))) |
+--------------------------------------------------------------+
| {foo: {bar: 2}}                                              |
+--------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.

SQL Unparser (SQL Formatter)¶

DataFusion now supports converting Exprs and LogicalPlans BACK to SQL text. This can be useful in query federation to push predicates down into other systems that only accept SQL, and for building systems that generate SQL.

For example, you can now convert a logical expression back to SQL text:

// Form a logical expression that represents the SQL "a < 5 OR a = 8"
let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8)));
// convert the expression back to SQL text
let sql = expr_to_sql(&expr)?.to_string();
assert_eq!(sql, "a < 5 OR a = 8");

You can also do complex things like parsing SQL, modifying the plan, and convert it back to SQL:

let df = ctx
  // Use SQL to read some data from the parquet file
  .sql("SELECT int_col, double_col, CAST(date_string_col as VARCHAR) FROM alltypes_plain")
  .await?;
// Programmatically add new filters `id > 1 and tinyint_col < double_col`
let df = df.filter(col("id").gt(lit(1)).and(col("tinyint_col").lt(col("double_col"))))?
// Convert the new logical plan back to SQL
let sql = plan_to_sql(df.logical_plan())?.to_string();
assert_eq!(sql, 
           "SELECT alltypes_plain.int_col, alltypes_plain.double_col, CAST(alltypes_plain.date_string_col AS VARCHAR) \
           FROM alltypes_plain WHERE ((alltypes_plain.id > 1) AND (alltypes_plain.tinyint_col < alltypes_plain.double_col))")
);

See the Plan to SQL example or the APIs expr_to_sql and plan_to_sql for more details.

Low Level APIs for Fast Parquet Access (indexing)¶

With their rising prevalence, supporting efficient access to Parquet files stored remotely on object storage is important. Part of doing this efficiently is minimizing the number of object store requests made by caching metadata and skipping over parts of the file that are not needed (e.g. via an index).

DataFusion's Parquet reader has long internally supported advanced predicate pushdown by reading the parquet metadata from the file footer and pruning based on row group and data page statistics. DataFusion now also supports users supplying their own low level pruning information via the [ParquetAccessPlan] API.

This API can be used along with index information to selectively skip decoding parts of the file. For example, Spice AI used this feature to add efficient support for reading from DeltaLake tables and handling deletion vectors.

        ┌───────────────────────┐   If the RowSelection does not include any
        │          ...          │   rows from a particular Data Page, that
        │                       │   Data Page is not fetched or decoded.
        │ ┌───────────────────┐ │   Note this requires a PageIndex
        │ │     ┌──────────┐  │ │
Row     │ │     │DataPage 0│  │ │                 ┌────────────────────┐
Groups  │ │     └──────────┘  │ │                 │                    │
        │ │     ┌──────────┐  │ │                 │    ParquetExec     │
        │ │ ... │DataPage 1│ ◀┼ ┼ ─ ─ ─           │  (Parquet Reader)  │
        │ │     └──────────┘  │ │      └ ─ ─ ─ ─ ─│                    │
        │ │     ┌──────────┐  │ │                 │ ╔═══════════════╗  │
        │ │     │DataPage 2│  │ │ If only rows    │ ║ParquetMetadata║  │
        │ │     └──────────┘  │ │ from DataPage 1 │ ╚═══════════════╝  │
        │ └───────────────────┘ │ are selected,   └────────────────────┘
        │                       │ only DataPage 1
        │          ...          │ is fetched and
        │                       │ decoded
        │ ╔═══════════════════╗ │
        │ ║  Thrift metadata  ║ │
        │ ╚═══════════════════╝ │
        └───────────────────────┘
         Parquet File

See the parquet_index.rs and advanced_parquet_index.rs examples for more details.

Thanks to @alamb and @Ted-Jiang for this feature.

Building Systems is Easier with DataFusion 🛠️¶

In addition to many incremental API improvements, there are several new APIs that make it easier to build systems on top of DataFusion:

Faster and easier to use TreeNode API for traversing and manipulating plans and expressions.
All functions now use the same Scalar User Defined Function API, making it easier to customize DataFusion's behavior without sacrificing performance. See ticket for more details.
DataFusion can now be compiled to WASM.

User Defined SQL Parsing Extensions¶

As of DataFusion 40.0.0, you can use the [ExprPlanner] trait to extend DataFusion's SQL planner to support custom operators or syntax.

For example the datafusion-functions-json project uses this API to support JSON operators in SQL queries. It provides a custom implementation for planning JSON operators such as -> and ->> with code like:

struct MyCustomPlanner;

impl ExprPlanner for MyCustomPlanner {
    // Provide custom implementation for planning a binary operators
    // such as `->` and `->>`
    fn plan_binary_op(
        &self,
        expr: RawBinaryExpr,
        _schema: &DFSchema,
    ) -> Result<PlannerResult<RawBinaryExpr>> {
        match &expr.op {
           BinaryOperator::Arrow => { /* plan -> operator */ }
           BinaryOperator::LongArrow => { /* plan ->> operator */ }
           ...
        }
    }
}

Thanks to @samuelcolvin, @jayzhan211 and @dharanad for helping make this feature happen.

Pluggable Support for `CREATE FUNCTION`¶

DataFusion's new [FunctionFactory] API let's users provide a handler for CREATE FUNCTION SQL statements. This feature lets you build systems that support defining functions in SQL such as

-- SQL based functions
CREATE FUNCTION my_func(DOUBLE, DOUBLE) RETURNS DOUBLE
    RETURN $1 + $3
;

-- ML Models
CREATE FUNCTION iris(FLOAT[]) RETURNS FLOAT[] 
LANGUAGE TORCH AS 'models:/iris@champion';

-- WebAssembly
CREATE FUNCTION func(FLOAT[]) RETURNS FLOAT[] 
LANGUAGE WASM AS 'func.wasm'

Huge thanks to @milenkovicm for this feature. There is an example of how to make macro like functions in function_factory.rs. It would be great if someone made a demo showing how to create WASMs 🎣.

Looking Ahead: The Next Six Months 🔭¶

The community has been discussing what we will work on in the next six months. Some major initiatives from that discussion are:

Performance: Improve the speed of aggregating "high cardinality" data when there are many (e.g. millions) of distinct groups as well as additional ideas to improve parquet performance.
Modularity: Make DataFusion even more modular, by completely unifying built in and user aggregate functions and window functions.
LogicalTypes: Introduce Logical Types to make it easier to use different encodings like StringView, RunEnd and Dictionary arrays as well as user defined types. Thanks @notfilippo for driving this.
Improved Documentation: Write blog posts and videos explaining how to use DataFusion for real-world use cases.
Testing: Improve CI infrastructure and test coverage, more fuzz testing, and better functional and performance regression testing.

How to Get Involved¶

Apache DataFusion Comet 0.1.0 Release

2024-07-20T00:00:00+00:00

The Apache DataFusion PMC is pleased to announce the first official source release of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers five months of development work since the project was donated to the Apache DataFusion project and is the result of merging 343 PRs from 41 contributors. See the change log for more information.

This first release supports 15 data types, 13 operators, and 106 expressions. Comet is compatible with Apache Spark versions 3.3, 3.4, and 3.5. There is also experimental support for preview versions of Spark 4.0.

Project Status¶

The project's recent focus has been on fixing correctness and stability issues and implementing additional native operators and expressions so that a broader range of queries can be executed natively.

Here are some of the highlights since the project was donated:

Implemented native support for:
SortMergeJoin
HashJoin
BroadcastHashJoin
Columnar Shuffle
More aggregate expressions
Window aggregates
Many Spark-compatible CAST expressions
Implemented a simple Spark Fuzz Testing utility to find correctness issues
Published a User Guide and Contributors Guide
Created a DataFusion Benchmarks repository with scripts and documentation for running benchmarks derived
from TPC-H and TPC-DS with DataFusion and Comet

Current Performance¶

Comet already delivers a modest performance speedup for many queries, enabling faster data processing and shorter time-to-insights.

We use benchmarks derived from the industry standard TPC-H and TPC-DS benchmarks for tracking progress with performance. The following chart shows the time it takes to run the 22 TPC-H queries against 100 GB of data in Parquet format using a single executor with eight cores. See the Comet Benchmarking Guide for details of the environment used for these benchmarks.

Comet reduces the overall execution time from 626 seconds to 407 seconds, a 54% speedup (1.54x faster).

Running the same queries with DataFusion standalone using the same number of cores results in a 3.9x speedup compared to Spark. Although this isn’t a fair comparison (DataFusion does not have shuffle or match Spark semantics in some cases, for example), it does give some idea about the potential future performance of Comet. Comet aims to provide a 2x-4x speedup for a wide range of queries once more operators and expressions can run natively.

The following chart shows how much Comet currently accelerates each query from the benchmark.

These benchmarks can be reproduced in any environment using the documentation in the Comet Benchmarking Guide. We encourage you to run these benchmarks in your environment or, even better, try Comet out with your existing Spark jobs.

Roadmap¶

Comet is an open-source project, and contributors are welcome to work on any features they are interested in, but here are some current focus areas.

Improve Performance & Reliability:
Implement the remaining features needed so that all TPC-H queries can run entirely natively
Implement spill support in SortMergeJoin
Enable columnar shuffle by default
Fully support Spark version 4.0.0
Support more Spark operators and expressions
We would like to support many more expressions natively in Comet, and this is a great place to start contributing. The contributors' guide has a section covering adding support for new expressions.
Move more Spark expressions into the datafusion-comet-spark-expr crate. Although the main focus of the Comet project is to provide an accelerator for Apache Spark, we also publish a standalone crate containing Spark-compatible expressions that can be used by any project using DataFusion, without adding any dependencies on JVM or Apache Spark.
Release Process & Documentation
Implement a binary release process so that we can publish JAR files to Maven for all supported platforms
Add documentation for running Spark and Comet in Kubernetes, and add example Dockerfiles.

Getting Involved¶

The Comet project welcomes new contributors. We use the same Slack and Discord channels as the main DataFusion project, and there is a Comet community video call held every four weeks on Wednesdays at 11:30 a.m. Eastern Time, which is 16:30 UTC during Eastern Standard Time and 15:30 UTC during Eastern Daylight Time. See the Comet Community Meeting Google Document for the next scheduled meeting date, the video call link, and recordings of previous calls.

There are also many good first issues waiting for contributions.

Announcing Apache Arrow DataFusion is now Apache DataFusion

2024-05-07T00:00:00+00:00

Introduction¶

TLDR; Apache Arrow DataFusion --> Apache DataFusion

The Arrow PMC and newly created DataFusion PMC are happy to announce that as of April 16, 2024 the Apache Arrow DataFusion subproject is now a top level Apache Software Foundation project.

Background¶

Apache DataFusion is a fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.

When DataFusion was donated to the Apache Software Foundation in 2019, the DataFusion community was not large enough to stand on its own and the Arrow project agreed to help support it. The community has grown significantly since 2019, benefiting immensely from being part of Arrow and following The Apache Way.

Why now?¶

The community discussed graduating to a top level project publicly for almost a year, as the project seemed ready to stand on its own and would benefit from more focused governance. For example, earlier in DataFusion's life many contributed to both arrow-rs and DataFusion, but as DataFusion has matured many contributors, committers and PMC members focused more and more exclusively on DataFusion.

Looking forward¶

The future looks bright. There are now 10s of known projects built with DataFusion, and that number continues to grow. We recently held our first in person meetup passed 5000 stars on GitHub, wrote a paper that was accepted at SIGMOD 2024, and began work on Comet, an Apache Spark accelerator initially donated by Apple.

Thank you to everyone in the Arrow community who helped DataFusion grow and mature over the years, and we look forward to continuing our collaboration as projects. All future blogs and announcements will be posted on the Apache DataFusion website.

Get Involved¶

If you are interested in joining the community, we would love to have you join us. Get in touch using Communication Doc and learn how to get involved in the Contributor Guide. We welcome everyone to try DataFusion on their own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests or code.

Announcing Apache Arrow DataFusion Comet

2024-03-06T00:00:00+00:00

Introduction¶

The Apache Arrow PMC is pleased to announce the donation of the Comet project, a native Spark SQL Accelerator built on Apache Arrow DataFusion.

Comet is an Apache Spark plugin that uses Apache Arrow DataFusion to accelerate Spark workloads. It is designed as a drop-in replacement for Spark's JVM based SQL execution engine and offers significant performance improvements for some workloads as shown below.

Figure 1: With Comet, users interact with the same Spark ecosystem, tools and APIs such as Spark SQL. Queries still run through Spark's query optimizer and planner. However, the execution is delegated to Comet, which is significantly faster and more resource efficient than a JVM based implementation.

Comet is one of a growing class of projects that aim to accelerate Spark using native columnar engines such as the proprietary Databricks Photon Engine and open source projects Gluten, Spark RAPIDS, and Blaze (also built using DataFusion).

Comet was originally implemented at Apple and the engineers who worked on the project are also significant contributors to Arrow and DataFusion. Bringing Comet into the Apache Software Foundation will accelerate its development and grow its community of contributors and users.

Get Involved¶

Comet is still in the early stages of development and we would love to have you join us and help shape the project. We are working on an initial release, and expect to post another update with more details at that time.

Before then, here are some ways to get involved:

Learn more by visiting the Comet project page and reading the mailing list discussion about the initial donation.
Help us plan out the roadmap
Try out the project and provide feedback, file issues, and contribute code.

Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024

2024-01-19T00:00:00+00:00

Introduction¶

We recently released DataFusion 34.0.0. This blog highlights some of the major improvements since we released DataFusion 26.0.0 (spoiler alert there are many) and a preview of where the community plans to focus in the next 6 months.

Apache Arrow DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast data centric systems such as databases, dataframe libraries, machine learning and streaming applications. While DataFusion’s primary design goal is to accelerate creating other data centric systems, it has a reasonable experience directly out of the box as a dataframe library and command line SQL tool.

This may also be our last update on the Apache Arrow Site. Future updates will likely be on the DataFusion website as we are working to graduate to a top level project (Apache Arrow DataFusion → Apache DataFusion!) which will help focus governance and project growth. Also exciting, our first DataFusion in person meetup is planned for March 2024.

DataFusion is very much a community endeavor. Our core thesis is that as a community we can build much more advanced technology than any of us as individuals or companies could alone. In the last 6 months between 26.0.0 and 34.0.0, community growth has been strong. We accepted and reviewed over a thousand PRs from 124 different committers, created over 650 issues and closed 517 of them. You can find a list of all changes in the detailed CHANGELOG.

Improved Performance 🚀¶

Performance is a key feature of DataFusion, DataFusion is more than 2x faster on ClickBench compared to version 25.0.0, as shown below:

Figure 1: Performance improvement between 25.0.0 and 34.0.0 on ClickBench. Note that DataFusion 25.0.0, could not run several queries due to unsupported SQL (Q9, Q11, Q12, Q14) or memory requirements (Q33).

Figure 2: Total query runtime for DataFusion 34.0.0 and DataFusion 25.0.0.

Here are some specific enhancements we have made to improve performance: * 2-3x better aggregation performance with many distinct groups * Partially ordered grouping / streaming grouping * [Specialized operator for "TopK" ORDER BY LIMIT XXX] * [Specialized operator for min(col) GROUP BY .. ORDER by min(col) LIMIT XXX] * Improved join performance * Eliminate redundant sorting with sort order aware optimizers

New Features ✨¶

DML / Insert / Creating Files¶

DataFusion now supports writing data in parallel, to individual or multiple files, using Parquet, CSV, JSON, ARROW and user defined formats. Benchmark results show improvements up to 5x in some cases.

Similarly to reading, data can now be written to any [ObjectStore] implementation, including AWS S3, Azure Blob Storage, GCP Cloud Storage, local files, and user defined implementations. While reading from hive style partitioned tables has long been supported, it is now possible to write to such tables as well.

For example, to write to a local file:

❯ CREATE EXTERNAL TABLE awesome_table(x INT) STORED AS PARQUET LOCATION '/tmp/my_awesome_table';
0 rows in set. Query took 0.003 seconds.

❯ INSERT INTO awesome_table SELECT x * 10 FROM my_source_table;
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.024 seconds.

You can also write to files with the [COPY], similarly to [DuckDB’s COPY]:

❯ COPY (SELECT x + 1 FROM my_source_table) TO '/tmp/output.json';
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.014 seconds.

$ cat /tmp/output.json
{"x":1}
{"x":2}
{"x":3}

Improved `STRUCT` and `ARRAY` support¶

DataFusion 34.0.0 has much improved STRUCT and ARRAY support, including a full range of struct functions and array functions.

For example, you can now use [] syntax and array_length to access and inspect arrays:

❯ SELECT column1, 
         column1[1] AS first_element, 
         array_length(column1) AS len 
  FROM my_table;
+-----------+---------------+-----+
| column1   | first_element | len |
+-----------+---------------+-----+
| [1, 2, 3] | 1             | 3   |
| [2]       | 2             | 1   |
| [4, 5]    | 4             | 2   |
+-----------+---------------+-----+

❯ SELECT column1, column1['c0'] FROM  my_table;
+------------------+----------------------+
| column1          | my_table.column1[c0] |
+------------------+----------------------+
| {c0: foo, c1: 1} | foo                  |
| {c0: bar, c1: 2} | bar                  |
+------------------+----------------------+
2 rows in set. Query took 0.002 seconds.

Other Features¶

Other notable features include: * Support aggregating datasets that exceed memory size, with group by spill to disk * All operators now track and limit their memory consumption, including Joins

Building Systems is Easier with DataFusion 🛠️¶

Documentation¶

It is easier than ever to get started using DataFusion with the new Library Users Guide as well as significantly improved the API documentation.

User Defined Window and Table Functions¶

In addition to DataFusion's User Defined Scalar Functions, and User Defined Aggregate Functions, DataFusion now supports User Defined Window Functions and User Defined Table Functions.

For example, [the datafusion-cli] implements a DuckDB style [parquet_metadata] function as a user defined table function (source code here):

❯ SELECT 
      path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size 
FROM 
      parquet_metadata('hits.parquet')
WHERE path_in_schema = '"WatchID"' 
LIMIT 3;

+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| path_in_schema | row_group_id | row_group_num_rows | stats_min           | stats_max           | total_compressed_size |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| "WatchID"      | 0            | 450560             | 4611687214012840539 | 9223369186199968220 | 3883759               |
| "WatchID"      | 1            | 612174             | 4611689135232456464 | 9223371478009085789 | 5176803               |
| "WatchID"      | 2            | 344064             | 4611692774829951781 | 9223363791697310021 | 3031680               |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
3 rows in set. Query took 0.053 seconds.

Growth of DataFusion 📈¶

DataFusion has been appearing more publicly in the wild. For example * New projects built using DataFusion such as lancedb, GlareDB, Arroyo, and optd. * Public talks such as Apache Arrow Datafusion: Vectorized Execution Framework For Maximum Performance in CommunityOverCode Asia 2023 * Blogs posts such as Apache Arrow, Arrow/DataFusion, AI-native Data Infra, Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0, and A Guide to User-Defined Functions in Apache Arrow DataFusion

We have also submitted a paper to SIGMOD 2024, one of the premiere database conferences, describing DataFusion in a technically formal style and making the case that it is possible to create a modular and extensive query engine without sacrificing performance. We hope this paper helps people evaluating DataFusion for their needs understand it better.

DataFusion in 2024 🥳¶

Some major initiatives from contributors we know of this year are:

Modularity: Make DataFusion even more modular, such as unifying built in and user functions, making it easier to customize DataFusion's behavior.
Community Growth: Graduate to our own top level Apache project, and subsequently add more committers and PMC members to keep pace with project growth.
Use case white papers: Write blog posts and videos explaining how to use DataFusion for real-world use cases.
Testing: Improve CI infrastructure and test coverage, more fuzz testing, and better functional and performance regression testing.
Planning Time: Reduce the time taken to plan queries, both wide tables of 1000s of columns, and in general.
Aggregate Performance: Improve the speed of aggregating "high cardinality" data when there are many (e.g. millions) of distinct groups.
Statistics: Improved statistics handling with an eye towards more sophisticated expression analysis and cost models.

How to Get Involved¶

If you are interested in contributing to DataFusion we would love to have you join us. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests or code. A list of open issues suitable for beginners is here.

As the community grows, we are also looking to restart biweekly calls / meetings. Timezones are always a challenge for such meetings, but we hope to have two calls that can work for most attendees. If you are interested in helping, or just want to say hi, please drop us a note via one of the methods listed in our Communication Doc.

Aggregating Millions of Groups Fast in Apache Arrow DataFusion 28.0.0

2023-08-05T00:00:00+00:00

Aggregating Millions of Groups Fast in Apache Arrow DataFusion¶

Andrew Lamb, Daniël Heres, Raphael Taylor-Davies,

Note: this article was originally published on the InfluxData Blog

TLDR¶

Grouped aggregations are a core part of any analytic tool, creating understandable summaries of huge data volumes. Apache Arrow DataFusion’s parallel aggregation capability is 2-3x faster in the newly released version 28.0.0 for queries with a large number (10,000 or more) of groups.

Improving aggregation performance matters to all users of DataFusion. For example, both InfluxDB, a time series data platform and Coralogix, a full-stack observability platform, aggregate vast amounts of raw data to monitor and create insights for our customers. Improving DataFusion’s performance lets us provide better user experiences by generating insights faster with fewer resources. Because DataFusion is open source and released under the permissive Apache 2.0 license, the whole DataFusion community benefits as well.

With the new optimizations, DataFusion’s grouping speed is now close to DuckDB, a system that regularly reports great grouping benchmark performance numbers. Figure 1 contains a representative sample of ClickBench on a single Parquet file, and the full results are at the end of this article.

Figure 1: Query performance for ClickBench queries on queries 16, 17, 18 and 19 on a single Parquet file for DataFusion 27.0.0, DataFusion 28.0.0 and DuckDB 0.8.1.

Introduction to high cardinality grouping¶

Aggregation is a fancy word for computing summary statistics across many rows that have the same value in one or more columns. We call the rows with the same values groups and “high cardinality” means there are a large number of distinct groups in the dataset. At the time of writing, a “large” number of groups in analytic engines is around 10,000.

For example the ClickBench hits dataset contains 100 million anonymized user clicks across a set of websites. ClickBench Query 17 is:

SELECT "UserID", "SearchPhrase", COUNT(*)
FROM hits
GROUP BY "UserID", "SearchPhrase"
ORDER BY COUNT(*)
DESC LIMIT 10;

In English, this query finds “the top ten (user, search phrase) combinations, across all clicks” and produces the following results (there are no search phrases for the top ten users):

+---------------------+--------------+-----------------+
| UserID              | SearchPhrase | COUNT(UInt8(1)) |
+---------------------+--------------+-----------------+
| 1313338681122956954 |              | 29097           |
| 1907779576417363396 |              | 25333           |
| 2305303682471783379 |              | 10597           |
| 7982623143712728547 |              | 6669            |
| 7280399273658728997 |              | 6408            |
| 1090981537032625727 |              | 6196            |
| 5730251990344211405 |              | 6019            |
| 6018350421959114808 |              | 5990            |
| 835157184735512989  |              | 5209            |
| 770542365400669095  |              | 4906            |
+---------------------+--------------+-----------------+

The ClickBench dataset contains

99,997,497 total rows[^1]
17,630,976 different users (distinct UserIDs)[^2]
6,019,103 different search phrases[^3]
24,070,560 distinct combinations[^4] of (UserID, SearchPhrase) Thus, to answer the query, DataFusion must map each of the 100M different input rows into one of the 24 million different groups, and keep count of how many such rows there are in each group.

The solution¶

Like most concepts in databases and other analytic systems, the basic ideas of this algorithm are straightforward and taught in introductory computer science courses. You could compute the query with a program such as this[^5]:

import pandas as pd
from collections import defaultdict
from operator import itemgetter

# read file
hits = pd.read_parquet('hits.parquet', engine='pyarrow')

# build groups
counts = defaultdict(int)
for index, row in hits.iterrows():
    group = (row['UserID'], row['SearchPhrase']);
    # update the dict entry for the corresponding key
    counts[group] += 1

# Print the top 10 values
print (dict(sorted(counts.items(), key=itemgetter(1), reverse=True)[:10]))

This approach, while simple, is both slow and very memory inefficient. It requires over 40 seconds to compute the results for less than 1% of the dataset[^6]. Both DataFusion 28.0.0 and DuckDB 0.8.1 compute results in under 10 seconds for the entire dataset.

To answer this query quickly and efficiently, you have to write your code such that it:

Keeps all cores busy aggregating via parallelized computation
Updates aggregate values quickly, using vectorizable loops that are easy for compilers to translate into the high performance SIMD instructions available in modern CPUs.

The rest of this article explains how grouping works in DataFusion and the improvements we made in 28.0.0.

Two phase parallel partitioned grouping¶

Both DataFusion 27.0. and 28.0.0 use state-of-the-art, two phase parallel hash partitioned grouping, similar to other high-performance vectorized engines like DuckDB’s Parallel Grouped Aggregates. In pictures this looks like:

            ▲                        ▲
            │                        │
            │                        │
            │                        │
┌───────────────────────┐  ┌───────────────────┐
│        GroupBy        │  │      GroupBy      │      Step 4
│        (Final)        │  │      (Final)      │
└───────────────────────┘  └───────────────────┘
            ▲                        ▲
            │                        │
            └────────────┬───────────┘
                         │
                         │
            ┌─────────────────────────┐
            │       Repartition       │               Step 3
            │         HASH(x)         │
            └─────────────────────────┘
                         ▲
                         │
            ┌────────────┴──────────┐
            │                       │
            │                       │
 ┌────────────────────┐  ┌─────────────────────┐
 │      GroupyBy      │  │       GroupBy       │      Step 2
 │     (Partial)      │  │      (Partial)      │
 └────────────────────┘  └─────────────────────┘
            ▲                       ▲
         ┌──┘                       └─┐
         │                            │
    .─────────.                  .─────────.
 ,─'           '─.            ,─'           '─.
;      Input      :          ;      Input      :      Step 1
:    Stream 1     ;          :    Stream 2     ;
 ╲               ╱            ╲               ╱
  '─.         ,─'              '─.         ,─'
     `───────'                    `───────'

Figure 2: Two phase repartitioned grouping: data flows from bottom (source) to top (results) in two phases. First (Steps 1 and 2), each core reads the data into a core-specific hash table, computing intermediate aggregates without any cross-core coordination. Then (Steps 3 and 4) DataFusion divides the data (“repartitions”) into distinct subsets by group value, and each subset is sent to a specific core which computes the final aggregate.

The two phases are critical for keeping cores busy in a multi-core system. Both phases use the same hash table approach (explained in the next section), but differ in how the groups are distributed and the partial results emitted from the accumulators. The first phase aggregates data as soon as possible after it is produced. However, as shown in Figure 2, the groups can be anywhere in any input, so the same group is often found on many different cores. The second phase uses a hash function to redistribute data evenly across the cores, so each group value is processed by exactly one core which emits the final results for that group.

    ┌─────┐    ┌─────┐
    │  1  │    │  3  │
    │  2  │    │  4  │   2. After Repartitioning: each
    └─────┘    └─────┘   group key  appears in exactly
    ┌─────┐    ┌─────┐   one partition
    │  1  │    │  3  │
    │  2  │    │  4  │
    └─────┘    └─────┘

─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─

    ┌─────┐    ┌─────┐
    │  2  │    │  2  │
    │  1  │    │  2  │
    │  3  │    │  3  │
    │  4  │    │  1  │
    └─────┘    └─────┘    1. Input Stream: groups
      ...        ...      values are spread
    ┌─────┐    ┌─────┐    arbitrarily over each input
    │  1  │    │  4  │
    │  4  │    │  3  │
    │  1  │    │  1  │
    │  4  │    │  3  │
    │  3  │    │  2  │
    │  2  │    │  2  │
    │  2  │    └─────┘
    └─────┘

    Core A      Core B

Figure 3: Group value distribution across 2 cores during aggregation phases. In the first phase, every group value 1, 2, 3, 4, is present in the input stream processed by each core. In the second phase, after repartitioning, the group values 1 and 2 are processed by core A, and values 3 and 4 are processed only by core B.

There are some additional subtleties in the DataFusion implementation not mentioned above due to space constraints, such as:

The policy of when to emit data from the first phase’s hash table (e.g. because the data is partially sorted)
Handling specific filters per aggregate (due to the FILTER SQL clause)
Data types of intermediate values (which may not be the same as the final output for some aggregates such as AVG).
Action taken when memory use exceeds its budget.

Hash grouping¶

DataFusion queries can compute many different aggregate functions for each group, both built in and/or user defined AggregateUDFs. The state for each aggregate function, called an accumulator, is tracked with a hash table (DataFusion uses the excellent HashBrown RawTable API), which logically stores the “index” identifying the specific group value.

Hash grouping in `27.0.0`¶

As shown in Figure 3, DataFusion 27.0.0 stores the data in a GroupState structure which, unsurprisingly, tracks the state for each group. The state for each group consists of:

The actual value of the group columns, in Arrow Row format.
In-progress accumulations (e.g. the running counts for the COUNT aggregate) for each group, in one of two possible formats (Accumulator or RowAccumulator).
Scratch space for tracking which rows match each aggregate in each batch.

                           ┌──────────────────────────────────────┐
                           │                                      │
                           │                  ...                 │
                           │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │
                           │ ┃                                  ┃ │
    ┌─────────┐            │ ┃ ┌──────────────────────────────┐ ┃ │
    │         │            │ ┃ │group values: OwnedRow        │ ┃ │
    │ ┌─────┐ │            │ ┃ └──────────────────────────────┘ ┃ │
    │ │  5  │ │            │ ┃ ┌──────────────────────────────┐ ┃ │
    │ ├─────┤ │            │ ┃ │Row accumulator:              │ ┃ │
    │ │  9  │─┼────┐       │ ┃ │Vec<u8>                       │ ┃ │
    │ ├─────┤ │    │       │ ┃ └──────────────────────────────┘ ┃ │
    │ │ ... │ │    │       │ ┃ ┌──────────────────────┐         ┃ │
    │ ├─────┤ │    │       │ ┃ │┌──────────────┐      │         ┃ │
    │ │  1  │ │    │       │ ┃ ││Accumulator 1 │      │         ┃ │
    │ ├─────┤ │    │       │ ┃ │└──────────────┘      │         ┃ │
    │ │ ... │ │    │       │ ┃ │┌──────────────┐      │         ┃ │
    │ └─────┘ │    │       │ ┃ ││Accumulator 2 │      │         ┃ │
    │         │    │       │ ┃ │└──────────────┘      │         ┃ │
    └─────────┘    │       │ ┃ │ Box<dyn Accumulator> │         ┃ │
    Hash Table     │       │ ┃ └──────────────────────┘         ┃ │
                   │       │ ┃ ┌─────────────────────────┐      ┃ │
                   │       │ ┃ │scratch indices: Vec<u32>│      ┃ │
                   │       │ ┃ └─────────────────────────┘      ┃ │
                   │       │ ┃ GroupState                       ┃ │
                   └─────▶ │ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ │
                           │                                      │
  Hash table tracks an     │                 ...                  │
  index into group_states  │                                      │
                           └──────────────────────────────────────┘
                           group_states: Vec<GroupState>

                           There is one GroupState PER GROUP

Figure 4: Hash group operator structure in DataFusion 27.0.0. A hash table maps each group to a GroupState which contains all the per-group states.

To compute the aggregate, DataFusion performs the following steps for each input batch:

Calculate hash using efficient vectorized code, specialized for each data type.
Determine group indexes for each input row using the hash table (creating new entries for newly seen groups).
Update Accumulators for each group that had input rows, assembling the rows into a contiguous range for vectorized accumulator if there are a sufficient number of them.

DataFusion also stores the hash values in the table to avoid potentially costly hash recomputation when resizing the hash table.

This scheme works very well for a relatively small number of distinct groups: all accumulators are efficiently updated with large contiguous batches of rows.

However, this scheme is not ideal for high cardinality grouping due to:

Multiple allocations per group for the group value row format, as well as for the RowAccumulators and each Accumulator. The Accumulator may have additional allocations within it as well.
Non-vectorized updates: Accumulator updates often fall back to a slower non-vectorized form because the number of distinct groups is large (and thus number of values per group is small) in each input batch.

Hash grouping in `28.0.0`¶

For 28.0.0, we rewrote the core group by implementation following traditional system optimization principles: fewer allocations, type specialization, and aggressive vectorization.

DataFusion 28.0.0 uses the same RawTable and still stores group indexes. The major differences, as shown in Figure 4, are:

Group values are stored either
1. Inline in the RawTable (for single columns of primitive types), where the conversion to Row format costs more than its benefit
2. In a separate Rows structure with a single contiguous allocation for all groups values, rather than an allocation per group. Accumulators manage the state for all the groups internally, so the code to update intermediate values is a tight type specialized loop. The new GroupsAccumulator interface results in highly efficient type accumulator update loops.

┌───────────────────────────────────┐     ┌───────────────────────┐
│ ┌ ─ ─ ─ ─ ─ ┐  ┌─────────────────┐│     │ ┏━━━━━━━━━━━━━━━━━━━┓ │
│                │                 ││     │ ┃  ┌──────────────┐ ┃ │
│ │           │  │ ┌ ─ ─ ┐┌─────┐  ││     │ ┃  │┌───────────┐ │ ┃ │
│                │    X   │  5  │  ││     │ ┃  ││  value1   │ │ ┃ │
│ │           │  │ ├ ─ ─ ┤├─────┤  ││     │ ┃  │└───────────┘ │ ┃ │
│                │    Q   │  9  │──┼┼──┐  │ ┃  │     ...      │ ┃ │
│ │           │  │ ├ ─ ─ ┤├─────┤  ││  └──┼─╋─▶│              │ ┃ │
│                │   ...  │ ... │  ││     │ ┃  │┌───────────┐ │ ┃ │
│ │           │  │ ├ ─ ─ ┤├─────┤  ││     │ ┃  ││  valueN   │ │ ┃ │
│                │    H   │  1  │  ││     │ ┃  │└───────────┘ │ ┃ │
│ │           │  │ ├ ─ ─ ┤├─────┤  ││     │ ┃  │values: Vec<T>│ ┃ │
│     Rows       │   ...  │ ... │  ││     │ ┃  └──────────────┘ ┃ │
│ │           │  │ └ ─ ─ ┘└─────┘  ││     │ ┃                   ┃ │
│  ─ ─ ─ ─ ─ ─   │                 ││     │ ┃ GroupsAccumulator ┃ │
│                └─────────────────┘│     │ ┗━━━━━━━━━━━━━━━━━━━┛ │
│                  Hash Table       │     │                       │
│                                   │     │          ...          │
└───────────────────────────────────┘     └───────────────────────┘
  GroupState                               Accumulators


Hash table value stores group_indexes     One  GroupsAccumulator
and group values.                         per aggregate. Each
                                          stores the state for
Group values are stored either inline     *ALL* groups, typically
in the hash table or in a single          using a native Vec<T>
allocation using the arrow Row format

Figure 5: Hash group operator structure in DataFusion 28.0.0. Group values are stored either directly in the hash table, or in a single allocation using the arrow Row format. The hash table contains group indexes. A single GroupsAccumulator stores the per-aggregate state for all groups.

This new structure improves performance significantly for high cardinality groups due to:

Reduced allocations: There are no longer any individual allocations per group.
Contiguous native accumulator states: Type-specialized accumulators store the values for all groups in a single contiguous allocation using a Rust Vec<T> of some native type.
Vectorized state update: The inner aggregate update loops, which are type-specialized and in terms of native Vecs, are well-vectorized by the Rust compiler (thanks LLVM!).

Notes¶

Some vectorized grouping implementations store the accumulator state row-wise directly in the hash table, which often uses modern CPU caches efficiently. Managing accumulator state in columnar fashion may sacrifice some cache locality, however it ensures the size of the hash table remains small, even when there are large numbers of groups and aggregates, making it easier for the compiler to vectorize the accumulator update.

Depending on the cost of recomputing hash values, DataFusion 28.0.0 may or may not store the hash values in the table. This optimizes the tradeoff between the cost of computing the hash value (which is expensive for strings, for example) vs. the cost of storing it in the hash table.

One subtlety that arises from pushing state updates into GroupsAccumulators is that each accumulator must handle similar variations with/without filtering and with/without nulls in the input. DataFusion 28.0.0 uses a templated NullState which encapsulates these common patterns across accumulators.

The code structure is heavily influenced by the fact DataFusion is implemented using Rust, a new(ish) systems programming language focused on speed and safety. Rust heavily discourages many of the traditional pointer casting “tricks” used in C/C++ hash grouping implementations. The DataFusion aggregation code is almost entirely safe, deviating into unsafe only when necessary. (Rust is a great choice because it makes DataFusion fast, easy to embed, and prevents many crashes and security issues often associated with multi-threaded C/C++ code).

ClickBench results¶

The full results of running the ClickBench queries against the single Parquet file with DataFusion 27.0.0, DataFusion 28.0.0, and DuckDB 0.8.1 are below. These numbers were run on a GCP e2-standard-8 machine with 8 cores and 32 GB of RAM, using the scripts here.

As the industry moves towards data systems assembled from components, it is increasingly important that they exchange data using open standards such as Apache Arrow and Parquet rather than custom storage and in-memory formats. Thus, this benchmark uses a single input Parquet file representative of many DataFusion users and aligned with the current trend in analytics of avoiding a costly load/transformation into a custom storage format prior to query.

DataFusion now reaches near-DuckDB-speeds querying Parquet data. While we don’t plan to engage in a benchmarking shootout with a team that literally wrote Fair Benchmarking Considered Difficult, hopefully everyone can agree that DataFusion 28.0.0 is a significant improvement.

Figure 6: Performance of DataFusion 27.0.0, DataFusion 28.0.0, and DuckDB 0.8.1 on all 43 ClickBench queries against a single hits.parquet file. Lower is better.

Notes¶

DataFusion 27.0.0 was not able to run several queries due to either planner bugs (Q9, Q11, Q12, 14) or running out of memory (Q33). DataFusion 28.0.0 solves those issues.

DataFusion is faster than DuckDB for query 21 and 22, likely due to optimized implementations of string pattern matching.

Conclusion: performance matters¶

Improving aggregation performance by more than a factor of two allows developers building products and projects with DataFusion to spend more time on value-added domain specific features. We believe building systems with DataFusion is much faster than trying to build something similar from scratch. DataFusion increases productivity because it eliminates the need to rebuild well-understood, but costly to implement, analytic database technology. While we’re pleased with the improvements in DataFusion 28.0.0, we are by no means done and are pursuing (Even More) Aggregation Performance. The future for performance is bright.

Acknowledgments¶

DataFusion is a community effort and this work was not possible without contributions from many in the community. A special shout out to sunchao, yjshen, yahoNanJing, mingmwang, ozankabak, mustafasrepo, and everyone else who contributed ideas, reviews, and encouragement during this work.

About DataFusion¶

Apache Arrow DataFusion is an extensible query engine and database toolkit, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion, along with Apache Calcite, Facebook’s Velox, and similar technology are part of the next generation “Deconstructed Database” architectures, where new systems are built on a foundation of fast, modular components, rather than as a single tightly integrated system.

Notes¶

[^1]: SELECT COUNT(*) FROM 'hits.parquet';

[^2]: SELECT COUNT(DISTINCT "UserID") as num_users FROM 'hits.parquet';

[^3]: SELECT COUNT(DISTINCT "SearchPhrase") as num_phrases FROM 'hits.parquet';

[^4]: SELECT COUNT(*) FROM (SELECT DISTINCT "UserID", "SearchPhrase" FROM 'hits.parquet')

[^5]: Full script at hash.py

[^6]: hits_0.parquet, one of the files from the partitioned ClickBench dataset, which has 100,000 rows and is 117 MB in size. The entire dataset has 100,000,000 rows in a single 14 GB Parquet file. The script did not complete on the entire dataset after 40 minutes, and used 212 GB RAM at peak.

Apache Arrow DataFusion 26.0.0

2023-06-24T00:00:00+00:00

It has been a whirlwind 6 months of DataFusion development since our last update: the community has grown, many features have been added, performance improved and we are discussing branching out to our own top level Apache Project.

Background¶

Apache Arrow DataFusion is an extensible query engine and database toolkit, written in Rust, that uses Apache Arrow as its in-memory format.

DataFusion, along with Apache Calcite, Facebook's Velox and similar technology are part of the next generation "Deconstructed Database" architectures, where new systems are built on a foundation of fast, modular components, rather as a single tightly integrated system.

While single tightly integrated systems such as Spark, DuckDB and Pola.rs are great pieces of technology, our community believes that anyone developing new data heavy application, such as those common in machine learning in the next 5 years, will require a high performance, vectorized, query engine to remain relevant. The only practical way to gain access to such technology without investing many millions of dollars to build a new tightly integrated engine, is though open source projects like DataFusion and similar enabling technologies such as Apache Arrow and Rust.

DataFusion is targeted primarily at developers creating other data intensive analytics, and offers:

High performance, native, parallel streaming execution engine
Mature SQL support, featuring subqueries, window functions, grouping sets, and more
Built in support for Parquet, Avro, CSV, JSON and Arrow formats and easy extension for others
Native DataFrame API and python bindings
Well documented source code and architecture, designed to be customized to suit downstream project needs
High quality, easy to use code released every 2 weeks to crates.io
Welcoming, open community, governed by the highly regarded and well understood Apache Software Foundation

The rest of this post highlights some of the improvements we have made to DataFusion over the last 6 months and a preview of where we are heading. You can see a list of all changes in the detailed CHANGELOG.

(Even) Better Performance¶

Various benchmarks show DataFusion to be quite close or even faster to the state of the art in analytic performance (at the moment this seems to be DuckDB). We continually work on improving performance (see #5546 for a list) and would love additional help in this area.

DataFusion now reads single large Parquet files significantly faster by parallelizing across multiple cores. Native speeds for reading JSON and CSV files are also up to 2.5x faster thanks to improvements upstream in arrow-rs JSON reader and CSV reader.

Also, we have integrated the arrow-rs Row Format into DataFusion resulting in up to 2-3x faster sorting and merging.

Improved Documentation and Website¶

Part of growing the DataFusion community is ensuring that DataFusion's features are understood and that it is easy to contribute and participate. To that end the website has been cleaned up, the architecture guide expanded, the roadmap updated, and several overview talks created:

Apr 2023 Query Engine: recording and slides
April 2023 Logical Plan and Expressions: recording and slides
April 2023 Physical Plan and Execution: recording and slides

New Features¶

More Streaming, Less Memory¶

We have made significant progress on the streaming execution roadmap such as unbounded datasources, streaming group by, sophisticated sort and repartitioning improvements in the optimizer, and support for symmetric hash join (read more about that in the great Synnada Blog Post on the topic). Together, these features both 1) make it easier to build streaming systems using DataFusion that can incrementally generate output before (or ever) seeing the end of the input and 2) allow general queries to use less memory and generate their results faster.

We have also improved the runtime memory management system so that DataFusion now stays within its declared memory budget generate runtime errors.

DML Support (`INSERT`, `DELETE`, `UPDATE`, etc)¶

Part of building high performance data systems includes writing data, and DataFusion supports several features for creating new files:

INSERT INTO and SELECT ... INTO support for memory backed and CSV tables
New API for writing data into TableProviders

We are working on easier to use COPY INTO syntax, better support for writing parquet, JSON, and AVRO, and more -- see our tracking epic for more details.

Timestamp and Intervals¶

One mark of the maturity of a SQL engine is how it handles the tricky world of timestamp, date, times and interval arithmetic. DataFusion is feature complete in this area and behaves as you would expect, supporting queries such as

SELECT now() + '1 month' FROM my_table;

We still have a long tail of date and time improvements, which we are working on as well.

Querying Structured Types (`List` and `Struct`s)¶

Arrow and Parquet support nested data well and DataFusion lets you easily query such Struct and List. For example, you can use DataFusion to read and query the JSON Datasets for Exploratory OLAP - Mendeley Data like this:

----------
-- Explore structured data using SQL
----------
SELECT delete FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL limit 10;
+---------------------------------------------------------------------------------------------------------------------------+
| delete                                                                                                                    |
+---------------------------------------------------------------------------------------------------------------------------+
| {status: {id: {$numberLong: 135037425050320896}, id_str: 135037425050320896, user_id: 334902461, user_id_str: 334902461}} |
| {status: {id: {$numberLong: 134703982051463168}, id_str: 134703982051463168, user_id: 405383453, user_id_str: 405383453}} |
| {status: {id: {$numberLong: 134773741740765184}, id_str: 134773741740765184, user_id: 64823441, user_id_str: 64823441}}   |
| {status: {id: {$numberLong: 132543659655704576}, id_str: 132543659655704576, user_id: 45917834, user_id_str: 45917834}}   |
| {status: {id: {$numberLong: 133786431926697984}, id_str: 133786431926697984, user_id: 67229952, user_id_str: 67229952}}   |
| {status: {id: {$numberLong: 134619093570560002}, id_str: 134619093570560002, user_id: 182430773, user_id_str: 182430773}} |
| {status: {id: {$numberLong: 134019857527214080}, id_str: 134019857527214080, user_id: 257396311, user_id_str: 257396311}} |
| {status: {id: {$numberLong: 133931546469076993}, id_str: 133931546469076993, user_id: 124539548, user_id_str: 124539548}} |
| {status: {id: {$numberLong: 134397743350296576}, id_str: 134397743350296576, user_id: 139836391, user_id_str: 139836391}} |
| {status: {id: {$numberLong: 127833661767823360}, id_str: 127833661767823360, user_id: 244442687, user_id_str: 244442687}} |
+---------------------------------------------------------------------------------------------------------------------------+

----------
-- Select some deeply nested fields
----------
SELECT
  delete['status']['id']['$numberLong'] as delete_id,
  delete['status']['user_id'] as delete_user_id
FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL LIMIT 10;

+--------------------+----------------+
| delete_id          | delete_user_id |
+--------------------+----------------+
| 135037425050320896 | 334902461      |
| 134703982051463168 | 405383453      |
| 134773741740765184 | 64823441       |
| 132543659655704576 | 45917834       |
| 133786431926697984 | 67229952       |
| 134619093570560002 | 182430773      |
| 134019857527214080 | 257396311      |
| 133931546469076993 | 124539548      |
| 134397743350296576 | 139836391      |
| 127833661767823360 | 244442687      |
+--------------------+----------------+

Subqueries All the Way Down¶

DataFusion can run many different subqueries by rewriting them to joins. It has been able to run the full suite of TPC-H queries for at least the last year, but recently we have implemented significant improvements to this logic, sufficient to run almost all queries in the TPC-DS benchmark as well.

Community and Project Growth¶

The six months since our last update saw significant growth in the DataFusion community. Between versions 17.0.0 and 26.0.0, DataFusion merged 711 PRs from 107 distinct contributors, not including all the work that goes into our core dependencies such as arrow, parquet, and object_store, that much of the same community helps support.

In addition, we have added 7 new committers and 1 new PMC member to the Apache Arrow project, largely focused on DataFusion, and we learned about some of the cool new systems which are using DataFusion. Given the growth of the community and interest in the project, we also clarified the mission statement and are discussing "graduate"ing DataFusion to a new top level Apache Software Foundation project.

How to Get Involved¶

Kudos to everyone in the community who has contributed ideas, discussions, bug reports, documentation and code. It is exciting to be innovating on the next generation of database architectures together!

If you are interested in contributing to DataFusion, we would love to have you join us. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here.

Check out our Communication Doc for more ways to engage with the community.

Apache Arrow DataFusion 16.0.0 Project Update

2023-01-19T00:00:00+00:00

Introduction¶

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. It is targeted primarily at developers creating data intensive analytics, and offers mature SQL support, a DataFrame API, and many extension points.

Systems based on DataFusion perform very well in benchmarks, especially considering they operate directly on parquet files rather than first loading into a specialized format. Some recent highlights include clickbench and the Cloudfuse.io standalone query engines page.

DataFusion is also part of a longer term trend, articulated clearly by Andy Pavlo in his 2022 Databases Retrospective. Database frameworks are proliferating and it is likely that all OLAP DBMSs and other data heavy applications, such as machine learning, will require a vectorized, highly performant query engine in the next 5 years to remain relevant. The only practical way to make such technology so widely available without many millions of dollars of investment is though open source engine such as DataFusion or Velox.

The rest of this post describes the improvements made to DataFusion over the last three months and some hints of where we are heading.

Community Growth¶

We again saw significant growth in the DataFusion community since our last update. There are some interesting metrics on OSSRank.

The DataFusion 16.0.0 release consists of 543 PRs from 73 distinct contributors, not including all the work that goes into dependencies such as arrow, parquet, and object_store, that much of the same community helps support. Thank you all for your help

Several new systems based on DataFusion were recently added:

Performance 🚀¶

Performance and efficiency are core values for DataFusion. While there is still a gap between DataFusion and the best of breed, tightly integrated systems such as DuckDB and Polars, DataFusion is closing the gap quickly. Performance highlights from the last three months:

Up to 30% Faster Sorting and Merging using the new Row Format
Advanced predicate pushdown, directly on parquet, directly from object storage, enabling sub millisecond filtering.
70% faster IN expressions evaluation (#4057)
Sort and partition aware optimizations (#3969 and #4691)
Filter selectivity analysis (#3868)

Runtime Resource Limits¶

Previously, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins.

In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to optionally spill to secondary storage. See #3941 for more detail.

SQL Window Functions¶

SQL Window Functions are useful for a variety of analysis and DataFusion's implementation support expanded significantly:

Custom window frames such as ... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING)
Unbounded window frames such as ... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING)
Support for the NTILE window function (#4676)
Support for GROUPS mode (#4155)

Improved Joins¶

Joins are often the most complicated operations to handle well in analytics systems and DataFusion 16.0.0 offers significant improvements such as

Cost based optimizer (CBO) automatically reorders join evaluations, selects algorithms (Merge / Hash), and pick build side based on available statistics and join type (INNER, LEFT, etc) (#4219)
Fast non column=column equijoins such as JOIN ON a.x + 5 = b.y
Better performance on non-equijoins (#4562)

Streaming Execution¶

One emerging use case for Datafusion is as a foundation for streaming-first data platforms. An important prerequisite is support for incremental execution for queries that can be computed incrementally.

With this release, DataFusion now supports the following streaming features:

Data ingestion from infinite files such as FIFOs (#4694),
Detection of pipeline-breaking queries in streaming use cases (#4694),
Automatic input swapping for joins so probe side is a data stream (#4694),
Intelligent elision of pipeline-breaking sort operations whenever possible (#4691),
Incremental execution for more types of queries; e.g. queries involving finite window frames (#4777).

These are a major steps forward, and we plan even more improvements over the next few releases.

Better Support for Distributed Catalogs¶

16.0.0 has been enhanced support for asynchronous catalogs (#4607) to better support distributed metadata stores such as Delta.io and Apache Iceberg which require asynchronous I/O during planning to access remote catalogs. Previously, DataFusion required synchronous access to all relevant catalog information.

Additional SQL Support¶

SQL support continues to improve, including some of these highlights:

Add TPC-DS query planning regression tests #4719
Support for PREPARE statement #4490
Automatic coercions ast between Date and Timestamp #4726
Support type coercion for timestamp and utf8 #4312
Full support for time32 and time64 literal values (ScalarValue) #4156
New functions, including uuid() #4041, current_time #4054, current_date #4022
Compressed CSV/JSON support #3642

The community has also invested in new sqllogic based tests to keep improving DataFusion's quality with less effort.

Plan Serialization and Substrait¶

DataFusion now supports serialization of physical plans, with a custom protocol buffers format. In addition, we are adding initial support for Substrait, a Cross-Language Serialization for Relational Algebra

How to Get Involved¶

Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!

Check out our Communication Doc on more ways to engage with the community.

Appendix: Contributor Shoutout¶

Here is a list of people who have contributed PRs to this project over the last three releases, derived from git shortlog -sn 13.0.0..16.0.0 . Thank you all!

   113  Andrew Lamb
    58  jakevin
    46  Raphael Taylor-Davies
    30  Andy Grove
    19  Batuhan Taskaya
    19  Remzi Yang
    17  ygf11
    16  Burak
    16  Jeffrey
    16  Marco Neumann
    14  Kun Liu
    12  Yang Jiang
    10  mingmwang
     9  Daniël Heres
     9  Mustafa akur
     9  comphead
     9  mvanschellebeeck
     9  xudong.w
     7  dependabot[bot]
     7  yahoNanJing
     6  Brent Gardner
     5  AssHero
     4  Jiayu Liu
     4  Wei-Ting Kuo
     4  askoa
     3  André Calado Coroado
     3  Jie Han
     3  Jon Mease
     3  Metehan Yıldırım
     3  Nga Tran
     3  Ruihang Xia
     3  baishen
     2  Berkay Şahin
     2  Dan Harris
     2  Dongyan Zhou
     2  Eduard Karacharov
     2  Kikkon
     2  Liang-Chi Hsieh
     2  Marko Milenković
     2  Martin Grigorov
     2  Roman Nozdrin
     2  Tim Van Wassenhove
     2  r.4ntix
     2  unconsolable
     2  unvalley
     1  Ajaya Agrawal
     1  Alexander Spies
     1  ArkashaJavelin
     1  Artjoms Iskovs
     1  BoredPerson
     1  Christian Salvati
     1  Creampanda
     1  Data Psycho
     1  Francis Du
     1  Francis Le Roy
     1  LFC
     1  Marko Grujic
     1  Matt Willian
     1  Matthijs Brobbel
     1  Max Burke
     1  Mehmet Ozan Kabak
     1  Rito Takeuchi
     1  Roman Zeyde
     1  Vrishabh
     1  Zhang Li
     1  ZuoTiJia
     1  byteink
     1  cfraz89
     1  nbr
     1  xxchan
     1  yujie.zhang
     1  zembunia
     1  哇呜哇呜呀咦耶

Apache Arrow Ballista 0.9.0 Release

2022-10-28T00:00:00+00:00

Introduction¶

Ballista is an Arrow-native distributed SQL query engine implemented in Rust.

Ballista 0.9.0 is now available and is the most significant release since the project was donated to Apache Arrow in 2021.

This release represents 4 weeks of work, with 66 commits from 14 contributors:

    22  Andy Grove
    12  yahoNanJing
     6  Daniël Heres
     4  Brent Gardner
     4  dependabot[bot]
     4  r.4ntix
     3  Stefan Stanciulescu
     3  mingmwang
     2  Ken Suenobu
     2  Yang Jiang
     1  Metehan Yıldırım
     1  Trent Feda
     1  askoa
     1  yangzhong

Release Highlights¶

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.

Support for Cloud Object Stores and Distributed File Systems¶

This is the first release of Ballista to have documented support for querying data from distributed file systems and object stores. Currently, S3 and HDFS are supported. Support for Google Cloud Storage and Azure Blob Storage is planned for the next release.

Flight SQL & JDBC support¶

The Ballista scheduler now implements the Flight SQL protocol, enabling any compliant Flight SQL client to connect to and run queries against a Ballista cluster.

The Apache Arrow Flight SQL JDBC driver can be used to connect Business Intelligence tools to a Ballista cluster.

Python Bindings¶

It is now possible to connect to a Ballista cluster from Python and execute queries using both the DataFrame and SQL interfaces.

Scheduler Web User Interface and REST API¶

The scheduler now has a web user interface for monitoring queries. It is also possible to view graphical query plans that show how the query was executed, along with metrics.

The REST API that powers the user interface can also be accessed directly.

Simplified Kubernetes Deployment¶

Ballista now provides a Helm chart for simplified Kubernetes deployment.

User Guide¶

The user guide is published at https://arrow.apache.org/ballista/ and provides deployment instructions for Docker, Docker Compose, and Kubernetes, as well as references for configuring and tuning Ballista.

Roadmap¶

The Ballista community is currently focused on the following tasks for the next release:

Support for Azure Blob Storage and Google Cloud Storage
Improve benchmark performance by implementing more query optimizations
Improve scheduler web user interface
Publish Docker images to GitHub Container Registry

The detailed list of issues planned for the 0.10.0 release can be found in the tracking issue.

Getting Involved¶

Ballista has a friendly community and we welcome contributions. A good place to start is to following the instructions in the user guide and try using Ballista with your own SQL queries and ETL pipelines, and file issues for any bugs or feature suggestions.

Apache Arrow DataFusion 13.0.0 Project Update

2022-10-25T00:00:00+00:00

Introduction¶

Apache Arrow DataFusion 13.0.0 is released, and this blog contains an update on the project for the 5 months since our last update in May 2022.

DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project to:

Support SQL support
Support DataFrame API
Support a Domain Specific Query Language
Easily and quickly read and process Parquet, JSON, Avro or CSV data.
Read from remote object stores such as AWS S3, Azure Blob Storage, GCP.

Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.

Background¶

DataFusion is used as the engine in many open source and commercial projects and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a "LLVM for database and AI systems"(alternate link) with announcements such as the release of FaceBook's Velox engine, the major investments in Acero as well as the continued popularity of Apache Calcite and other similar technologies.

While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and extension points for just about everything. Some DataFusion users use a subset of the features such as the frontend (e.g. dask-sql) or the execution engine, (e.g. Blaze), and some use many different components to build both SQL based and customized DSL based systems such as InfluxDB IOx and VegaFusion.

One of DataFusion’s advantages is its implementation in Rust and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the ease of parallelization with the high quality and standardized async ecosystem , as well as its modern dependency management system and wonderful performance.

Summary¶

We have increased the frequency of DataFusion releases to monthly instead of quarterly. This makes it easier for the increasing number of projects that now depend on DataFusion.

We have also completed the "graduation" of Ballista to its own top-level arrow-ballista repository which decouples the two projects and allows each project to move even faster.

Along with numerous other bug fixes and smaller improvements, here are some of the major advances:

Improved Support for Cloud Object Stores¶

DataFusion now supports many major cloud object stores (Amazon S3, Azure Blob Storage, and Google Cloud Storage) "out of the box" via the object_store crate. Using this integration, DataFusion optimizes reading parquet files by reading only the parts of the files that are needed.

Advanced SQL¶

DataFusion now supports correlated subqueries, by rewriting them as joins. See the Subquery page in the User Guide for more information.

In addition to numerous other small improvements, the following SQL features are now supported:

ROWS, RANGE, PRECEDING and FOLLOWING in OVER clauses #3570
ROLLUP and CUBE grouping set expressions #2446
SUM DISTINCT aggregate support #2405
IN and NOT IN Subqueries by rewriting them to SEMI / ANTI #2421
Non equality predicates in ON clause of LEFT, RIGHT,and FULL joins #2591
Exact MEDIAN #3009
GROUPING SETS/CUBE/ROLLUP #2716

More DDL Support¶

Just as it is important to query, it is also important to give users the ability to define their data sources. We have added:

CREATE VIEW #2279
DESCRIBE <table> #2642
Custom / Dynamic table provider factories #3311
SHOW CREATE TABLE for support for views #2830

Faster Execution¶

Performance is always an important goal for DataFusion, and there are a number of significant new optimizations such as

Optimizations of TopK (queries with a LIMIT or OFFSET clause): #3527, #2521
Reduce left/right/full joins to inner join #2750
Convert cross joins to inner joins when possible #3482
Sort preserving SortMergeJoin #2699
Improvements in group by and sort performance #2375
Adaptive regex_replace implementation #3518

Optimizer Enhancements¶

Internally the optimizer has been significantly enhanced as well.

Casting / coercion now happens during logical planning #3185 #3636
More sophisticated expression analysis and simplification is available

Parquet¶

The parquet reader can now read directly from parquet files on remote object storage #2489 #3051
Experimental support for “predicate pushdown” with late materialization after filtering during the scan (another blog post on this topic is coming soon).
Support reading directly from AWS S3 and other object stores via datafusion-cli #3631

DataType Support¶

Support for TimestampTz #3660
Expanded support for the Decimal type, including IN list and better built in coercion.
Expanded support for date/time manipulation such as date_bin built-in function , timestamp +/- interval, TIME literal values #3010, #3110, #3034
Binary operations (AND, XOR, etc): #3037 #3420
IS TRUE/FALSE and IS [NOT] UNKNOWN #3235, #3246

Upcoming Work¶

With the community growing and code accelerating, there is so much great stuff on the horizon. Some features we expect to land in the next few months:

Complete Parquet Pushdown
Additional date/time support
Cost models, Nested Join Optimizations, analysis framework #128, #3843, #3845

Community Growth¶

The DataFusion 9.0.0 and 13.0.0 releases consists of 433 PRs from 64 distinct contributors. This does not count all the work that goes into our dependencies such as arrow, parquet, and object_store, that much of the same community helps nurture.

How to Get Involved¶

Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!

If you are interested in contributing to DataFusion, we would love to have you join us on our journey to create the most advanced open source query engine. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here.

Check out our Communication Doc on more ways to engage with the community.

Appendix: Contributor Shoutout¶

To give a sense of the number of people who contribute to this project regularly, we present for your consideration the following list derived from git shortlog -sn 9.0.0..13.0.0 . Thank you all again!

    87  Andy Grove
    71  Andrew Lamb
    29  Kun Liu
    29  Kirk Mitchener
    17  Wei-Ting Kuo
    14  Yang Jiang
    12  Raphael Taylor-Davies
    11  Batuhan Taskaya
    10  Brent Gardner
    10  Remzi Yang
    10  comphead
    10  xudong.w
     8  AssHero
     7  Ruihang Xia
     6  Dan Harris
     6  Daniël Heres
     6  Ian Alexander Joiner
     6  Mike Roberts
     6  askoa
     4  BaymaxHWY
     4  gorkem
     4  jakevin
     3  George Andronchik
     3  Sarah Yurick
     3  Stuart Carnie
     2  Dalton Modlin
     2  Dmitry Patsura
     2  JasonLi
     2  Jon Mease
     2  Marco Neumann
     2  yahoNanJing
     1  Adilet Sarsembayev
     1  Ayush Dattagupta
     1  Dezhi Wu
     1  Dhamotharan Sritharan
     1  Eduard Karacharov
     1  Francis Du
     1  Harbour Zheng
     1  Ismaël Mejía
     1  Jack Klamer
     1  Jeremy Dyer
     1  Jiayu Liu
     1  Kamil Konior
     1  Liang-Chi Hsieh
     1  Martin Grigorov
     1  Matthijs Brobbel
     1  Mehmet Ozan Kabak
     1  Metehan Yıldırım
     1  Morgan Cassels
     1  Nitish Tiwari
     1  Renjie Liu
     1  Rito Takeuchi
     1  Robert Pack
     1  Thomas Cameron
     1  Vrishabh
     1  Xin Hao
     1  Yijie Shen
     1  byteink
     1  kamille
     1  mateuszkj
     1  nvartolomei
     1  yourenawo
     1  Özgür Akkurt

Apache Arrow DataFusion 8.0.0 Release

2022-05-16T00:00:00+00:00

Introduction¶

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth checking out.

DataFusion's SQL, DataFrame, and manual PlanBuilder API let users access a sophisticated query optimizer and execution engine capable of fast, resource efficient, and parallel execution that takes optimal advantage of today's multicore hardware. Being written in Rust means DataFusion can offer both the safety of a dynamic language and the resource efficiency of a compiled language.

The Apache Arrow team is pleased to announce the DataFusion 8.0.0 release (and also the release of version 0.7.0 of the Ballista subproject). This covers 3 months of development work and includes 279 commits from the following 49 distinct contributors.

    39  Andy Grove
    33  Andrew Lamb
    21  DuRipeng
    20  Yijie Shen
    19  Yang Jiang
    17  Raphael Taylor-Davies
    11  Dan Harris
    11  Matthew Turner
    11  yahoNanJing
     9  dependabot[bot]
     8  jakevin
     6  Kun Liu
     5  Jiayu Liu
     4  Daniël Heres
     4  mingmwang
     4  xudong.w
     3  Carol (Nichols || Goulding)
     3  Dmitry Patsura
     3  Eduard Karacharov
     3  Jeremy Dyer
     3  Kaushik
     3  Rich
     3  comphead
     3  gaojun2048
     3  Feynman Han
     2  Jie Han
     2  Jon Mease
     2  Tim Van Wassenhove
     2  Yt
     2  Zhang Li
     2  silence-coding
     1  Alexander Spies
     1  George Andronchik
     1  Guillaume Balaine
     1  Hao Xin
     1  Jiacai Liu
     1  Jörn Horstmann
     1  Liang-Chi Hsieh
     1  Max Burke
     1  NaincyKumariKnoldus
     1  Nga Tran
     1  Patrick More
     1  Pierre Zemb
     1  Remzi Yang
     1  Sergey Melnychuk
     1  Stephen Carman
     1  doki

The following sections highlight some of the changes in this release. Of course, many other bug fixes and improvements have been made and we encourage you to check out the changelog for full details.

Summary¶

DDL Support¶

DDL support has been expanded to include the following commands for creating databases, schemas, and views. This allows DataFusion to be used more effectively from the CLI.

CREATE DATABASE
CREATE VIEW
CREATE SCHEMA
CREATE EXTERNAL TABLE now supports JSON files, IF NOT EXISTS, and partition columns

SQL Support¶

The SQL query planner now supports a number of new SQL features, including:

Subqueries: when used via IN, EXISTS, and as scalars
Grouping Sets: CUBE and ROLLUP grouping sets.
Aggregate functions: approx_percentile, approx_percentile_cont, approx_percentile_cont_with_weight, approx_distinct, approx_median and array
null literals
bitwise operations: for example '|'

There are also many bug fixes and improvements around normalizing identifiers consistently.

We continue our tradition of incrementally releasing support for new features as they are developed. Thus, while the physical plan may not yet support all new features, it gets more complete each release. These changes also make DataFusion an increasingly compelling choice for projects looking for a SQL parser and query planner that can produce optimized logical plans that can be translated to their own execution engine.

Query Execution & Internals¶

There are several notable improvements and new features in the query execution engine:

The ExecutionContext has been renamed to SessionContext and now supports multi-tenancy
The ExecutionPlan trait is no longer async
A new serialization API for serializing plans to bytes (based on protobuf)

In addition, we have added several foundational features to drive even more advanced query processing into DataFusion, focusing on running arbitrary queries larger than available memory, and pushing the envelope for performance of sorting, grouping, and joining even further:

Morsel-Driven Scheduler based on "Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age"
Consolidated object store implementation and integration with parquet decoding
Memory Limited Spilling sort operator
Memory Limited Sort-Merge join operator
High performance JIT code generation for tuple comparisons
Memory efficient Row Format

Improved file support¶

DataFusion now supports JSON, both for reading and writing. There are also new DataFrame methods for writing query results to files in CSV, Parquet, and JSON format.

Ballista¶

Ballista continues to mature and now supports a wider range of operators and expressions. There are also improvements to the scheduler to support UDFs, and there are some robustness improvements, such as cleaning up work directories and persisting session configs to allow schedulers to restart and continue processing in-flight jobs.

Upcoming Work¶

Here are some of the initiatives that the community plans on working on prior to the next release.

There is a proposal to move Ballista to its own top-level arrow-ballista repository to decouple DataFusion and Ballista releases and to allow each project to have documentation better targeted at its particular audience.
We plan on increasing the frequency of DataFusion releases, with monthly releases now instead of quarterly. This is driven by requests from the increasing number of projects that now depend on DataFusion.
There is ongoing work to implement new optimizer rules to rewrite queries containing subquery expressions as joins, to support a wider range of queries.
The new scheduler based on morsel-driven execution will continue to evolve in this next release, with work to refine IO abstractions to improve performance and integration with the new scheduler.
Improved performance for Sort, Grouping and Joins

How to Get Involved¶

If you are interested in contributing to DataFusion, and learning about state-of-the-art query processing, we would love to have you join us on the journey! You can help by trying out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here

Check out our new Communication Doc on more ways to engage with the community.

Introducing Apache Arrow DataFusion Contrib

2022-03-21T00:00:00+00:00

Introduction¶

Apache Arrow DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth checking out. DataFusion's pluggable design makes creating extensions at various points particular easy to build.

DataFusion's SQL, DataFrame, and manual PlanBuilder API let users access a sophisticated query optimizer and execution engine capable of fast, resource efficient, and parallel execution that takes optimal advantage of todays multicore hardware. Being written in Rust means DataFusion can offer both the safety of dynamic languages as well as the resource efficiency of a compiled language.

The DataFusion team is pleased to announce the creation of the DataFusion-Contrib GitHub organization to support and accelerate other projects. While the core DataFusion library remains under Apache governance, the contrib organization provides a more flexible testing ground for new DataFusion features and a home for DataFusion extensions. With this announcement, we are pleased to introduce the following inaugural DataFusion-Contrib repositories.

DataFusion-Python¶

This project provides Python bindings to the core Rust implementation of DataFusion, which allows users to:

Work with familiar SQL or DataFrame APIs to run queries in a safe, multi-threaded environment, returning results in Python
Create User Defined Functions and User Defined Aggregate Functions for complex operations
Pay no overhead to copy between Python and underlying Rust execution engine (by way of Apache Arrow arrays)

Upcoming enhancements¶

The team is focusing on exposing more features from the underlying Rust implementation of DataFusion and improving documentation.

How to install¶

From pip

pip install datafusion

python -m pip install datafusion

DataFusion-ObjectStore-S3¶

This crate provides an ObjectStore implementation for querying data stored in S3 or S3 compatible storage. This makes it almost as easy to query data that lives on S3 as lives in local files

Ability to create S3FileSystem to register as part of DataFusion ExecutionContext
Register files or directories stored on S3 with ctx.register_listing_table

Upcoming enhancements¶

The current priority is adding python bindings for S3FileSystem. After that there will be async improvements as DataFusion adopts more of that functionality and we are looking into S3 Select functionality.

How to Install¶

Add the below to your Cargo.toml in your Rust Project with DataFusion.

datafusion-objectstore-s3 = "0.1.0"

DataFusion-Substrait¶

Substrait is an emerging standard that provides a cross-language serialization format for relational algebra (e.g. expressions and query plans).

This crate provides a Substrait producer and consumer for DataFusion. A producer converts a DataFusion logical plan into a Substrait protobuf and a consumer does the reverse.

Examples of how to use this crate can be found here.

Potential Use Cases¶

Replace custom DataFusion protobuf serialization.
Make it easier to pass query plans over FFI boundaries, such as from Python to Rust
Allow Apache Calcite query plans to be executed in DataFusion

DataFusion-BigTable¶

This crate implements Bigtable as a data source and physical executor for DataFusion queries. It currently supports both UTF-8 string and 64-bit big-endian signed integers in Bigtable. From a SQL perspective it supports both simple and composite row keys with =, IN, and BETWEEN operators as well as projection pushdown. The physical execution for queries is handled by this crate while any subsequent aggregation, group bys, or joins are handled in DataFusion.

Upcoming Enhancements¶

Predicate pushdown
Value range
Value Regex
Timestamp range
Multithreaded
Partition aware execution
Production ready

How to Install¶

Add the below to your Cargo.toml in your Rust Project with DataFusion.

datafusion-bigtable = "0.1.0"

DataFusion-HDFS¶

This crate introduces HadoopFileSystem as a remote ObjectStore which provides the ability to query HDFS files. For HDFS access the fs-hdfs library is used.

DataFusion-Tokomak¶

This crate provides an e-graph based DataFusion optimization framework based on the Rust egg library. An e-graph is a data structure that powers the equality saturation optimization technique.

As context, the optimizer framework within DataFusion is currently under review with the objective of implementing a more strategic long term solution that is more efficient and simpler to develop.

Some of the benefits of using egg within DataFusion are:

Implements optimized algorithms that are hard to match with manually written optimization passes
Makes it easy and less verbose to add optimization rules
Plugin framework to add more complex optimizations
Egg does not depend on rule order and can lead to a higher level of optimization by being able to apply multiple rules at the same time until it converges
Allows for cost-based optimizations

This is an exciting new area for DataFusion with lots of opportunity for community involvement!

DataFusion-Tui¶

DataFusion-tui aka dft provides a feature rich terminal application for using DataFusion. It has drawn inspiration and several features from datafusion-cli. In contrast to datafusion-cli the objective of this tool is to provide a light SQL IDE experience for querying data with DataFusion. This includes features such as the following which are currently implemented:

Tab Management to provide clean and structured organization of DataFusion queries, results, ExecutionContext information, and logs
SQL Editor
- Text editor for writing SQL queries
Query History
- History of executed queries, their execution time, and the number of returned rows
ExecutionContext information
- Expose information on which physical optimizers are used and which ExecutionConfig settings are set
Logs
- Logs from dft, DataFusion, and any dependent libraries
Support for custom ObjectStores
S3
Preload DDL from ~/.datafusionrc to enable having local "database" available at startup

Upcoming Enhancements¶

SQL Editor
Command to write query results to file
Multiple SQL editor tabs
Expose more information from ExecutionContext
A help tab that provides information on functions
Query custom TableProviders such as DeltaTable or BigTable

DataFusion-Streams¶

DataFusion-Stream is a new testing ground for creating a StreamProvider in DataFusion that will enable querying streaming data sources such as Apache Kafka. The implementation for this feature is currently being designed and is under active review. Once the design is finalized the trait and attendant data structures will be added back to the core DataFusion crate.

DataFusion-Java¶

This project created an initial set of Java bindings to DataFusion. The project is currently in maintenance mode and is looking for maintainers to drive future development.

How to Get Involved¶

If you are interested in contributing to DataFusion, and learning about state of the art query processing, we would love to have you join us on the journey! You can help by trying out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here

The best way to find out about creating new extensions within DataFusion-Contrib is reaching out on the #arrow-rust channel of the Apache Software Foundation Slack workspace.

You can also check out our new Communication Doc on more ways to engage with the community.

Links for each DataFusion-Contrib repository are provided above if you would like to contribute to those.

Apache Arrow DataFusion 7.0.0 Release

2022-02-28T00:00:00+00:00

Introduction¶

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth checking out.

DataFusion's SQL, DataFrame, and manual PlanBuilder API let users access a sophisticated query optimizer and execution engine capable of fast, resource efficient, and parallel execution that takes optimal advantage of todays multicore hardware. Being written in Rust means DataFusion can offer both the safety of dynamic languages as well as the resource efficiency of a compiled language.

The Apache Arrow team is pleased to announce the DataFusion 7.0.0 release. This covers 4 months of development work and includes 195 commits from the following 37 distinct contributors.

    44  Andrew Lamb
    24  Kun Liu
    23  Jiayu Liu
    17  xudong.w
    11  Yijie Shen
     9  Matthew Turner
     7  Liang-Chi Hsieh
     5  Lin Ma
     4  Stephen Carman
     4  James Katz
     4  Dmitry Patsura
     4  QP Hou
     3  dependabot[bot]
     3  Remzi Yang
     3  Yang
     3  ic4y
     3  Daniël Heres
     2  Andy Grove
     2  Raphael Taylor-Davies
     2  Jason Tianyi Wang
     2  Dan Harris
     2  Sergey Melnychuk
     1  Nitish Tiwari
     1  Dom
     1  Eduard Karacharov
     1  Javier Goday
     1  Boaz
     1  Marko Mikulicic
     1  Max Burke
     1  Carol (Nichols || Goulding)
     1  Phillip Cloud
     1  Rich
     1  Toby Hede
     1  Will Jones
     1  r.4ntix
     1  rdettai

The following section highlights some of the improvements in this release. Of course, many other bug fixes and improvements have also been made and we refer you to the complete changelog for the full detail.

Summary¶

DataFusion Crate
The DataFusion crate is being split into multiple crates to decrease compilation times and improve the development experience. Initially, datafusion-common (the core DataFusion components) and datafusion-expr (DataFusion expressions, functions, and operators) have been split out. There will be additional splits after the 7.0 release.
Performance Improvements and Optimizations
Arrow’s dyn scalar kernels are now used to enable efficient operations on DictionaryArrays #1685
Switch from std::sync::Mutex to parking_lot::Mutex #1720
New Features
Support for memory tracking and spilling to disk
- MemoryManager and DiskManager #1526
- Out of core sort #1526
- New metrics
- Gauge and CurrentMemoryUsage #1682
- Spill_count and spilled_bytes #1641
New math functions
- Approx_quantile #1529
- stddev and variance (sample and population) #1525
- corr #1561
Support decimal type #1394 #1407 #1408 #1431 #1483 #1554 #1640
Support for reading Parquet files with evolved schemas #1622 #1709
Support for registering DataFrame as table #1699
Support for the substring function #1621
Support array_agg(distinct ...) #1579
Support sort on unprojected columns #1415
Additional Integration Points
A new public Expression simplification API #1717
DataFusion-Contrib
A new GitHub organization created as a home for both DataFusion extensions and as a testing ground for new features.
- Extensions
- DataFusion-Python
- DataFusion-Java
- DataFusion-hdsfs-native
- DataFusion-ObjectStore-s3
- New Features
- DataFusion-Streams
Arrow2
An Arrow2 Branch has been created. There are ongoing discussions in DataFusion and arrow-rs about migrating DataFusion to Arrow2

Documentation and Roadmap¶

We are working to consolidate the documentation into the official site. You can find more details there on topics such as the SQL status and a user guide. This is also an area we would love to get help from the broader community #1821.

To provide transparency on DataFusion’s priorities to users and developers a three month roadmap will be published at the beginning of each quarter. This can be found here here.

Upcoming Attractions¶

Ballista is gaining momentum, and several groups are now evaluating and contributing to the project.
Some of the proposed improvements
Continued improvements for working with limited resources and large datasets
Memory limited joins#1599
Sort-merge join#141 #1776
Introduce row based bytes representation #1708

How to Get Involved¶

Check out our new Communication Doc on more ways to engage with the community.

Apache Arrow DataFusion 6.0.0 Release

2021-11-19T00:00:00+00:00

Introduction¶

DataFusion is an embedded query engine which leverages the unique features of Rust and Apache Arrow to provide a system that is high performance, easy to connect, easy to embed, and high quality.

The Apache Arrow team is pleased to announce the DataFusion 6.0.0 release. This covers 4 months of development work and includes 134 commits from the following 28 distinct contributors.

    28  Andrew Lamb
    26  Jiayu Liu
    13  xudong963
     9  rdettai
     9  QP Hou
     6  Matthew Turner
     5  Daniël Heres
     4  Guillaume Balaine
     3  Francis Du
     3  Marco Neumann
     3  Jon Mease
     3  Nga Tran
     2  Yijie Shen
     2  Ruihang Xia
     2  Liang-Chi Hsieh
     2  baishen
     2  Andy Grove
     2  Jason Tianyi Wang
     1  Nan Zhu
     1  Antoine Wendlinger
     1  Krisztián Szűcs
     1  Mike Seddon
     1  Conner Murphy
     1  Patrick More
     1  Taehoon Moon
     1  Tiphaine Ruy
     1  adsharma
     1  lichuan6

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.

New Website¶

Befitting a growing project, DataFusion now has its own website hosted as part of the main Apache Arrow Website

Roadmap¶

The community worked to gather their thoughts about where we are taking DataFusion into a public Roadmap for the first time

New Features¶

Runtime operator metrics collection framework
Object store abstraction for unified access to local or remote storage
Hive style table partitioning support, for Parquet, CSV, Avro and Json files
DataFrame API support for: except, intersect, show, limit and window functions
SQL
EXPLAIN ANALYZE with runtime metrics
trim ( [ LEADING | TRAILING | BOTH ] [ FROM ] string text [, characters text ] ) syntax
Postgres style regular expression matching operators ~, ~*, !~, and !~*
SQL set operators UNION, INTERSECT, and EXCEPT
cume_dist, percent_rank window functions
digest, blake2s, blake2b, blake3 crypto functions
HyperLogLog based approx_distinct
is distinct from and is not distinct from
CREATE TABLE AS SELECT
Accessing elements of nested Struct and List columns (e.g. SELECT struct_column['field_name'], array_column[0] FROM ...)
Boolean expressions in CASE statement
DROP TABLE
VALUES List
Postgres regex match operators
Support for Avro format
Support for ScalarValue::Struct
Automatic schema inference for CSV files
Better interactive editing support in datafusion-cli as well as psql style commands such as \d, \?, and \q
Generic constant evaluation and simplification framework
Added common subexpression eliminate query plan optimization rule
Python binding 0.4.0 with all Datafusion 6.0.0 features

With these new features, we are also now passing TPC-H queries 8, 13 and 21.

For the full list of new features with their relevant PRs, see the enhancements section in the changelog.

`async` planning and decoupling file format from table layout¶

Driven by the need to support Hive style table partitioning, @rdettai introduced the following design change to the Datafusion core.

The code for reading specific file formats (Parquet, Avro, CSV, and JSON) was separated from the logic that handles grouping sets of files into execution partitions.
The query planning process was made async.

As a result, we are able to replace the old Parquet, CSV and JSON table providers with a single ListingTable table provider.

This also sets up DataFusion and its plug-in ecosystem to supporting a wide range of catalogs and various object store implementations. You can read more about this change in the design document and on the arrow-datafusion#1010 PR.

How to Get Involved¶

If you are interested in contributing to DataFusion, we would love to have you! You can help by trying out DataFusion on some of your own data and projects and filing bug reports and helping to improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for beginners is here and the full list is here.

Check out our new Communication Doc on more ways to engage with the community.

Apache Arrow Ballista 0.5.0 Release

2021-08-18T00:00:00+00:00

Ballista extends DataFusion to provide support for distributed queries. This is the first release of Ballista since the project was donated to the Apache Arrow project and includes 80 commits from 11 contributors.

git shortlog -sn 4.0.0..5.0.0 ballista/rust/client ballista/rust/core ballista/rust/executor ballista/rust/scheduler
  27  Andy Grove
  15  Jiayu Liu
  12  Andrew Lamb
   8  Ximo Guanter
   6  Daniël Heres
   5  QP Hou
   2  Jorge Leitao
   1  Javier Goday
   1  K.I. (Dennis) Jung
   1  Mike Seddon
   1  sathis

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.

Performance and Scalability¶

Ballista is now capable of running complex SQL queries at scale and supports scalable distributed joins. We have been benchmarking using individual queries from the TPC-H benchmark at scale factors up to 1000 (1 TB). When running against CSV files, performance is generally very close to DataFusion, and significantly faster in some cases due to the fact that the scheduler limits the number of concurrent tasks that run at any given time. Performance against large Parquet datasets is currently non ideal due to some issues (#867, #868) that we hope to resolve for the next release.

New Features¶

The main new features in this release are:

Ballista queries can now be executed by calling DataFrame.collect()
The shuffle mechanism has been re-implemented
Distributed hash-partitioned joins are now supported
Keda autoscaling is supported

To get started with Ballista, refer to the crate documentation.

Now that the basic functionality is in place, the focus for the next release will be to improve the performance and scalability as well as improving the documentation.

How to Get Involved¶

If you are interested in contributing to Ballista, we would love to have you! You can help by trying out Ballista on some of your own data and projects and filing bug reports and helping to improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for beginners is here and the full list is here.

Apache Arrow DataFusion 5.0.0 Release

2021-08-18T00:00:00+00:00

The Apache Arrow team is pleased to announce the DataFusion 5.0.0 release. This covers 4 months of development work and includes 211 commits from the following 31 distinct contributors.

$ git shortlog -sn 4.0.0..5.0.0 datafusion datafusion-cli datafusion-examples
    61  Jiayu Liu
    47  Andrew Lamb
    27  Daniël Heres
    13  QP Hou
    13  Andy Grove
     4  Javier Goday
     4  sathis
     3  Ruan Pearce-Authers
     3  Raphael Taylor-Davies
     3  Jorge Leitao
     3  Cui Wenzheng
     3  Mike Seddon
     3  Edd Robinson
     2  思维
     2  Liang-Chi Hsieh
     2  Michael Lu
     2  Parth Sarthy
     2  Patrick More
     2  Rich
     1  Charlie Evans
     1  Gang Liao
     1  Agata Naomichi
     1  Ritchie Vink
     1  Evan Chan
     1  Ruihang Xia
     1  Todd Treece
     1  Yichen Wang
     1  baishen
     1  Nga Tran
     1  rdettai
     1  Marco Neumann

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.

Performance¶

There have been numerous performance improvements in this release. The following chart shows the relative performance of individual TPC-H queries compared to the previous release.

TPC-H @ scale factor 100, in parquet format. Concurrency 24.

We also extended support for more TPC-H queries: q7, q8, q9 and q13 are running successfully in DataFusion 5.0.

New Features¶

Initial support for SQL-99 Analytics (WINDOW functions)
Improved JOIN support: cross join, semi-join, anti join, and fixes to null handling
Improved EXPLAIN support
Initial implementation of metrics in the physical plan
Support for SELECT DISTINCT
Support for Json and NDJson formatted inputs
Query column with relations
Added more datetime related functions: now, date_trunc, to_timestamp_millis, to_timestamp_micros, to_timestamp_seconds
Streaming Dataframe.collect
Support table column aliases
Answer count(*), min() and max() queries using only statistics
Non-equi-join filters in JOIN conditions
Modulus operation
Support group by column positions
Added constant folding query optimizer
Hash partitioned aggregation
Added random SQL function
Implemented count distinct for floats and dictionary types
Re-exported arrow and parquet crates in Datafusion
General row group pruning logic that’s agnostic to storage format

How to Get Involved¶

Ballista: A Distributed Scheduler for Apache Arrow

2021-04-12T00:00:00+00:00

We are excited to announce that Ballista has been donated to the Apache Arrow project.

Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as first-class citizens without paying a penalty for serialization costs.

The foundational technologies in Ballista are:

Apache Arrow memory model and compute kernels for efficient processing of data.
Apache Arrow DataFusion query planning and execution framework, extended by Ballista to provide distributed planning and execution.
Apache Arrow Flight Protocol for efficient data transfer between processes.
Google Protocol Buffers for serializing query plans.
Docker for packaging up executors along with user-defined code.

Ballista can be deployed as a standalone cluster and also supports Kubernetes. In either case, the scheduler can be configured to use etcd as a backing store to (eventually) provide redundancy in the case of a scheduler failing.

Status¶

The Ballista project is at an early stage of development. However, it is capable of running complex analytics queries in a distributed cluster with reasonable performance (comparable to more established distributed query frameworks).

One of the benefits of Ballista being part of the Arrow codebase is that there is now an opportunity to push parts of the scheduler down to DataFusion so that is possible to seamlessly scale across cores in DataFusion, and across nodes in Ballista, using the same unified query scheduler.

Contributors Welcome!¶

If you are excited about being able to use Rust for distributed compute and ETL and would like to contribute to this work then there are many ways to get involved. The simplest way to get started is to try out Ballista against your own datasets and file bug reports for any issues that you find. You could also check out the current list of issues and have a go at fixing one.

The Arrow Rust Community section of the Rust README provides information on other ways to interact with the Ballista contributors and maintainers.

DataFusion: A Rust-native Query Engine for Apache Arrow

2019-02-04T00:00:00+00:00

We are excited to announce that DataFusion has been donated to the Apache Arrow project. DataFusion is an in-memory query engine for the Rust implementation of Apache Arrow.

Although DataFusion was started two years ago, it was recently re-implemented to be Arrow-native and currently has limited capabilities but does support SQL queries against iterators of RecordBatch and has support for CSV files. There are plans to add support for Parquet files.

SQL support is limited to projection (SELECT), selection (WHERE), and simple aggregates (MIN, MAX, SUM) with an optional GROUP BY clause.

Supported expressions are identifiers, literals, simple math operations (+, -, *, /), binary expressions (AND, OR), equality and comparison operators (=, !=, <, <=, >=, >), and CAST(expr AS type).

Example¶

The following example demonstrates running a simple aggregate SQL query against a CSV file.

// create execution context
let mut ctx = ExecutionContext::new();

// define schema for data source (csv file)
let schema = Arc::new(Schema::new(vec![
    Field::new("c1", DataType::Utf8, false),
    Field::new("c2", DataType::UInt32, false),
    Field::new("c3", DataType::Int8, false),
    Field::new("c4", DataType::Int16, false),
    Field::new("c5", DataType::Int32, false),
    Field::new("c6", DataType::Int64, false),
    Field::new("c7", DataType::UInt8, false),
    Field::new("c8", DataType::UInt16, false),
    Field::new("c9", DataType::UInt32, false),
    Field::new("c10", DataType::UInt64, false),
    Field::new("c11", DataType::Float32, false),
    Field::new("c12", DataType::Float64, false),
    Field::new("c13", DataType::Utf8, false),
]));

// register csv file with the execution context
let csv_datasource =
    CsvDataSource::new("test/data/aggregate_test_100.csv", schema.clone(), 1024);
ctx.register_datasource("aggregate_test_100", Rc::new(RefCell::new(csv_datasource)));

let sql = "SELECT c1, MIN(c12), MAX(c12) FROM aggregate_test_100 WHERE c11 > 0.1 AND c11 < 0.9 GROUP BY c1";

// execute the query
let relation = ctx.sql(&sql).unwrap();
let mut results = relation.borrow_mut();

// iterate over the results
while let Some(batch) = results.next().unwrap() {
    println!(
        "RecordBatch has {} rows and {} columns",
        batch.num_rows(),
        batch.num_columns()
    );

    let c1 = batch
        .column(0)
        .as_any()
        .downcast_ref::<BinaryArray>()
        .unwrap();

    let min = batch
        .column(1)
        .as_any()
        .downcast_ref::<Float64Array>()
        .unwrap();

    let max = batch
        .column(2)
        .as_any()
        .downcast_ref::<Float64Array>()
        .unwrap();

    for i in 0..batch.num_rows() {
        let c1_value: String = String::from_utf8(c1.value(i).to_vec()).unwrap();
        println!("{}, Min: {}, Max: {}", c1_value, min.value(i), max.value(i),);
    }
}

Roadmap¶

The roadmap for DataFusion will depend on interest from the Rust community, but here are some of the short term items that are planned:

Extending test coverage of the existing functionality
Adding support for Parquet data sources
Implementing more SQL features such as JOIN, ORDER BY and LIMIT
Implement a DataFrame API as an alternative to SQL
Adding support for partitioning and parallel query execution using Rust's async and await functionality
Creating a Docker image to make it easy to use DataFusion as a standalone query tool for interactive and batch queries

Contributors Welcome!¶

If you are excited about being able to use Rust for data science and would like to contribute to this work then there are many ways to get involved. The simplest way to get started is to try out DataFusion against your own data sources and file bug reports for any issues that you find. You could also check out the current list of issues and have a go at fixing one. You can also join the user mailing list to ask questions.

Apache DataFusion Blog

Apache DataFusion Comet 0.16.0 Release

Expanded Spark 4 Support¶

Adapting to Spark 4 Behavior Changes¶

ANSI SQL Semantics¶

Expanded Adaptive Execution Support¶

Improved TPC-DS Benchmark Results¶

Other Key Features¶

Hash Join Improvements¶

Aggregation¶

New Expression Support¶

Object Storage¶

Native Scan Improvements¶

Metrics and Observability¶

Stability and Correctness¶

Compatibility¶

Get Started with Comet 0.16.0¶

Apache DataFusion Comet 0.15.0 Release

Performance¶

Reducing JVM/Native Boundary Overhead¶

Expanded Native Execution Coverage¶

Memory Management¶

Object Storage I/O¶

Native Iceberg Reader Enabled by Default¶

Sort-Merge Join Performance¶

Other Key Features¶

New Expressions and Function Support¶

Expanded Metrics and Observability¶

Stability and Correctness¶

Dependency Upgrades¶

Deprecations and Removals¶

Compatibility¶

Get Started with Comet 0.15.0¶

Apache DataFusion 53.0.0 Released

Performance Improvements 🚀¶

LIMIT-Aware Parquet Row Group Pruning¶

Improved Filter Pushdown¶

Faster Query Planning¶

Faster Functions¶

Nested Field Pushdown¶

New Features ✨¶

Stability and Release Engineering 🦺¶

Upgrade Notes¶

Known Issues¶

Thank You¶

Writing Custom Table Providers in Apache DataFusion

The Three Layers¶

Background: Logical and Physical Planning¶

Logical Planning¶

Physical Planning¶

Why This Matters for Table Providers¶

Choosing the Right Starting Point¶

Layer 1: TableProvider¶

Keep scan() Lightweight¶

Existing Implementations to Learn From¶

Layer 2: ExecutionPlan¶

Partitioning Strategies¶

Keep execute() Lightweight Too¶

Existing Implementations to Learn From¶

Layer 3: SendableRecordBatchStream¶

Using RecordBatchStreamAdapter¶

Blocking Work: Use a Separate Thread Pool¶

Where Should the Work Happen?¶

Why This Matters¶

Filter Pushdown: Doing Less Work¶

How Filter Pushdown Works¶

Why Filter Pushdown Matters¶

Only Push Down Filters When the Data Source Can Do Better¶

Using EXPLAIN to Debug Your Table Provider¶

A Complete Filter Pushdown Example¶

Putting It All Together¶

Acknowledgements¶

Get Involved¶

Further Reading¶

Turning LIMIT into an I/O Optimization: Inside DataFusion’s Multi-Layer Pruning Stack

DataFusion's Pruning Pipeline¶

Phase 1: High-Level Discovery¶

Phase 2: Row Group Statistics¶

Phase 3: Granular Pruning¶

The Problem: LIMIT Was Ignored¶

`LIMIT`-Aware Parquet Row Group Pruning¶

Keep `scan()` Lightweight¶

Keep `execute()` Lightweight Too¶

`CASE` Evaluation in DataFusion 50.0.0¶

Faster `CASE` Expressions¶

`MIN`/`MAX` Aggregate Dynamic Filters¶

More Extensible SQL Planning with `RelationPlanner`¶

`TableProvider` supports `DELETE` and `UPDATE` statements¶

`CoalesceBatchesExec` Removed¶

1) Extending parsing: wrapping `DFParser` for custom statements¶

2) Extending expression semantics: `ExprPlanner`¶

Example: Postgres JSON operators (`->`, `->>`)¶

3) Extending type support: `TypePlanner`¶

4) Extending the FROM clause: `RelationPlanner`¶