Apache DataFusion 52.0.0 Released

Posted on: Mon 12 January 2026 by pmc

We are proud to announce the release of DataFusion 52.0.0. This post highlights some of the major improvements since DataFusion 51.0.0. The complete list of changes is available in the changelog. Thanks to the 121 contributors for making this release possible.

Performance Improvements 🚀¶

We continue to make significant performance improvements in DataFusion as explained below.

Faster CASE Expressions¶

DataFusion 52 has lookup-table-based evaluation for certain CASE expressions to avoid repeated evaluation for accelerating common ETL patterns such as

CASE company
    WHEN 1 THEN 'Apple'
    WHEN 5 THEN 'Samsung'
    WHEN 2 THEN 'Motorola'
    WHEN 3 THEN 'LG'
    ELSE 'Other'
END

This is the final work in our CASE performance epic (#18075), which has improved CASE evaluation significantly. Related PRs #18183. Thanks to rluvaton and pepijnve for the implementation.

MIN/MAX Aggregate Dynamic Filters¶

DataFusion now creates dynamic filters for queries with MIN/MAX aggregates that have filters, but no GROUP BY. These dynamic filters are used during scan to prune files and rows as tighter bounds are discovered during execution, as explained in the Dynamic Filtering Blog. For example, the following query:

SELECT min(l_shipdate)
FROM lineitem
WHERE l_returnflag = 'R';

Is now executed like this

SELECT min(l_shipdate)
FROM lineitem
--  '__current_min' is updated dynamically during execution
WHERE l_returnflag = 'R' AND l_shipdate < __current_min;

Thanks to 2010YOUY01 for implementing this feature, with reviews from martin-g, adriangb, and LiaCastaneda. Related PRs: #18644

New Merge Join¶

DataFusion 52 includes a rewrite of the sort-merge join (SMJ) operator, with speedups of three orders of magnitude in some pathological cases such as the case in #18487, which also affected Apache Comet workloads. Benchmarks in #18875 show dramatic gains for TPC-H Q21 (minutes to milliseconds) while leaving other queries unchanged or modestly faster. Thanks to mbutrovich for the implementation and reviews from Dandandan.

Caching Improvements¶

This release also includes several additional caching improvements.

A new statistics cache for File Metadata avoids repeatedly (re)calculating statistics for files. This significantly improves planning time for certain queries. You can see the contents of the new cache using the statistics_cache function in the CLI:

select * from statistics_cache();
+------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
| path             | file_modified       | file_size_bytes | e_tag                  | version | num_rows        | num_columns | table_size_bytes   | statistics_size_bytes |
+------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
| .../hits.parquet | 2022-06-25T22:22:22 | 14779976446     | 0-5e24d1ee16380-370f48 | NULL    | Exact(99997497) | 105         | Exact(36445943240) | 0                     |
+------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+

Thanks to bharath-techie and nuno-faria for implementing the statistics cache, with reviews from martin-g, alamb, and alchemist51. Related PRs: #18971, #19054

A prefix-aware list-files cache accelerates evaluating partition predicates for Hive partitioned tables.

-- Read the hive partitioned dataset from Overture Maps (100s of Parquet files)
CREATE EXTERNAL TABLE overturemaps
STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-12-17.0/';
-- Find all files where the path contains `theme=base without requiring another LIST call
select count(*) from overturemaps where theme='base';

You can see the contents of the new cache using the list_files_cache function in the CLI:

create external table overturemaps
stored as parquet
location 's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/type=infrastructure';
0 row(s) fetched.
> select table, path, metadata_size_bytes, expires_in, unnest(metadata_list)['file_size_bytes'] as file_size_bytes, unnest(metadata_list)['e_tag'] as e_tag from list_files_cache() limit 10;
+--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
| table        | path                                                | metadata_size_bytes | expires_in                        | file_size_bytes | e_tag                                 |
+--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 999055952       | "35fc8fbe8400960b54c66fbb408c48e8-60" |
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 975592768       | "8a16e10b722681cdc00242564b502965-59" |
...
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 1016732378      | "6d70857a0473ed9ed3fc6e149814168b-61" |
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 991363784       | "c9cafb42fcbb413f851691c895dd7c2b-60" |
| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750                | 0 days 0 hours 0 mins 25.264 secs | 1032469715      | "7540252d0d67158297a67038a3365e0f-62" |
+--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+

Thanks to BlakeOrth and Yuvraj-cyborg for implementing the list-files cache work, with reviews from gabotechs, alamb, alchemist51, martin-g, and BlakeOrth. Related PRs: #18146, #18855, #19366, #19298,

Improved Hash Join Filter Pushdown¶

Starting in DataFusion 51, filtering information from HashJoinExec is passed dynamically to scans, as explained in the Dynamic Filtering Blog using a technique referred to as Sideways Information Passing in Database research literature. The initial implementation passed min/max values for the join keys. DataFusion 52 extends the optimization (#17171 / #18393) to pass the contents of the build side hash map. These filters are evaluated on the probe side scan to prune files, row groups, and individual rows. When the build side contains 20 or fewer rows (configurable) the contents of the hash map are transformed to an IN expression and used for statistics-based pruning which can avoid reading entire files or row groups that contain no matching join keys. Thanks to adriangb for implementing this feature, with reviews from LiaCastaneda, asolimando, comphead, and mbutrovich.

Major Features ✨¶

Arrow IPC Stream file support¶

DataFusion can now read Arrow IPC stream files (#18457). This expands interoperability with systems that emit Arrow streams directly, making it simpler to ingest Arrow-native data without conversion. Thanks to corasaurus-hex for implementing this feature, with reviews from martin-g, Jefffrey, jdcasale, 2010YOUY01, and timsaucer.

CREATE EXTERNAL TABLE ipc_events
STORED AS ARROW
LOCATION 's3://bucket/events.arrow';

Related PRs: #18457

More Extensible SQL Planning with RelationPlanner¶

DataFusion now has an API for extending the SQL planner for relations, as explained in the Extending SQL in DataFusion Blog. In addition to the existing expression and types extension points, this new API now allows extending FROM clauses. Using these APIs it is straightforward to provide SQL support for almost any dialect, including vendor-specific syntax. Example use cases include:

-- Postgres-style JSON operators
SELECT payload->'user'->>'id' FROM logs;
-- MySQL-specific types
SELECT DATETIME '2001-01-01 18:00:00';
-- Statistical sampling
SELECT * FROM sensor_data TABLESAMPLE BERNOULLI(10 PERCENT);

Thanks to geoffreyclaude for implementing relation planner extensions, and to theirix, alamb, NGA-TRAN, and gabotechs for reviews and feedback on the design. Related PRs: #17843

Expression Evaluation Pushdown to Scans¶

DataFusion now pushes down expression evaluation into TableProviders using PhysicalExprAdapter, replacing the older SchemaAdapter approach (#14993, #16800). Predicates and expressions can now be customized for each individual file schema, opening additional optimization such as support for Variant shredding. Thanks to adriangb for implementing PhysicalExprAdapter and reworking pushdown to use it. Related PRs: #18998, #19345

Sort Pushdown to Scans¶

DataFusion can now push sorts into data sources (#10433, #19064). This allows table provider implementations to optimize based on sort knowledge for certain query patterns. For example, the provided Parquet data source now reverses the scan order of row groups and files when queried for the opposite of the file's natural sort (e.g. DESC when the files are sorted ASC). This reversal, combined with dynamic filtering, allows top-K queries with LIMIT on pre-sorted data to find the requested rows very quickly, pruning more files and row groups without even scanning them. We have seen a ~30x performance improvement on benchmark queries with pre-sorted data. Thanks to zhuqi-lucas and xudong963 for this feature, with reviews from martin-g, adriangb, and alamb.

TableProvider supports DELETE and UPDATE statements¶

The TableProvider trait now includes hooks for DELETE and UPDATE statements and the basic MemTable implements them (#19142). This lets downstream implementations and storage engines plug in their own mutation logic. See TableProvider::delete_from and TableProvider::update for more details.

Example:

DELETE FROM mem_table WHERE status = 'obsolete';

Thanks to ethan-tyler for the implementation and alamb and adriangb for reviews.

CoalesceBatchesExec Removed¶

The standalone CoalesceBatchesExec operator existed to ensure batches were large enough for subsequent vectorized execution, and was inserted after filter-like operators such as FilterExec, HashJoinExec, and RepartitionExec. However, using a separate operator also blocks other optimizations such as pushing LIMIT through joins and made optimizer rules more complex. In this release, we integrated the coalescing into the operators themselves (#18779) using Arrow's coalesce kernel. This reduces plan complexity while keeping batch sizes efficient, and allows additional focused optimization work in the Arrow kernel, such as Dandandan's recent work with filtering in arrow-rs/#8951.

Related PRs: #18540, #18604, #18630, #18972, #19002, #19342, #19239 Thanks to Tim-53, Dandandan, jizezhang, and feniljain for implementing this feature, with reviews from Jefffrey, alamb, martin-g, geoffreyclaude, milenkovicm, and jizezhang.

Upgrade Guide and Changelog¶

As always, upgrading to 52.0.0 should be straightforward for most users. Please review the Upgrade Guide for details on breaking changes and code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog.

About DataFusion¶

Apache DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast, data-centric systems such as databases, dataframe libraries, and machine learning and streaming applications. While DataFusion's primary design goal is to accelerate the creation of other data-centric systems, it provides a reasonable experience directly out of the box as a dataframe library, Python library, and command-line SQL tool.

How to Get Involved¶

DataFusion is not a project built or driven by a single person, company, or foundation. Rather, our community of users and contributors works together to build a shared technology that none of us could have built alone.

If you are interested in joining us, we would love to have you. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests, or code. A list of open issues suitable for beginners is here, and you can find out how to reach us on the communication doc.


Comments