Apache DataFusion 48.0.0 Released

Posted on: Wed 16 July 2025 by PMC

We’re excited to announce the release of Apache DataFusion 48.0.0! As always, this version packs in a wide range of improvements and fixes. You can find the complete details in the full changelog. We’ll highlight the most important changes below and guide you through upgrading.

Breaking Changes

DataFusion 48.0.0 brings a few breaking changes that may require adjustments to your code as described in the Upgrade Guide. Here are the most notable ones:

  • datafusion.execution.collect_statistics defaults to true: In DataFusion 48.0.0, the default value of this configuration setting is now true, and DataFusion will collect and store statistics when a table is first created via CREATE EXTERNAL TABLE or one of the DataFrame::register_* APIs.

  • Expr::Literal has optional metadata: The Expr::Literal variant now includes optional metadata, which allows for carrying through Arrow field metadata to support extension types and other uses. This means code such as

match expr {
...
  Expr::Literal(scalar) => ...
...
}

Should be updated to:

match expr {
...
  Expr::Literal(scalar, _metadata) => ...
...
}
  • Expr::WindowFunction is now Boxed: Expr::WindowFunction is now a Box<WindowFunction> instead of a WindowFunction directly. This change was made to reduce the size of Expr and improve performance when planning queries (see details on #16207).

  • UDFs changed to use FieldRef instead of DataType: To support metadata handling and prepare for extension types, UDF traits now use FieldRef rather than a DataType and nullability. FieldRef contains the type and nullability, and additionally allows access to metadata fields, which can be used for extension types.

  • Physical Expression return Field: Similarly to UDFs, in order to prepare for extension type support the PhysicalExpr trait has been changed to return Field rather than DataType. To upgrade structs which implement PhysicalExpr you need to implement the return_field function.

  • FileFormat::supports_filters_pushdown was replaced with FileSource::try_pushdown_filters to support upcoming work to push down dynamic filters and physical filter pushdown.

  • ParquetExec, AvroExec, CsvExec, JsonExec removed: ParquetExec, AvroExec, CsvExec, and JsonExec were deprecated in DataFusion 46 and are removed in DataFusion 48.

Performance Improvements

DataFusion 48.0.0 comes with some noteworthy performance enhancements:

  • Fewer unnecessary projections: DataFusion now removes additional unnecessary Projections in queries. (PRs #15787, #15761, and #15746 by xudong963).

  • Accelerated string functions: The ascii function was optimized to significantly improve its performance (PR #16087 by tlm365). The character_length function was optimized resulting in up to 3x performance improvement (PR #15931 by Dandandan)

  • Constant aggregate window expressions: For unbounded aggregate window functions the result is the same for all rows within a partition. DataFusion 48.0.0 avoids unnecessary computation for such queries, resulting in improved performance by 5.6x (PR #16234 by suibianwanwank)

Highlighted New Features

New datafusion-spark crate

The DataFusion community has requested Apache Spark-compatible functions for many years, but the current builtin function library is most similar to Postgresql, which leads to friction. Unfortunately, there are even functions with the same name but different signatures and/or return types in the two systems.

One of the many uses of DataFusion is to enhance (e.g. Apache DataFusion Comet) or replace (e.g. Sail) Apache Spark. To support the community requests and the use cases mentioned above, we have introduced a new datafusion-spark crate for DataFusion with spark-compatible functions so the community can collaborate to build this shared resource. There are several hundred functions to implement, and we are looking for help to complete datafusion-spark Spark Compatible Functions.

To register all functions in datafusion-spark you can use:

    // Create a new session context
    let mut ctx = SessionContext::new();
    // register all spark functions with the context
    datafusion_spark::register_all(&mut ctx)?;
    // run a query. Note the `sha2` function is now available which
    // has Spark semantics
    let df = ctx.sql("SELECT sha2('The input String', 256)").await?;
    ...
}

Or, to use an individual function, you can do:

use datafusion_expr::{col, lit};
use datafusion_spark::expr_fn::sha2;
// Create the expression `sha2(my_data, 256)`
let expr = sha2(col("my_data"), lit(256));
...

Thanks to shehabgamin for the initial PR #15168 and many others for their help adding additional functions. Please consider helping complete datafusion-spark Spark Compatible Functions.

ORDER BY ALL sql support

Inspired by DuckDB, DataFusion 48.0.0 adds support for ORDER BY ALL. This allows for easy ordering of all columns in a query:

> set datafusion.sql_parser.dialect = 'DuckDB';
0 row(s) fetched.
> CREATE OR REPLACE TABLE addresses AS
    SELECT '123 Quack Blvd' AS address, 'DuckTown' AS city, '11111' AS zip
    UNION ALL
    SELECT '111 Duck Duck Goose Ln', 'DuckTown', '11111'
    UNION ALL
    SELECT '111 Duck Duck Goose Ln', 'Duck Town', '11111'
    UNION ALL
    SELECT '111 Duck Duck Goose Ln', 'Duck Town', '11111-0001';
0 row(s) fetched.
> SELECT * FROM addresses ORDER BY ALL;
+------------------------+-----------+------------+
| address                | city      | zip        |
+------------------------+-----------+------------+
| 111 Duck Duck Goose Ln | Duck Town | 11111      |
| 111 Duck Duck Goose Ln | Duck Town | 11111-0001 |
| 111 Duck Duck Goose Ln | DuckTown  | 11111      |
| 123 Quack Blvd         | DuckTown  | 11111      |
+------------------------+-----------+------------+
4 row(s) fetched.

Thanks to PokIsemaine for PR #15772

FFI Support for AggregateUDF and WindowUDF

This improvement allows for using user defined aggregate and user defined window functions across FFI boundaries, which enables shared libraries to pass functions back and forth. This feature unlocks:

  • Modules to provide DataFusion based FFI aggregates that can be reused in projects such as datafusion-python

  • Using the same aggregate and window functions without recompiling with different DataFusion versions.

This completes the work to add support for all UDF types to DataFusion's FFI bindings. Thanks to timsaucer for PRs #16261 and #14775.

Reduced size of Expr struct

The Expr struct is widely used across the DataFusion and downstream codebases. By Boxing WindowFunctions, we reduced the size of Expr by almost 50%, from 272 to 144 bytes. This reduction improved planning times between 10% and 20% and reduced memory usage. Thanks to hendrikmakait for PR #16207

Upgrade Guide and Changelog

Upgrading to 48.0.0 should be straightforward for most users, but do review the Upgrade Guide for DataFusion 48.0.0 for detailed steps and code changes. The upgrade guide covers the breaking changes mentioned above and provides code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog for the 48.0.0 release. The changelog enumerates every merged PR in this release, including many smaller fixes and improvements that we couldn’t cover in this post.

Get Involved

Apache DataFusion is an open-source project, and we welcome involvement from anyone interested. Now is a great time to take 48.0.0 for a spin: try it out on your workloads, and let us know if you encounter any issues or have suggestions. You can report bugs or request features on our GitHub issue tracker, or better yet, submit a pull request. Join our community discussions – whether you have questions, want to share how you’re using DataFusion, or are looking to contribute, we’d love to hear from you. A list of open issues suitable for beginners is here and you can find how to reach us on the communication doc.

Happy querying!

Copyright 2025, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.