Apache DataFusion 48.0.0 Released
Posted on: Wed 16 July 2025 by PMC
We’re excited to announce the release of Apache DataFusion 48.0.0! As always, this version packs in a wide range of improvements and fixes. You can find the complete details in the full changelog. We’ll highlight the most important changes below and guide you through upgrading.
Breaking Changes
DataFusion 48.0.0 brings a few breaking changes that may require adjustments to your code as described in the Upgrade Guide. Here are the most notable ones:
-
datafusion.execution.collect_statistics
defaults totrue
: In DataFusion 48.0.0, the default value of this configuration setting is now true, and DataFusion will collect and store statistics when a table is first created viaCREATE EXTERNAL TABLE
or one of theDataFrame::register_*
APIs. -
Expr::Literal
has optional metadata: TheExpr::Literal
variant now includes optional metadata, which allows for carrying through Arrow field metadata to support extension types and other uses. This means code such as
match expr {
...
Expr::Literal(scalar) => ...
...
}
Should be updated to:
match expr {
...
Expr::Literal(scalar, _metadata) => ...
...
}
-
Expr::WindowFunction
is now Boxed:Expr::WindowFunction
is now aBox<WindowFunction>
instead of aWindowFunction
directly. This change was made to reduce the size ofExpr
and improve performance when planning queries (see details on #16207). -
UDFs changed to use
FieldRef
instead ofDataType
: To support metadata handling and prepare for extension types, UDF traits now use FieldRef rather than aDataType
and nullability.FieldRef
contains the type and nullability, and additionally allows access to metadata fields, which can be used for extension types. -
Physical Expression return
Field
: Similarly to UDFs, in order to prepare for extension type support the PhysicalExpr trait has been changed to return Field rather thanDataType
. To upgrade structs which implementPhysicalExpr
you need to implement thereturn_field
function. -
FileFormat::supports_filters_pushdown
was replaced withFileSource::try_pushdown_filters
to support upcoming work to push down dynamic filters and physical filter pushdown. -
ParquetExec
,AvroExec
,CsvExec
,JsonExec
removed:ParquetExec
,AvroExec
,CsvExec
, andJsonExec
were deprecated in DataFusion 46 and are removed in DataFusion 48.
Performance Improvements
DataFusion 48.0.0 comes with some noteworthy performance enhancements:
-
Fewer unnecessary projections: DataFusion now removes additional unnecessary
Projection
s in queries. (PRs #15787, #15761, and #15746 by xudong963). -
Accelerated string functions: The
ascii
function was optimized to significantly improve its performance (PR #16087 by tlm365). Thecharacter_length
function was optimized resulting in up to 3x performance improvement (PR #15931 by Dandandan) -
Constant aggregate window expressions: For unbounded aggregate window functions the result is the same for all rows within a partition. DataFusion 48.0.0 avoids unnecessary computation for such queries, resulting in improved performance by 5.6x (PR #16234 by suibianwanwank)
Highlighted New Features
New datafusion-spark
crate
The DataFusion community has requested Apache Spark-compatible functions for many years, but the current builtin function library is most similar to Postgresql, which leads to friction. Unfortunately, there are even functions with the same name but different signatures and/or return types in the two systems.
One of the many uses of DataFusion is to enhance (e.g. Apache DataFusion Comet) or replace (e.g. Sail) Apache Spark. To support the community requests and the use cases mentioned above, we have introduced a new datafusion-spark crate for DataFusion with spark-compatible functions so the community can collaborate to build this shared resource. There are several hundred functions to implement, and we are looking for help to complete datafusion-spark Spark Compatible Functions.
To register all functions in datafusion-spark
you can use:
// Create a new session context
let mut ctx = SessionContext::new();
// register all spark functions with the context
datafusion_spark::register_all(&mut ctx)?;
// run a query. Note the `sha2` function is now available which
// has Spark semantics
let df = ctx.sql("SELECT sha2('The input String', 256)").await?;
...
}
Or, to use an individual function, you can do:
use datafusion_expr::{col, lit};
use datafusion_spark::expr_fn::sha2;
// Create the expression `sha2(my_data, 256)`
let expr = sha2(col("my_data"), lit(256));
...
Thanks to shehabgamin for the initial PR #15168 and many others for their help adding additional functions. Please consider helping complete datafusion-spark Spark Compatible Functions.
ORDER BY ALL sql
support
Inspired by DuckDB, DataFusion 48.0.0 adds support for ORDER BY ALL
. This allows for easy ordering of all columns in a query:
> set datafusion.sql_parser.dialect = 'DuckDB';
0 row(s) fetched.
> CREATE OR REPLACE TABLE addresses AS
SELECT '123 Quack Blvd' AS address, 'DuckTown' AS city, '11111' AS zip
UNION ALL
SELECT '111 Duck Duck Goose Ln', 'DuckTown', '11111'
UNION ALL
SELECT '111 Duck Duck Goose Ln', 'Duck Town', '11111'
UNION ALL
SELECT '111 Duck Duck Goose Ln', 'Duck Town', '11111-0001';
0 row(s) fetched.
> SELECT * FROM addresses ORDER BY ALL;
+------------------------+-----------+------------+
| address | city | zip |
+------------------------+-----------+------------+
| 111 Duck Duck Goose Ln | Duck Town | 11111 |
| 111 Duck Duck Goose Ln | Duck Town | 11111-0001 |
| 111 Duck Duck Goose Ln | DuckTown | 11111 |
| 123 Quack Blvd | DuckTown | 11111 |
+------------------------+-----------+------------+
4 row(s) fetched.
Thanks to PokIsemaine for PR #15772
FFI Support for AggregateUDF
and WindowUDF
This improvement allows for using user defined aggregate and user defined window functions across FFI boundaries, which enables shared libraries to pass functions back and forth. This feature unlocks:
-
Modules to provide DataFusion based FFI aggregates that can be reused in projects such as datafusion-python
-
Using the same aggregate and window functions without recompiling with different DataFusion versions.
This completes the work to add support for all UDF types to DataFusion's FFI bindings. Thanks to timsaucer for PRs #16261 and #14775.
Reduced size of Expr
struct
The Expr struct is widely used across the DataFusion and downstream codebases. By Box
ing WindowFunction
s, we reduced the size of Expr
by almost 50%, from 272
to 144
bytes. This reduction improved planning times between 10% and 20% and reduced memory usage. Thanks to hendrikmakait for
PR #16207
Upgrade Guide and Changelog
Upgrading to 48.0.0 should be straightforward for most users, but do review the Upgrade Guide for DataFusion 48.0.0 for detailed steps and code changes. The upgrade guide covers the breaking changes mentioned above and provides code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog for the 48.0.0 release. The changelog enumerates every merged PR in this release, including many smaller fixes and improvements that we couldn’t cover in this post.
Get Involved
Apache DataFusion is an open-source project, and we welcome involvement from anyone interested. Now is a great time to take 48.0.0 for a spin: try it out on your workloads, and let us know if you encounter any issues or have suggestions. You can report bugs or request features on our GitHub issue tracker, or better yet, submit a pull request. Join our community discussions – whether you have questions, want to share how you’re using DataFusion, or are looking to contribute, we’d love to hear from you. A list of open issues suitable for beginners is here and you can find how to reach us on the communication doc.
Happy querying!
Copyright 2025, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.