Apache DataFusion 47.0.0 Released
Posted on: Fri 11 July 2025 by PMC
We’re excited to announce the release of Apache DataFusion 47.0.0! This new version represents a significant milestone for the project, packing in a wide range of improvements and fixes. You can find the complete details in the full changelog. We’ll highlight the most important changes below and guide you through upgrading.
Note that DataFusion 47.0.0 was released in April 2025, but we are only now publishing the blog post due to limited bandwidth in the DataFusion community. We apologize for the delay and encourage you to come help us accelerate the next release and announcements by joining the community 🎣.
Breaking Changes
DataFusion 47.0.0 brings a few breaking changes that may require adjustments to your code as described in the Upgrade Guide. Here are some notable ones:
- Upgrades to arrow-rs and arrow-parquet 55.0.0 and object_store 0.12.0:
Several APIs changed in the underlying
arrow
,parquet
andobject_store
libraries to use au64
instead of usize to better support WASM. This requires converting fromusize
tou64
occasionally as well as changes to ObjectStore implementations such as
impl ObjectStore {
...
// The range is now a u64 instead of usize
async fn get_range(&self, location: &Path, range: Range<u64>) -> ObjectStoreResult<Bytes> {
self.inner.get_range(location, range).await
}
...
// the lifetime is now 'static instead of '_ (meaning the captured closure can't contain references)
// (this also applies to list_with_offset)
fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, ObjectStoreResult<ObjectMeta>> {
self.inner.list(prefix)
}
}
- DisplayFormatType::TreeRender:
Implementations of
ExecutionPlan
must also provide a description in theDisplayFormatType::TreeRender
format to provide support for the new tree style explains. This can be the same as the existingDisplayFormatType::Default
.
Performance Improvements
DataFusion 47.0.0 comes with numerous performance enhancements across the board. Here are some of the noteworthy optimizations in this release:
-
FIRST_VALUE
andLAST_VALUE
:FIRST_VALUE
andLAST_VALUE
functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in 7 seconds compared to 36 seconds in DataFusion 46.0.0:select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4
(h2o.ai dataset). (PR's #15266 and #15542 by UBarney). -
MIN
,MAX
andAVG
for Durations: DataFusion executes aggregate queries up to 2.5x faster when they includeMIN
,MAX
andAVG
onDuration
columns. (PRs #15322 and #15748 by shruti2522). -
Short circuit evaluation for
AND
andOR
: DataFusion now eagerly skips the evaluation of the right operand if the left is known to be false (AND
) or true (OR
) in certain cases. For complex predicates, such as those with manyLIKE
orCASE
expressions, this optimization results in significant performance improvements (up to 100x in extreme cases). (PRs #15462 and #15694 by acking-you). -
TopK optimization for partially sorted input: Previous versions of DataFusion implemented early termination optimization (TopK) for fully sorted data. DataFusion 47.0.0 extends the optimization for partially sorted data, which is common in many real-world datasets, such as time-series data sorted by day but not within each day. (PR #15563 by geoffreyclaude).
-
Disable re-validation of spilled files: DataFusion no longer does unnecessary re-validation of temporary spill files. The validation is unnecessary and expensive as the data is known to be valid when it was written out (PR #15454 by zebsme).
Highlighted New Features
Tree style explains
In previous releases the EXPLAIN statement results in a formatted table which is succinct and contains important details for implementers, but was often hard to read especially with queries that included joins or unions having multiple children.
DataFusion 47.0.0 includes the new EXPLAIN FORMAT TREE
(default in
datafusion-cli
) rendered in a visual tree style that is much easier to quickly
understand.
Example of the new explain output:
> explain select * from t1 inner join t2 on t1.ti=t2.ti;
+---------------+------------------------------------------------------------+
| plan_type | plan |
+---------------+------------------------------------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ CoalesceBatchesExec │ |
| | │ -------------------- │ |
| | │ target_batch_size: │ |
| | │ 8192 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ HashJoinExec │ |
| | │ -------------------- ├──────────────┐ |
| | │ on: (ti = ti) │ │ |
| | └─────────────┬─────────────┘ │ |
| | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
| | │ DataSourceExec ││ DataSourceExec │ |
| | │ -------------------- ││ -------------------- │ |
| | │ bytes: 112 ││ bytes: 112 │ |
| | │ format: memory ││ format: memory │ |
| | │ rows: 1 ││ rows: 1 │ |
| | └───────────────────────────┘└───────────────────────────┘ |
| | |
+---------------+------------------------------------------------------------+
Example of the EXPLAIN FORMAT INDENT
output for the same query
> explain format indent select * from t1 inner join t2 on t1.ti=t2.ti;
+---------------+----------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------+
| logical_plan | Inner Join: t1.ti = t2.ti |
| | TableScan: t1 projection=[ti] |
| | TableScan: t2 projection=[ti] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(ti@0, ti@0)] |
| | DataSourceExec: partitions=1, partition_sizes=[1] |
| | DataSourceExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+----------------------------------------------------------------------+
2 row(s) fetched.
Thanks to irenjj for the initial work in PR #14677 and many others for completing the followup epic
SQL VARCHAR
defaults to Utf8View
In previous releases when a column was created in SQL the column would be mapped to the Utf8 Arrow data type. In this release
the SQL varchar
columns will be mapped to the Utf8View arrow data type by default, which is a more efficient representation of UTF-8 strings in Arrow.
create table foo(x varchar);
0 row(s) fetched.
> describe foo;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| x | Utf8View | YES |
+-------------+-----------+-------------+
Previous versions of DataFusion used Utf8View
when reading parquet files and it is faster in most cases.
Thanks to zhuqi-lucas for PR #15104
Context propagation in spawned tasks (for tracing, logging, etc.)
This release introduces an API for propagating user-defined context (such as tracing spans, logging, or metrics) across thread boundaries without depending on any specific instrumentation library. You can use the JoinSetTracer API to instrument DataFusion plans with your own tracing or logging libraries, or use pre-integrated community crates such as the datafusion-tracing crate.
Previously, tasks spawned on new threads — such as those performing repartitioning or Parquet file reads — could lose thread-local context, which is often used in instrumentation libraries. A full example of how to use this new API is available in the DataFusion examples, and a simple example is shown below.
/// Models a simple tracer. Calling `in_current_span()` and `in_scope()` saves thread-specific state
/// for the current span and must be called at the start of each new task or thread.
struct SpanTracer;
/// Implements the `JoinSetTracer` trait so we can inject instrumentation
/// for both async futures and blocking closures.
impl JoinSetTracer for SpanTracer {
/// Instruments a boxed future to run in the current span. The future's
/// return type is erased to `Box<dyn Any + Send>`, which we simply
/// run inside the `Span::current()` context.
fn trace_future(
&self,
fut: BoxFuture<'static, Box<dyn Any + Send>>,
) -> BoxFuture<'static, Box<dyn Any + Send>> {
// Ensures any thread-local context is set in this future
fut.in_current_span().boxed()
}
/// Instruments a boxed blocking closure by running it inside the
/// `Span::current()` context.
fn trace_block(
&self,
f: Box<dyn FnOnce() -> Box<dyn Any + Send> + Send>,
) -> Box<dyn FnOnce() -> Box<dyn Any + Send> + Send> {
let span = Span::current();
// Ensures any thread-local context is set for this closure
Box::new(move || span.in_scope(f))
}
}
...
set_join_set_tracer(&SpanTracer).expect("Failed to set tracer");
...
Thanks to geoffreyclaude for PR #14914
Upgrade Guide and Changelog
Upgrading to 47.0.0 should be straightforward for most users, but do review the Upgrade Guide for DataFusion 47.0.0 for detailed steps and code changes. The upgrade guide covers the breaking changes mentioned above and provides code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog for 47.0.0. The changelog enumerates every merged PR in this release, including many smaller fixes and improvements that we couldn’t cover in this post.
Get Involved
Apache DataFusion is an open-source project, and we welcome involvement from anyone interested. Now is a great time to take 47.0.0 for a spin: try it out on your workloads, and let us know if you encounter any issues or have suggestions. You can report bugs or request features on our GitHub issue tracker, or better yet, submit a pull request. Join our community discussions – whether you have questions, want to share how you’re using DataFusion, or are looking to contribute, we’d love to hear from you. A list of open issues suitable for beginners is here and you can find how to reach us on the communication doc.
Happy querying!
Copyright 2025, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.