Apache DataFusion 47.0.0 Released
Posted on: Fri 11 July 2025 by PMC
Weβre excited to announce the release of Apache DataFusion 47.0.0! This new version represents a significant milestone for the project, packing in a wide range of improvements and fixes. You can find the complete details in the full changelog. Weβll highlight the most important changes below and guide you through upgrading.
Note that DataFusion 47.0.0 was released in April 2025, but we are only now publishing the blog post due to limited bandwidth in the DataFusion community. We apologize for the delay and encourage you to come help us accelerate the next release and announcements by joining the community π£.
Breaking ChangesΒΆ
DataFusion 47.0.0 brings a few breaking changes that may require adjustments to your code as described in the Upgrade Guide. Here are some notable ones:
- Upgrades to arrow-rs and arrow-parquet 55.0.0 and object_store 0.12.0:
Several APIs changed in the underlying
arrow,parquetandobject_storelibraries to use au64instead of usize to better support WASM. This requires converting fromusizetou64occasionally as well as changes to ObjectStore implementations such as
impl ObjectStore {
...
// The range is now a u64 instead of usize
async fn get_range(&self, location: &Path, range: Range<u64>) -> ObjectStoreResult<Bytes> {
self.inner.get_range(location, range).await
}
...
// the lifetime is now 'static instead of '_ (meaning the captured closure can't contain references)
// (this also applies to list_with_offset)
fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, ObjectStoreResult<ObjectMeta>> {
self.inner.list(prefix)
}
}
- DisplayFormatType::TreeRender:
Implementations of
ExecutionPlanmust also provide a description in theDisplayFormatType::TreeRenderformat to provide support for the new tree style explains. This can be the same as the existingDisplayFormatType::Default.
Performance ImprovementsΒΆ
DataFusion 47.0.0 comes with numerous performance enhancements across the board. Here are some of the noteworthy optimizations in this release:
-
FIRST_VALUEandLAST_VALUE:FIRST_VALUEandLAST_VALUEfunctions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in 7 seconds compared to 36 seconds in DataFusion 46.0.0:select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4(h2o.ai dataset). (PR's #15266 and #15542 by UBarney). -
MIN,MAXandAVGfor Durations: DataFusion executes aggregate queries up to 2.5x faster when they includeMIN,MAXandAVGonDurationcolumns. (PRs #15322 and #15748 by shruti2522). -
Short circuit evaluation for
ANDandOR: DataFusion now eagerly skips the evaluation of the right operand if the left is known to be false (AND) or true (OR) in certain cases. For complex predicates, such as those with manyLIKEorCASEexpressions, this optimization results in significant performance improvements (up to 100x in extreme cases). (PRs #15462 and #15694 by acking-you). -
TopK optimization for partially sorted input: Previous versions of DataFusion implemented early termination optimization (TopK) for fully sorted data. DataFusion 47.0.0 extends the optimization for partially sorted data, which is common in many real-world datasets, such as time-series data sorted by day but not within each day. (PR #15563 by geoffreyclaude).
-
Disable re-validation of spilled files: DataFusion no longer does unnecessary re-validation of temporary spill files. The validation is unnecessary and expensive as the data is known to be valid when it was written out (PR #15454 by zebsme).
Highlighted New FeaturesΒΆ
Tree style explainsΒΆ
In previous releases the EXPLAIN statement results in a formatted table which is succinct and contains important details for implementers, but was often hard to read especially with queries that included joins or unions having multiple children.
DataFusion 47.0.0 includes the new EXPLAIN FORMAT TREE (default in
datafusion-cli) rendered in a visual tree style that is much easier to quickly
understand.
Example of the new explain output:
> explain select * from t1 inner join t2 on t1.ti=t2.ti;
+---------------+------------------------------------------------------------+
| plan_type | plan |
+---------------+------------------------------------------------------------+
| physical_plan | βββββββββββββββββββββββββββββ |
| | β CoalesceBatchesExec β |
| | β -------------------- β |
| | β target_batch_size: β |
| | β 8192 β |
| | βββββββββββββββ¬ββββββββββββββ |
| | βββββββββββββββ΄ββββββββββββββ |
| | β HashJoinExec β |
| | β -------------------- ββββββββββββββββ |
| | β on: (ti = ti) β β |
| | βββββββββββββββ¬ββββββββββββββ β |
| | βββββββββββββββ΄βββββββββββββββββββββββββββββ΄ββββββββββββββ |
| | β DataSourceExec ββ DataSourceExec β |
| | β -------------------- ββ -------------------- β |
| | β bytes: 112 ββ bytes: 112 β |
| | β format: memory ββ format: memory β |
| | β rows: 1 ββ rows: 1 β |
| | ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| | |
+---------------+------------------------------------------------------------+
Example of the EXPLAIN FORMAT INDENT output for the same query
> explain format indent select * from t1 inner join t2 on t1.ti=t2.ti;
+---------------+----------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------+
| logical_plan | Inner Join: t1.ti = t2.ti |
| | TableScan: t1 projection=[ti] |
| | TableScan: t2 projection=[ti] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(ti@0, ti@0)] |
| | DataSourceExec: partitions=1, partition_sizes=[1] |
| | DataSourceExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+----------------------------------------------------------------------+
2 row(s) fetched.
Thanks to irenjj for the initial work in PR #14677 and many others for completing the followup epic
SQL VARCHAR defaults to Utf8ViewΒΆ
In previous releases when a column was created in SQL the column would be mapped to the Utf8 Arrow data type. In this release
the SQL varchar columns will be mapped to the Utf8View arrow data type by default, which is a more efficient representation of UTF-8 strings in Arrow.
create table foo(x varchar);
0 row(s) fetched.
> describe foo;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| x | Utf8View | YES |
+-------------+-----------+-------------+
Previous versions of DataFusion used Utf8View when reading parquet files and it is faster in most cases.
Thanks to zhuqi-lucas for PR #15104
Context propagation in spawned tasks (for tracing, logging, etc.)ΒΆ
This release introduces an API for propagating user-defined context (such as tracing spans, logging, or metrics) across thread boundaries without depending on any specific instrumentation library. You can use the JoinSetTracer API to instrument DataFusion plans with your own tracing or logging libraries, or use pre-integrated community crates such as the datafusion-tracing crate.
Previously, tasks spawned on new threads β such as those performing repartitioning or Parquet file reads β could lose thread-local context, which is often used in instrumentation libraries. A full example of how to use this new API is available in the DataFusion examples, and a simple example is shown below.
/// Models a simple tracer. Calling `in_current_span()` and `in_scope()` saves thread-specific state
/// for the current span and must be called at the start of each new task or thread.
struct SpanTracer;
/// Implements the `JoinSetTracer` trait so we can inject instrumentation
/// for both async futures and blocking closures.
impl JoinSetTracer for SpanTracer {
/// Instruments a boxed future to run in the current span. The future's
/// return type is erased to `Box<dyn Any + Send>`, which we simply
/// run inside the `Span::current()` context.
fn trace_future(
&self,
fut: BoxFuture<'static, Box<dyn Any + Send>>,
) -> BoxFuture<'static, Box<dyn Any + Send>> {
// Ensures any thread-local context is set in this future
fut.in_current_span().boxed()
}
/// Instruments a boxed blocking closure by running it inside the
/// `Span::current()` context.
fn trace_block(
&self,
f: Box<dyn FnOnce() -> Box<dyn Any + Send> + Send>,
) -> Box<dyn FnOnce() -> Box<dyn Any + Send> + Send> {
let span = Span::current();
// Ensures any thread-local context is set for this closure
Box::new(move || span.in_scope(f))
}
}
...
set_join_set_tracer(&SpanTracer).expect("Failed to set tracer");
...
Thanks to geoffreyclaude for PR #14914
Upgrade Guide and ChangelogΒΆ
Upgrading to 47.0.0 should be straightforward for most users, but do review the Upgrade Guide for DataFusion 47.0.0 for detailed steps and code changes. The upgrade guide covers the breaking changes mentioned above and provides code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog for 47.0.0. The changelog enumerates every merged PR in this release, including many smaller fixes and improvements that we couldnβt cover in this post.
Get InvolvedΒΆ
Apache DataFusion is an open-source project, and we welcome involvement from anyone interested. Now is a great time to take 47.0.0 for a spin: try it out on your workloads, and let us know if you encounter any issues or have suggestions. You can report bugs or request features on our GitHub issue tracker, or better yet, submit a pull request. Join our community discussions β whether you have questions, want to share how youβre using DataFusion, or are looking to contribute, weβd love to hear from you. A list of open issues suitable for beginners is here and you can find how to reach us on the communication doc.
Happy querying!
Comments
We use Giscus for comments, powered by GitHub Discussions. To respect your privacy, Giscus and comments will load only if you click "Show Comments"