Apache DataFusion 47.0.0 Released

Posted on: Fri 11 July 2025 by PMC

We’re excited to announce the release of Apache DataFusion 47.0.0! This new version represents a significant milestone for the project, packing in a wide range of improvements and fixes. You can find the complete details in the full changelog. We’ll highlight the most important changes below and guide you through upgrading.

Note that DataFusion 47.0.0 was released in April 2025, but we are only now publishing the blog post due to limited bandwidth in the DataFusion community. We apologize for the delay and encourage you to come help us accelerate the next release and announcements by joining the community 🎣.

Breaking Changes

DataFusion 47.0.0 brings a few breaking changes that may require adjustments to your code as described in the Upgrade Guide. Here are some notable ones:

impl ObjectStore {
    ...

    // The range is now a u64 instead of usize
    async fn get_range(&self, location: &Path, range: Range<u64>) -> ObjectStoreResult<Bytes> {
        self.inner.get_range(location, range).await
    }

    ...

    // the lifetime is now 'static instead of '_ (meaning the captured closure can't contain references)
    // (this also applies to list_with_offset)
    fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, ObjectStoreResult<ObjectMeta>> {
        self.inner.list(prefix)
    }
}
  • DisplayFormatType::TreeRender: Implementations of ExecutionPlan must also provide a description in the DisplayFormatType::TreeRender format to provide support for the new tree style explains. This can be the same as the existing DisplayFormatType::Default.

Performance Improvements

DataFusion 47.0.0 comes with numerous performance enhancements across the board. Here are some of the noteworthy optimizations in this release:

  • FIRST_VALUE and LAST_VALUE: FIRST_VALUE and LAST_VALUE functions execute much faster for data with high cardinality such as those with many groups or partitions. DataFusion 47.0.0 executes the following in 7 seconds compared to 36 seconds in DataFusion 46.0.0: select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4 (h2o.ai dataset). (PR's #15266 and #15542 by UBarney).

  • MIN, MAX and AVG for Durations: DataFusion executes aggregate queries up to 2.5x faster when they include MIN, MAX and AVG on Duration columns. (PRs #15322 and #15748 by shruti2522).

  • Short circuit evaluation for AND and OR: DataFusion now eagerly skips the evaluation of the right operand if the left is known to be false (AND) or true (OR) in certain cases. For complex predicates, such as those with many LIKE or CASE expressions, this optimization results in significant performance improvements (up to 100x in extreme cases). (PRs #15462 and #15694 by acking-you).

  • TopK optimization for partially sorted input: Previous versions of DataFusion implemented early termination optimization (TopK) for fully sorted data. DataFusion 47.0.0 extends the optimization for partially sorted data, which is common in many real-world datasets, such as time-series data sorted by day but not within each day. (PR #15563 by geoffreyclaude).

  • Disable re-validation of spilled files: DataFusion no longer does unnecessary re-validation of temporary spill files. The validation is unnecessary and expensive as the data is known to be valid when it was written out (PR #15454 by zebsme).

Highlighted New Features

Tree style explains

In previous releases the EXPLAIN statement results in a formatted table which is succinct and contains important details for implementers, but was often hard to read especially with queries that included joins or unions having multiple children.

DataFusion 47.0.0 includes the new EXPLAIN FORMAT TREE (default in datafusion-cli) rendered in a visual tree style that is much easier to quickly understand.

Example of the new explain output:

> explain select * from t1 inner join t2 on t1.ti=t2.ti;
+---------------+------------------------------------------------------------+
| plan_type     | plan                                                       |
+---------------+------------------------------------------------------------+
| physical_plan | ┌───────────────────────────┐                              |
|               | │    CoalesceBatchesExec    │                              |
|               | │    --------------------   │                              |
|               | │     target_batch_size:    │                              |
|               | │            8192           │                              |
|               | └─────────────┬─────────────┘                              |
|               | ┌─────────────┴─────────────┐                              |
|               | │        HashJoinExec       │                              |
|               | │    --------------------   ├──────────────┐               |
|               | │       on: (ti = ti)       │              │               |
|               | └─────────────┬─────────────┘              │               |
|               | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      ││       DataSourceExec      │ |
|               | │    --------------------   ││    --------------------   │ |
|               | │         bytes: 112        ││         bytes: 112        │ |
|               | │       format: memory      ││       format: memory      │ |
|               | │          rows: 1          ││          rows: 1          │ |
|               | └───────────────────────────┘└───────────────────────────┘ |
|               |                                                            |
+---------------+------------------------------------------------------------+

Example of the EXPLAIN FORMAT INDENT output for the same query

> explain format indent select * from t1 inner join t2 on t1.ti=t2.ti;
+---------------+----------------------------------------------------------------------+
| plan_type     | plan                                                                 |
+---------------+----------------------------------------------------------------------+
| logical_plan  | Inner Join: t1.ti = t2.ti                                            |
|               |   TableScan: t1 projection=[ti]                                      |
|               |   TableScan: t2 projection=[ti]                                      |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                          |
|               |   HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(ti@0, ti@0)] |
|               |     DataSourceExec: partitions=1, partition_sizes=[1]                |
|               |     DataSourceExec: partitions=1, partition_sizes=[1]                |
|               |                                                                      |
+---------------+----------------------------------------------------------------------+
2 row(s) fetched.

Thanks to irenjj for the initial work in PR #14677 and many others for completing the followup epic

SQL VARCHAR defaults to Utf8View

In previous releases when a column was created in SQL the column would be mapped to the Utf8 Arrow data type. In this release the SQL varchar columns will be mapped to the Utf8View arrow data type by default, which is a more efficient representation of UTF-8 strings in Arrow.

create table foo(x varchar);
0 row(s) fetched.

> describe foo;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| x           | Utf8View  | YES         |
+-------------+-----------+-------------+

Previous versions of DataFusion used Utf8View when reading parquet files and it is faster in most cases.

Thanks to zhuqi-lucas for PR #15104

Context propagation in spawned tasks (for tracing, logging, etc.)

This release introduces an API for propagating user-defined context (such as tracing spans, logging, or metrics) across thread boundaries without depending on any specific instrumentation library. You can use the JoinSetTracer API to instrument DataFusion plans with your own tracing or logging libraries, or use pre-integrated community crates such as the datafusion-tracing crate.

DataFusion telemetry project logo

Previously, tasks spawned on new threads — such as those performing repartitioning or Parquet file reads — could lose thread-local context, which is often used in instrumentation libraries. A full example of how to use this new API is available in the DataFusion examples, and a simple example is shown below.

/// Models a simple tracer. Calling `in_current_span()` and `in_scope()` saves thread-specific state
/// for the current span and must be called at the start of each new task or thread.
struct SpanTracer;

/// Implements the `JoinSetTracer` trait so we can inject instrumentation
/// for both async futures and blocking closures.
impl JoinSetTracer for SpanTracer {
    /// Instruments a boxed future to run in the current span. The future's
    /// return type is erased to `Box<dyn Any + Send>`, which we simply
    /// run inside the `Span::current()` context.
    fn trace_future(
        &self,
        fut: BoxFuture<'static, Box<dyn Any + Send>>,
    ) -> BoxFuture<'static, Box<dyn Any + Send>> {
        // Ensures any thread-local context is set in this future 
        fut.in_current_span().boxed()
    }

    /// Instruments a boxed blocking closure by running it inside the
    /// `Span::current()` context.
    fn trace_block(
        &self,
        f: Box<dyn FnOnce() -> Box<dyn Any + Send> + Send>,
    ) -> Box<dyn FnOnce() -> Box<dyn Any + Send> + Send> {
        let span = Span::current();
        // Ensures any thread-local context is set for this closure
        Box::new(move || span.in_scope(f))
    }
}

...
set_join_set_tracer(&SpanTracer).expect("Failed to set tracer");
...

Thanks to geoffreyclaude for PR #14914

Upgrade Guide and Changelog

Upgrading to 47.0.0 should be straightforward for most users, but do review the Upgrade Guide for DataFusion 47.0.0 for detailed steps and code changes. The upgrade guide covers the breaking changes mentioned above and provides code snippets to help with the transition. For a comprehensive list of all changes, please refer to the changelog for 47.0.0. The changelog enumerates every merged PR in this release, including many smaller fixes and improvements that we couldn’t cover in this post.

Get Involved

Apache DataFusion is an open-source project, and we welcome involvement from anyone interested. Now is a great time to take 47.0.0 for a spin: try it out on your workloads, and let us know if you encounter any issues or have suggestions. You can report bugs or request features on our GitHub issue tracker, or better yet, submit a pull request. Join our community discussions – whether you have questions, want to share how you’re using DataFusion, or are looking to contribute, we’d love to hear from you. A list of open issues suitable for beginners is here and you can find how to reach us on the communication doc.

Happy querying!

Copyright 2025, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.