Upgrade Guides#

DataFusion 50.0.0#

ListingTable automatically detects Hive Partitioned tables#

DataFusion 50.0.0 automatically infers Hive partitions when using the ListingTableFactory and CREATE EXTERNAL TABLE. Previously, when creating a ListingTable, datasets that use Hive partitioning (e.g. /table_root/column1=value1/column2=value2/data.parquet) would not have the Hive columns reflected in the table’s schema or data. The previous behavior can be restored by setting the datafusion.execution.listing_table_factory_infer_partitions configuration option to false. See issue #17049 for more details.

`MSRV` updated to 1.86.0#

The Minimum Supported Rust Version (MSRV) has been updated to 1.86.0. See #17230 for details.

`ScalarUDFImpl`, `AggregateUDFImpl` and `WindowUDFImpl` traits now require `PartialEq`, `Eq`, and `Hash` traits#

To address error-proneness of ScalarUDFImpl::equals, AggregateUDFImpl::equalsand WindowUDFImpl::equals methods and to make it easy to implement function equality correctly, the equals and hash_value methods have been removed from ScalarUDFImpl, AggregateUDFImpl and WindowUDFImpl traits. They are replaced the requirement to implement the PartialEq, Eq, and Hash traits on any type implementing ScalarUDFImpl, AggregateUDFImpl or WindowUDFImpl. Please see issue #16677 for more details.

Most of the scalar functions are stateless and have a signature field. These can be migrated using regular expressions

search for \#\[derive$Debug$\](\n *(pub )?struct \w+ \{\n *signature\: Signature\,\n *\}),
replace with #[derive(Debug, PartialEq, Eq, Hash)]$1,
review all the changes and make sure only function structs were changed.

`AsyncScalarUDFImpl::invoke_async_with_args` returns `ColumnarValue`#

In order to enable single value optimizations and be consistent with other user defined function APIs, the AsyncScalarUDFImpl::invoke_async_with_args method now returns a ColumnarValue instead of a ArrayRef.

To upgrade, change the return type of your implementation

impl AsyncScalarUDFImpl for AskLLM {
    async fn invoke_async_with_args(
        &self,
        args: ScalarFunctionArgs,
        _option: &ConfigOptions,
    ) -> Result<ColumnarValue> {
        ..
      return array_ref; // old code
    }
}

To return a ColumnarValue

impl AsyncScalarUDFImpl for AskLLM {
    async fn invoke_async_with_args(
        &self,
        args: ScalarFunctionArgs,
        _option: &ConfigOptions,
    ) -> Result<ColumnarValue> {
        ..
      return ColumnarValue::from(array_ref); // new code
    }
}

See #16896 for more details.

`ProjectionExpr` changed from type alias to struct#

ProjectionExpr has been changed from a type alias to a struct with named fields to improve code clarity and maintainability.

Before:

pub type ProjectionExpr = (Arc<dyn PhysicalExpr>, String);

After:

#[derive(Debug, Clone)]
pub struct ProjectionExpr {
    pub expr: Arc<dyn PhysicalExpr>,
    pub alias: String,
}

To upgrade your code:

Replace tuple construction (expr, alias) with ProjectionExpr::new(expr, alias) or ProjectionExpr { expr, alias }
Replace tuple field access .0 and .1 with .expr and .alias
Update pattern matching from (expr, alias) to ProjectionExpr { expr, alias }

This mainly impacts use of ProjectionExec.

This change was done in #17398

`SessionState`, `SessionConfig`, and `OptimizerConfig` returns `&Arc<ConfigOptions>` instead of `&ConfigOptions`#

To provide broader access to ConfigOptions and reduce required clones, some APIs have been changed to return a &Arc<ConfigOptions> instead of a &ConfigOptions. This allows sharing the same ConfigOptions across multiple threads without needing to clone the entire ConfigOptions structure unless it is modified.

Most users will not be impacted by this change since the Rust compiler typically automatically dereference the Arc when needed. However, in some cases you may have to change your code to explicitly call as_ref() for example, from

let optimizer_config: &ConfigOptions = state.options();

let optimizer_config: &ConfigOptions = state.options().as_ref();

See PR #16970

API Change to `AsyncScalarUDFImpl::invoke_async_with_args`#

The invoke_async_with_args method of the AsyncScalarUDFImpl trait has been updated to remove the _option: &ConfigOptions parameter to simplify the API now that the ConfigOptions can be accessed through the ScalarFunctionArgs parameter.

You can change your code like this

impl AsyncScalarUDFImpl for AskLLM {
    async fn invoke_async_with_args(
        &self,
        args: ScalarFunctionArgs,
        _option: &ConfigOptions,
    ) -> Result<ArrayRef> {
        ..
    }
    ...
}

To this:

impl AsyncScalarUDFImpl for AskLLM {
    async fn invoke_async_with_args(
        &self,
        args: ScalarFunctionArgs,
    ) -> Result<ArrayRef> {
        let options = &args.config_options;
        ..
    }
    ...
}

Schema Rewriter Module Moved to New Crate#

The schema_rewriter module and its associated symbols have been moved from datafusion_physical_expr to a new crate datafusion_physical_expr_adapter. This affects the following symbols:

DefaultPhysicalExprAdapter
DefaultPhysicalExprAdapterFactory
PhysicalExprAdapter
PhysicalExprAdapterFactory

To upgrade, change your imports to:

use datafusion_physical_expr_adapter::{
    DefaultPhysicalExprAdapter, DefaultPhysicalExprAdapterFactory,
    PhysicalExprAdapter, PhysicalExprAdapterFactory
};

Upgrade to arrow `56.0.0` and parquet `56.0.0`#

This version of DataFusion upgrades the underlying Apache Arrow implementation to version 56.0.0. See the release notes for more details.

Added `ExecutionPlan::reset_state`#

In order to fix a bug in DataFusion 49.0.0 where dynamic filters (currently only generated in the presence of a query such as ORDER BY ... LIMIT ...) produced incorrect results in recursive queries, a new method reset_state has been added to the ExecutionPlan trait.

Any ExecutionPlan that needs to maintain internal state or references to other nodes in the execution plan tree should implement this method to reset that state. See #17028 for more details and an example implementation for SortExec.

Nested Loop Join input sort order cannot be preserved#

The Nested Loop Join operator has been rewritten from scratch to improve performance and memory efficiency. From the micro-benchmarks: this change introduces up to 5X speed-up and uses only 1% memory in extreme cases compared to the previous implementation.

However, the new implementation cannot preserve input sort order like the old version could. This is a fundamental design trade-off that prioritizes performance and memory efficiency over sort order preservation.

See #16996 for details.

Add `as_any()` method to `LazyBatchGenerator`#

To help with protobuf serialization, the as_any() method has been added to the LazyBatchGenerator trait. This means you will need to add as_any() to your implementation of LazyBatchGenerator:

impl LazyBatchGenerator for MyBatchGenerator {
    fn as_any(&self) -> &dyn Any {
        self
    }

    ...
}

See #17200 for details.

Refactored `DataSource::try_swapping_with_projection`#

We refactored DataSource::try_swapping_with_projection to simplify the method and minimize leakage across the ExecutionPlan <-> DataSource abstraction layer. Reimplementation for any custom DataSource should be relatively straightforward, see #17395 for more details.

`FileOpenFuture` now uses `DataFusionError` instead of `ArrowError`#

The FileOpenFuture type alias has been updated to use DataFusionError instead of ArrowError for its error type. This change affects the FileOpener trait and any implementations that work with file streaming operations.

Before:

pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch, ArrowError>>>>;

After:

pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch>>>>;

If you have custom implementations of FileOpener or work directly with FileOpenFuture, you’ll need to update your error handling to use DataFusionError instead of ArrowError. The FileStreamState enum’s Open variant has also been updated accordingly. See #17397 for more details.

FFI user defined aggregate function signature change#

The Foreign Function Interface (FFI) signature for user defined aggregate functions has been updated to call return_field instead of return_type on the underlying aggregate function. This is to support metadata handling with these aggregate functions. This change should be transparent to most users. If you have written unit tests to call return_type directly, you may need to change them to calling return_field instead.

This update is a breaking change to the FFI API. The current best practice when using the FFI crate is to ensure that all libraries that are interacting are using the same underlying Rust version. Issue #17374 has been opened to discuss stabilization of this interface so that these libraries can be used across different DataFusion versions.

See #17407 for details.

Added `PhysicalExpr::is_volatile_node`#

We added a method to PhysicalExpr to mark a PhysicalExpr as volatile:

impl PhysicalExpr for MyRandomExpr {
  fn is_volatile_node(&self) -> bool {
    true
  }
}

We’ve shipped this with a default value of false to minimize breakage but we highly recommend that implementers of PhysicalExpr opt into a behavior, even if it is returning false.

You can see more discussion and example implementations in #17351.

Upgrade Guides#

DataFusion 50.0.0#

ListingTable automatically detects Hive Partitioned tables#

MSRV updated to 1.86.0#

ScalarUDFImpl, AggregateUDFImpl and WindowUDFImpl traits now require PartialEq, Eq, and Hash traits#

AsyncScalarUDFImpl::invoke_async_with_args returns ColumnarValue#

ProjectionExpr changed from type alias to struct#

SessionState, SessionConfig, and OptimizerConfig returns &Arc<ConfigOptions> instead of &ConfigOptions#

API Change to AsyncScalarUDFImpl::invoke_async_with_args#

Schema Rewriter Module Moved to New Crate#

Upgrade to arrow 56.0.0 and parquet 56.0.0#

Added ExecutionPlan::reset_state#