# Upgrade Guides ## DataFusion 50.0.0 ### ListingTable automatically detects Hive Partitioned tables DataFusion 50.0.0 automatically infers Hive partitions when using the `ListingTableFactory` and `CREATE EXTERNAL TABLE`. Previously, when creating a `ListingTable`, datasets that use Hive partitioning (e.g. `/table_root/column1=value1/column2=value2/data.parquet`) would not have the Hive columns reflected in the table's schema or data. The previous behavior can be restored by setting the `datafusion.execution.listing_table_factory_infer_partitions` configuration option to `false`. See [issue #17049] for more details. [issue #17049]: https://github.com/apache/datafusion/issues/17049 ### `MSRV` updated to 1.86.0 The Minimum Supported Rust Version (MSRV) has been updated to [`1.86.0`]. See [#17230] for details. [`1.86.0`]: https://releases.rs/docs/1.86.0/ [#17230]: https://github.com/apache/datafusion/pull/17230 ### `ScalarUDFImpl`, `AggregateUDFImpl` and `WindowUDFImpl` traits now require `PartialEq`, `Eq`, and `Hash` traits To address error-proneness of `ScalarUDFImpl::equals`, `AggregateUDFImpl::equals`and `WindowUDFImpl::equals` methods and to make it easy to implement function equality correctly, the `equals` and `hash_value` methods have been removed from `ScalarUDFImpl`, `AggregateUDFImpl` and `WindowUDFImpl` traits. They are replaced the requirement to implement the `PartialEq`, `Eq`, and `Hash` traits on any type implementing `ScalarUDFImpl`, `AggregateUDFImpl` or `WindowUDFImpl`. Please see [issue #16677] for more details. Most of the scalar functions are stateless and have a `signature` field. These can be migrated using regular expressions - search for `\#\[derive\(Debug\)\](\n *(pub )?struct \w+ \{\n *signature\: Signature\,\n *\})`, - replace with `#[derive(Debug, PartialEq, Eq, Hash)]$1`, - review all the changes and make sure only function structs were changed. [issue #16677]: https://github.com/apache/datafusion/issues/16677 ### `AsyncScalarUDFImpl::invoke_async_with_args` returns `ColumnarValue` In order to enable single value optimizations and be consistent with other user defined function APIs, the `AsyncScalarUDFImpl::invoke_async_with_args` method now returns a `ColumnarValue` instead of a `ArrayRef`. To upgrade, change the return type of your implementation ```rust # /* comment to avoid running impl AsyncScalarUDFImpl for AskLLM { async fn invoke_async_with_args( &self, args: ScalarFunctionArgs, _option: &ConfigOptions, ) -> Result { .. return array_ref; // old code } } # */ ``` To return a `ColumnarValue` ```rust # /* comment to avoid running impl AsyncScalarUDFImpl for AskLLM { async fn invoke_async_with_args( &self, args: ScalarFunctionArgs, _option: &ConfigOptions, ) -> Result { .. return ColumnarValue::from(array_ref); // new code } } # */ ``` See [#16896](https://github.com/apache/datafusion/issues/16896) for more details. ### `ProjectionExpr` changed from type alias to struct `ProjectionExpr` has been changed from a type alias to a struct with named fields to improve code clarity and maintainability. **Before:** ```rust,ignore pub type ProjectionExpr = (Arc, String); ``` **After:** ```rust,ignore #[derive(Debug, Clone)] pub struct ProjectionExpr { pub expr: Arc, pub alias: String, } ``` To upgrade your code: - Replace tuple construction `(expr, alias)` with `ProjectionExpr::new(expr, alias)` or `ProjectionExpr { expr, alias }` - Replace tuple field access `.0` and `.1` with `.expr` and `.alias` - Update pattern matching from `(expr, alias)` to `ProjectionExpr { expr, alias }` This mainly impacts use of `ProjectionExec`. This change was done in [#17398] [#17398]: https://github.com/apache/datafusion/pull/17398 ### `SessionState`, `SessionConfig`, and `OptimizerConfig` returns `&Arc` instead of `&ConfigOptions` To provide broader access to `ConfigOptions` and reduce required clones, some APIs have been changed to return a `&Arc` instead of a `&ConfigOptions`. This allows sharing the same `ConfigOptions` across multiple threads without needing to clone the entire `ConfigOptions` structure unless it is modified. Most users will not be impacted by this change since the Rust compiler typically automatically dereference the `Arc` when needed. However, in some cases you may have to change your code to explicitly call `as_ref()` for example, from ```rust # /* comment to avoid running let optimizer_config: &ConfigOptions = state.options(); # */ ``` To ```rust # /* comment to avoid running let optimizer_config: &ConfigOptions = state.options().as_ref(); # */ ``` See PR [#16970](https://github.com/apache/datafusion/pull/16970) ### API Change to `AsyncScalarUDFImpl::invoke_async_with_args` The `invoke_async_with_args` method of the `AsyncScalarUDFImpl` trait has been updated to remove the `_option: &ConfigOptions` parameter to simplify the API now that the `ConfigOptions` can be accessed through the `ScalarFunctionArgs` parameter. You can change your code like this ```rust # /* comment to avoid running impl AsyncScalarUDFImpl for AskLLM { async fn invoke_async_with_args( &self, args: ScalarFunctionArgs, _option: &ConfigOptions, ) -> Result { .. } ... } # */ ``` To this: ```rust # /* comment to avoid running impl AsyncScalarUDFImpl for AskLLM { async fn invoke_async_with_args( &self, args: ScalarFunctionArgs, ) -> Result { let options = &args.config_options; .. } ... } # */ ``` ### Schema Rewriter Module Moved to New Crate The `schema_rewriter` module and its associated symbols have been moved from `datafusion_physical_expr` to a new crate `datafusion_physical_expr_adapter`. This affects the following symbols: - `DefaultPhysicalExprAdapter` - `DefaultPhysicalExprAdapterFactory` - `PhysicalExprAdapter` - `PhysicalExprAdapterFactory` To upgrade, change your imports to: ```rust use datafusion_physical_expr_adapter::{ DefaultPhysicalExprAdapter, DefaultPhysicalExprAdapterFactory, PhysicalExprAdapter, PhysicalExprAdapterFactory }; ``` ### Upgrade to arrow `56.0.0` and parquet `56.0.0` This version of DataFusion upgrades the underlying Apache Arrow implementation to version `56.0.0`. See the [release notes](https://github.com/apache/arrow-rs/releases/tag/56.0.0) for more details. ### Added `ExecutionPlan::reset_state` In order to fix a bug in DataFusion `49.0.0` where dynamic filters (currently only generated in the presence of a query such as `ORDER BY ... LIMIT ...`) produced incorrect results in recursive queries, a new method `reset_state` has been added to the `ExecutionPlan` trait. Any `ExecutionPlan` that needs to maintain internal state or references to other nodes in the execution plan tree should implement this method to reset that state. See [#17028] for more details and an example implementation for `SortExec`. [#17028]: https://github.com/apache/datafusion/pull/17028 ### Nested Loop Join input sort order cannot be preserved The Nested Loop Join operator has been rewritten from scratch to improve performance and memory efficiency. From the micro-benchmarks: this change introduces up to 5X speed-up and uses only 1% memory in extreme cases compared to the previous implementation. However, the new implementation cannot preserve input sort order like the old version could. This is a fundamental design trade-off that prioritizes performance and memory efficiency over sort order preservation. See [#16996] for details. [#16996]: https://github.com/apache/datafusion/pull/16996 ### Add `as_any()` method to `LazyBatchGenerator` To help with protobuf serialization, the `as_any()` method has been added to the `LazyBatchGenerator` trait. This means you will need to add `as_any()` to your implementation of `LazyBatchGenerator`: ```rust # /* comment to avoid running impl LazyBatchGenerator for MyBatchGenerator { fn as_any(&self) -> &dyn Any { self } ... } # */ ``` See [#17200](https://github.com/apache/datafusion/pull/17200) for details. ### Refactored `DataSource::try_swapping_with_projection` We refactored `DataSource::try_swapping_with_projection` to simplify the method and minimize leakage across the ExecutionPlan <-> DataSource abstraction layer. Reimplementation for any custom `DataSource` should be relatively straightforward, see [#17395] for more details. [#17395]: https://github.com/apache/datafusion/pull/17395/ ### `FileOpenFuture` now uses `DataFusionError` instead of `ArrowError` The `FileOpenFuture` type alias has been updated to use `DataFusionError` instead of `ArrowError` for its error type. This change affects the `FileOpener` trait and any implementations that work with file streaming operations. **Before:** ```rust,ignore pub type FileOpenFuture = BoxFuture<'static, Result>>>; ``` **After:** ```rust,ignore pub type FileOpenFuture = BoxFuture<'static, Result>>>; ``` If you have custom implementations of `FileOpener` or work directly with `FileOpenFuture`, you'll need to update your error handling to use `DataFusionError` instead of `ArrowError`. The `FileStreamState` enum's `Open` variant has also been updated accordingly. See [#17397] for more details. [#17397]: https://github.com/apache/datafusion/pull/17397 ### FFI user defined aggregate function signature change The Foreign Function Interface (FFI) signature for user defined aggregate functions has been updated to call `return_field` instead of `return_type` on the underlying aggregate function. This is to support metadata handling with these aggregate functions. This change should be transparent to most users. If you have written unit tests to call `return_type` directly, you may need to change them to calling `return_field` instead. This update is a breaking change to the FFI API. The current best practice when using the FFI crate is to ensure that all libraries that are interacting are using the same underlying Rust version. Issue [#17374] has been opened to discuss stabilization of this interface so that these libraries can be used across different DataFusion versions. See [#17407] for details. [#17407]: https://github.com/apache/datafusion/pull/17407 [#17374]: https://github.com/apache/datafusion/issues/17374 ### Added `PhysicalExpr::is_volatile_node` We added a method to `PhysicalExpr` to mark a `PhysicalExpr` as volatile: ```rust,ignore impl PhysicalExpr for MyRandomExpr { fn is_volatile_node(&self) -> bool { true } } ``` We've shipped this with a default value of `false` to minimize breakage but we highly recommend that implementers of `PhysicalExpr` opt into a behavior, even if it is returning `false`. You can see more discussion and example implementations in [#17351]. [#17351]: https://github.com/apache/datafusion/pull/17351