Upgrade Guides#
DataFusion 50.0.0#
ListingTable automatically detects Hive Partitioned tables#
DataFusion 50.0.0 automatically infers Hive partitions when using the ListingTableFactory and CREATE EXTERNAL TABLE. Previously,
when creating a ListingTable, datasets that use Hive partitioning (e.g.
/table_root/column1=value1/column2=value2/data.parquet) would not have the Hive columns reflected in
the table’s schema or data. The previous behavior can be
restored by setting the datafusion.execution.listing_table_factory_infer_partitions configuration option to false.
See issue #17049 for more details.
MSRV updated to 1.86.0#
The Minimum Supported Rust Version (MSRV) has been updated to 1.86.0.
See #17230 for details.
ScalarUDFImpl, AggregateUDFImpl and WindowUDFImpl traits now require PartialEq, Eq, and Hash traits#
To address error-proneness of ScalarUDFImpl::equals, AggregateUDFImpl::equalsand
WindowUDFImpl::equals methods and to make it easy to implement function equality correctly,
the equals and hash_value methods have been removed from ScalarUDFImpl, AggregateUDFImpl
and WindowUDFImpl traits. They are replaced the requirement to implement the PartialEq, Eq,
and Hash traits on any type implementing ScalarUDFImpl, AggregateUDFImpl or WindowUDFImpl.
Please see issue #16677 for more details.
Most of the scalar functions are stateless and have a signature field. These can be migrated
using regular expressions
search for
\#\[derive\(Debug\)\](\n *(pub )?struct \w+ \{\n *signature\: Signature\,\n *\}),replace with
#[derive(Debug, PartialEq, Eq, Hash)]$1,review all the changes and make sure only function structs were changed.
AsyncScalarUDFImpl::invoke_async_with_args returns ColumnarValue#
In order to enable single value optimizations and be consistent with other
user defined function APIs, the AsyncScalarUDFImpl::invoke_async_with_args method now
returns a ColumnarValue instead of a ArrayRef.
To upgrade, change the return type of your implementation
impl AsyncScalarUDFImpl for AskLLM {
async fn invoke_async_with_args(
&self,
args: ScalarFunctionArgs,
_option: &ConfigOptions,
) -> Result<ColumnarValue> {
..
return array_ref; // old code
}
}
To return a ColumnarValue
impl AsyncScalarUDFImpl for AskLLM {
async fn invoke_async_with_args(
&self,
args: ScalarFunctionArgs,
_option: &ConfigOptions,
) -> Result<ColumnarValue> {
..
return ColumnarValue::from(array_ref); // new code
}
}
See #16896 for more details.
ProjectionExpr changed from type alias to struct#
ProjectionExpr has been changed from a type alias to a struct with named fields to improve code clarity and maintainability.
Before:
pub type ProjectionExpr = (Arc<dyn PhysicalExpr>, String);
After:
#[derive(Debug, Clone)]
pub struct ProjectionExpr {
pub expr: Arc<dyn PhysicalExpr>,
pub alias: String,
}
To upgrade your code:
Replace tuple construction
(expr, alias)withProjectionExpr::new(expr, alias)orProjectionExpr { expr, alias }Replace tuple field access
.0and.1with.exprand.aliasUpdate pattern matching from
(expr, alias)toProjectionExpr { expr, alias }
This mainly impacts use of ProjectionExec.
This change was done in #17398
SessionState, SessionConfig, and OptimizerConfig returns &Arc<ConfigOptions> instead of &ConfigOptions#
To provide broader access to ConfigOptions and reduce required clones, some
APIs have been changed to return a &Arc<ConfigOptions> instead of a
&ConfigOptions. This allows sharing the same ConfigOptions across multiple
threads without needing to clone the entire ConfigOptions structure unless it
is modified.
Most users will not be impacted by this change since the Rust compiler typically
automatically dereference the Arc when needed. However, in some cases you may
have to change your code to explicitly call as_ref() for example, from
let optimizer_config: &ConfigOptions = state.options();
To
let optimizer_config: &ConfigOptions = state.options().as_ref();
See PR #16970
API Change to AsyncScalarUDFImpl::invoke_async_with_args#
The invoke_async_with_args method of the AsyncScalarUDFImpl trait has been
updated to remove the _option: &ConfigOptions parameter to simplify the API
now that the ConfigOptions can be accessed through the ScalarFunctionArgs
parameter.
You can change your code like this
impl AsyncScalarUDFImpl for AskLLM {
async fn invoke_async_with_args(
&self,
args: ScalarFunctionArgs,
_option: &ConfigOptions,
) -> Result<ArrayRef> {
..
}
...
}
To this:
impl AsyncScalarUDFImpl for AskLLM {
async fn invoke_async_with_args(
&self,
args: ScalarFunctionArgs,
) -> Result<ArrayRef> {
let options = &args.config_options;
..
}
...
}
Schema Rewriter Module Moved to New Crate#
The schema_rewriter module and its associated symbols have been moved from datafusion_physical_expr to a new crate datafusion_physical_expr_adapter. This affects the following symbols:
DefaultPhysicalExprAdapterDefaultPhysicalExprAdapterFactoryPhysicalExprAdapterPhysicalExprAdapterFactory
To upgrade, change your imports to:
use datafusion_physical_expr_adapter::{
DefaultPhysicalExprAdapter, DefaultPhysicalExprAdapterFactory,
PhysicalExprAdapter, PhysicalExprAdapterFactory
};
Upgrade to arrow 56.0.0 and parquet 56.0.0#
This version of DataFusion upgrades the underlying Apache Arrow implementation
to version 56.0.0. See the release notes
for more details.
Added ExecutionPlan::reset_state#
In order to fix a bug in DataFusion 49.0.0 where dynamic filters (currently only generated in the presence of a query such as ORDER BY ... LIMIT ...)
produced incorrect results in recursive queries, a new method reset_state has been added to the ExecutionPlan trait.
Any ExecutionPlan that needs to maintain internal state or references to other nodes in the execution plan tree should implement this method to reset that state.
See #17028 for more details and an example implementation for SortExec.
Nested Loop Join input sort order cannot be preserved#
The Nested Loop Join operator has been rewritten from scratch to improve performance and memory efficiency. From the micro-benchmarks: this change introduces up to 5X speed-up and uses only 1% memory in extreme cases compared to the previous implementation.
However, the new implementation cannot preserve input sort order like the old version could. This is a fundamental design trade-off that prioritizes performance and memory efficiency over sort order preservation.
See #16996 for details.
Add as_any() method to LazyBatchGenerator#
To help with protobuf serialization, the as_any() method has been added to the LazyBatchGenerator trait. This means you will need to add as_any() to your implementation of LazyBatchGenerator:
impl LazyBatchGenerator for MyBatchGenerator {
fn as_any(&self) -> &dyn Any {
self
}
...
}
See #17200 for details.
Refactored DataSource::try_swapping_with_projection#
We refactored DataSource::try_swapping_with_projection to simplify the method and minimize leakage across the ExecutionPlan <-> DataSource abstraction layer.
Reimplementation for any custom DataSource should be relatively straightforward, see #17395 for more details.
FileOpenFuture now uses DataFusionError instead of ArrowError#
The FileOpenFuture type alias has been updated to use DataFusionError instead of ArrowError for its error type. This change affects the FileOpener trait and any implementations that work with file streaming operations.
Before:
pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch, ArrowError>>>>;
After:
pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch>>>>;
If you have custom implementations of FileOpener or work directly with FileOpenFuture, you’ll need to update your error handling to use DataFusionError instead of ArrowError. The FileStreamState enum’s Open variant has also been updated accordingly. See #17397 for more details.
FFI user defined aggregate function signature change#
The Foreign Function Interface (FFI) signature for user defined aggregate functions
has been updated to call return_field instead of return_type on the underlying
aggregate function. This is to support metadata handling with these aggregate functions.
This change should be transparent to most users. If you have written unit tests to call
return_type directly, you may need to change them to calling return_field instead.
This update is a breaking change to the FFI API. The current best practice when using the FFI crate is to ensure that all libraries that are interacting are using the same underlying Rust version. Issue #17374 has been opened to discuss stabilization of this interface so that these libraries can be used across different DataFusion versions.
See #17407 for details.
Added PhysicalExpr::is_volatile_node#
We added a method to PhysicalExpr to mark a PhysicalExpr as volatile:
impl PhysicalExpr for MyRandomExpr {
fn is_volatile_node(&self) -> bool {
true
}
}
We’ve shipped this with a default value of false to minimize breakage but we highly recommend that implementers of PhysicalExpr opt into a behavior, even if it is returning false.
You can see more discussion and example implementations in #17351.