Upgrade Guides#

DataFusion `51.0.0`#

Note: DataFusion 51.0.0 has not been released yet. The information provided in this section pertains to features and changes that have already been merged to the main branch and are awaiting release in this version.

You can see the current status of the 51.0.0release here

`arrow` / `parquet` updated to 57.0.0#

Upgrade to arrow `57.0.0` and parquet `57.0.0`#

This version of DataFusion upgrades the underlying Apache Arrow implementation to version 57.0.0, including several dependent crates such as prost, tonic, pyo3, and substrait. . See the release notes for more details.

`MSRV` updated to 1.87.0#

The Minimum Supported Rust Version (MSRV) has been updated to 1.87.0.

`FunctionRegistry` exposes two additional methods#

FunctionRegistry exposes two additional methods udafs and udwfs which expose set of registered user defined aggregation and window function names. To upgrade implement methods returning set of registered function names:

impl FunctionRegistry for FunctionRegistryImpl {
      fn udfs(&self) -> HashSet<String> {
         self.scalar_functions.keys().cloned().collect()
     }
+    fn udafs(&self) -> HashSet<String> {
+        self.aggregate_functions.keys().cloned().collect()
+    }
+
+    fn udwfs(&self) -> HashSet<String> {
+        self.window_functions.keys().cloned().collect()
+    }
}

`datafusion-proto` use `TaskContext` rather than `SessionContext` in physical plan serde methods#

There have been changes in the public API methods of datafusion-proto which handle physical plan serde.

Methods like physical_plan_from_bytes, parse_physical_expr and similar, expect TaskContext instead of SessionContext

- let plan2 = physical_plan_from_bytes(&bytes, &ctx)?;
+ let plan2 = physical_plan_from_bytes(&bytes, &ctx.task_ctx())?;

as TaskContext contains RuntimeEnv methods such as try_into_physical_plan will not have explicit RuntimeEnv parameter.

let result_exec_plan: Arc<dyn ExecutionPlan> = proto
-   .try_into_physical_plan(&ctx, runtime.deref(), &composed_codec)
+.  .try_into_physical_plan(&ctx.task_ctx(), &composed_codec)

PhysicalExtensionCodec::try_decode() expects TaskContext instead of FunctionRegistry:

pub trait PhysicalExtensionCodec {
    fn try_decode(
        &self,
        buf: &[u8],
        inputs: &[Arc<dyn ExecutionPlan>],
-        registry: &dyn FunctionRegistry,
+        ctx: &TaskContext,
    ) -> Result<Arc<dyn ExecutionPlan>>;

See issue #17601 for more details.

`SessionState`’s `sql_to_statement` method takes `Dialect` rather than a `str`#

The dialect parameter of sql_to_statement method defined in datafusion::execution::session_state::SessionState has changed from &str to &Dialect. Dialect is an enum defined in the datafusion-common crate under the config module that provides type safety and better validation for SQL dialect selection

Reorganization of `ListingTable` into `datafusion-catalog-listing` crate#

There has been a long standing request to remove features such as ListingTable from the datafusion crate to support faster build times. The structs ListingOptions, ListingTable, and ListingTableConfig are now available within the datafusion-catalog-listing crate. These are re-exported in the datafusion crate, so this should be a minimal impact to existing users.

See issue #14462 and issue #17713 for more details.

Reorganization of `ArrowSource` into `datafusion-datasource-arrow` crate#

To support issue #17713 the ArrowSource code has been removed from the datafusion core crate into it’s own crate, datafusion-datasource-arrow. This follows the pattern for the AVRO, CSV, JSON, and Parquet data sources. Users may need to update their paths to account for these changes.

See issue #17713 for more details.

`FileScanConfig::projection` renamed to `FileScanConfig::projection_exprs`#

The projection field in FileScanConfig has been renamed to projection_exprs and its type has changed from Option<Vec<usize>> to Option<ProjectionExprs>. This change enables more powerful projection pushdown capabilities by supporting arbitrary physical expressions rather than just column indices.

Impact on direct field access:

If you directly access the projection field:

let config: FileScanConfig = ...;
let projection = config.projection;

You should update to:

let config: FileScanConfig = ...;
let projection_exprs = config.projection_exprs;

Impact on builders:

The FileScanConfigBuilder::with_projection() method has been deprecated in favor of with_projection_indices():

let config = FileScanConfigBuilder::new(url, schema, file_source)
-   .with_projection(Some(vec![0, 2, 3]))
+   .with_projection_indices(Some(vec![0, 2, 3]))
    .build();

Note: with_projection() still works but is deprecated and will be removed in a future release.

What is ProjectionExprs?

ProjectionExprs is a new type that represents a list of physical expressions for projection. While it can be constructed from column indices (which is what with_projection_indices does internally), it also supports arbitrary physical expressions, enabling advanced features like expression evaluation during scanning.

You can access column indices from ProjectionExprs using its methods if needed:

let projection_exprs: ProjectionExprs = ...;
// Get the column indices if the projection only contains simple column references
let indices = projection_exprs.column_indices();

`DESCRIBE query` support#

DESCRIBE query was previously an alias for EXPLAIN query, which outputs the execution plan of the query. With this release, DESCRIBE query now outputs the computed schema of the query, consistent with the behavior of DESCRIBE table_name.

DataFusion `50.0.0`#

ListingTable automatically detects Hive Partitioned tables#

DataFusion 50.0.0 automatically infers Hive partitions when using the ListingTableFactory and CREATE EXTERNAL TABLE. Previously, when creating a ListingTable, datasets that use Hive partitioning (e.g. /table_root/column1=value1/column2=value2/data.parquet) would not have the Hive columns reflected in the table’s schema or data. The previous behavior can be restored by setting the datafusion.execution.listing_table_factory_infer_partitions configuration option to false. See issue #17049 for more details.

`MSRV` updated to 1.86.0#

The Minimum Supported Rust Version (MSRV) has been updated to 1.86.0. See #17230 for details.

`ScalarUDFImpl`, `AggregateUDFImpl` and `WindowUDFImpl` traits now require `PartialEq`, `Eq`, and `Hash` traits#

To address error-proneness of ScalarUDFImpl::equals, AggregateUDFImpl::equalsand WindowUDFImpl::equals methods and to make it easy to implement function equality correctly, the equals and hash_value methods have been removed from ScalarUDFImpl, AggregateUDFImpl and WindowUDFImpl traits. They are replaced the requirement to implement the PartialEq, Eq, and Hash traits on any type implementing ScalarUDFImpl, AggregateUDFImpl or WindowUDFImpl. Please see issue #16677 for more details.

Most of the scalar functions are stateless and have a signature field. These can be migrated using regular expressions

search for \#\[derive$Debug$\](\n *(pub )?struct \w+ \{\n *signature\: Signature\,\n *\}),
replace with #[derive(Debug, PartialEq, Eq, Hash)]$1,
review all the changes and make sure only function structs were changed.

`AsyncScalarUDFImpl::invoke_async_with_args` returns `ColumnarValue`#

In order to enable single value optimizations and be consistent with other user defined function APIs, the AsyncScalarUDFImpl::invoke_async_with_args method now returns a ColumnarValue instead of a ArrayRef.

To upgrade, change the return type of your implementation

impl AsyncScalarUDFImpl for AskLLM {
    async fn invoke_async_with_args(
        &self,
        args: ScalarFunctionArgs,
        _option: &ConfigOptions,
    ) -> Result<ColumnarValue> {
        ..
      return array_ref; // old code
    }
}

To return a ColumnarValue

impl AsyncScalarUDFImpl for AskLLM {
    async fn invoke_async_with_args(
        &self,
        args: ScalarFunctionArgs,
        _option: &ConfigOptions,
    ) -> Result<ColumnarValue> {
        ..
      return ColumnarValue::from(array_ref); // new code
    }
}

See #16896 for more details.

`ProjectionExpr` changed from type alias to struct#

ProjectionExpr has been changed from a type alias to a struct with named fields to improve code clarity and maintainability.

Before:

pub type ProjectionExpr = (Arc<dyn PhysicalExpr>, String);

After:

#[derive(Debug, Clone)]
pub struct ProjectionExpr {
    pub expr: Arc<dyn PhysicalExpr>,
    pub alias: String,
}

To upgrade your code:

Replace tuple construction (expr, alias) with ProjectionExpr::new(expr, alias) or ProjectionExpr { expr, alias }
Replace tuple field access .0 and .1 with .expr and .alias
Update pattern matching from (expr, alias) to ProjectionExpr { expr, alias }

This mainly impacts use of ProjectionExec.

This change was done in #17398

`SessionState`, `SessionConfig`, and `OptimizerConfig` returns `&Arc<ConfigOptions>` instead of `&ConfigOptions`#

To provide broader access to ConfigOptions and reduce required clones, some APIs have been changed to return a &Arc<ConfigOptions> instead of a &ConfigOptions. This allows sharing the same ConfigOptions across multiple threads without needing to clone the entire ConfigOptions structure unless it is modified.

Most users will not be impacted by this change since the Rust compiler typically automatically dereference the Arc when needed. However, in some cases you may have to change your code to explicitly call as_ref() for example, from

let optimizer_config: &ConfigOptions = state.options();

let optimizer_config: &ConfigOptions = state.options().as_ref();

See PR #16970

API Change to `AsyncScalarUDFImpl::invoke_async_with_args`#

The invoke_async_with_args method of the AsyncScalarUDFImpl trait has been updated to remove the _option: &ConfigOptions parameter to simplify the API now that the ConfigOptions can be accessed through the ScalarFunctionArgs parameter.

You can change your code like this

impl AsyncScalarUDFImpl for AskLLM {
    async fn invoke_async_with_args(
        &self,
        args: ScalarFunctionArgs,
        _option: &ConfigOptions,
    ) -> Result<ArrayRef> {
        ..
    }
    ...
}

To this:

impl AsyncScalarUDFImpl for AskLLM {
    async fn invoke_async_with_args(
        &self,
        args: ScalarFunctionArgs,
    ) -> Result<ArrayRef> {
        let options = &args.config_options;
        ..
    }
    ...
}

Schema Rewriter Module Moved to New Crate#

The schema_rewriter module and its associated symbols have been moved from datafusion_physical_expr to a new crate datafusion_physical_expr_adapter. This affects the following symbols:

DefaultPhysicalExprAdapter
DefaultPhysicalExprAdapterFactory
PhysicalExprAdapter
PhysicalExprAdapterFactory

To upgrade, change your imports to:

use datafusion_physical_expr_adapter::{
    DefaultPhysicalExprAdapter, DefaultPhysicalExprAdapterFactory,
    PhysicalExprAdapter, PhysicalExprAdapterFactory
};

Upgrade to arrow `56.0.0` and parquet `56.0.0`#

This version of DataFusion upgrades the underlying Apache Arrow implementation to version 56.0.0. See the release notes for more details.

Added `ExecutionPlan::reset_state`#

In order to fix a bug in DataFusion 49.0.0 where dynamic filters (currently only generated in the presence of a query such as ORDER BY ... LIMIT ...) produced incorrect results in recursive queries, a new method reset_state has been added to the ExecutionPlan trait.

Any ExecutionPlan that needs to maintain internal state or references to other nodes in the execution plan tree should implement this method to reset that state. See #17028 for more details and an example implementation for SortExec.

Nested Loop Join input sort order cannot be preserved#

The Nested Loop Join operator has been rewritten from scratch to improve performance and memory efficiency. From the micro-benchmarks: this change introduces up to 5X speed-up and uses only 1% memory in extreme cases compared to the previous implementation.

However, the new implementation cannot preserve input sort order like the old version could. This is a fundamental design trade-off that prioritizes performance and memory efficiency over sort order preservation.

See #16996 for details.

Add `as_any()` method to `LazyBatchGenerator`#

To help with protobuf serialization, the as_any() method has been added to the LazyBatchGenerator trait. This means you will need to add as_any() to your implementation of LazyBatchGenerator:

impl LazyBatchGenerator for MyBatchGenerator {
    fn as_any(&self) -> &dyn Any {
        self
    }

    ...
}

See #17200 for details.

Refactored `DataSource::try_swapping_with_projection`#

We refactored DataSource::try_swapping_with_projection to simplify the method and minimize leakage across the ExecutionPlan <-> DataSource abstraction layer. Reimplementation for any custom DataSource should be relatively straightforward, see #17395 for more details.

`FileOpenFuture` now uses `DataFusionError` instead of `ArrowError`#

The FileOpenFuture type alias has been updated to use DataFusionError instead of ArrowError for its error type. This change affects the FileOpener trait and any implementations that work with file streaming operations.

Before:

pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch, ArrowError>>>>;

After:

pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch>>>>;

If you have custom implementations of FileOpener or work directly with FileOpenFuture, you’ll need to update your error handling to use DataFusionError instead of ArrowError. The FileStreamState enum’s Open variant has also been updated accordingly. See #17397 for more details.

FFI user defined aggregate function signature change#

The Foreign Function Interface (FFI) signature for user defined aggregate functions has been updated to call return_field instead of return_type on the underlying aggregate function. This is to support metadata handling with these aggregate functions. This change should be transparent to most users. If you have written unit tests to call return_type directly, you may need to change them to calling return_field instead.

This update is a breaking change to the FFI API. The current best practice when using the FFI crate is to ensure that all libraries that are interacting are using the same underlying Rust version. Issue #17374 has been opened to discuss stabilization of this interface so that these libraries can be used across different DataFusion versions.

See #17407 for details.

Added `PhysicalExpr::is_volatile_node`#

We added a method to PhysicalExpr to mark a PhysicalExpr as volatile:

impl PhysicalExpr for MyRandomExpr {
  fn is_volatile_node(&self) -> bool {
    true
  }
}

We’ve shipped this with a default value of false to minimize breakage but we highly recommend that implementers of PhysicalExpr opt into a behavior, even if it is returning false.

You can see more discussion and example implementations in #17351.

DataFusion `49.0.0`#

`MSRV` updated to 1.85.1#

The Minimum Supported Rust Version (MSRV) has been updated to 1.85.1. See #16728 for details.

`DataFusionError` variants are now `Box`ed#

To reduce the size of DataFusionError, several variants that were previously stored inline are now Boxed. This reduces the size of Result<T, DataFusionError> and thus stack usage and async state machine size. Please see #16652 for more details.

The following variants of DataFusionError are now boxed:

ArrowError
SQL
SchemaError

This is a breaking change. Code that constructs or matches on these variants will need to be updated.

For example, to create a SchemaError, instead of:

use datafusion_common::{DataFusionError, SchemaError};
DataFusionError::SchemaError(
  SchemaError::DuplicateUnqualifiedField { name: "foo".to_string() },
  Box::new(None)
)

You now need to Box the inner error:

use datafusion_common::{DataFusionError, SchemaError};
DataFusionError::SchemaError(
  Box::new(SchemaError::DuplicateUnqualifiedField { name: "foo".to_string() }),
  Box::new(None)
)

Metadata on Arrow Types is now represented by `FieldMetadata`#

Metadata from the Arrow Field is now stored using the FieldMetadata structure. In prior versions it was stored as both a HashMap<String, String> and a BTreeMap<String, String>. FieldMetadata is a easier to work with and is more efficient.

To create FieldMetadata from a Field:

 let metadata = FieldMetadata::from(&field);

To add metadata to a Field, use the add_to_field method:

let updated_field = metadata.add_to_field(field);

See #16317 for details.

New `datafusion.execution.spill_compression` configuration option#

DataFusion 49.0.0 adds support for compressing spill files when data is written to disk during spilling query execution. A new configuration option datafusion.execution.spill_compression controls the compression codec used.

Configuration:

Key: datafusion.execution.spill_compression
Default: uncompressed
Valid values: uncompressed, lz4_frame, zstd

Usage:

use datafusion::prelude::*;
use datafusion_common::config::SpillCompression;

let config = SessionConfig::default()
    .with_spill_compression(SpillCompression::Zstd);
let ctx = SessionContext::new_with_config(config);

Or via SQL:

SET datafusion.execution.spill_compression = 'zstd';

For more details about this configuration option, including performance trade-offs between different compression codecs, see the Configuration Settings documentation.

Deprecated `map_varchar_to_utf8view` configuration option#

See issue #16290 for more information The old configuration

datafusion.sql_parser.map_varchar_to_utf8view

is now deprecated in favor of the unified option below.
If you previously used this to control only VARCHAR→Utf8View mapping, please migrate to map_string_types_to_utf8view.

New `map_string_types_to_utf8view` configuration option#

To unify all SQL string types (CHAR, VARCHAR, TEXT, STRING) to Arrow’s zero‑copy Utf8View, DataFusion 49.0.0 introduces:

Key: datafusion.sql_parser.map_string_types_to_utf8view
Default: true

Description:

When true (default), all SQL string types are mapped to Utf8View, avoiding full‑copy UTF‑8 allocations and improving performance.
When false, DataFusion falls back to the legacy Utf8 mapping for all string types.

Examples#

// Disable Utf8View mapping for all SQL string types
let opts = datafusion::sql::planner::ParserOptions::new()
    .with_map_string_types_to_utf8view(false);

// Verify the setting is applied
assert!(!opts.map_string_types_to_utf8view);

-- Disable Utf8View mapping globally
SET datafusion.sql_parser.map_string_types_to_utf8view = false;

-- Now VARCHAR, CHAR, TEXT, STRING all use Utf8 rather than Utf8View
CREATE TABLE my_table (a VARCHAR, b TEXT, c STRING);
DESCRIBE my_table;

Deprecating `SchemaAdapterFactory` and `SchemaAdapter`#

We are moving away from converting data (using SchemaAdapter) to converting the expressions themselves (which is more efficient and flexible).

See issue #16800 for more information The first place this change has taken place is in predicate pushdown for Parquet. By default if you do not use a custom SchemaAdapterFactory we will use expression conversion instead. If you do set a custom SchemaAdapterFactory we will continue to use it but emit a warning about that code path being deprecated.

To resolve this you need to implement a custom PhysicalExprAdapterFactory and use that instead of a SchemaAdapterFactory. See the default values for an example of how to do this. Opting into the new APIs will set you up for future changes since we plan to expand use of PhysicalExprAdapterFactory to other areas of DataFusion.

See #16800 for details.

`TableParquetOptions` Updated#

The TableParquetOptions struct has a new crypto field to specify encryption options for Parquet files. The ParquetEncryptionOptions implements Default so you can upgrade your existing code like this:

TableParquetOptions {
  global,
  column_specific_options,
  key_value_metadata,
}

To this:

TableParquetOptions {
  global,
  column_specific_options,
  key_value_metadata,
  crypto: Default::default(), // New crypto field
}

DataFusion `48.0.1`#

`datafusion.execution.collect_statistics` now defaults to `true`#

The default value of the datafusion.execution.collect_statistics configuration setting is now true. This change impacts users that use that value directly and relied on its default value being false.

This change also restores the default behavior of ListingTable to its previous. If you use it directly you can maintain the current behavior by overriding the default value in your code.

ListingOptions::new(Arc::new(ParquetFormat::default()))
    .with_collect_stat(false)
    // other options

DataFusion `48.0.0`#

`Expr::Literal` has optional metadata#

The Expr::Literal variant now includes optional metadata, which allows for carrying through Arrow field metadata to support extension types and other uses.

This means code such as

match expr {
...
  Expr::Literal(scalar) => ...
...
}

Should be updated to:

match expr {
...
  Expr::Literal(scalar, _metadata) => ...
...
}

Likewise constructing Expr::Literal requires metadata as well. The lit function has not changed and returns an Expr::Literal with no metadata.

`Expr::WindowFunction` is now `Box`ed#

Expr::WindowFunction is now a Box<WindowFunction> instead of a WindowFunction directly. This change was made to reduce the size of Expr and improve performance when planning queries (see details on #16207).

This is a breaking change, so you will need to update your code if you match on Expr::WindowFunction directly. For example, if you have code like this:

match expr {
  Expr::WindowFunction(WindowFunction {
    params:
      WindowFunctionParams {
       partition_by,
       order_by,
      ..
    }
  }) => {
    // Use partition_by and order_by as needed
  }
  _ => {
    // other expr
  }
}

You will need to change it to:

match expr {
  Expr::WindowFunction(window_fun) => {
    let WindowFunction {
      fun,
      params: WindowFunctionParams {
        args,
        partition_by,
        ..
        },
    } = window_fun.as_ref();
    // Use partition_by and order_by as needed
  }
  _ => {
    // other expr
  }
}

The `VARCHAR` SQL type is now represented as `Utf8View` in Arrow#

The mapping of the SQL VARCHAR type has been changed from Utf8 to Utf8View which improves performance for many string operations. You can read more about Utf8View in the DataFusion blog post on German-style strings

This means that when you create a table with a VARCHAR column, it will now use Utf8View as the underlying data type. For example:

> CREATE TABLE my_table (my_column VARCHAR);
0 row(s) fetched.
Elapsed 0.001 seconds.

> DESCRIBE my_table;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| my_column   | Utf8View  | YES         |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.000 seconds.

You can restore the old behavior of using Utf8 by changing the datafusion.sql_parser.map_varchar_to_utf8view configuration setting. For example

> set datafusion.sql_parser.map_varchar_to_utf8view = false;
0 row(s) fetched.
Elapsed 0.001 seconds.

> CREATE TABLE my_table (my_column VARCHAR);
0 row(s) fetched.
Elapsed 0.014 seconds.

> DESCRIBE my_table;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| my_column   | Utf8      | YES         |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.004 seconds.

`ListingOptions` default for `collect_stat` changed from `true` to `false`#

This makes it agree with the default for SessionConfig. Most users won’t be impacted by this change but if you were using ListingOptions directly and relied on the default value of collect_stat being true, you will need to explicitly set it to true in your code.

ListingOptions::new(Arc::new(ParquetFormat::default()))
    .with_collect_stat(true)
    // other options

Processing `FieldRef` instead of `DataType` for user defined functions#

In order to support metadata handling and extension types, user defined functions are now switching to traits which use FieldRef rather than a DataType and nullability. This gives a single interface to both of these parameters and additionally allows access to metadata fields, which can be used for extension types.

To upgrade structs which implement ScalarUDFImpl, if you have implemented return_type_from_args you need instead to implement return_field_from_args. If your functions do not need to handle metadata, this should be straightforward repackaging of the output data into a FieldRef. The name you specify on the field is not important. It will be overwritten during planning. ReturnInfo has been removed, so you will need to remove all references to it.

ScalarFunctionArgs now contains a field called arg_fields. You can use this to access the metadata associated with the columnar values during invocation.

To upgrade user defined aggregate functions, there is now a function return_field that will allow you to specify both metadata and nullability of your function. You are not required to implement this if you do not need to handle metadata.

The largest change to aggregate functions happens in the accumulator arguments. Both the AccumulatorArgs and StateFieldsArgs now contain FieldRef rather than DataType.

To upgrade window functions, ExpressionArgs now contains input fields instead of input data types. When setting these fields, the name of the field is not important since this gets overwritten during the planning stage. All you should need to do is wrap your existing data types in fields with nullability set depending on your use case.

Physical Expression return `Field`#

To support the changes to user defined functions processing metadata, the PhysicalExpr trait, which now must specify a return Field based on the input schema. To upgrade structs which implement PhysicalExpr you need to implement the return_field function. There are numerous examples in the physical-expr crate.

`FileFormat::supports_filters_pushdown` replaced with `FileSource::try_pushdown_filters`#

To support more general filter pushdown, the FileFormat::supports_filters_pushdown was replaced with FileSource::try_pushdown_filters. If you implemented a custom FileFormat that uses a custom FileSource you will need to implement FileSource::try_pushdown_filters. See ParquetSource::try_pushdown_filters for an example of how to implement this.

FileFormat::supports_filters_pushdown has been removed.

`ParquetExec`, `AvroExec`, `CsvExec`, `JsonExec` Removed#

ParquetExec, AvroExec, CsvExec, and JsonExec were deprecated in DataFusion 46 and are removed in DataFusion 48. This is sooner than the normal process described in the API Deprecation Guidelines because all the tests cover the new DataSourceExec rather than the older structures. As we evolve DataSource, the old structures began to show signs of “bit rotting” (not working but no one knows due to lack of test coverage).

`PartitionedFile` added as an argument to the `FileOpener` trait#

This is necessary to properly fix filter pushdown for filters that combine partition columns and file columns (e.g. day = username['dob']).

If you implemented a custom FileOpener you will need to add the PartitionedFile argument but are not required to use it in any way.

DataFusion `47.0.0`#

This section calls out some of the major changes in the 47.0.0 release of DataFusion.

Here are some example upgrade PRs that demonstrate changes required when upgrading from DataFusion 46.0.0:

Upgrades to `arrow-rs` and `arrow-parquet` 55.0.0 and `object_store` 0.12.0#

Several APIs are changed in the underlying arrow and parquet libraries to use a u64 instead of usize to better support WASM (See #7371 and [#6961])

Additionally ObjectStore::list and ObjectStore::list_with_offset have been changed to return static lifetimes (See #6619)

This requires converting from usize to u64 occasionally as well as changes to ObjectStore implementations such as

impl Objectstore {
    ...
    // The range is now a u64 instead of usize
    async fn get_range(&self, location: &Path, range: Range<u64>) -> ObjectStoreResult<Bytes> {
        self.inner.get_range(location, range).await
    }
    ...
    // the lifetime is now 'static instead of `_ (meaning the captured closure can't contain references)
    // (this also applies to list_with_offset)
    fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, ObjectStoreResult<ObjectMeta>> {
        self.inner.list(prefix)
    }
}

The ParquetObjectReader has been updated to no longer require the object size (it can be fetched using a single suffix request). See #7334 for details

Pattern in DataFusion 46.0.0:

let meta: ObjectMeta = ...;
let reader = ParquetObjectReader::new(store, meta);

Pattern in DataFusion 47.0.0:

let meta: ObjectMeta = ...;
let reader = ParquetObjectReader::new(store, location)
  .with_file_size(meta.size);

`DisplayFormatType::TreeRender`#

DataFusion now supports tree style explain plans. Implementations of Executionplan must also provide a description in the DisplayFormatType::TreeRender format. This can be the same as the existing DisplayFormatType::Default.

Removed Deprecated APIs#

Several APIs have been removed in this release. These were either deprecated previously or were hard to use correctly such as the multiple different ScalarUDFImpl::invoke* APIs. See #15130, #15123, and #15027 for more details.

`FileScanConfig` –> `FileScanConfigBuilder`#

Previously, FileScanConfig::build() directly created ExecutionPlans. In DataFusion 47.0.0 this has been changed to use FileScanConfigBuilder. See #15352 for details.

Pattern in DataFusion 46.0.0:

let plan = FileScanConfig::new(url, schema, Arc::new(file_source))
  .with_statistics(stats)
  ...
  .build()

Pattern in DataFusion 47.0.0:

let config = FileScanConfigBuilder::new(url, schema, Arc::new(file_source))
  .with_statistics(stats)
  ...
  .build();
let scan = DataSourceExec::from_data_source(config);

DataFusion `46.0.0`#

Use `invoke_with_args` instead of `invoke()` and `invoke_batch()`#

DataFusion is moving to a consistent API for invoking ScalarUDFs, ScalarUDFImpl::invoke_with_args(), and deprecating ScalarUDFImpl::invoke(), ScalarUDFImpl::invoke_batch(), and ScalarUDFImpl::invoke_no_args()

If you see errors such as the following it means the older APIs are being used:

This feature is not implemented: Function concat does not implement invoke but called

To fix this error, use ScalarUDFImpl::invoke_with_args() instead, as shown below. See PR 14876 for an example.

Given existing code like this:

impl ScalarUDFImpl for SparkConcat {
...
    fn invoke_batch(&self, args: &[ColumnarValue], number_rows: usize) -> Result<ColumnarValue> {
        if args
            .iter()
            .any(|arg| matches!(arg.data_type(), DataType::List(_)))
        {
            ArrayConcat::new().invoke_batch(args, number_rows)
        } else {
            ConcatFunc::new().invoke_batch(args, number_rows)
        }
    }
}

impl ScalarUDFImpl for SparkConcat {
    ...
    fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
        if args
            .args
            .iter()
            .any(|arg| matches!(arg.data_type(), DataType::List(_)))
        {
            ArrayConcat::new().invoke_with_args(args)
        } else {
            ConcatFunc::new().invoke_with_args(args)
        }
    }
}

`ParquetExec`, `AvroExec`, `CsvExec`, `JsonExec` deprecated#

DataFusion 46 has a major change to how the built in DataSources are organized. Instead of individual ExecutionPlans for the different file formats they now all use DataSourceExec and the format specific information is embodied in new traits DataSource and FileSource.

Here is more information about

Design Ticket
Change PR PR #14224
Example of an Upgrade PR in delta-rs

Cookbook: Changes to `ParquetExecBuilder`#

Code that looks for ParquetExec like this will no longer work:

    if let Some(parquet_exec) = plan.as_any().downcast_ref::<ParquetExec>() {
        // Do something with ParquetExec here
    }

Instead, with DataSourceExec, the same information is now on FileScanConfig and ParquetSource. The equivalent code is

if let Some(datasource_exec) = plan.as_any().downcast_ref::<DataSourceExec>() {
  if let Some(scan_config) = datasource_exec.data_source().as_any().downcast_ref::<FileScanConfig>() {
    // FileGroups, and other information is on the FileScanConfig
    // parquet
    if let Some(parquet_source) = scan_config.file_source.as_any().downcast_ref::<ParquetSource>()
    {
      // Information on PruningPredicates and parquet options are here
    }
}

Cookbook: Changes to `ParquetExecBuilder`#

Likewise code that builds ParquetExec using the ParquetExecBuilder such as the following must be changed:

let mut exec_plan_builder = ParquetExecBuilder::new(
    FileScanConfig::new(self.log_store.object_store_url(), file_schema)
        .with_projection(self.projection.cloned())
        .with_limit(self.limit)
        .with_table_partition_cols(table_partition_cols),
)
.with_schema_adapter_factory(Arc::new(DeltaSchemaAdapterFactory {}))
.with_table_parquet_options(parquet_options);

// Add filter
if let Some(predicate) = logical_filter {
    if config.enable_parquet_pushdown {
        exec_plan_builder = exec_plan_builder.with_predicate(predicate);
    }
};

New code should use FileScanConfig to build the appropriate DataSourceExec:

let mut file_source = ParquetSource::new(parquet_options)
    .with_schema_adapter_factory(Arc::new(DeltaSchemaAdapterFactory {}));

// Add filter
if let Some(predicate) = logical_filter {
    if config.enable_parquet_pushdown {
        file_source = file_source.with_predicate(predicate);
    }
};

let file_scan_config = FileScanConfig::new(
    self.log_store.object_store_url(),
    file_schema,
    Arc::new(file_source),
)
.with_statistics(stats)
.with_projection(self.projection.cloned())
.with_limit(self.limit)
.with_table_partition_cols(table_partition_cols);

// Build the actual scan like this
parquet_scan: file_scan_config.build(),

`datafusion-cli` no longer automatically unescapes strings#

datafusion-cli previously would incorrectly unescape string literals (see ticket for more details).

To escape ' in SQL literals, use '':

> select 'it''s escaped';
+----------------------+
| Utf8("it's escaped") |
+----------------------+
| it's escaped         |
+----------------------+
1 row(s) fetched.

To include special characters (such as newlines via \n) you can use an E literal string. For example

> select 'foo\nbar';
+------------------+
| Utf8("foo\nbar") |
+------------------+
| foo\nbar         |
+------------------+
1 row(s) fetched.
Elapsed 0.005 seconds.

Changes to array scalar function signatures#

DataFusion 46 has changed the way scalar array function signatures are declared. Previously, functions needed to select from a list of predefined signatures within the ArrayFunctionSignature enum. Now the signatures can be defined via a Vec of pseudo-types, which each correspond to a single argument. Those pseudo-types are the variants of the ArrayFunctionArgument enum and are as follows:

Array: An argument of type List/LargeList/FixedSizeList. All Array arguments must be coercible to the same type.
Element: An argument that is coercible to the inner type of the Array arguments.
Index: An Int64 argument.

Each of the old variants can be converted to the new format as follows:

TypeSignature::ArraySignature(ArrayFunctionSignature::ArrayAndElement):

TypeSignature::ArraySignature(ArrayFunctionSignature::Array {
    arguments: vec![ArrayFunctionArgument::Array, ArrayFunctionArgument::Element],
    array_coercion: Some(ListCoercion::FixedSizedListToList),
});

TypeSignature::ArraySignature(ArrayFunctionSignature::ElementAndArray):

TypeSignature::ArraySignature(ArrayFunctionSignature::Array {
    arguments: vec![ArrayFunctionArgument::Element, ArrayFunctionArgument::Array],
    array_coercion: Some(ListCoercion::FixedSizedListToList),
});

TypeSignature::ArraySignature(ArrayFunctionSignature::ArrayAndIndex):

TypeSignature::ArraySignature(ArrayFunctionSignature::Array {
    arguments: vec![ArrayFunctionArgument::Array, ArrayFunctionArgument::Index],
    array_coercion: None,
});

TypeSignature::ArraySignature(ArrayFunctionSignature::ArrayAndElementAndOptionalIndex):

TypeSignature::OneOf(vec![
    TypeSignature::ArraySignature(ArrayFunctionSignature::Array {
        arguments: vec![ArrayFunctionArgument::Array, ArrayFunctionArgument::Element],
        array_coercion: None,
    }),
    TypeSignature::ArraySignature(ArrayFunctionSignature::Array {
        arguments: vec![
            ArrayFunctionArgument::Array,
            ArrayFunctionArgument::Element,
            ArrayFunctionArgument::Index,
        ],
        array_coercion: None,
    }),
]);

TypeSignature::ArraySignature(ArrayFunctionSignature::Array):

TypeSignature::ArraySignature(ArrayFunctionSignature::Array {
    arguments: vec![ArrayFunctionArgument::Array],
    array_coercion: None,
});

Alternatively, you can switch to using one of the following functions which take care of constructing the TypeSignature for you:

Signature::array_and_element
Signature::array_and_element_and_optional_index
Signature::array_and_index
Signature::array

Upgrade Guides#

DataFusion 51.0.0#

arrow / parquet updated to 57.0.0#

Upgrade to arrow 57.0.0 and parquet 57.0.0#

MSRV updated to 1.87.0#

FunctionRegistry exposes two additional methods#

datafusion-proto use TaskContext rather than SessionContext in physical plan serde methods#

SessionState’s sql_to_statement method takes Dialect rather than a str#

Reorganization of ListingTable into datafusion-catalog-listing crate#

Reorganization of ArrowSource into datafusion-datasource-arrow crate#

FileScanConfig::projection renamed to FileScanConfig::projection_exprs#

DESCRIBE query support#

DataFusion 50.0.0#

ListingTable automatically detects Hive Partitioned tables#

MSRV updated to 1.86.0#

ScalarUDFImpl, AggregateUDFImpl and WindowUDFImpl traits now require PartialEq, Eq, and Hash traits#

AsyncScalarUDFImpl::invoke_async_with_args returns ColumnarValue#

ProjectionExpr changed from type alias to struct#

SessionState, SessionConfig, and OptimizerConfig returns &Arc<ConfigOptions> instead of &ConfigOptions#

API Change to AsyncScalarUDFImpl::invoke_async_with_args#

Schema Rewriter Module Moved to New Crate#

Upgrade to arrow 56.0.0 and parquet 56.0.0#

Added ExecutionPlan::reset_state#

Nested Loop Join input sort order cannot be preserved#

Add as_any() method to LazyBatchGenerator#

Refactored DataSource::try_swapping_with_projection#

FileOpenFuture now uses DataFusionError instead of ArrowError#

FFI user defined aggregate function signature change#

Added PhysicalExpr::is_volatile_node#

DataFusion 49.0.0#

MSRV updated to 1.85.1#

DataFusionError variants are now Boxed#

Metadata on Arrow Types is now represented by FieldMetadata#

New datafusion.execution.spill_compression configuration option#

Deprecated map_varchar_to_utf8view configuration option#

New map_string_types_to_utf8view configuration option#

Examples#

Deprecating SchemaAdapterFactory and SchemaAdapter#

TableParquetOptions Updated#

DataFusion 48.0.1#

datafusion.execution.collect_statistics now defaults to true#

DataFusion 48.0.0#

Expr::Literal has optional metadata#

Expr::WindowFunction is now Boxed#

The VARCHAR SQL type is now represented as Utf8View in Arrow#

ListingOptions default for collect_stat changed from true to false#

Processing FieldRef instead of DataType for user defined functions#

Physical Expression return Field#

FileFormat::supports_filters_pushdown replaced with FileSource::try_pushdown_filters#

ParquetExec, AvroExec, CsvExec, JsonExec Removed#

PartitionedFile added as an argument to the FileOpener trait#

DataFusion 47.0.0#

Upgrades to arrow-rs and arrow-parquet 55.0.0 and object_store 0.12.0#

DisplayFormatType::TreeRender#

Removed Deprecated APIs#

FileScanConfig –> FileScanConfigBuilder#

DataFusion 46.0.0#

Use invoke_with_args instead of invoke() and invoke_batch()#

ParquetExec, AvroExec, CsvExec, JsonExec deprecated#

Cookbook: Changes to ParquetExecBuilder#

Cookbook: Changes to ParquetExecBuilder#

datafusion-cli no longer automatically unescapes strings#

Changes to array scalar function signatures#

This Page

DataFusion `51.0.0`#

`arrow` / `parquet` updated to 57.0.0#

Upgrade to arrow `57.0.0` and parquet `57.0.0`#

`MSRV` updated to 1.87.0#

`FunctionRegistry` exposes two additional methods#

`datafusion-proto` use `TaskContext` rather than `SessionContext` in physical plan serde methods#

`SessionState`’s `sql_to_statement` method takes `Dialect` rather than a `str`#

Reorganization of `ListingTable` into `datafusion-catalog-listing` crate#

Reorganization of `ArrowSource` into `datafusion-datasource-arrow` crate#

`FileScanConfig::projection` renamed to `FileScanConfig::projection_exprs`#

`DESCRIBE query` support#

DataFusion `50.0.0`#

`MSRV` updated to 1.86.0#

`ScalarUDFImpl`, `AggregateUDFImpl` and `WindowUDFImpl` traits now require `PartialEq`, `Eq`, and `Hash` traits#

`AsyncScalarUDFImpl::invoke_async_with_args` returns `ColumnarValue`#

`ProjectionExpr` changed from type alias to struct#

`SessionState`, `SessionConfig`, and `OptimizerConfig` returns `&Arc<ConfigOptions>` instead of `&ConfigOptions`#

API Change to `AsyncScalarUDFImpl::invoke_async_with_args`#

Upgrade to arrow `56.0.0` and parquet `56.0.0`#

Added `ExecutionPlan::reset_state`#

Add `as_any()` method to `LazyBatchGenerator`#

Refactored `DataSource::try_swapping_with_projection`#

`FileOpenFuture` now uses `DataFusionError` instead of `ArrowError`#

Added `PhysicalExpr::is_volatile_node`#

DataFusion `49.0.0`#

`MSRV` updated to 1.85.1#

`DataFusionError` variants are now `Box`ed#

Metadata on Arrow Types is now represented by `FieldMetadata`#

New `datafusion.execution.spill_compression` configuration option#

Deprecated `map_varchar_to_utf8view` configuration option#

New `map_string_types_to_utf8view` configuration option#

Deprecating `SchemaAdapterFactory` and `SchemaAdapter`#

`TableParquetOptions` Updated#

DataFusion `48.0.1`#

`datafusion.execution.collect_statistics` now defaults to `true`#

DataFusion `48.0.0`#

`Expr::Literal` has optional metadata#

`Expr::WindowFunction` is now `Box`ed#

The `VARCHAR` SQL type is now represented as `Utf8View` in Arrow#

`ListingOptions` default for `collect_stat` changed from `true` to `false`#

Processing `FieldRef` instead of `DataType` for user defined functions#

Physical Expression return `Field`#

`FileFormat::supports_filters_pushdown` replaced with `FileSource::try_pushdown_filters`#

`ParquetExec`, `AvroExec`, `CsvExec`, `JsonExec` Removed#

`PartitionedFile` added as an argument to the `FileOpener` trait#

DataFusion `47.0.0`#

Upgrades to `arrow-rs` and `arrow-parquet` 55.0.0 and `object_store` 0.12.0#

`DisplayFormatType::TreeRender`#

`FileScanConfig` –> `FileScanConfigBuilder`#

DataFusion `46.0.0`#

Use `invoke_with_args` instead of `invoke()` and `invoke_batch()`#

`ParquetExec`, `AvroExec`, `CsvExec`, `JsonExec` deprecated#

Cookbook: Changes to `ParquetExecBuilder`#

Cookbook: Changes to `ParquetExecBuilder`#

`datafusion-cli` no longer automatically unescapes strings#