Upgrade Guides#

DataFusion 54.0.0#

Note: DataFusion 54.0.0 has not been released yet. The information provided in this section pertains to features and changes that have already been merged to the main branch and are awaiting release in this version.

`AggregateFunctionExpr::human_display()` now returns `Option<&str>`#

datafusion_physical_expr::aggregate::AggregateFunctionExpr::human_display() now returns Option<&str> instead of &str.

If your code read the display text directly, handle the None case and fall back to name() when needed:

let display = agg_expr.human_display().unwrap_or(agg_expr.name());

Aggregate logical-to-physical lowering helpers are deprecated#

create_aggregate_expr_with_name_and_maybe_filter and create_aggregate_expr_and_maybe_filter are deprecated. Use datafusion_physical_expr::aggregate::LoweredAggregateBuilder for new code that lowers a logical aggregate Expr into an AggregateFunctionExpr, filter, and order-by expressions.

For example:

let lowered = LoweredAggregateBuilder::new(
    expr,
    logical_input_schema,
    physical_input_schema,
    execution_props,
)
.build()?;

LoweredAggregateBuilder returns a LoweredAggregate containing the aggregate physical expression, optional filter, and order-by expressions.

`Expr::unalias_nested()` preserves aliases with metadata#

Expr::unalias_nested() no longer removes aliases that carry non-empty FieldMetadata. This preserves user-provided output field metadata. Code that needs to remove all aliases, including aliases with metadata, should unwrap Expr::Alias explicitly.

Physical aggregate proto display may contain encoded alias data#

PhysicalAggregateExprNode.human_display may now contain an internal encoded prefix when an aggregate display has a separate output alias. DataFusion decodes this when reading physical plans. Older readers that do not know this encoding may show the prefix text directly in diagnostics.

Physical `EXPLAIN` now shows lowered aggregate execution forms#

Physical EXPLAIN output is intended for diagnostics and may change between DataFusion versions. This release changes aggregate expression formatting in physical plans to show the lowered expression executed by the engine while keeping the visible output alias.

Examples:

count(*) may now appear as count(1) as count(*)
simplified aggregates may show the lowered implementation, such as min(...) as percentile_cont(...)
internal aggregate aliases may now show the underlying expression instead of only the alias name

Tests or diagnostics that compare physical EXPLAIN output exactly may need to update their expected strings.

String/numeric comparison coercion now prefers numeric types#

Previously, comparing a numeric column with a string value (e.g., WHERE int_col > '100') coerced both sides to strings and performed a lexicographic comparison. This produced surprising results — for example, 5 > '100' yielded true because '5' > '1' lexicographically, even though 5 > 100 is false numerically.

DataFusion now coerces the string side to the numeric type in comparison contexts (=, <, >, <=, >=, <>, IN, BETWEEN, CASE .. WHEN, GREATEST, LEAST). For example, 5 > '100' will now yield false.

Who is affected:

Queries that compare numeric values with string values
Queries that use IN lists with mixed string and numeric types
Queries that use CASE expr WHEN with mixed string and numeric types
Queries that use GREATEST or LEAST with mixed string and numeric types

Behavioral changes:

Expression	Old behavior	New behavior
`int_col > '100'`	Lexicographic	Numeric
`float_col = '5'`	String `'5' != '5.0'`	Numeric `5.0 = 5.0`
`int_col = 'hello'`	String comparison, always false	Cast error
`str_col IN ('a', 1)`	Coerce to Utf8	Cast error (`'a'` cannot be cast to Int64)
`float_col IN ('1.0')`	String `'1.0' != '1'`	Numeric `1.0 = 1.0` (correct)
`CASE str_col WHEN 1.0`	Coerce to Utf8	Coerce to Float64
`GREATEST(10, '9')`	Utf8 `'9'` (lexicographic)	Int64 `10` (numeric)
`LEAST(10, '9')`	Utf8 `10` (lexicographic)	Int64 `9` (numeric)

Migration guide:

Most queries will produce more correct results with no changes needed. However, queries that relied on the old string-comparison behavior may need adjustment:

Queries comparing numeric columns with non-numeric strings (e.g., int_col = 'hello' or int_col > text_col where text_col contains non-numeric values) will now produce a cast error instead of silently returning no rows.
Mixed-type IN lists (e.g., str_col IN ('a', 1)) are now rejected. Use consistent types for the IN list or add an explicit CAST.
Queries comparing integer columns with non-integer numeric string literals (e.g., int_col = '99.99') will now produce a cast error because '99.99' cannot be cast to an integer. Use a float column or adjust the literal.

See #15161 and PR #20426 for details.

`CastColumnExpr` removed in favor of field-aware `CastExpr`#

datafusion_physical_expr::expressions::CastColumnExpr has been removed; use the field-aware datafusion_physical_expr::expressions::CastExpr instead.

If your code downcasted to CastColumnExpr, downcast to CastExpr instead and use CastExpr::target_field() for the output field metadata and CastExpr::expr() for the input expression. To construct casts with explicit field semantics, use CastExpr::new_with_target_field(...). The type-only CastExpr::new(...) and cast(...) helpers remain available for callers that only have a DataType.

`comparison_coercion_numeric` removed, replaced by `comparison_coercion`#

The comparison_coercion_numeric function has been removed. Its behavior (preferring numeric types for string/numeric comparisons) is now the default in comparison_coercion. A new function, type_union_coercion, handles contexts where string types are preferred (UNION, CASE THEN/ELSE, NVL2).

Who is affected:

Crates that call comparison_coercion_numeric directly
Crates that call comparison_coercion and relied on its old string-preferring behavior
Crates that call get_coerce_type_for_case_expression

`ExecutionPlan::partition_statistics` now returns `Arc<Statistics>`#

ExecutionPlan::partition_statistics now returns Result<Arc<Statistics>> instead of Result<Statistics>. This avoids cloning Statistics when it is shared across multiple consumers.

Before:

fn partition_statistics(&self, partition: Option<usize>) -> Result<Statistics> {
    Ok(Statistics::new_unknown(&self.schema()))
}

After:

fn partition_statistics(&self, partition: Option<usize>) -> Result<Arc<Statistics>> {
    Ok(Arc::new(Statistics::new_unknown(&self.schema())))
}

If you need an owned Statistics value (e.g. to mutate it), use Arc::unwrap_or_clone:

// If you previously consumed the Statistics directly:
let stats = plan.partition_statistics(None)?;
stats.column_statistics[0].min_value = ...;

// Now unwrap the Arc first:
let mut stats = Arc::unwrap_or_clone(plan.partition_statistics(None)?);
stats.column_statistics[0].min_value = ...;

Remove `as_any` from `PhysicalExpr`, `ScalarUDFImpl`, `AggregateUDFImpl`, `WindowUDFImpl`, `ExecutionPlan`, `TableProvider`, `SchemaProvider`, `CatalogProvider`, `CatalogProviderList`, `TableSource`, `FileSource`, `FileFormat`, `FileFormatFactory`, `DataSource`, and `DataSink`#

Now that we have a more recent minimum version of Rust, we can take advantage of trait upcasting. This reduces the amount of boilerplate code that users need to implement. In your implementations of the traits listed above, you can simply remove the as_any function. For example:

 impl PhysicalExpr for MyExpr {
-    fn as_any(&self) -> &dyn Any {
-        self
-    }
-
     fn data_type(&self, input_schema: &Schema) -> Result<DataType> {
         ...
     }

     ...
 }

The same change applies to all of the above traits — simply delete the as_any method from each implementation.

If you have code that is downcasting, you can drop the .as_any() call and use downcast_ref / is directly on the trait object:

-let exec = plan.as_any().downcast_ref::<MyExec>().unwrap();
+let exec = plan.downcast_ref::<MyExec>().unwrap();

These methods work correctly whether the value is a bare reference or behind an Arc (Rust auto-derefs through the Arc).

Warning: Do not cast an Arc<dyn Trait> directly to &dyn Any. Writing (&plan as &dyn Any) gives you an Any reference to the Arc itself, not the underlying trait object, so the downcast will always return None. Use the downcast_ref method above instead, or dereference through the Arc first with plan.as_ref() as &dyn Any.

`PruningStatistics::row_counts` no longer takes a `column` parameter#

The row_counts method on the PruningStatistics trait no longer takes a &Column argument, since row counts are a container-level property (the same for every column).

Before:

fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
    // ...
}

After:

fn row_counts(&self) -> Option<ArrayRef> {
    // ...
}

Who is affected:

Users who implement the PruningStatistics trait

Migration guide:

Remove the column: &Column parameter from your row_counts implementation and any corresponding call sites. If your implementation was using the column argument, note that row counts are identical for all columns in a container, so the parameter was unnecessary.

See PR #21369 for details.

Avro API and timestamp decoding changes#

DataFusion has switched to use arrow-avro (see #17861) when reading avro files which results in a few changes:

DataFusionError::AvroError has been removed.
From<apache_avro::Error> for DataFusionError has been removed.
Avro crate re-export changed:
- Before: datafusion::apache_avro
- After: datafusion::arrow_avro
Avro timestamp logical type interpretation changed. Notable effects:
- Avro timestamp-* logical types are read as UTC timezone-aware Arrow timestamps (Timestamp(..., Some("+00:00")))
- Avro local-timestamp-* logical types remain timezone-naive (Timestamp(..., None))

Who is affected:

Users matching on DataFusionError::AvroError
Users importing datafusion::apache_avro
Users relying on previous Avro timestamp logical type behavior

Migration guide:

Replace datafusion::apache_avro imports with datafusion::arrow_avro.
Update error handling code that matches on DataFusionError::AvroError to use the current error surface.
Validate timestamp handling where timezone semantics matter: timestamp-* is UTC timezone-aware, while local-timestamp-* is timezone-naive.

`lpad`, `rpad`, and `translate` now operate on Unicode codepoints instead of grapheme clusters#

Previously, lpad, rpad, and translate used Unicode grapheme cluster segmentation to measure and manipulate strings. They now use Unicode codepoints, which is consistent with the SQL standard and most other SQL implementations. It also matches the behavior of other string-related functions in DataFusion.

The difference is only observable for strings containing combining characters (e.g., U+0301 COMBINING ACUTE ACCENT) or other multi-codepoint grapheme clusters (e.g., ZWJ emoji sequences). For ASCII and most common Unicode text, behavior is unchanged.

Scalar subquery execution changes#

Uncorrelated scalar subqueries (e.g. SELECT ... WHERE x > (SELECT max(v) FROM t)) are now executed by a dedicated physical operator rather than being rewritten to a join. Correlated scalar subqueries are unchanged.

This produces two user-visible changes:

Subqueries that return multiple rows now fail at runtime. An uncorrelated scalar subquery that returns more than one row fails with Execution error: Scalar subquery returned more than one row. This matches the SQL standard and the behavior of most other SQL implementations. The previous join-based rewrite could silently produce multi-row output. Add a LIMIT 1 or an aggregate to the subquery to fix such queries.
Plan shape changes. Uncorrelated Expr::ScalarSubquery nodes now survive into the final logical plan instead of being replaced by a join; the corresponding physical plan contains a new ScalarSubqueryExec node and a ScalarSubqueryExpr expression. Code that walks or transforms LogicalPlan / ExecutionPlan trees, as well as EXPLAIN output, may need updating.

`datafusion-proto`: expression deserialization now takes a `TaskContext`#

Serializeable::from_bytes_with_registry is renamed to from_bytes_with_ctx and takes a &TaskContext instead of a &dyn FunctionRegistry. parse_expr, parse_exprs, and parse_sorts take the same change. Expr::from_bytes (without a registry argument) is unchanged.

-let expr = Expr::from_bytes_with_registry(&bytes, &registry)?;
+let expr = Expr::from_bytes_with_ctx(&bytes, ctx.task_ctx().as_ref())?;

-let expr = parse_expr(&proto, &registry, &codec)?;
+let expr = parse_expr(&proto, ctx.task_ctx().as_ref(), &codec)?;

`datafusion-proto`: `PhysicalProtoConverterExtension` reshaped#

PhysicalProtoConverterExtension and the parse_physical_*_with_converter helpers now take a single &PhysicalPlanDecodeContext<'_> that bundles the TaskContext and the PhysicalExtensionCodec. Implementations update like this:

 impl PhysicalProtoConverterExtension for MyConverter {
     fn proto_to_execution_plan(
         &self,
-        ctx: &TaskContext,
-        codec: &dyn PhysicalExtensionCodec,
         proto: &protobuf::PhysicalPlanNode,
+        ctx: &PhysicalPlanDecodeContext<'_>,
     ) -> Result<Arc<dyn ExecutionPlan>> {
-        proto.try_into_physical_plan_with_converter(ctx, codec, self)
+        self.default_proto_to_execution_plan(proto, ctx)
     }

     fn proto_to_physical_expr(
         &self,
         proto: &PhysicalExprNode,
-        ctx: &TaskContext,
         input_schema: &Schema,
-        codec: &dyn PhysicalExtensionCodec,
+        ctx: &PhysicalPlanDecodeContext<'_>,
     ) -> Result<Arc<dyn PhysicalExpr>> {
-        parse_physical_expr_with_converter(proto, ctx, input_schema, codec, self)
+        self.default_proto_to_physical_expr(proto, input_schema, ctx)
     }
 }

Pull out the TaskContext or codec inside these methods with ctx.task_ctx() and ctx.codec(). Construct a fresh context at an API boundary with PhysicalPlanDecodeContext::new(task_ctx, codec).

`ExecutionProps` has new fields#

ExecutionProps gained new public fields. Code that constructs it via a struct literal, or pattern-matches it without .., no longer compiles. Use ExecutionProps::new() and include .. in exhaustive patterns.

Items in `datafusion_functions::strings` are no longer public#

StringArrayBuilder, LargeStringArrayBuilder, StringViewArrayBuilder, ColumnarValueRef, and append_view have been reduced to pub(crate). They were only ever used to implement concat and concat_ws inside the crate. If you were importing them externally, use Arrow’s corresponding builders with a caller-computed NullBuffer.

Conversion from `FileDecryptionProperties` to `ConfigFileDecryptionProperties` is now fallible#

Previously, datafusion_common::config::ConfigFileDecryptionProperties implemented From<&Arc<parquet::encryption::decrypt::FileDecryptionProperties>>. If an error was encountered when retrieving the footer key without providing key metadata, the error would be ignored and an empty footer key set in the result. This could lead to obscure errors later.

ConfigFileDecryptionProperties now instead implements TryFrom<&Arc<FileDecryptionProperties>>, and errors retrieving the footer key will be propagated up.

Migration guide:

Replace calls to ConfigFileDecryptionProperties::from with ConfigFileDecryptionProperties::try_from, and affected calls to into with try_into, with appropriate error handling added.

Before:

let config_decryption_properties: ConfigFileDecryptionProperties = (&decryption_properties).into();
// or
let config_decryption_properties = ConfigFileDecryptionProperties::from(&decryption_properties);

(where decryption_properties is an Arc<FileDecryptionProperties>)

After:

let config_decryption_properties: ConfigFileDecryptionProperties = (&decryption_properties).try_into()?;
// or
let config_decryption_properties = ConfigFileDecryptionProperties::try_from(&decryption_properties)?;

See #21602 and PR #21603 for details.

Conversion from `ConfigFileEncryptionProperties` / `ConfigFileDecryptionProperties` is now fallible#

Previously, datafusion_common::config::ConfigFileEncryptionProperties and datafusion_common::config::ConfigFileDecryptionProperties implemented infallible conversions into Parquet’s encryption/decryption types (via From / Into). These conversions may need to decode hex-encoded keys and other configuration values, which can fail.

They now use TryFrom / TryInto and return a Result:

impl TryFrom<ConfigFileEncryptionProperties> for parquet::encryption::encrypt::FileEncryptionProperties
impl TryFrom<ConfigFileDecryptionProperties> for parquet::encryption::decrypt::FileDecryptionProperties

Migration guide:

Replace from() / into() with try_from() / try_into() and handle the resulting Result.

Before:

let file_encryption_properties: FileEncryptionProperties = config_encryption_properties.into();
// or
let file_decryption_properties = FileDecryptionProperties::from(config_decryption_properties);

( where config_encryption_properties is a ConfigFileEncryptionProperties and config_decryption_properties is a ConfigFileDecryptionProperties )

After:

let file_encryption_properties: FileEncryptionProperties =
    config_encryption_properties.try_into()?;
// or
let file_decryption_properties =
    FileDecryptionProperties::try_from(config_decryption_properties)?;

See #21974 and PR #21985 for details.

`approx_percentile_cont`, `approx_percentile_cont_with_weight`, `approx_median` now coerce to floats#

The type signatures of approx_percentile_cont, approx_percentile_cont_with_weight, and approx_median now coerce integer input values to Float64 before computing the approximation. As a result, these functions always return a float, even when the input column is an integer type.

Who is affected:

Queries or downstream code that relied on approx_percentile_cont / approx_percentile_cont_with_weight / approx_median returning an integer type when given an integer column.

Migration guide:

If downstream code checks or relies on the return type being an integer, add an explicit CAST back to the desired integer type, or update the type assertion:

-- Before (returned Int64):
SELECT approx_percentile_cont(quantity, 0.5) FROM orders;

-- After (returns Float64); cast if an integer result is required:
SELECT CAST(approx_percentile_cont(quantity, 0.5) AS BIGINT) FROM orders;

`PartitionedFile::extensions` is now a type-keyed map#

PartitionedFile.extensions previously held a single Option<Arc<dyn Any + Send + Sync>> slot, so two independent components could not both attach data to the same file without colliding. The field is now a FileExtensions (a re-export of datafusion_common::extensions::Extensions), a map keyed by concrete Rust type. Each type occupies its own slot, so multiple consumers (e.g. a ParquetAccessPlan and a custom index entry) can coexist on a single PartitionedFile.

The previous with_extensions(Arc<dyn Any + Send + Sync>) builder is deprecated (it still works, keyed by the value’s dynamic TypeId) in favor of a typed variant:

-let pf = PartitionedFile::new(path, size)
-    .with_extensions(Arc::new(access_plan));
+let pf = PartitionedFile::new(path, size)
+    .with_extension(access_plan);

Reading an extension no longer requires a manual downcast:

-let access_plan = partitioned_file
-    .extensions
-    .as_ref()
-    .and_then(|ext| ext.downcast_ref::<ParquetAccessPlan>());
+let access_plan = partitioned_file.extension::<ParquetAccessPlan>();

The FileExtensions API is insert / insert_arc / get / get_arc / contains / merge, all generic over the concrete type T. Values are stored as Arc<T> so the map remains cheap to clone.

Who is affected:

Code that constructs PartitionedFile and calls .with_extensions(...).
Custom ParquetFileReaderFactory implementations or other consumers that read partitioned_file.extensions and downcast manually.

`arrays_zip` struct field names changed#

The arrays_zip (and its alias list_zip) scalar function now names its output struct fields "1", "2", …, "n" (1-indexed, matching DuckDB and Spark) instead of c0, c1, …, c{n-1}.

Who is affected:

Queries or downstream code that references the output struct fields by name (e.g. arrays_zip(a, b)[1]['c0']). Update field accessors to '1', '2', etc. (e.g. arrays_zip(a, b)[1]['1']).

See PR #20886 for details.

`Box<C>` and `Arc<C>` `TreeNodeContainer` impls now require `C: Default`#

The generic TreeNodeContainer implementations for Box<C> and Arc<C> now require C: Default. This change was necessary as part of optimizing tree rewriting to reduce heap allocations.

Who is affected:

Users that implement TreeNodeContainer on a custom type and wrap it in Box or Arc when walking trees.

Migration guide:

Add a Default implementation to your type. The default value is used as a temporary placeholder during query optimization, so when possible, pick a cheap, allocation-free variant:

impl Default for MyTreeNode {
    fn default() -> Self {
        MyTreeNode::Leaf // or whichever variant is cheapest to construct
    }
}

`MemoryPool` now requires `'static` (adds `Any` as a supertrait)#

To enable downcasting of dyn MemoryPool to concrete pool types (via is::<T>() / downcast_ref::<T>()), the MemoryPool trait now has Any as a supertrait:

// Before
pub trait MemoryPool: Send + Sync + std::fmt::Debug + Display { ... }

// After
pub trait MemoryPool: Any + Send + Sync + std::fmt::Debug + Display { ... }

Because Any is only implemented for 'static types, this implicitly adds a 'static bound to every MemoryPool implementor.

Who is affected:

Users who implement a custom MemoryPool whose type carries a lifetime parameter or borrows state (e.g. struct MyPool<'a> { inner: &'a State }). Existing implementations that are already 'static (the common case) need no changes.

Migration guide:

Replace borrowed references with owned handles so the pool type becomes 'static. The typical fix is to swap &'a T for Arc<T> (or Rc<T>, or an owned value):

// Before — not 'static, no longer compiles
struct MyPool<'a> {
    inner: &'a SomeState,
}

impl<'a> MemoryPool for MyPool<'a> { ... }

// After — owned handle makes MyPool: 'static
struct MyPool {
    inner: Arc<SomeState>,
}

impl MemoryPool for MyPool { ... }

If the borrowed state truly cannot be made 'static, you can wrap the borrowed pool in a 'static adapter that the pool consumer owns — for example, store the underlying state in an Arc owned by the adapter, or move the borrow behind an interior-mutability primitive such as Arc<Mutex<_>> or Arc<OnceLock<_>>.

See PR #21803 for details.

File statistics cache is now memory-limited and managed by the `CacheManager`#

The file statistics cache used by ListingTable is now memory-limited and centrally managed through the CacheManager.

To configure the cache size use the file_statistics_cache_limit setting:

SET datafusion.runtime.file_statistics_cache_limit = '10M'

To disable the file statistics cache, set the limit to 0.

The file statistics cache is no longer created inside the ListingTable. Instead, it is created within the CacheManager and must be passed to the ListingTable.

Who is affected:

Users who want to limit the memory usage of the file statistics cache.
Users who want to disable the file statistics cache.
Users who want to create a ListingTable programmatically with a file statistics cache.

Migration guide:

Disable the cache by setting the configuration value to 0:

SET datafusion.runtime.file_statistics_cache_limit = '0'

Use the file statistics cache provided by the CacheManager when initializing a new ListingTable:

ListingTable::try_new(config)?
  .with_cache(ctx.runtime_env().cache_manager.get_file_statistic_cache())

`UnionsToFilter` optimizer rule is now disabled by default#

The datafusion.optimizer.enable_unions_to_filter option now defaults to false. When enabled, the rule rewrites UNION DISTINCT branches that read the same source and differ only by filter predicates into a single scan with a combined OR predicate:

-- Before: two separate scans
SELECT * FROM t WHERE a = 1
UNION
SELECT * FROM t WHERE a = 2

-- After: one scan
SELECT DISTINCT * FROM t WHERE a = 1 OR a = 2

Who is affected:

Queries using UNION against the same table with different filter conditions may benefit from enabling this rule.

Migration guide:

Enable the rule when your UNION queries scan the same large table multiple times with different predicates. Avoid it when the data source handles individual equality predicates more efficiently than a combined OR (e.g., index-backed sources).

SET datafusion.optimizer.enable_unions_to_filter = true;

See PR #21075 for more details

Higher-order functions and lambdas#

The changes below are related to the added support for higher-order functions and lambdas. For more info, see #14205, PR #18921, PR #21679 and EPIC #21172.

`FunctionRegistry` exposes two additional methods#

FunctionRegistry exposes two additional methods, higher_order_function which returns the registered higher-order function with the given name, if any, and higher_order_function_names which exposes the set of registered user defined higher-order function names.

Who is affected:

Users who implement the FunctionRegistry trait

Migration guide:

Add higher_order_function and higher_order_function_names to your implementation.

impl FunctionRegistry for FunctionRegistryImpl {
      fn udfs(&self) -> HashSet<String> {
         self.scalar_functions.keys().cloned().collect()
     }
+
+    fn higher_order_function(&self, name: &str) -> Result<Arc<dyn HigherOrderUDF>> {
+        self.higher_order_functions
+            .get(name)
+            .cloned()
+            .ok_or_else(|| plan_datafusion_err!("Higher-order function {name} not found"))
+    }
+
+    fn higher_order_function_names(&self) -> HashSet<String> {
+        self.higher_order_functions.keys().cloned().collect()
+    }
}

`ContextProvider` exposes two additional methods#

ContextProvider exposes two additional methods, get_higher_order_meta which returns the registered higher-order function with the given name, if any, and higher_order_function_names which exposes the registered user defined higher-order function names.

Who is affected:

Users who implement the ContextProvider trait

Migration guide:

Add get_higher_order_meta and higher_order_function_names to your implementation.

impl ContextProvider for ContextProviderImpl {
      fn udfs(&self) -> HashSet<String> {
         self.scalar_functions.keys().cloned().collect()
     }
+
+    fn get_higher_order_meta(&self, name: &str) -> Option<Arc<dyn HigherOrderUDF>> {
+        self.higher_order_functions.get(name).cloned()
+    }
+
+    fn higher_order_function_names(&self) -> Vec<String> {
+        self.higher_order_functions.keys().cloned().collect()
+    }
}

Add `higher_order_functions()` method to `Session`#

The higher_order_functions method has been added to the Session trait, which exposes the registered user defined higher-order functions.

Who is affected:

Users who implement the Session trait

Migration guide:

Add higher_order_functions to your implementation.

impl Session for MySession {
     ...
+    fn higher_order_functions(&self) -> &HashMap<String, Arc<dyn HigherOrderUDF>> {
+        &self.higher_order_functions
+    }
}

New argument on `TaskContext::new`#

TaskContext::new expects a new argument, higher_order_functions, which is a map of higher-order functions keyed by name.

Who is affected:

Users who call TaskContext::new

Migration guide:

Provide the new argument to the function. An empty hash map is sufficient.

+let higher_order_functions = HashMap::new();

TaskContext::new(
     task_id,
     session_id,
     session_config,
     scalar_functions,
+    higher_order_functions,
     aggregate_functions,
     window_functions,
     runtime,
)

The `Expr` enum has three new variants#

HigherOrderFunction: Call a higher-order function with a set of arguments
Lambda: A Lambda expression with a set of parameters names and a body
LambdaVariable: A named reference to a lambda parameter

Who is affected:

Users who match on an Expr without a default branch _ => {}

Migration guide:

Add the new branches to the match with the logic applicable to the context.

match expr {
    Expr::Column(column) => ...,
    ...,
+   Expr::HigherOrderFunction(func) => {},
+   Expr::Lambda(lambda) => {},
+   Expr::LambdaVariable(lambda_var) => {},

The `RegisterFunction` enum has a new `HigherOrder` variant#

RegisterFunction now has a HigherOrder(Arc<dyn HigherOrderUDF>) variant so user-defined higher-order functions can be registered

Who is affected:

Users who match on a RegisterFunction without a default branch _ => {}

Migration guide:

Add the new branch to the match with the logic applicable to the context.

match register_function {
    RegisterFunction::Scalar(scalar) => {},
    RegisterFunction::Aggregate(aggregate) => {},
    RegisterFunction::Window(window) => {},
+   RegisterFunction::HigherOrder(higher_order) => {},
    RegisterFunction::Table(name, table) => {},
}

Upgrade Guides#

DataFusion 54.0.0#

AggregateFunctionExpr::human_display() now returns Option<&str>#

Aggregate logical-to-physical lowering helpers are deprecated#

Expr::unalias_nested() preserves aliases with metadata#

Physical aggregate proto display may contain encoded alias data#

Physical EXPLAIN now shows lowered aggregate execution forms#

String/numeric comparison coercion now prefers numeric types#

CastColumnExpr removed in favor of field-aware CastExpr#

comparison_coercion_numeric removed, replaced by comparison_coercion#

ExecutionPlan::partition_statistics now returns Arc<Statistics>#

Remove as_any from PhysicalExpr, ScalarUDFImpl, AggregateUDFImpl, WindowUDFImpl, ExecutionPlan, TableProvider, SchemaProvider, CatalogProvider, CatalogProviderList, TableSource, FileSource, FileFormat, FileFormatFactory, DataSource, and DataSink#

PruningStatistics::row_counts no longer takes a column parameter#

Avro API and timestamp decoding changes#

lpad, rpad, and translate now operate on Unicode codepoints instead of grapheme clusters#

Scalar subquery execution changes#

datafusion-proto: expression deserialization now takes a TaskContext#

datafusion-proto: PhysicalProtoConverterExtension reshaped#

ExecutionProps has new fields#

Items in datafusion_functions::strings are no longer public#

Conversion from FileDecryptionProperties to ConfigFileDecryptionProperties is now fallible#

Conversion from ConfigFileEncryptionProperties / ConfigFileDecryptionProperties is now fallible#

approx_percentile_cont, approx_percentile_cont_with_weight, approx_median now coerce to floats#

PartitionedFile::extensions is now a type-keyed map#

arrays_zip struct field names changed#

Box<C> and Arc<C> TreeNodeContainer impls now require C: Default#

MemoryPool now requires 'static (adds Any as a supertrait)#

File statistics cache is now memory-limited and managed by the CacheManager#

UnionsToFilter optimizer rule is now disabled by default#

Higher-order functions and lambdas#

FunctionRegistry exposes two additional methods#

ContextProvider exposes two additional methods#

Add higher_order_functions() method to Session#

New argument on TaskContext::new#

The Expr enum has three new variants#

The RegisterFunction enum has a new HigherOrder variant#

`AggregateFunctionExpr::human_display()` now returns `Option<&str>`#

`Expr::unalias_nested()` preserves aliases with metadata#

Physical `EXPLAIN` now shows lowered aggregate execution forms#

`CastColumnExpr` removed in favor of field-aware `CastExpr`#

`comparison_coercion_numeric` removed, replaced by `comparison_coercion`#

`ExecutionPlan::partition_statistics` now returns `Arc<Statistics>`#

Remove `as_any` from `PhysicalExpr`, `ScalarUDFImpl`, `AggregateUDFImpl`, `WindowUDFImpl`, `ExecutionPlan`, `TableProvider`, `SchemaProvider`, `CatalogProvider`, `CatalogProviderList`, `TableSource`, `FileSource`, `FileFormat`, `FileFormatFactory`, `DataSource`, and `DataSink`#

`PruningStatistics::row_counts` no longer takes a `column` parameter#

`lpad`, `rpad`, and `translate` now operate on Unicode codepoints instead of grapheme clusters#

`datafusion-proto`: expression deserialization now takes a `TaskContext`#

`datafusion-proto`: `PhysicalProtoConverterExtension` reshaped#

`ExecutionProps` has new fields#

Items in `datafusion_functions::strings` are no longer public#

Conversion from `FileDecryptionProperties` to `ConfigFileDecryptionProperties` is now fallible#

Conversion from `ConfigFileEncryptionProperties` / `ConfigFileDecryptionProperties` is now fallible#

`approx_percentile_cont`, `approx_percentile_cont_with_weight`, `approx_median` now coerce to floats#

`PartitionedFile::extensions` is now a type-keyed map#

`arrays_zip` struct field names changed#

`Box<C>` and `Arc<C>` `TreeNodeContainer` impls now require `C: Default`#

`MemoryPool` now requires `'static` (adds `Any` as a supertrait)#

File statistics cache is now memory-limited and managed by the `CacheManager`#

`UnionsToFilter` optimizer rule is now disabled by default#

`FunctionRegistry` exposes two additional methods#

`ContextProvider` exposes two additional methods#

Add `higher_order_functions()` method to `Session`#

New argument on `TaskContext::new`#

The `Expr` enum has three new variants#

The `RegisterFunction` enum has a new `HigherOrder` variant#