Upgrade Guides#

DataFusion 54.0.0#

Note: DataFusion 54.0.0 has not been released yet. The information provided in this section pertains to features and changes that have already been merged to the main branch and are awaiting release in this version.

String/numeric comparison coercion now prefers numeric types#

Previously, comparing a numeric column with a string value (e.g., WHERE int_col > '100') coerced both sides to strings and performed a lexicographic comparison. This produced surprising results — for example, 5 > '100' yielded true because '5' > '1' lexicographically, even though 5 > 100 is false numerically.

DataFusion now coerces the string side to the numeric type in comparison contexts (=, <, >, <=, >=, <>, IN, BETWEEN, CASE .. WHEN, GREATEST, LEAST). For example, 5 > '100' will now yield false.

Who is affected:

  • Queries that compare numeric values with string values

  • Queries that use IN lists with mixed string and numeric types

  • Queries that use CASE expr WHEN with mixed string and numeric types

  • Queries that use GREATEST or LEAST with mixed string and numeric types

Behavioral changes:

Expression

Old behavior

New behavior

int_col > '100'

Lexicographic

Numeric

float_col = '5'

String '5' != '5.0'

Numeric 5.0 = 5.0

int_col = 'hello'

String comparison, always false

Cast error

str_col IN ('a', 1)

Coerce to Utf8

Cast error ('a' cannot be cast to Int64)

float_col IN ('1.0')

String '1.0' != '1'

Numeric 1.0 = 1.0 (correct)

CASE str_col WHEN 1.0

Coerce to Utf8

Coerce to Float64

GREATEST(10, '9')

Utf8 '9' (lexicographic)

Int64 10 (numeric)

LEAST(10, '9')

Utf8 10 (lexicographic)

Int64 9 (numeric)

Migration guide:

Most queries will produce more correct results with no changes needed. However, queries that relied on the old string-comparison behavior may need adjustment:

  • Queries comparing numeric columns with non-numeric strings (e.g., int_col = 'hello' or int_col > text_col where text_col contains non-numeric values) will now produce a cast error instead of silently returning no rows.

  • Mixed-type IN lists (e.g., str_col IN ('a', 1)) are now rejected. Use consistent types for the IN list or add an explicit CAST.

  • Queries comparing integer columns with non-integer numeric string literals (e.g., int_col = '99.99') will now produce a cast error because '99.99' cannot be cast to an integer. Use a float column or adjust the literal.

See #15161 and PR #20426 for details.

CastColumnExpr removed in favor of field-aware CastExpr#

datafusion_physical_expr::expressions::CastColumnExpr has been removed; use the field-aware datafusion_physical_expr::expressions::CastExpr instead.

If your code downcasted to CastColumnExpr, downcast to CastExpr instead and use CastExpr::target_field() for the output field metadata and CastExpr::expr() for the input expression. To construct casts with explicit field semantics, use CastExpr::new_with_target_field(...). The type-only CastExpr::new(...) and cast(...) helpers remain available for callers that only have a DataType.

comparison_coercion_numeric removed, replaced by comparison_coercion#

The comparison_coercion_numeric function has been removed. Its behavior (preferring numeric types for string/numeric comparisons) is now the default in comparison_coercion. A new function, type_union_coercion, handles contexts where string types are preferred (UNION, CASE THEN/ELSE, NVL2).

Who is affected:

  • Crates that call comparison_coercion_numeric directly

  • Crates that call comparison_coercion and relied on its old string-preferring behavior

  • Crates that call get_coerce_type_for_case_expression

ExecutionPlan::apply_expressions is now a required method#

apply_expressions has been added as a required method on the ExecutionPlan trait (no default implementation). The same applies to the FileSource and DataSource traits. Any custom implementation of these traits must now implement apply_expressions.

Who is affected:

  • Users who implement custom ExecutionPlan nodes

  • Users who implement custom FileSource or DataSource sources

Migration guide:

Add apply_expressions to your implementation. Call f on each top-level PhysicalExpr your node owns, using visit_sibling to correctly propagate TreeNodeRecursion:

Node with no expressions:

fn apply_expressions(
    &self,
    _f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
    Ok(TreeNodeRecursion::Continue)
}

Node with a single expression:

fn apply_expressions(
    &self,
    f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
    f(self.predicate.as_ref())
}

Node with multiple expressions:

fn apply_expressions(
    &self,
    f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
    let mut tnr = TreeNodeRecursion::Continue;
    for expr in &self.expressions {
        tnr = tnr.visit_sibling(|| f(expr.as_ref()))?;
    }
    Ok(tnr)
}

Node whose only expressions are in output_ordering() (e.g. a synthetic test node with no owned expression fields):

fn apply_expressions(
    &self,
    f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
    let mut tnr = TreeNodeRecursion::Continue;
    if let Some(ordering) = self.cache.output_ordering() {
        for sort_expr in ordering {
            tnr = tnr.visit_sibling(|| f(sort_expr.expr.as_ref()))?;
        }
    }
    Ok(tnr)
}

ExecutionPlan::partition_statistics now returns Arc<Statistics>#

ExecutionPlan::partition_statistics now returns Result<Arc<Statistics>> instead of Result<Statistics>. This avoids cloning Statistics when it is shared across multiple consumers.

Before:

fn partition_statistics(&self, partition: Option<usize>) -> Result<Statistics> {
    Ok(Statistics::new_unknown(&self.schema()))
}

After:

fn partition_statistics(&self, partition: Option<usize>) -> Result<Arc<Statistics>> {
    Ok(Arc::new(Statistics::new_unknown(&self.schema())))
}

If you need an owned Statistics value (e.g. to mutate it), use Arc::unwrap_or_clone:

// If you previously consumed the Statistics directly:
let stats = plan.partition_statistics(None)?;
stats.column_statistics[0].min_value = ...;

// Now unwrap the Arc first:
let mut stats = Arc::unwrap_or_clone(plan.partition_statistics(None)?);
stats.column_statistics[0].min_value = ...;

Remove as_any from PhysicalExpr, ScalarUDFImpl, AggregateUDFImpl, WindowUDFImpl, ExecutionPlan, TableProvider, SchemaProvider, CatalogProvider, CatalogProviderList, TableSource, FileSource, FileFormat, FileFormatFactory, DataSource, and DataSink#

Now that we have a more recent minimum version of Rust, we can take advantage of trait upcasting. This reduces the amount of boilerplate code that users need to implement. In your implementations of the traits listed above, you can simply remove the as_any function. For example:

 impl PhysicalExpr for MyExpr {
-    fn as_any(&self) -> &dyn Any {
-        self
-    }
-
     fn data_type(&self, input_schema: &Schema) -> Result<DataType> {
         ...
     }

     ...
 }

The same change applies to all of the above traits — simply delete the as_any method from each implementation.

If you have code that is downcasting, you can drop the .as_any() call and use downcast_ref / is directly on the trait object:

-let exec = plan.as_any().downcast_ref::<MyExec>().unwrap();
+let exec = plan.downcast_ref::<MyExec>().unwrap();

These methods work correctly whether the value is a bare reference or behind an Arc (Rust auto-derefs through the Arc).

Warning: Do not cast an Arc<dyn Trait> directly to &dyn Any. Writing (&plan as &dyn Any) gives you an Any reference to the Arc itself, not the underlying trait object, so the downcast will always return None. Use the downcast_ref method above instead, or dereference through the Arc first with plan.as_ref() as &dyn Any.

PruningStatistics::row_counts no longer takes a column parameter#

The row_counts method on the PruningStatistics trait no longer takes a &Column argument, since row counts are a container-level property (the same for every column).

Before:

fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
    // ...
}

After:

fn row_counts(&self) -> Option<ArrayRef> {
    // ...
}

Who is affected:

  • Users who implement the PruningStatistics trait

Migration guide:

Remove the column: &Column parameter from your row_counts implementation and any corresponding call sites. If your implementation was using the column argument, note that row counts are identical for all columns in a container, so the parameter was unnecessary.

See PR #21369 for details.

Avro API and timestamp decoding changes#

DataFusion has switched to use arrow-avro (see #17861) when reading avro files which results in a few changes:

  • DataFusionError::AvroError has been removed.

  • From<apache_avro::Error> for DataFusionError has been removed.

  • Avro crate re-export changed:

    • Before: datafusion::apache_avro

    • After: datafusion::arrow_avro

  • Avro timestamp logical type interpretation changed. Notable effects:

    • Avro timestamp-* logical types are read as UTC timezone-aware Arrow timestamps (Timestamp(..., Some("+00:00")))

    • Avro local-timestamp-* logical types remain timezone-naive (Timestamp(..., None))

Who is affected:

  • Users matching on DataFusionError::AvroError

  • Users importing datafusion::apache_avro

  • Users relying on previous Avro timestamp logical type behavior

Migration guide:

  • Replace datafusion::apache_avro imports with datafusion::arrow_avro.

  • Update error handling code that matches on DataFusionError::AvroError to use the current error surface.

  • Validate timestamp handling where timezone semantics matter: timestamp-* is UTC timezone-aware, while local-timestamp-* is timezone-naive.

lpad, rpad, and translate now operate on Unicode codepoints instead of grapheme clusters#

Previously, lpad, rpad, and translate used Unicode grapheme cluster segmentation to measure and manipulate strings. They now use Unicode codepoints, which is consistent with the SQL standard and most other SQL implementations. It also matches the behavior of other string-related functions in DataFusion.

The difference is only observable for strings containing combining characters (e.g., U+0301 COMBINING ACUTE ACCENT) or other multi-codepoint grapheme clusters (e.g., ZWJ emoji sequences). For ASCII and most common Unicode text, behavior is unchanged.

Scalar subquery execution changes#

Uncorrelated scalar subqueries (e.g. SELECT ... WHERE x > (SELECT max(v) FROM t)) are now executed by a dedicated physical operator rather than being rewritten to a join. Correlated scalar subqueries are unchanged.

This produces two user-visible changes:

  • Subqueries that return multiple rows now fail at runtime. An uncorrelated scalar subquery that returns more than one row fails with Execution error: Scalar subquery returned more than one row. This matches the SQL standard and the behavior of most other SQL implementations. The previous join-based rewrite could silently produce multi-row output. Add a LIMIT 1 or an aggregate to the subquery to fix such queries.

  • Plan shape changes. Uncorrelated Expr::ScalarSubquery nodes now survive into the final logical plan instead of being replaced by a join; the corresponding physical plan contains a new ScalarSubqueryExec node and a ScalarSubqueryExpr expression. Code that walks or transforms LogicalPlan / ExecutionPlan trees, as well as EXPLAIN output, may need updating.

datafusion-proto: expression deserialization now takes a TaskContext#

Serializeable::from_bytes_with_registry is renamed to from_bytes_with_ctx and takes a &TaskContext instead of a &dyn FunctionRegistry. parse_expr, parse_exprs, and parse_sorts take the same change. Expr::from_bytes (without a registry argument) is unchanged.

-let expr = Expr::from_bytes_with_registry(&bytes, &registry)?;
+let expr = Expr::from_bytes_with_ctx(&bytes, ctx.task_ctx().as_ref())?;
-let expr = parse_expr(&proto, &registry, &codec)?;
+let expr = parse_expr(&proto, ctx.task_ctx().as_ref(), &codec)?;

datafusion-proto: PhysicalProtoConverterExtension reshaped#

PhysicalProtoConverterExtension and the parse_physical_*_with_converter helpers now take a single &PhysicalPlanDecodeContext<'_> that bundles the TaskContext and the PhysicalExtensionCodec. Implementations update like this:

 impl PhysicalProtoConverterExtension for MyConverter {
     fn proto_to_execution_plan(
         &self,
-        ctx: &TaskContext,
-        codec: &dyn PhysicalExtensionCodec,
         proto: &protobuf::PhysicalPlanNode,
+        ctx: &PhysicalPlanDecodeContext<'_>,
     ) -> Result<Arc<dyn ExecutionPlan>> {
-        proto.try_into_physical_plan_with_converter(ctx, codec, self)
+        self.default_proto_to_execution_plan(proto, ctx)
     }

     fn proto_to_physical_expr(
         &self,
         proto: &PhysicalExprNode,
-        ctx: &TaskContext,
         input_schema: &Schema,
-        codec: &dyn PhysicalExtensionCodec,
+        ctx: &PhysicalPlanDecodeContext<'_>,
     ) -> Result<Arc<dyn PhysicalExpr>> {
-        parse_physical_expr_with_converter(proto, ctx, input_schema, codec, self)
+        self.default_proto_to_physical_expr(proto, input_schema, ctx)
     }
 }

Pull out the TaskContext or codec inside these methods with ctx.task_ctx() and ctx.codec(). Construct a fresh context at an API boundary with PhysicalPlanDecodeContext::new(task_ctx, codec).

ExecutionProps has new fields#

ExecutionProps gained new public fields. Code that constructs it via a struct literal, or pattern-matches it without .., no longer compiles. Use ExecutionProps::new() and include .. in exhaustive patterns.

Items in datafusion_functions::strings are no longer public#

StringArrayBuilder, LargeStringArrayBuilder, StringViewArrayBuilder, ColumnarValueRef, and append_view have been reduced to pub(crate). They were only ever used to implement concat and concat_ws inside the crate. If you were importing them externally, use Arrow’s corresponding builders with a caller-computed NullBuffer.

Conversion from FileDecryptionProperties to ConfigFileDecryptionProperties is now fallible#

Previously, datafusion_common::config::ConfigFileDecryptionProperties implemented From<&Arc<parquet::encryption::decrypt::FileDecryptionProperties>>. If an error was encountered when retrieving the footer key without providing key metadata, the error would be ignored and an empty footer key set in the result. This could lead to obscure errors later.

ConfigFileDecryptionProperties now instead implements TryFrom<&Arc<FileDecryptionProperties>>, and errors retrieving the footer key will be propagated up.

Migration guide:

Replace calls to ConfigFileDecryptionProperties::from with ConfigFileDecryptionProperties::try_from, and affected calls to into with try_into, with appropriate error handling added.

Before:

let config_decryption_properties: ConfigFileDecryptionProperties = (&decryption_properties).into();
// or
let config_decryption_properties = ConfigFileDecryptionProperties::from(&decryption_properties);

(where decryption_properties is an Arc<FileDecryptionProperties>)

After:

let config_decryption_properties: ConfigFileDecryptionProperties = (&decryption_properties).try_into()?;
// or
let config_decryption_properties = ConfigFileDecryptionProperties::try_from(&decryption_properties)?;

See #21602 and PR #21603 for details.

approx_percentile_cont, approx_percentile_cont_with_weight, approx_median now coerce to floats#

The type signatures of approx_percentile_cont, approx_percentile_cont_with_weight, and approx_median now coerce integer input values to Float64 before computing the approximation. As a result, these functions always return a float, even when the input column is an integer type.

Who is affected:

  • Queries or downstream code that relied on approx_percentile_cont / approx_percentile_cont_with_weight / approx_median returning an integer type when given an integer column.

Migration guide:

If downstream code checks or relies on the return type being an integer, add an explicit CAST back to the desired integer type, or update the type assertion:

-- Before (returned Int64):
SELECT approx_percentile_cont(quantity, 0.5) FROM orders;

-- After (returns Float64); cast if an integer result is required:
SELECT CAST(approx_percentile_cont(quantity, 0.5) AS BIGINT) FROM orders;

Box<C> and Arc<C> TreeNodeContainer impls now require C: Default#

The generic TreeNodeContainer implementations for Box<C> and Arc<C> now require C: Default. This change was necessary as part of optimizing tree rewriting to reduce heap allocations.

Who is affected:

  • Users that implement TreeNodeContainer on a custom type and wrap it in Box or Arc when walking trees.

Migration guide:

Add a Default implementation to your type. The default value is used as a temporary placeholder during query optimization, so when possible, pick a cheap, allocation-free variant:

impl Default for MyTreeNode {
    fn default() -> Self {
        MyTreeNode::Leaf // or whichever variant is cheapest to construct
    }
}