Upgrade Guides#

DataFusion 54.0.0#

Note: DataFusion 54.0.0 has not been released yet. The information provided in this section pertains to features and changes that have already been merged to the main branch and are awaiting release in this version.

String/numeric comparison coercion now prefers numeric types#

Previously, comparing a numeric column with a string value (e.g., WHERE int_col > '100') coerced both sides to strings and performed a lexicographic comparison. This produced surprising results — for example, 5 > '100' yielded true because '5' > '1' lexicographically, even though 5 > 100 is false numerically.

DataFusion now coerces the string side to the numeric type in comparison contexts (=, <, >, <=, >=, <>, IN, BETWEEN, CASE .. WHEN, GREATEST, LEAST). For example, 5 > '100' will now yield false.

Who is affected:

  • Queries that compare numeric values with string values

  • Queries that use IN lists with mixed string and numeric types

  • Queries that use CASE expr WHEN with mixed string and numeric types

  • Queries that use GREATEST or LEAST with mixed string and numeric types

Behavioral changes:

Expression

Old behavior

New behavior

int_col > '100'

Lexicographic

Numeric

float_col = '5'

String '5' != '5.0'

Numeric 5.0 = 5.0

int_col = 'hello'

String comparison, always false

Cast error

str_col IN ('a', 1)

Coerce to Utf8

Cast error ('a' cannot be cast to Int64)

float_col IN ('1.0')

String '1.0' != '1'

Numeric 1.0 = 1.0 (correct)

CASE str_col WHEN 1.0

Coerce to Utf8

Coerce to Float64

GREATEST(10, '9')

Utf8 '9' (lexicographic)

Int64 10 (numeric)

LEAST(10, '9')

Utf8 10 (lexicographic)

Int64 9 (numeric)

Migration guide:

Most queries will produce more correct results with no changes needed. However, queries that relied on the old string-comparison behavior may need adjustment:

  • Queries comparing numeric columns with non-numeric strings (e.g., int_col = 'hello' or int_col > text_col where text_col contains non-numeric values) will now produce a cast error instead of silently returning no rows.

  • Mixed-type IN lists (e.g., str_col IN ('a', 1)) are now rejected. Use consistent types for the IN list or add an explicit CAST.

  • Queries comparing integer columns with non-integer numeric string literals (e.g., int_col = '99.99') will now produce a cast error because '99.99' cannot be cast to an integer. Use a float column or adjust the literal.

See #15161 and PR #20426 for details.

comparison_coercion_numeric removed, replaced by comparison_coercion#

The comparison_coercion_numeric function has been removed. Its behavior (preferring numeric types for string/numeric comparisons) is now the default in comparison_coercion. A new function, type_union_coercion, handles contexts where string types are preferred (UNION, CASE THEN/ELSE, NVL2).

Who is affected:

  • Crates that call comparison_coercion_numeric directly

  • Crates that call comparison_coercion and relied on its old string-preferring behavior

  • Crates that call get_coerce_type_for_case_expression

ExecutionPlan::apply_expressions is now a required method#

apply_expressions has been added as a required method on the ExecutionPlan trait (no default implementation). The same applies to the FileSource and DataSource traits. Any custom implementation of these traits must now implement apply_expressions.

Who is affected:

  • Users who implement custom ExecutionPlan nodes

  • Users who implement custom FileSource or DataSource sources

Migration guide:

Add apply_expressions to your implementation. Call f on each top-level PhysicalExpr your node owns, using visit_sibling to correctly propagate TreeNodeRecursion:

Node with no expressions:

fn apply_expressions(
    &self,
    _f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
    Ok(TreeNodeRecursion::Continue)
}

Node with a single expression:

fn apply_expressions(
    &self,
    f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
    f(self.predicate.as_ref())
}

Node with multiple expressions:

fn apply_expressions(
    &self,
    f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
    let mut tnr = TreeNodeRecursion::Continue;
    for expr in &self.expressions {
        tnr = tnr.visit_sibling(|| f(expr.as_ref()))?;
    }
    Ok(tnr)
}

Node whose only expressions are in output_ordering() (e.g. a synthetic test node with no owned expression fields):

fn apply_expressions(
    &self,
    f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
    let mut tnr = TreeNodeRecursion::Continue;
    if let Some(ordering) = self.cache.output_ordering() {
        for sort_expr in ordering {
            tnr = tnr.visit_sibling(|| f(sort_expr.expr.as_ref()))?;
        }
    }
    Ok(tnr)
}

ExecutionPlan::partition_statistics now returns Arc<Statistics>#

ExecutionPlan::partition_statistics now returns Result<Arc<Statistics>> instead of Result<Statistics>. This avoids cloning Statistics when it is shared across multiple consumers.

Before:

fn partition_statistics(&self, partition: Option<usize>) -> Result<Statistics> {
    Ok(Statistics::new_unknown(&self.schema()))
}

After:

fn partition_statistics(&self, partition: Option<usize>) -> Result<Arc<Statistics>> {
    Ok(Arc::new(Statistics::new_unknown(&self.schema())))
}

If you need an owned Statistics value (e.g. to mutate it), use Arc::unwrap_or_clone:

// If you previously consumed the Statistics directly:
let stats = plan.partition_statistics(None)?;
stats.column_statistics[0].min_value = ...;

// Now unwrap the Arc first:
let mut stats = Arc::unwrap_or_clone(plan.partition_statistics(None)?);
stats.column_statistics[0].min_value = ...;

Remove as_any from ScalarUDFImpl, AggregateUDFImpl, WindowUDFImpl, ExecutionPlan, TableProvider, SchemaProvider, CatalogProvider, and CatalogProviderList#

Now that we have a more recent minimum version of Rust, we can take advantage of trait upcasting. This reduces the amount of boilerplate code that users need to implement. In your implementations of ScalarUDFImpl, AggregateUDFImpl, WindowUDFImpl, ExecutionPlan, TableProvider, SchemaProvider, CatalogProvider, and CatalogProviderList, you can simply remove the as_any function. The below diffs are examples from the associated PRs.

Scalar UDFs:

 impl ScalarUDFImpl for MyEq {
-    fn as_any(&self) -> &dyn Any {
-        self
-    }
-
     fn name(&self) -> &str {
         "my_eq"
     }

     ...
 }

Aggregate UDFs:

 impl AggregateUDFImpl for GeoMeanUdf {
-    fn as_any(&self) -> &dyn Any {
-        self
-    }
-
     fn name(&self) -> &str {
         "geo_mean"
     }

     ...
 }

Window UDFs:

 impl WindowUDFImpl for SmoothIt {
-    fn as_any(&self) -> &dyn Any {
-        self
-    }
-
     fn name(&self) -> &str {
         "smooth_it"
     }

     ...
 }

Execution Plans:

 impl ExecutionPlan for MyExec {
-    fn as_any(&self) -> &dyn Any {
-        self
-    }
-
     fn name(&self) -> &'static str {
         "MyExec"
     }

     ...
 }

Table Providers:

 impl TableProvider for MyTable {
-    fn as_any(&self) -> &dyn Any {
-        self
-    }
-
     fn schema(&self) -> SchemaRef {
         ...
     }

     ...
 }

Schema Providers:

 impl SchemaProvider for MySchema {
-    fn as_any(&self) -> &dyn Any {
-        self
-    }
-
     fn table_names(&self) -> Vec<String> {
         ...
     }

     ...
 }

Catalog Providers:

 impl CatalogProvider for MyCatalog {
-    fn as_any(&self) -> &dyn Any {
-        self
-    }
-
     fn schema_names(&self) -> Vec<String> {
         ...
     }

     ...
 }

Catalog Provider Lists:

 impl CatalogProviderList for MyCatalogList {
-    fn as_any(&self) -> &dyn Any {
-        self
-    }
-
     fn register_catalog(
         ...
     }

     ...
 }

If you have code that is downcasting, you can use the new downcast_ref and is methods defined directly on each trait object:

Before:

let exec = plan.as_any().downcast_ref::<MyExec>().unwrap();
let udf = scalar_udf.as_any().downcast_ref::<MyUdf>().unwrap();
let table = table_provider.as_any().downcast_ref::<MyTable>().unwrap();

After:

let exec = plan.downcast_ref::<MyExec>().unwrap();
let udf = scalar_udf.downcast_ref::<MyUdf>().unwrap();
let table = table_provider.downcast_ref::<MyTable>().unwrap();

These methods are available on dyn ExecutionPlan, dyn TableProvider, dyn SchemaProvider, dyn CatalogProvider, dyn CatalogProviderList, dyn ScalarUDFImpl, dyn AggregateUDFImpl, and dyn WindowUDFImpl. They work correctly whether the value is a bare reference or behind an Arc (Rust auto-derefs through the Arc).

Warning: Do not cast an Arc<dyn Trait> directly to &dyn Any. Writing (&plan as &dyn Any) gives you an Any reference to the Arc itself, not the underlying trait object, so the downcast will always return None. Use the downcast_ref method above instead, or dereference through the Arc first with plan.as_ref() as &dyn Any.

PruningStatistics::row_counts no longer takes a column parameter#

The row_counts method on the PruningStatistics trait no longer takes a &Column argument, since row counts are a container-level property (the same for every column).

Before:

fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
    // ...
}

After:

fn row_counts(&self) -> Option<ArrayRef> {
    // ...
}

Who is affected:

  • Users who implement the PruningStatistics trait

Migration guide:

Remove the column: &Column parameter from your row_counts implementation and any corresponding call sites. If your implementation was using the column argument, note that row counts are identical for all columns in a container, so the parameter was unnecessary.

See PR #21369 for details.

Avro API and timestamp decoding changes#

DataFusion has switched to use arrow-avro (see #17861) when reading avro files which results in a few changes:

  • DataFusionError::AvroError has been removed.

  • From<apache_avro::Error> for DataFusionError has been removed.

  • Avro crate re-export changed:

    • Before: datafusion::apache_avro

    • After: datafusion::arrow_avro

  • Avro timestamp logical type interpretation changed. Notable effects:

    • Avro timestamp-* logical types are read as UTC timezone-aware Arrow timestamps (Timestamp(..., Some("+00:00")))

    • Avro local-timestamp-* logical types remain timezone-naive (Timestamp(..., None))

Who is affected:

  • Users matching on DataFusionError::AvroError

  • Users importing datafusion::apache_avro

  • Users relying on previous Avro timestamp logical type behavior

Migration guide:

  • Replace datafusion::apache_avro imports with datafusion::arrow_avro.

  • Update error handling code that matches on DataFusionError::AvroError to use the current error surface.

  • Validate timestamp handling where timezone semantics matter: timestamp-* is UTC timezone-aware, while local-timestamp-* is timezone-naive.

lpad, rpad, and translate now operate on Unicode codepoints instead of grapheme clusters#

Previously, lpad, rpad, and translate used Unicode grapheme cluster segmentation to measure and manipulate strings. They now use Unicode codepoints, which is consistent with the SQL standard and most other SQL implementations. It also matches the behavior of other string-related functions in DataFusion.

The difference is only observable for strings containing combining characters (e.g., U+0301 COMBINING ACUTE ACCENT) or other multi-codepoint grapheme clusters (e.g., ZWJ emoji sequences). For ASCII and most common Unicode text, behavior is unchanged.