Upgrade Guides#
DataFusion 54.0.0#
Note: DataFusion 54.0.0 has not been released yet. The information provided
in this section pertains to features and changes that have already been merged
to the main branch and are awaiting release in this version.
String/numeric comparison coercion now prefers numeric types#
Previously, comparing a numeric column with a string value (e.g.,
WHERE int_col > '100') coerced both sides to strings and performed a
lexicographic comparison. This produced surprising results — for example,
5 > '100' yielded true because '5' > '1' lexicographically, even
though 5 > 100 is false numerically.
DataFusion now coerces the string side to the numeric type in comparison
contexts (=, <, >, <=, >=, <>, IN, BETWEEN, CASE .. WHEN,
GREATEST, LEAST). For example, 5 > '100' will now yield false.
Who is affected:
Queries that compare numeric values with string values
Queries that use
INlists with mixed string and numeric typesQueries that use
CASE expr WHENwith mixed string and numeric typesQueries that use
GREATESTorLEASTwith mixed string and numeric types
Behavioral changes:
Expression |
Old behavior |
New behavior |
|---|---|---|
|
Lexicographic |
Numeric |
|
String |
Numeric |
|
String comparison, always false |
Cast error |
|
Coerce to Utf8 |
Cast error ( |
|
String |
Numeric |
|
Coerce to Utf8 |
Coerce to Float64 |
|
Utf8 |
Int64 |
|
Utf8 |
Int64 |
Migration guide:
Most queries will produce more correct results with no changes needed. However, queries that relied on the old string-comparison behavior may need adjustment:
Queries comparing numeric columns with non-numeric strings (e.g.,
int_col = 'hello'orint_col > text_colwheretext_colcontains non-numeric values) will now produce a cast error instead of silently returning no rows.Mixed-type
INlists (e.g.,str_col IN ('a', 1)) are now rejected. Use consistent types for theINlist or add an explicitCAST.Queries comparing integer columns with non-integer numeric string literals (e.g.,
int_col = '99.99') will now produce a cast error because'99.99'cannot be cast to an integer. Use a float column or adjust the literal.
comparison_coercion_numeric removed, replaced by comparison_coercion#
The comparison_coercion_numeric function has been removed. Its behavior
(preferring numeric types for string/numeric comparisons) is now the default in
comparison_coercion. A new function, type_union_coercion, handles contexts
where string types are preferred (UNION, CASE THEN/ELSE, NVL2).
Who is affected:
Crates that call
comparison_coercion_numericdirectlyCrates that call
comparison_coercionand relied on its old string-preferring behaviorCrates that call
get_coerce_type_for_case_expression
ExecutionPlan::apply_expressions is now a required method#
apply_expressions has been added as a required method on the ExecutionPlan trait (no default implementation). The same applies to the FileSource and DataSource traits. Any custom implementation of these traits must now implement apply_expressions.
Who is affected:
Users who implement custom
ExecutionPlannodesUsers who implement custom
FileSourceorDataSourcesources
Migration guide:
Add apply_expressions to your implementation. Call f on each top-level PhysicalExpr your node owns, using visit_sibling to correctly propagate TreeNodeRecursion:
Node with no expressions:
fn apply_expressions(
&self,
_f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
Ok(TreeNodeRecursion::Continue)
}
Node with a single expression:
fn apply_expressions(
&self,
f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
f(self.predicate.as_ref())
}
Node with multiple expressions:
fn apply_expressions(
&self,
f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
let mut tnr = TreeNodeRecursion::Continue;
for expr in &self.expressions {
tnr = tnr.visit_sibling(|| f(expr.as_ref()))?;
}
Ok(tnr)
}
Node whose only expressions are in output_ordering() (e.g. a synthetic test node with no owned expression fields):
fn apply_expressions(
&self,
f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
let mut tnr = TreeNodeRecursion::Continue;
if let Some(ordering) = self.cache.output_ordering() {
for sort_expr in ordering {
tnr = tnr.visit_sibling(|| f(sort_expr.expr.as_ref()))?;
}
}
Ok(tnr)
}
ExecutionPlan::partition_statistics now returns Arc<Statistics>#
ExecutionPlan::partition_statistics now returns Result<Arc<Statistics>> instead of Result<Statistics>. This avoids cloning Statistics when it is shared across multiple consumers.
Before:
fn partition_statistics(&self, partition: Option<usize>) -> Result<Statistics> {
Ok(Statistics::new_unknown(&self.schema()))
}
After:
fn partition_statistics(&self, partition: Option<usize>) -> Result<Arc<Statistics>> {
Ok(Arc::new(Statistics::new_unknown(&self.schema())))
}
If you need an owned Statistics value (e.g. to mutate it), use Arc::unwrap_or_clone:
// If you previously consumed the Statistics directly:
let stats = plan.partition_statistics(None)?;
stats.column_statistics[0].min_value = ...;
// Now unwrap the Arc first:
let mut stats = Arc::unwrap_or_clone(plan.partition_statistics(None)?);
stats.column_statistics[0].min_value = ...;
Remove as_any from ScalarUDFImpl, AggregateUDFImpl, WindowUDFImpl, ExecutionPlan, TableProvider, SchemaProvider, CatalogProvider, and CatalogProviderList#
Now that we have a more recent minimum version of Rust, we can take advantage of
trait upcasting. This reduces the amount of boilerplate code that
users need to implement. In your implementations of ScalarUDFImpl,
AggregateUDFImpl, WindowUDFImpl, ExecutionPlan, TableProvider,
SchemaProvider, CatalogProvider, and CatalogProviderList, you can simply
remove the as_any function. The below diffs are examples from the associated
PRs.
Scalar UDFs:
impl ScalarUDFImpl for MyEq {
- fn as_any(&self) -> &dyn Any {
- self
- }
-
fn name(&self) -> &str {
"my_eq"
}
...
}
Aggregate UDFs:
impl AggregateUDFImpl for GeoMeanUdf {
- fn as_any(&self) -> &dyn Any {
- self
- }
-
fn name(&self) -> &str {
"geo_mean"
}
...
}
Window UDFs:
impl WindowUDFImpl for SmoothIt {
- fn as_any(&self) -> &dyn Any {
- self
- }
-
fn name(&self) -> &str {
"smooth_it"
}
...
}
Execution Plans:
impl ExecutionPlan for MyExec {
- fn as_any(&self) -> &dyn Any {
- self
- }
-
fn name(&self) -> &'static str {
"MyExec"
}
...
}
Table Providers:
impl TableProvider for MyTable {
- fn as_any(&self) -> &dyn Any {
- self
- }
-
fn schema(&self) -> SchemaRef {
...
}
...
}
Schema Providers:
impl SchemaProvider for MySchema {
- fn as_any(&self) -> &dyn Any {
- self
- }
-
fn table_names(&self) -> Vec<String> {
...
}
...
}
Catalog Providers:
impl CatalogProvider for MyCatalog {
- fn as_any(&self) -> &dyn Any {
- self
- }
-
fn schema_names(&self) -> Vec<String> {
...
}
...
}
Catalog Provider Lists:
impl CatalogProviderList for MyCatalogList {
- fn as_any(&self) -> &dyn Any {
- self
- }
-
fn register_catalog(
...
}
...
}
If you have code that is downcasting, you can use the new downcast_ref
and is methods defined directly on each trait object:
Before:
let exec = plan.as_any().downcast_ref::<MyExec>().unwrap();
let udf = scalar_udf.as_any().downcast_ref::<MyUdf>().unwrap();
let table = table_provider.as_any().downcast_ref::<MyTable>().unwrap();
After:
let exec = plan.downcast_ref::<MyExec>().unwrap();
let udf = scalar_udf.downcast_ref::<MyUdf>().unwrap();
let table = table_provider.downcast_ref::<MyTable>().unwrap();
These methods are available on dyn ExecutionPlan, dyn TableProvider,
dyn SchemaProvider, dyn CatalogProvider, dyn CatalogProviderList,
dyn ScalarUDFImpl, dyn AggregateUDFImpl, and dyn WindowUDFImpl.
They work correctly
whether the value is a bare reference or behind an Arc (Rust
auto-derefs through the Arc).
Warning: Do not cast an
Arc<dyn Trait>directly to&dyn Any. Writing(&plan as &dyn Any)gives you anAnyreference to theArcitself, not the underlying trait object, so the downcast will always returnNone. Use thedowncast_refmethod above instead, or dereference through theArcfirst withplan.as_ref() as &dyn Any.
PruningStatistics::row_counts no longer takes a column parameter#
The row_counts method on the PruningStatistics trait no longer takes a
&Column argument, since row counts are a container-level property (the same
for every column).
Before:
fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
// ...
}
After:
fn row_counts(&self) -> Option<ArrayRef> {
// ...
}
Who is affected:
Users who implement the
PruningStatisticstrait
Migration guide:
Remove the column: &Column parameter from your row_counts implementation
and any corresponding call sites. If your implementation was using the column
argument, note that row counts are identical for all columns in a container, so
the parameter was unnecessary.
See PR #21369 for details.
Avro API and timestamp decoding changes#
DataFusion has switched to use arrow-avro (see #17861) when reading avro files
which results in a few changes:
DataFusionError::AvroErrorhas been removed.From<apache_avro::Error> for DataFusionErrorhas been removed.Avro crate re-export changed:
Before:
datafusion::apache_avroAfter:
datafusion::arrow_avro
Avro timestamp logical type interpretation changed. Notable effects:
Avro
timestamp-*logical types are read as UTC timezone-aware Arrow timestamps (Timestamp(..., Some("+00:00")))Avro
local-timestamp-*logical types remain timezone-naive (Timestamp(..., None))
Who is affected:
Users matching on
DataFusionError::AvroErrorUsers importing
datafusion::apache_avroUsers relying on previous Avro timestamp logical type behavior
Migration guide:
Replace
datafusion::apache_avroimports withdatafusion::arrow_avro.Update error handling code that matches on
DataFusionError::AvroErrorto use the current error surface.Validate timestamp handling where timezone semantics matter:
timestamp-*is UTC timezone-aware, whilelocal-timestamp-*is timezone-naive.
lpad, rpad, and translate now operate on Unicode codepoints instead of grapheme clusters#
Previously, lpad, rpad, and translate used Unicode grapheme cluster
segmentation to measure and manipulate strings. They now use Unicode codepoints,
which is consistent with the SQL standard and most other SQL implementations. It
also matches the behavior of other string-related functions in DataFusion.
The difference is only observable for strings containing combining characters (e.g., U+0301 COMBINING ACUTE ACCENT) or other multi-codepoint grapheme clusters (e.g., ZWJ emoji sequences). For ASCII and most common Unicode text, behavior is unchanged.