Upgrade Guides#
DataFusion 54.0.0#
Note: DataFusion 54.0.0 has not been released yet. The information provided
in this section pertains to features and changes that have already been merged
to the main branch and are awaiting release in this version.
String/numeric comparison coercion now prefers numeric types#
Previously, comparing a numeric column with a string value (e.g.,
WHERE int_col > '100') coerced both sides to strings and performed a
lexicographic comparison. This produced surprising results — for example,
5 > '100' yielded true because '5' > '1' lexicographically, even
though 5 > 100 is false numerically.
DataFusion now coerces the string side to the numeric type in comparison
contexts (=, <, >, <=, >=, <>, IN, BETWEEN, CASE .. WHEN,
GREATEST, LEAST). For example, 5 > '100' will now yield false.
Who is affected:
Queries that compare numeric values with string values
Queries that use
INlists with mixed string and numeric typesQueries that use
CASE expr WHENwith mixed string and numeric typesQueries that use
GREATESTorLEASTwith mixed string and numeric types
Behavioral changes:
Expression |
Old behavior |
New behavior |
|---|---|---|
|
Lexicographic |
Numeric |
|
String |
Numeric |
|
String comparison, always false |
Cast error |
|
Coerce to Utf8 |
Cast error ( |
|
String |
Numeric |
|
Coerce to Utf8 |
Coerce to Float64 |
|
Utf8 |
Int64 |
|
Utf8 |
Int64 |
Migration guide:
Most queries will produce more correct results with no changes needed. However, queries that relied on the old string-comparison behavior may need adjustment:
Queries comparing numeric columns with non-numeric strings (e.g.,
int_col = 'hello'orint_col > text_colwheretext_colcontains non-numeric values) will now produce a cast error instead of silently returning no rows.Mixed-type
INlists (e.g.,str_col IN ('a', 1)) are now rejected. Use consistent types for theINlist or add an explicitCAST.Queries comparing integer columns with non-integer numeric string literals (e.g.,
int_col = '99.99') will now produce a cast error because'99.99'cannot be cast to an integer. Use a float column or adjust the literal.
CastColumnExpr removed in favor of field-aware CastExpr#
datafusion_physical_expr::expressions::CastColumnExpr has been removed; use
the field-aware datafusion_physical_expr::expressions::CastExpr instead.
If your code downcasted to CastColumnExpr, downcast to CastExpr instead and
use CastExpr::target_field() for the output field metadata and
CastExpr::expr() for the input expression. To construct casts with explicit
field semantics, use CastExpr::new_with_target_field(...). The type-only
CastExpr::new(...) and cast(...) helpers remain available for callers that
only have a DataType.
comparison_coercion_numeric removed, replaced by comparison_coercion#
The comparison_coercion_numeric function has been removed. Its behavior
(preferring numeric types for string/numeric comparisons) is now the default in
comparison_coercion. A new function, type_union_coercion, handles contexts
where string types are preferred (UNION, CASE THEN/ELSE, NVL2).
Who is affected:
Crates that call
comparison_coercion_numericdirectlyCrates that call
comparison_coercionand relied on its old string-preferring behaviorCrates that call
get_coerce_type_for_case_expression
ExecutionPlan::apply_expressions is now a required method#
apply_expressions has been added as a required method on the ExecutionPlan trait (no default implementation). The same applies to the FileSource and DataSource traits. Any custom implementation of these traits must now implement apply_expressions.
Who is affected:
Users who implement custom
ExecutionPlannodesUsers who implement custom
FileSourceorDataSourcesources
Migration guide:
Add apply_expressions to your implementation. Call f on each top-level PhysicalExpr your node owns, using visit_sibling to correctly propagate TreeNodeRecursion:
Node with no expressions:
fn apply_expressions(
&self,
_f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
Ok(TreeNodeRecursion::Continue)
}
Node with a single expression:
fn apply_expressions(
&self,
f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
f(self.predicate.as_ref())
}
Node with multiple expressions:
fn apply_expressions(
&self,
f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
let mut tnr = TreeNodeRecursion::Continue;
for expr in &self.expressions {
tnr = tnr.visit_sibling(|| f(expr.as_ref()))?;
}
Ok(tnr)
}
Node whose only expressions are in output_ordering() (e.g. a synthetic test node with no owned expression fields):
fn apply_expressions(
&self,
f: &mut dyn FnMut(&dyn PhysicalExpr) -> Result<TreeNodeRecursion>,
) -> Result<TreeNodeRecursion> {
let mut tnr = TreeNodeRecursion::Continue;
if let Some(ordering) = self.cache.output_ordering() {
for sort_expr in ordering {
tnr = tnr.visit_sibling(|| f(sort_expr.expr.as_ref()))?;
}
}
Ok(tnr)
}
ExecutionPlan::partition_statistics now returns Arc<Statistics>#
ExecutionPlan::partition_statistics now returns Result<Arc<Statistics>> instead of Result<Statistics>. This avoids cloning Statistics when it is shared across multiple consumers.
Before:
fn partition_statistics(&self, partition: Option<usize>) -> Result<Statistics> {
Ok(Statistics::new_unknown(&self.schema()))
}
After:
fn partition_statistics(&self, partition: Option<usize>) -> Result<Arc<Statistics>> {
Ok(Arc::new(Statistics::new_unknown(&self.schema())))
}
If you need an owned Statistics value (e.g. to mutate it), use Arc::unwrap_or_clone:
// If you previously consumed the Statistics directly:
let stats = plan.partition_statistics(None)?;
stats.column_statistics[0].min_value = ...;
// Now unwrap the Arc first:
let mut stats = Arc::unwrap_or_clone(plan.partition_statistics(None)?);
stats.column_statistics[0].min_value = ...;
Remove as_any from PhysicalExpr, ScalarUDFImpl, AggregateUDFImpl, WindowUDFImpl, ExecutionPlan, TableProvider, SchemaProvider, CatalogProvider, CatalogProviderList, TableSource, FileSource, FileFormat, FileFormatFactory, DataSource, and DataSink#
Now that we have a more recent minimum version of Rust, we can take advantage of
trait upcasting. This reduces the amount of boilerplate code that
users need to implement. In your implementations of the traits listed above,
you can simply remove the as_any function. For example:
impl PhysicalExpr for MyExpr {
- fn as_any(&self) -> &dyn Any {
- self
- }
-
fn data_type(&self, input_schema: &Schema) -> Result<DataType> {
...
}
...
}
The same change applies to all of the above traits — simply delete the
as_any method from each implementation.
If you have code that is downcasting, you can drop the .as_any() call
and use downcast_ref / is directly on the trait object:
-let exec = plan.as_any().downcast_ref::<MyExec>().unwrap();
+let exec = plan.downcast_ref::<MyExec>().unwrap();
These methods work correctly whether the value is a bare reference or
behind an Arc (Rust auto-derefs through the Arc).
Warning: Do not cast an
Arc<dyn Trait>directly to&dyn Any. Writing(&plan as &dyn Any)gives you anAnyreference to theArcitself, not the underlying trait object, so the downcast will always returnNone. Use thedowncast_refmethod above instead, or dereference through theArcfirst withplan.as_ref() as &dyn Any.
PruningStatistics::row_counts no longer takes a column parameter#
The row_counts method on the PruningStatistics trait no longer takes a
&Column argument, since row counts are a container-level property (the same
for every column).
Before:
fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
// ...
}
After:
fn row_counts(&self) -> Option<ArrayRef> {
// ...
}
Who is affected:
Users who implement the
PruningStatisticstrait
Migration guide:
Remove the column: &Column parameter from your row_counts implementation
and any corresponding call sites. If your implementation was using the column
argument, note that row counts are identical for all columns in a container, so
the parameter was unnecessary.
See PR #21369 for details.
Avro API and timestamp decoding changes#
DataFusion has switched to use arrow-avro (see #17861) when reading avro files
which results in a few changes:
DataFusionError::AvroErrorhas been removed.From<apache_avro::Error> for DataFusionErrorhas been removed.Avro crate re-export changed:
Before:
datafusion::apache_avroAfter:
datafusion::arrow_avro
Avro timestamp logical type interpretation changed. Notable effects:
Avro
timestamp-*logical types are read as UTC timezone-aware Arrow timestamps (Timestamp(..., Some("+00:00")))Avro
local-timestamp-*logical types remain timezone-naive (Timestamp(..., None))
Who is affected:
Users matching on
DataFusionError::AvroErrorUsers importing
datafusion::apache_avroUsers relying on previous Avro timestamp logical type behavior
Migration guide:
Replace
datafusion::apache_avroimports withdatafusion::arrow_avro.Update error handling code that matches on
DataFusionError::AvroErrorto use the current error surface.Validate timestamp handling where timezone semantics matter:
timestamp-*is UTC timezone-aware, whilelocal-timestamp-*is timezone-naive.
lpad, rpad, and translate now operate on Unicode codepoints instead of grapheme clusters#
Previously, lpad, rpad, and translate used Unicode grapheme cluster
segmentation to measure and manipulate strings. They now use Unicode codepoints,
which is consistent with the SQL standard and most other SQL implementations. It
also matches the behavior of other string-related functions in DataFusion.
The difference is only observable for strings containing combining characters (e.g., U+0301 COMBINING ACUTE ACCENT) or other multi-codepoint grapheme clusters (e.g., ZWJ emoji sequences). For ASCII and most common Unicode text, behavior is unchanged.
Scalar subquery execution changes#
Uncorrelated scalar subqueries (e.g. SELECT ... WHERE x > (SELECT max(v) FROM t))
are now executed by a dedicated physical operator rather than being rewritten to
a join. Correlated scalar subqueries are unchanged.
This produces two user-visible changes:
Subqueries that return multiple rows now fail at runtime. An uncorrelated scalar subquery that returns more than one row fails with
Execution error: Scalar subquery returned more than one row. This matches the SQL standard and the behavior of most other SQL implementations. The previous join-based rewrite could silently produce multi-row output. Add aLIMIT 1or an aggregate to the subquery to fix such queries.Plan shape changes. Uncorrelated
Expr::ScalarSubquerynodes now survive into the final logical plan instead of being replaced by a join; the corresponding physical plan contains a newScalarSubqueryExecnode and aScalarSubqueryExprexpression. Code that walks or transformsLogicalPlan/ExecutionPlantrees, as well asEXPLAINoutput, may need updating.
datafusion-proto: expression deserialization now takes a TaskContext#
Serializeable::from_bytes_with_registry is renamed to from_bytes_with_ctx
and takes a &TaskContext instead of a &dyn FunctionRegistry. parse_expr,
parse_exprs, and parse_sorts take the same change. Expr::from_bytes
(without a registry argument) is unchanged.
-let expr = Expr::from_bytes_with_registry(&bytes, ®istry)?;
+let expr = Expr::from_bytes_with_ctx(&bytes, ctx.task_ctx().as_ref())?;
-let expr = parse_expr(&proto, ®istry, &codec)?;
+let expr = parse_expr(&proto, ctx.task_ctx().as_ref(), &codec)?;
datafusion-proto: PhysicalProtoConverterExtension reshaped#
PhysicalProtoConverterExtension and the parse_physical_*_with_converter
helpers now take a single &PhysicalPlanDecodeContext<'_> that bundles the
TaskContext and the PhysicalExtensionCodec. Implementations update like
this:
impl PhysicalProtoConverterExtension for MyConverter {
fn proto_to_execution_plan(
&self,
- ctx: &TaskContext,
- codec: &dyn PhysicalExtensionCodec,
proto: &protobuf::PhysicalPlanNode,
+ ctx: &PhysicalPlanDecodeContext<'_>,
) -> Result<Arc<dyn ExecutionPlan>> {
- proto.try_into_physical_plan_with_converter(ctx, codec, self)
+ self.default_proto_to_execution_plan(proto, ctx)
}
fn proto_to_physical_expr(
&self,
proto: &PhysicalExprNode,
- ctx: &TaskContext,
input_schema: &Schema,
- codec: &dyn PhysicalExtensionCodec,
+ ctx: &PhysicalPlanDecodeContext<'_>,
) -> Result<Arc<dyn PhysicalExpr>> {
- parse_physical_expr_with_converter(proto, ctx, input_schema, codec, self)
+ self.default_proto_to_physical_expr(proto, input_schema, ctx)
}
}
Pull out the TaskContext or codec inside these methods with
ctx.task_ctx() and ctx.codec(). Construct a fresh context at an API
boundary with PhysicalPlanDecodeContext::new(task_ctx, codec).
ExecutionProps has new fields#
ExecutionProps gained new public fields. Code that constructs it via a
struct literal, or pattern-matches it without .., no longer compiles. Use
ExecutionProps::new() and include .. in exhaustive patterns.
Items in datafusion_functions::strings are no longer public#
StringArrayBuilder, LargeStringArrayBuilder, StringViewArrayBuilder,
ColumnarValueRef, and append_view have been reduced to pub(crate). They
were only ever used to implement concat and concat_ws inside the crate. If
you were importing them externally, use Arrow’s corresponding builders with a
caller-computed NullBuffer.
Conversion from FileDecryptionProperties to ConfigFileDecryptionProperties is now fallible#
Previously, datafusion_common::config::ConfigFileDecryptionProperties
implemented From<&Arc<parquet::encryption::decrypt::FileDecryptionProperties>>.
If an error was encountered when retrieving the footer key without providing key metadata,
the error would be ignored and an empty footer key set in the result.
This could lead to obscure errors later.
ConfigFileDecryptionProperties now instead implements TryFrom<&Arc<FileDecryptionProperties>>,
and errors retrieving the footer key will be propagated up.
Migration guide:
Replace calls to ConfigFileDecryptionProperties::from with ConfigFileDecryptionProperties::try_from,
and affected calls to into with try_into, with appropriate error handling added.
Before:
let config_decryption_properties: ConfigFileDecryptionProperties = (&decryption_properties).into();
// or
let config_decryption_properties = ConfigFileDecryptionProperties::from(&decryption_properties);
(where decryption_properties is an Arc<FileDecryptionProperties>)
After:
let config_decryption_properties: ConfigFileDecryptionProperties = (&decryption_properties).try_into()?;
// or
let config_decryption_properties = ConfigFileDecryptionProperties::try_from(&decryption_properties)?;
approx_percentile_cont, approx_percentile_cont_with_weight, approx_median now coerce to floats#
The type signatures of approx_percentile_cont, approx_percentile_cont_with_weight, and
approx_median now coerce integer input values to Float64 before computing the approximation.
As a result, these functions always return a float, even when the input column is an integer type.
Who is affected:
Queries or downstream code that relied on
approx_percentile_cont/approx_percentile_cont_with_weight/approx_medianreturning an integer type when given an integer column.
Migration guide:
If downstream code checks or relies on the return type being an integer, add an explicit
CAST back to the desired integer type, or update the type assertion:
-- Before (returned Int64):
SELECT approx_percentile_cont(quantity, 0.5) FROM orders;
-- After (returns Float64); cast if an integer result is required:
SELECT CAST(approx_percentile_cont(quantity, 0.5) AS BIGINT) FROM orders;
Box<C> and Arc<C> TreeNodeContainer impls now require C: Default#
The generic TreeNodeContainer implementations for Box<C> and Arc<C> now
require C: Default. This change was necessary as part of optimizing tree
rewriting to reduce heap allocations.
Who is affected:
Users that implement
TreeNodeContaineron a custom type and wrap it inBoxorArcwhen walking trees.
Migration guide:
Add a Default implementation to your type. The default value is used as a
temporary placeholder during query optimization, so when possible, pick a cheap,
allocation-free variant:
impl Default for MyTreeNode {
fn default() -> Self {
MyTreeNode::Leaf // or whichever variant is cheapest to construct
}
}