Upgrade Guides#

DataFusion 53.0.0#

Upgrade arrow/parquet to 58.0.0 and object_store to 0.13.0#

DataFusion 53.0.0 uses arrow and parquet 58.0.0, and object_store 0.13.0. This may require updates to your Cargo.toml if you have direct dependencies on these crates.

See the Arrow 58.0.0 release notes and the object_store 0.13.0 upgrade guide for details on breaking changes in those versions.

ExecutionPlan::statistics removed#

The deprecated ExecutionPlan::statistics() method has been removed. If you implement custom ExecutionPlans, remove that method from your impl and implement partition_statistics() instead.

Before:

impl ExecutionPlan for MyExec {
    // ...

    fn statistics(&self) -> Result<Statistics> {
        Ok(Statistics::new_unknown(&self.schema()))
    }
}

After:

impl ExecutionPlan for MyExec {
    // ...

    fn partition_statistics(&self, _partition: Option<usize>) -> Result<Statistics> {
        Ok(Statistics::new_unknown(&self.schema()))
    }
}

If you do not have partition-specific statistics, return the same value for None and for any partition index.

ExecutionPlan::properties now returns &Arc<PlanProperties>#

Now ExecutionPlan::properties() returns &Arc<PlanProperties> instead of a reference. This make it possible to cheaply clone properties and reuse them across multiple ExecutionPlans. It also makes it possible to optimize [ExecutionPlan::with_new_children] to reuse properties when the children plans have not changed, which can significantly reduce planning time for complex queries.

ExecutionPlan::with_new_children

To migrate, in all ExecutionPlan implementations, you will likely need to wrap stored PlanProperties in an Arc:

-    cache: PlanProperties,
+    cache: Arc<PlanProperties>,

...

-    fn properties(&self) -> &PlanProperties {
+    fn properties(&self) -> &Arc<PlanProperties> {
         &self.cache
     }

To improve performance of with_new_children for custom ExecutionPlan implementations, you can use the new macro: check_if_same_properties. For it to work, you need to implement the function: with_new_children_and_same_properties with semantics identical to with_new_children, but operating under the assumption that the properties of the children plans have not changed.

An example of supporting this optimization for ProjectionExec:

     impl ProjectionExec {
+       fn with_new_children_and_same_properties(
+           &self,
+           mut children: Vec<Arc<dyn ExecutionPlan>>,
+       ) -> Self {
+           Self {
+               input: children.swap_remove(0),
+               metrics: ExecutionPlanMetricsSet::new(),
+               ..Self::clone(self)
+           }
+       }
    }

    impl ExecutionPlan for ProjectionExec {
        fn with_new_children(
            self: Arc<Self>,
            mut children: Vec<Arc<dyn ExecutionPlan>>,
        ) -> Result<Arc<dyn ExecutionPlan>> {
+           check_if_same_properties!(self, children);
            ProjectionExec::try_new(
                self.projector.projection().into_iter().cloned(),
                children.swap_remove(0),
            )
            .map(|p| Arc::new(p) as _)
        }
    }

PlannerContext outer query schema API now uses a stack#

PlannerContext no longer stores a single outer_query_schema. It now tracks a stack of outer relation schemas so nested subqueries can access non-adjacent outer relations.

Before:

let old_outer_query_schema =
    planner_context.set_outer_query_schema(Some(input_schema.clone().into()));
let sub_plan = self.query_to_plan(subquery, planner_context)?;
planner_context.set_outer_query_schema(old_outer_query_schema);

After:

planner_context.append_outer_query_schema(input_schema.clone().into());
let sub_plan = self.query_to_plan(subquery, planner_context)?;
planner_context.pop_outer_query_schema();

HashJoinExec::try_new adds null_aware#

HashJoinExec::try_new now takes an extra null_aware: bool argument. This flag is used for null-aware anti joins, such as plans generated for NOT IN subqueries.

Most callers should pass false, the previous behavior. Pass true only for null-aware JoinType::LeftAnti joins.

FileSinkConfig adds file_output_mode#

FileSinkConfig now includes a file_output_mode: FileOutputMode field to control single-file vs directory output behavior. Any code constructing FileSinkConfig via struct literals must initialize this field.

The FileOutputMode enum has three variants:

  • Automatic (default): Infer output mode from the URL (extension/trailing / heuristic)

  • SingleFile: Write to a single file at the exact output path

  • Directory: Write to a directory with generated filenames

Before:

FileSinkConfig {
    // ...
    file_extension: "parquet".into(),
}

After:

use datafusion_datasource::file_sink_config::FileOutputMode;

FileSinkConfig {
    // ...
    file_extension: "parquet".into(),
    file_output_mode: FileOutputMode::Automatic,
}

SimplifyInfo trait removed, SimplifyContext now uses builder-style API#

The SimplifyInfo trait has been removed and replaced with the concrete SimplifyContext struct. This simplifies the expression simplification API and removes the need for trait objects.

Who is affected:

  • Users who implemented custom SimplifyInfo implementations

  • Users who implemented ScalarUDFImpl::simplify() for custom scalar functions

  • Users who directly use SimplifyContext or ExprSimplifier

Breaking changes:

  1. The SimplifyInfo trait has been removed entirely

  2. SimplifyContext no longer takes &ExecutionProps - it now uses a builder-style API with direct fields

  3. ScalarUDFImpl::simplify() now takes &SimplifyContext instead of &dyn SimplifyInfo

  4. Time-dependent function simplification (e.g., now()) is now optional - if query_execution_start_time is None, these functions won’t be simplified

Migration guide:

If you implemented a custom SimplifyInfo:

Before:

impl SimplifyInfo for MySimplifyInfo {
    fn is_boolean_type(&self, expr: &Expr) -> Result<bool> { ... }
    fn nullable(&self, expr: &Expr) -> Result<bool> { ... }
    fn execution_props(&self) -> &ExecutionProps { ... }
    fn get_data_type(&self, expr: &Expr) -> Result<DataType> { ... }
}

After:

Use SimplifyContext directly with the builder-style API:

let context = SimplifyContext::default()
    .with_schema(schema)
    .with_config_options(config_options)
    .with_query_execution_start_time(Some(Utc::now())); // or use .with_current_time()

If you implemented ScalarUDFImpl::simplify():

Before:

fn simplify(
    &self,
    args: Vec<Expr>,
    info: &dyn SimplifyInfo,
) -> Result<ExprSimplifyResult> {
    let now_ts = info.execution_props().query_execution_start_time;
    // ...
}

After:

fn simplify(
    &self,
    args: Vec<Expr>,
    info: &SimplifyContext,
) -> Result<ExprSimplifyResult> {
    // query_execution_start_time is now Option<DateTime<Utc>>
    // Return Original if time is not set (simplification skipped)
    let Some(now_ts) = info.query_execution_start_time() else {
        return Ok(ExprSimplifyResult::Original(args));
    };
    // ...
}

If you created SimplifyContext from ExecutionProps:

Before:

let props = ExecutionProps::new();
let context = SimplifyContext::new(&props).with_schema(schema);

After:

let context = SimplifyContext::default()
    .with_schema(schema)
    .with_config_options(config_options)
    .with_current_time(); // Sets query_execution_start_time to Utc::now()

See SimplifyContext documentation for more details.

Struct Casting Now Requires Field Name Overlap#

DataFusion’s struct casting mechanism previously allowed casting between structs with differing field names if the field counts matched. This “positional fallback” behavior could silently misalign fields and cause data corruption.

Breaking Change:

Starting with DataFusion 53.0.0, struct casts now require at least one overlapping field name between the source and target structs. Casts without field name overlap are rejected at plan time with a clear error message.

Who is affected:

  • Applications that cast between structs with no overlapping field names

  • Queries that rely on positional struct field mapping (e.g., casting struct(x, y) to struct(a, b) based solely on position)

  • Code that constructs or transforms struct columns programmatically

Migration guide:

If you encounter an error like:

Cannot cast struct with 2 fields to 2 fields because there is no field name overlap

You must explicitly rename or map fields to ensure at least one field name matches. Here are common patterns:

Example 1: Source and target field names already match (Name-based casting)

Success case (field names align):

-- source_col has schema: STRUCT<x INT, y INT>
-- Casting to the same field names succeeds (no-op or type validation only)
SELECT CAST(source_col AS STRUCT<x INT, y INT>) FROM table1;

Example 2: Source and target field names differ (Migration scenario)

What fails now (no field name overlap):

-- source_col has schema: STRUCT<a INT, b INT>
-- This FAILS because there is no field name overlap:
-- ❌ SELECT CAST(source_col AS STRUCT<x INT, y INT>) FROM table1;
-- Error: Cannot cast struct with 2 fields to 2 fields because there is no field name overlap

Migration options (must align names):

Option A: Use struct constructor for explicit field mapping

-- source_col has schema: STRUCT<a INT, b INT>
-- Use STRUCT_CONSTRUCT with explicit field names
SELECT STRUCT_CONSTRUCT(
    'x', source_col.a,
    'y', source_col.b
) AS renamed_struct FROM table1;

Option B: Rename in the cast target to match source names

-- source_col has schema: STRUCT<a INT, b INT>
-- Cast to target with matching field names
SELECT CAST(source_col AS STRUCT<a INT, b INT>) FROM table1;

Example 3: Using struct constructors in Rust API

If you need to map fields programmatically, build the target struct explicitly:

// Build the target struct with explicit field names
let target_struct_type = DataType::Struct(vec![
    FieldRef::new("x", DataType::Int32),
    FieldRef::new("y", DataType::Utf8),
]);

// Use struct constructors rather than casting for field mapping
// This makes the field mapping explicit and unambiguous
// Use struct builders or row constructors that preserve your mapping logic

Why this change:

  1. Safety: Field names are now the primary contract for struct compatibility

  2. Explicitness: Prevents silent data misalignment caused by positional assumptions

  3. Consistency: Matches DuckDB’s behavior and aligns with other SQL engines that enforce name-based matching

  4. Debuggability: Errors now appear at plan time rather than as silent data corruption

See Issue #19841 and PR #19955 for more details.

FilterExec builder methods deprecated#

The following methods on FilterExec have been deprecated in favor of using FilterExecBuilder:

  • with_projection()

  • with_batch_size()

Who is affected:

  • Users who create FilterExec instances and use these methods to configure them

Migration guide:

Use FilterExecBuilder instead of chaining method calls on FilterExec:

Before:

let filter = FilterExec::try_new(predicate, input)?
    .with_projection(Some(vec![0, 2]))?
    .with_batch_size(8192)?;

After:

let filter = FilterExecBuilder::new(predicate, input)
    .with_projection(Some(vec![0, 2]))
    .with_batch_size(8192)
    .build()?;

The builder pattern is more efficient as it computes properties once during build() rather than recomputing them for each method call.

Note: with_default_selectivity() is not deprecated as it simply updates a field value and does not require the overhead of the builder pattern.

Protobuf conversion trait added#

A new trait, PhysicalProtoConverterExtension, has been added to the datafusion-proto crate. This is used for controlling the process of conversion of physical plans and expressions to and from their protobuf equivalents. The methods for conversion now require an additional parameter.

The primary APIs for interacting with this crate have not been modified, so most users should not need to make any changes. If you do require this trait, you can use the DefaultPhysicalProtoConverter implementation.

For example, to convert a sort expression protobuf node you can make the following updates:

Before:

let sort_expr = parse_physical_sort_expr(
    sort_proto,
    ctx,
    input_schema,
    codec,
);

After:

let converter = DefaultPhysicalProtoConverter {};
let sort_expr = parse_physical_sort_expr(
    sort_proto,
    ctx,
    input_schema,
    codec,
    &converter
);

Similarly to convert from a physical sort expression into a protobuf node:

Before:

let sort_proto = serialize_physical_sort_expr(
    sort_expr,
    codec,
);

After:

let converter = DefaultPhysicalProtoConverter {};
let sort_proto = serialize_physical_sort_expr(
    sort_expr,
    codec,
    &converter,
);

generate_series and range table functions changed#

The generate_series and range table functions now return an empty set when the interval is invalid, instead of an error. This behavior is consistent with systems like PostgreSQL.

Before:

> select * from generate_series(0, -1);
Error during planning: Start is bigger than end, but increment is positive: Cannot generate infinite series

> select * from range(0, -1);
Error during planning: Start is bigger than end, but increment is positive: Cannot generate infinite series

Now:

> select * from generate_series(0, -1);
+-------+
| value |
+-------+
+-------+
0 row(s) fetched.

> select * from range(0, -1);
+-------+
| value |
+-------+
+-------+
0 row(s) fetched.

array_remove, array_remove_n, array_remove_all now return NULL when the element argument is NULL#

Previously, calling array_remove(array, NULL) would attempt to match and remove NULL entries from the array, returning the array with NULLs stripped. Now, passing NULL as the element to remove causes the function to return NULL, which is consistent with standard SQL NULL propagation semantics.

The same change applies to array_remove_n (aliases: list_remove_n) and array_remove_all (aliases: list_remove_all).

Who is affected:

  • Queries that call array_remove, array_remove_n, or array_remove_all with a NULL element argument (literal or column-derived).

Behavioral changes:

Expression

Old result

New result

array_remove(make_array(1, NULL, 2), NULL)

[1, 2]

NULL

array_remove(make_array(1, NULL, 2, NULL), NULL)

[1, 2, NULL]

NULL

array_remove_n(make_array(1, 2, 2, 1, 1), NULL, 2)

[1, 2, 2, 1, 1]

NULL

array_remove_all(make_array(1, 2, 2, 1, 1), NULL)

[1, 2, 2, 1, 1]

NULL

Migration guide:

If your queries relied on the old behavior to strip NULLs from arrays, use array_remove_all with a non-NULL sentinel or use array_filter instead:

-- Before (removed NULLs from array):
SELECT array_remove(make_array(1, NULL, 2), NULL);
-- Old result: [1, 2]

-- After (returns NULL due to NULL propagation):
SELECT array_remove(make_array(1, NULL, 2), NULL);
-- New result: NULL

See #21011 and PR #21013 for details.