Upgrade Guides#
DataFusion 55.0.0#
Note: DataFusion 55.0.0 has not been released yet. The information provided
in this section pertains to features and changes that have already been merged
to the main branch and are awaiting release in this version.
User SpillFile traits instead of RefCountedTempFile#
Spill file APIs now use the datafusion_execution::SpillFile trait instead of
the concrete RefCountedTempFile type. DiskManager::create_tmp_file now
returns Arc<dyn SpillFile>.
This change was introduced in [PR #21882], which adds pluggable spill file
backends via SpillFile and TempFileFactory.
If your code matched on DiskManagerMode, add a DiskManagerMode::Custom(_)
arm.
If your code wrote directly to a RefCountedTempFile or called
RefCountedTempFile::update_disk_usage, open a spill writer instead:
- temp_file.inner().as_file().write_all(bytes)?;
- temp_file.update_disk_usage()?;
+ temp_file.open_writer()?.write_all(bytes)?;
Use temp_file.size() instead of RefCountedTempFile::current_disk_usage.
Dialect::AVAILABLE replaced by Dialect::available()#
datafusion_common::config::Dialect::AVAILABLE has been removed. Use
Dialect::available() instead.
spill_record_batch_by_size removed#
datafusion_physical_plan::spill::spill_record_batch_by_size has been removed.
This function was deprecated in DataFusion 46.0.0.
Use datafusion_physical_plan::spill::SpillManager::spill_record_batch_by_size
instead.
Decimal scalar formatting uses human-readable values#
Decimal scalar literals in EXPLAIN output, expression display strings, and
auto-generated column names now format the decimal value using its scale while
still showing the precision and scale. For example, a Decimal128 literal with
stored value 1, precision 1, and scale 1 is now rendered as
Decimal128(0.1,1,1) instead of Decimal128(Some(1),1,1). When formatting a
ScalarValue directly, it now appears as 0.1 instead of Some(1),1,1.
NULL decimal literals were previously shown as Decimal128(None,10,2); they
will now appear as Decimal128(NULL,10,2).
Query result values already used human-readable decimal formatting and are unchanged.
GroupsAccumulator::merge_batch no longer takes opt_filter#
The opt_filter argument has been removed from
datafusion_expr_common::groups_accumulator::GroupsAccumulator::merge_batch:
fn merge_batch(
&mut self,
values: &[ArrayRef],
group_indices: &[usize],
- opt_filter: Option<&BooleanArray>,
total_num_groups: usize,
) -> Result<()>;
Aggregate FILTER clauses only apply to raw input rows during the partial
(update) phase, so by the time intermediate states are merged there is nothing
left to filter per row. In practice opt_filter was always None here, so
removing it makes the API self-explanatory and impossible to misuse.
Who is affected:
Anyone with a custom
GroupsAccumulatorimplementation.Anyone calling
merge_batchdirectly.
Migration guide:
Drop the opt_filter argument from your merge_batch signature and from any
call sites:
fn merge_batch(
&mut self,
values: &[ArrayRef],
group_indices: &[usize],
- opt_filter: Option<&BooleanArray>,
total_num_groups: usize,
) -> Result<()> {
// ...
}
- acc.merge_batch(values, group_indices, None, total_num_groups)?;
+ acc.merge_batch(values, group_indices, total_num_groups)?;
If your implementation previously inspected opt_filter (for example asserting
it was None), that code can simply be deleted.
See issue #22775 for details.
is_dynamic_physical_expr is deprecated#
datafusion_physical_expr_common::physical_expr::is_dynamic_physical_expr is
deprecated. It was a thin wrapper over snapshot_generation(expr) != 0 used to
ask “does this predicate contain a dynamic filter?”.
Prefer asking the question directly against the concrete type. For a one-off
check, downcast to DynamicFilterPhysicalExpr:
use datafusion_physical_expr::expressions::DynamicFilterPhysicalExpr;
use datafusion_common::tree_node::{TreeNode, TreeNodeRecursion};
let mut is_dynamic = false;
predicate.apply(|e| {
if e.downcast_ref::<DynamicFilterPhysicalExpr>().is_some() {
is_dynamic = true;
Ok(TreeNodeRecursion::Stop)
} else {
Ok(TreeNodeRecursion::Continue)
}
})?;
If you also need to know whether the dynamic filters can still change (and to be
notified when they do), use the new DynamicFilterTracking /
DynamicFilterTracker API in datafusion_physical_expr:
use datafusion_physical_expr::DynamicFilterTracking;
let tracking = DynamicFilterTracking::classify(&predicate);
if tracking.contains_dynamic_filter() {
// worth re-evaluating the predicate at runtime
}
FilePruner::try_new no longer builds a pruner for static predicates without statistics#
datafusion_pruning::FilePruner::try_new now returns None when the predicate
is purely static and the file carries no usable column statistics, because
such a pruner can never prune anything beyond what planning already did.
Previously it returned Some whenever a statistics struct was present (the
“is this worth pruning?” decision lived in the Parquet opener). Files with column
statistics, and predicates that carry a dynamic filter, are unaffected.
QueryPlanner adds Any as a supertrait#
To enable downcasting of dyn QueryPlanner to concrete query planner types (via
is::<T>() / downcast_ref::<T>()), the QueryPlanner trait now has Any
as a supertrait:
- pub trait QueryPlanner: Debug
+ pub trait QueryPlanner: Any + Debug
ExecutionPlan::partition_statistics deprecated in favor of statistics_with_args#
ExecutionPlan::partition_statistics is deprecated. A new method
statistics_with_args accepts a StatisticsArgs parameter that carries
the partition index and a shared cache for memoized child statistics lookups.
Existing implementations of partition_statistics continue to work unchanged.
The default statistics_with_args delegates to the deprecated method, so no
migration is required until the deprecated method is removed.
Warning: The delegation is one-way: the default
statistics_with_argscallspartition_statistics, but the defaultpartition_statisticsdoes not callstatistics_with_args— it returnsStatistics::new_unknown. Nodes that override onlystatistics_with_argswill silently returnStatistics::new_unknownto any caller still using the deprecatedpartition_statistics.
Who is affected:
Users who implement custom
ExecutionPlannodes (recommended to migrate)Users who call
partition_statisticsdirectly (recommended to switch tostatistics_with_args)
Migration guide:
For implementations, override statistics_with_args instead of
partition_statistics. Leaf nodes that do not have children can ignore
the args.
Child statistics are looked up via args.compute_child_statistics(child, partition).
Use args.partition() for partition-preserving operators, or None for
partition-merging operators that always need overall stats:
// Before:
fn partition_statistics(&self, partition: Option<usize>) -> Result<Arc<Statistics>> {
let child_stats = self.input.partition_statistics(partition)?;
// ... transform child_stats ...
}
// After (partition-preserving):
fn statistics_with_args(
&self,
args: &StatisticsArgs,
) -> Result<Arc<Statistics>> {
let child_stats = args.compute_child_statistics(&self.input, args.partition())?;
// ... transform child_stats ...
}
// After (partition-merging):
fn statistics_with_args(
&self,
args: &StatisticsArgs,
) -> Result<Arc<Statistics>> {
let child_stats = args.compute_child_statistics(&self.input, None)?;
// ... transform child_stats ...
}
For callers, create a StatisticsArgs and call statistics_with_args
directly. The cache is created automatically:
use datafusion_physical_plan::StatisticsArgs;
// Before:
let stats = plan.partition_statistics(None)?;
// After:
let stats = plan.statistics_with_args(&StatisticsArgs::new())?;
DdlStatement::CreateExternalTable and CreateFunction are now boxed#
The two largest variants of datafusion_expr::DdlStatement are now
Boxed:
// Before
pub enum DdlStatement {
CreateExternalTable(CreateExternalTable),
// ...
CreateFunction(CreateFunction),
// ...
}
// After
pub enum DdlStatement {
CreateExternalTable(Box<CreateExternalTable>),
// ...
CreateFunction(Box<CreateFunction>),
// ...
}
CreateExternalTable is 312 bytes and CreateFunction is 288 bytes, so
without boxing they forced the entire LogicalPlan enum to 320 bytes
even on SELECT-only query paths that never instantiate them. Boxing
shrinks LogicalPlan from 320 → 176 bytes (−45%), making every
mem::take / mem::swap / Arc<LogicalPlan> store on the planning
hot path move a smaller payload.
Who is affected:
Users who construct
DdlStatement::CreateExternalTable(...)orDdlStatement::CreateFunction(...)from an owned struct.Users who pattern-match these variants and destructure the inner struct in the same pattern (e.g.
DdlStatement::CreateExternalTable(CreateExternalTable { name, .. })).Code that consumes the inner struct out of these variants (e.g. to pass
CreateExternalTableby value to another function).
Migration guide:
When constructing the variants, wrap the inner struct in Box::new:
// Before
let stmt = DdlStatement::CreateFunction(CreateFunction { name, args, .. });
// After
let stmt = DdlStatement::CreateFunction(Box::new(CreateFunction {
name,
args,
..
}));
When pattern-matching, bind the boxed value and either access fields
through it (Rust auto-derefs the Box) or destructure via .as_ref():
// Before
match ddl {
DdlStatement::CreateExternalTable(CreateExternalTable {
name, location, ..
}) => { /* use name, location */ }
}
// After — access fields through the box
match ddl {
DdlStatement::CreateExternalTable(ce) => {
let name = &ce.name;
let location = &ce.location;
/* ... */
}
}
// After — destructure the dereferenced struct
match ddl {
DdlStatement::CreateExternalTable(ce) => {
let CreateExternalTable { name, location, .. } = ce.as_ref();
/* ... */
}
}
When you need an owned CreateExternalTable / CreateFunction out of
the variant, dereference the box with *:
// Before
match plan {
LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) => Ok(cmd),
_ => { /* ... */ }
}
// After
match plan {
LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) => Ok(*cmd),
_ => { /* ... */ }
}
See PR #22733 for details, including the per-variant size breakdown and benchmark results.
ListingOptions::target_partitions and collect_stat removed#
The target_partitions and collect_stat fields on
datafusion_catalog_listing::ListingOptions, their builder methods
(with_target_partitions, with_collect_stat), and the
with_session_config_options helper have been removed.
ListingTable now reads both values directly from the active SessionConfig
at scan time instead of from a copy snapshotted onto the table at construction
time.
Who is affected:
Code that set
target_partitions/collect_statper table viaListingOptions, or read those public fields.Code that relied on a
ListingTablefreezing these values at construction time independently of the session config. The table now always reflects the currentSessionConfig.
Migration guide:
Configure these on the SessionConfig instead:
// Before
let options = ListingOptions::new(format)
.with_target_partitions(8)
.with_collect_stat(true);
// After
let config = SessionConfig::new()
.with_target_partitions(8)
.with_collect_statistics(true);
See PR #22969 for details.
Spark map functions now reject duplicate keys by default#
The Spark-compatibility map-construction functions (map_from_arrays,
map_from_entries, str_to_map) now raise [DUPLICATED_MAP_KEY] at runtime
when constructing a map that contains duplicate keys. This matches the default
of Spark’s spark.sql.mapKeyDedupPolicy.
A new config option, datafusion.spark.map_key_dedup_policy, controls the
behavior:
EXCEPTION(default): raise on any duplicate key.LAST_WIN: keep the last occurrence of each duplicate key. The key stays at its first-seen position with the value from its last occurrence (matching Spark’sArrayBasedMapBuilder).
Who is affected:
Queries calling
map_from_arraysorstr_to_mapon data that contains duplicate keys. Previously these functions either tolerated duplicates silently or raised a non-configurable error.
Migration guide:
To restore lenient duplicate-key handling, set the policy to LAST_WIN:
SET datafusion.spark.map_key_dedup_policy = 'LAST_WIN';
See PR #21720 for details.
Unify LRU memory-limiting caches into one generic cache#
The caches DefaultFileMetadataCache, DefaultListFilesCache and DefaultFileStatisticsCache
are merged into one generic implementation DefaultCache. The corresponding traits are now
type aliases:
- pub trait FileStatisticsCache: CacheAccessor<TableScopedPath, CachedFileMetadata>
- pub trait ListFilesCache: CacheAccessor<TableScopedPath, CachedFileList>
- pub trait FileMetadataCache: CacheAccessor<Path, CachedFileMetadataEntry>
+ pub type FileStatisticsCache = dyn Cache<TableScopedPath, CachedFileMetadata>;
+ pub type ListFilesCache = dyn Cache<TableScopedPath, CachedFileList>;
+ pub type FileMetadataCache = dyn Cache<Path, CachedFileMetadataEntry>;
Who is affected:
Users who introduced their own implementation of
FileMetadataCache,ListFilesCacheorFileStatisticsCache.
Migration guide:
Implement the newly introduced types for your custom cache implementation.
See PR #22613 for details.