Upgrade Guides#
DataFusion 48.0.0#
Expr::Literal has optional metadata#
The Expr::Literal variant now includes optional metadata, which allows for
carrying through Arrow field metadata to support extension types and other uses.
This means code such as
match expr {
...
Expr::Literal(scalar) => ...
...
}
Should be updated to:
match expr {
...
Expr::Literal(scalar, _metadata) => ...
...
}
Likewise constructing Expr::Literal requires metadata as well. The lit function
has not changed and returns an Expr::Literal with no metadata.
Expr::WindowFunction is now Boxed#
Expr::WindowFunction is now a Box<WindowFunction> instead of a WindowFunction directly.
This change was made to reduce the size of Expr and improve performance when
planning queries (see details on #16207).
This is a breaking change, so you will need to update your code if you match
on Expr::WindowFunction directly. For example, if you have code like this:
match expr {
Expr::WindowFunction(WindowFunction {
params:
WindowFunctionParams {
partition_by,
order_by,
..
}
}) => {
// Use partition_by and order_by as needed
}
_ => {
// other expr
}
}
You will need to change it to:
match expr {
Expr::WindowFunction(window_fun) => {
let WindowFunction {
fun,
params: WindowFunctionParams {
args,
partition_by,
..
},
} = window_fun.as_ref();
// Use partition_by and order_by as needed
}
_ => {
// other expr
}
}
The VARCHAR SQL type is now represented as Utf8View in Arrow#
The mapping of the SQL VARCHAR type has been changed from Utf8 to Utf8View
which improves performance for many string operations. You can read more about
Utf8View in the DataFusion blog post on German-style strings
This means that when you create a table with a VARCHAR column, it will now use
Utf8View as the underlying data type. For example:
> CREATE TABLE my_table (my_column VARCHAR);
0 row(s) fetched.
Elapsed 0.001 seconds.
> DESCRIBE my_table;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| my_column | Utf8View | YES |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.000 seconds.
You can restore the old behavior of using Utf8 by changing the
datafusion.sql_parser.map_varchar_to_utf8view configuration setting. For
example
> set datafusion.sql_parser.map_varchar_to_utf8view = false;
0 row(s) fetched.
Elapsed 0.001 seconds.
> CREATE TABLE my_table (my_column VARCHAR);
0 row(s) fetched.
Elapsed 0.014 seconds.
> DESCRIBE my_table;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| my_column | Utf8 | YES |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.004 seconds.
ListingOptions default for collect_stat changed from true to false#
This makes it agree with the default for SessionConfig.
Most users won’t be impacted by this change but if you were using ListingOptions directly
and relied on the default value of collect_stat being true, you will need to
explicitly set it to true in your code.
ListingOptions::new(Arc::new(ParquetFormat::default()))
.with_collect_stat(true)
// other options
Processing FieldRef instead of DataType for user defined functions#
In order to support metadata handling and extension types, user defined functions are
now switching to traits which use FieldRef rather than a DataType and nullability.
This gives a single interface to both of these parameters and additionally allows
access to metadata fields, which can be used for extension types.
To upgrade structs which implement ScalarUDFImpl, if you have implemented
return_type_from_args you need instead to implement return_field_from_args.
If your functions do not need to handle metadata, this should be straightforward
repackaging of the output data into a FieldRef. The name you specify on the
field is not important. It will be overwritten during planning. ReturnInfo
has been removed, so you will need to remove all references to it.
ScalarFunctionArgs now contains a field called arg_fields. You can use this
to access the metadata associated with the columnar values during invocation.
To upgrade user defined aggregate functions, there is now a function
return_field that will allow you to specify both metadata and nullability of
your function. You are not required to implement this if you do not need to
handle metadata.
The largest change to aggregate functions happens in the accumulator arguments.
Both the AccumulatorArgs and StateFieldsArgs now contain FieldRef rather
than DataType.
To upgrade window functions, ExpressionArgs now contains input fields instead
of input data types. When setting these fields, the name of the field is
not important since this gets overwritten during the planning stage. All you
should need to do is wrap your existing data types in fields with nullability
set depending on your use case.
Physical Expression return Field#
To support the changes to user defined functions processing metadata, the
PhysicalExpr trait, which now must specify a return Field based on the input
schema. To upgrade structs which implement PhysicalExpr you need to implement
the return_field function. There are numerous examples in the physical-expr
crate.
FileFormat::supports_filters_pushdown replaced with FileSource::try_pushdown_filters#
To support more general filter pushdown, the FileFormat::supports_filters_pushdown was replaced with
FileSource::try_pushdown_filters.
If you implemented a custom FileFormat that uses a custom FileSource you will need to implement
FileSource::try_pushdown_filters.
See ParquetSource::try_pushdown_filters for an example of how to implement this.
FileFormat::supports_filters_pushdown has been removed.
ParquetExec, AvroExec, CsvExec, JsonExec Removed#
ParquetExec, AvroExec, CsvExec, and JsonExec were deprecated in
DataFusion 46 and are removed in DataFusion 48. This is sooner than the normal
process described in the API Deprecation Guidelines because all the tests
cover the new DataSourceExec rather than the older structures. As we evolve
DataSource, the old structures began to show signs of “bit rotting” (not
working but no one knows due to lack of test coverage).
PartitionedFile added as an argument to the FileOpener trait#
This is necessary to properly fix filter pushdown for filters that combine partition
columns and file columns (e.g. day = username['dob']).
If you implemented a custom FileOpener you will need to add the PartitionedFile argument
but are not required to use it in any way.