Upgrade Guides#

DataFusion 48.0.0#

Expr::Literal has optional metadata#

The Expr::Literal variant now includes optional metadata, which allows for carrying through Arrow field metadata to support extension types and other uses.

This means code such as

match expr {
...
  Expr::Literal(scalar) => ...
...
}

Should be updated to:

match expr {
...
  Expr::Literal(scalar, _metadata) => ...
...
}

Likewise constructing Expr::Literal requires metadata as well. The lit function has not changed and returns an Expr::Literal with no metadata.

Expr::WindowFunction is now Boxed#

Expr::WindowFunction is now a Box<WindowFunction> instead of a WindowFunction directly. This change was made to reduce the size of Expr and improve performance when planning queries (see details on #16207).

This is a breaking change, so you will need to update your code if you match on Expr::WindowFunction directly. For example, if you have code like this:

match expr {
  Expr::WindowFunction(WindowFunction {
    params:
      WindowFunctionParams {
       partition_by,
       order_by,
      ..
    }
  }) => {
    // Use partition_by and order_by as needed
  }
  _ => {
    // other expr
  }
}

You will need to change it to:

match expr {
  Expr::WindowFunction(window_fun) => {
    let WindowFunction {
      fun,
      params: WindowFunctionParams {
        args,
        partition_by,
        ..
        },
    } = window_fun.as_ref();
    // Use partition_by and order_by as needed
  }
  _ => {
    // other expr
  }
}

The VARCHAR SQL type is now represented as Utf8View in Arrow#

The mapping of the SQL VARCHAR type has been changed from Utf8 to Utf8View which improves performance for many string operations. You can read more about Utf8View in the DataFusion blog post on German-style strings

This means that when you create a table with a VARCHAR column, it will now use Utf8View as the underlying data type. For example:

> CREATE TABLE my_table (my_column VARCHAR);
0 row(s) fetched.
Elapsed 0.001 seconds.

> DESCRIBE my_table;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| my_column   | Utf8View  | YES         |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.000 seconds.

You can restore the old behavior of using Utf8 by changing the datafusion.sql_parser.map_varchar_to_utf8view configuration setting. For example

> set datafusion.sql_parser.map_varchar_to_utf8view = false;
0 row(s) fetched.
Elapsed 0.001 seconds.

> CREATE TABLE my_table (my_column VARCHAR);
0 row(s) fetched.
Elapsed 0.014 seconds.

> DESCRIBE my_table;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| my_column   | Utf8      | YES         |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.004 seconds.

ListingOptions default for collect_stat changed from true to false#

This makes it agree with the default for SessionConfig. Most users won’t be impacted by this change but if you were using ListingOptions directly and relied on the default value of collect_stat being true, you will need to explicitly set it to true in your code.

ListingOptions::new(Arc::new(ParquetFormat::default()))
    .with_collect_stat(true)
    // other options

Processing FieldRef instead of DataType for user defined functions#

In order to support metadata handling and extension types, user defined functions are now switching to traits which use FieldRef rather than a DataType and nullability. This gives a single interface to both of these parameters and additionally allows access to metadata fields, which can be used for extension types.

To upgrade structs which implement ScalarUDFImpl, if you have implemented return_type_from_args you need instead to implement return_field_from_args. If your functions do not need to handle metadata, this should be straightforward repackaging of the output data into a FieldRef. The name you specify on the field is not important. It will be overwritten during planning. ReturnInfo has been removed, so you will need to remove all references to it.

ScalarFunctionArgs now contains a field called arg_fields. You can use this to access the metadata associated with the columnar values during invocation.

To upgrade user defined aggregate functions, there is now a function return_field that will allow you to specify both metadata and nullability of your function. You are not required to implement this if you do not need to handle metadata.

The largest change to aggregate functions happens in the accumulator arguments. Both the AccumulatorArgs and StateFieldsArgs now contain FieldRef rather than DataType.

To upgrade window functions, ExpressionArgs now contains input fields instead of input data types. When setting these fields, the name of the field is not important since this gets overwritten during the planning stage. All you should need to do is wrap your existing data types in fields with nullability set depending on your use case.

Physical Expression return Field#

To support the changes to user defined functions processing metadata, the PhysicalExpr trait, which now must specify a return Field based on the input schema. To upgrade structs which implement PhysicalExpr you need to implement the return_field function. There are numerous examples in the physical-expr crate.

FileFormat::supports_filters_pushdown replaced with FileSource::try_pushdown_filters#

To support more general filter pushdown, the FileFormat::supports_filters_pushdown was replaced with FileSource::try_pushdown_filters. If you implemented a custom FileFormat that uses a custom FileSource you will need to implement FileSource::try_pushdown_filters. See ParquetSource::try_pushdown_filters for an example of how to implement this.

FileFormat::supports_filters_pushdown has been removed.

ParquetExec, AvroExec, CsvExec, JsonExec Removed#

ParquetExec, AvroExec, CsvExec, and JsonExec were deprecated in DataFusion 46 and are removed in DataFusion 48. This is sooner than the normal process described in the API Deprecation Guidelines because all the tests cover the new DataSourceExec rather than the older structures. As we evolve DataSource, the old structures began to show signs of “bit rotting” (not working but no one knows due to lack of test coverage).

PartitionedFile added as an argument to the FileOpener trait#

This is necessary to properly fix filter pushdown for filters that combine partition columns and file columns (e.g. day = username['dob']).

If you implemented a custom FileOpener you will need to add the PartitionedFile argument but are not required to use it in any way.