HOWTOs#

How to update the version of Rust used in CI tests#

Make a PR to update the rust-toolchain file in the root of the repository.

Adding new functions#

Implementation

Function type

Location to implement

Trait to implement

Macros to use

Example

Scalar

functions

ScalarUDFImpl

make_udf_function!() and export_functions!()

advanced_udf.rs

Nested

functions-nested

ScalarUDFImpl

make_udf_expr_and_func!()

Aggregate

functions-aggregate

AggregateUDFImpl and an Accumulator

make_udaf_expr_and_func!()

advanced_udaf.rs

Window

functions-window

WindowUDFImpl and a PartitionEvaluator

define_udwf_and_expr!()

advanced_udwf.rs

Table

functions-table

TableFunctionImpl and a TableProvider

create_udtf_function!()

simple_udtf.rs

  • The macros are to simplify some boilerplate such as ensuring a DataFrame API compatible function is also created

  • Ensure new functions are properly exported through the subproject mod.rs or lib.rs.

  • Functions should preferably provide documentation via the #[user_doc(...)] attribute so their documentation can be included in the SQL reference documentation (see below section)

  • Scalar functions are further grouped into modules for families of functions (e.g. string, math, datetime). Functions should be added to the relevant module; if a new module needs to be created then a new Rust feature should also be added to allow DataFusion users to conditionally compile the modules as needed

  • Aggregate functions can optionally implement a GroupsAccumulator for better performance

Spark compatible functions are located in separate crate but otherwise follow the same steps, though all function types (e.g. scalar, nested, aggregate) are grouped together in the single location.

Testing

Prefer adding sqllogictest integration tests where the function is called via SQL against well known data and returns an expected result. See the existing test files if there is an appropriate file to add test cases to, otherwise create a new file. See the sqllogictest documentation for details on how to construct these tests. Ensure edge case, null input cases are considered in these tests.

If a behaviour cannot be tested via sqllogictest (e.g. testing simplify(), needs to be tested in isolation from the optimizer, difficult to construct exact input via sqllogictest) then tests can be added as Rust unit tests in the implementation module, though these should be kept minimal where possible

Documentation

Run documentation update script ./dev/update_function_docs.sh which will update the relevant markdown document here (see the documents for scalar, aggregate and window functions)

  • You should not manually update the markdown document after running the script as those manual changes would be overwritten on next execution

  • Reference GitHub issue which introduced this behaviour

How to display plans graphically#

The query plans represented by LogicalPlan nodes can be graphically rendered using Graphviz.

To do so, save the output of the display_graphviz function to a file.:

// Create plan somehow...
let mut output = File::create("/tmp/plan.dot")?;
write!(output, "{}", plan.display_graphviz());

Then, use the dot command line tool to render it into a file that can be displayed. For example, the following command creates a /tmp/plan.pdf file:

dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf

How to format .md documents#

We use prettier to format .md files.

You can either use npm i -g prettier to install it globally or use npx to run it as a standalone binary. Using npx requires a working node environment. Upgrading to the latest prettier is recommended (by adding --upgrade to the npm command).

$ prettier --version
2.3.0

After you’ve confirmed your prettier version, you can format all the .md files:

prettier -w {datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md

How to format .toml files#

We use taplo to format .toml files.

To install via cargo:

cargo install taplo-cli --locked

Refer to the taplo installation documentation for other ways to install it.

$ taplo --version
taplo 0.9.0

After you’ve confirmed your taplo version, you can format all the .toml files:

taplo fmt

How to update protobuf/gen dependencies#

For the proto and proto-common crates, the prost/tonic code is generated by running their respective ./regen.sh scripts, which in turn invokes the Rust binary located in ./gen.

This is necessary after modifying the protobuf definitions or altering the dependencies of ./gen, and requires a valid installation of protoc (see installation instructions for details).

# From repository root
# proto-common
./datafusion/proto-common/regen.sh
# proto
./datafusion/proto/regen.sh