HOWTOs

How to add a new scalar function

Below is a checklist of what you need to do to add a new scalar function to DataFusion:

  • Add the actual implementation of the function to a new module file within:

    • here for array functions

    • here for crypto functions

    • here for datetime functions

    • here for encoding functions

    • here for math functions

    • here for regex functions

    • here for string functions

    • here for unicode functions

    • create a new module here for other functions.

  • New function modules - for example a vector module, should use a rust feature (for example vector_expressions) to allow DataFusion users to enable or disable the new module as desired.

  • The implementation of the function is done via implementing ScalarUDFImpl trait for the function struct.

    • See the advanced_udf.rs example for an example implementation

    • Add tests for the new function

  • To connect the implementation of the function add to the mod.rs file:

    • a mod xyz; where xyz is the new module file

    • a call to make_udf_function!(..);

    • an item in export_functions!(..);

  • In sqllogictest/test_files, add new sqllogictest integration tests where the function is called through SQL against well known data and returns the expected result.

    • Documentation for sqllogictest here

  • Add SQL reference documentation here

How to add a new aggregate function

Below is a checklist of what you need to do to add a new aggregate function to DataFusion:

  • Add the actual implementation of an Accumulator and AggregateExpr:

  • In datafusion/expr/src, add:

    • a new variant to AggregateFunction

    • a new entry to FromStr with the name of the function as called by SQL

    • a new line in return_type with the expected return type of the function, given an incoming type

    • a new line in signature with the signature of the function (number and types of its arguments)

    • a new line in create_aggregate_expr mapping the built-in to the implementation

    • tests to the function.

  • In sqllogictest/test_files, add new sqllogictest integration tests where the function is called through SQL against well known data and returns the expected result.

    • Documentation for sqllogictest here

  • Add SQL reference documentation here

How to display plans graphically

The query plans represented by LogicalPlan nodes can be graphically rendered using Graphviz.

To do so, save the output of the display_graphviz function to a file.:

// Create plan somehow...
let mut output = File::create("/tmp/plan.dot")?;
write!(output, "{}", plan.display_graphviz());

Then, use the dot command line tool to render it into a file that can be displayed. For example, the following command creates a /tmp/plan.pdf file:

dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf

How to format .md document

We are using prettier to format .md files.

You can either use npm i -g prettier to install it globally or use npx to run it as a standalone binary. Using npx required a working node environment. Upgrading to the latest prettier is recommended (by adding --upgrade to the npm command).

$ prettier --version
2.3.0

After you’ve confirmed your prettier version, you can format all the .md files:

prettier -w {datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md

How to format .toml files

We use taplo to format .toml files.

For Rust developers, you can install it via:

cargo install taplo-cli --locked

Refer to the Installation section on other ways to install it.

$ taplo --version
taplo 0.9.0

After you’ve confirmed your taplo version, you can format all the .toml files:

taplo fmt