Working with Exprs

Expr is short for “expression”. It is a core abstraction in DataFusion for representing a computation, and follows the standard “expression tree” abstraction found in most compilers and databases.

For example, the SQL expression a + b would be represented as an Expr with a BinaryExpr variant. A BinaryExpr has a left and right Expr and an operator.

As another example, the SQL expression a + b * c would be represented as an Expr with a BinaryExpr variant. The left Expr would be a and the right Expr would be another BinaryExpr with a left Expr of b and a right Expr of c. As a classic expression tree, this would look like:

            ┌────────────────────┐
            │     BinaryExpr     │
            │       op: +        │
            └────────────────────┘
                   ▲     ▲
           ┌───────┘     └────────────────┐
           │                              │
┌────────────────────┐         ┌────────────────────┐
│     Expr::Col      │         │     BinaryExpr     │
│       col: a       │         │       op: *        │
└────────────────────┘         └────────────────────┘
                                        ▲    ▲
                               ┌────────┘    └─────────┐
                               │                       │
                    ┌────────────────────┐  ┌────────────────────┐
                    │     Expr::Col      │  │     Expr::Col      │
                    │       col: b       │  │       col: c       │
                    └────────────────────┘  └────────────────────┘

As the writer of a library, you can use Exprs to represent computations that you want to perform. This guide will walk you through how to make your own scalar UDF as an Expr and how to rewrite Exprs to inline the simple UDF.

Creating and Evaluating Exprs

Please see expr_api.rs for well commented code for creating, evaluating, simplifying, and analyzing Exprs.

A Scalar UDF Example

We’ll use a ScalarUDF expression as our example. This necessitates implementing an actual UDF, and for ease we’ll use the same example from the adding UDFs guide.

So assuming you’ve written that function, you can use it to create an Expr:

let add_one_udf = create_udf(
    "add_one",
    vec![DataType::Int64],
    Arc::new(DataType::Int64),
    Volatility::Immutable,
    make_scalar_function(add_one),  // <-- the function we wrote
);

// make the expr `add_one(5)`
let expr = add_one_udf.call(vec![lit(5)]);

// make the expr `add_one(my_column)`
let expr = add_one_udf.call(vec![col("my_column")]);

If you’d like to learn more about Exprs, before we get into the details of creating and rewriting them, you can read the expression user-guide.

Rewriting Exprs

There are several examples of rewriting and working with Exprs:

Rewriting Expressions is the process of taking an Expr and transforming it into another Expr. This is useful for a number of reasons, including:

  • Simplifying Exprs to make them easier to evaluate

  • Optimizing Exprs to make them faster to evaluate

  • Converting Exprs to other forms, e.g. converting a BinaryExpr to a CastExpr

In our example, we’ll use rewriting to update our add_one UDF, to be rewritten as a BinaryExpr with a Literal of 1. We’re effectively inlining the UDF.

Rewriting with transform

To implement the inlining, we’ll need to write a function that takes an Expr and returns a Result<Expr>. If the expression is not to be rewritten Transformed::no is used to wrap the original Expr. If the expression is to be rewritten, Transformed::yes is used to wrap the new Expr.

fn rewrite_add_one(expr: Expr) -> Result<Expr> {
    expr.transform(&|expr| {
        Ok(match expr {
            Expr::ScalarUDF(scalar_fun) if scalar_fun.fun.name == "add_one" => {
                let input_arg = scalar_fun.args[0].clone();
                let new_expression = input_arg + lit(1i64);

                Transformed::yes(new_expression)
            }
            _ => Transformed::no(expr),
        })
    })
}

Creating an OptimizerRule

In DataFusion, an OptimizerRule is a trait that supports rewritingExprs that appear in various parts of the LogicalPlan. It follows DataFusion’s general mantra of trait implementations to drive behavior.

We’ll call our rule AddOneInliner and implement the OptimizerRule trait. The OptimizerRule trait has two methods:

  • name - returns the name of the rule

  • try_optimize - takes a LogicalPlan and returns an Option<LogicalPlan>. If the rule is able to optimize the plan, it returns Some(LogicalPlan) with the optimized plan. If the rule is not able to optimize the plan, it returns None.

struct AddOneInliner {}

impl OptimizerRule for AddOneInliner {
    fn name(&self) -> &str {
        "add_one_inliner"
    }

    fn try_optimize(
        &self,
        plan: &LogicalPlan,
        config: &dyn OptimizerConfig,
    ) -> Result<Option<LogicalPlan>> {
        // Map over the expressions and rewrite them
        let new_expressions = plan
            .expressions()
            .into_iter()
            .map(|expr| rewrite_add_one(expr))
            .collect::<Result<Vec<_>>>()?;

        let inputs = plan.inputs().into_iter().cloned().collect::<Vec<_>>();

        let plan = plan.with_new_exprs(&new_expressions, &inputs);

        plan.map(Some)
    }
}

Note the use of rewrite_add_one which is mapped over plan.expressions() to rewrite the expressions, then plan.with_new_exprs is used to create a new LogicalPlan with the rewritten expressions.

We’re almost there. Let’s just test our rule works properly.

Testing the Rule

Testing the rule is fairly simple, we can create a SessionState with our rule and then create a DataFrame and run a query. The logical plan will be optimized by our rule.

use datafusion::prelude::*;

let rules = Arc::new(AddOneInliner {});
let state = ctx.state().with_optimizer_rules(vec![rules]);

let ctx = SessionContext::with_state(state);
ctx.register_udf(add_one);

let sql = "SELECT add_one(1) AS added_one";
let plan = ctx.sql(sql).await?.logical_plan();

println!("{:?}", plan);

This results in the following output:

Projection: Int64(1) + Int64(1) AS added_one
  EmptyRelation

I.e. the add_one UDF has been inlined into the projection.

Getting the data type of the expression

The arrow::datatypes::DataType of the expression can be obtained by calling the get_type given something that implements Expr::Schemable, for example a DFschema object:

use arrow_schema::DataType;
use datafusion::common::{DFField, DFSchema};
use datafusion::logical_expr::{col, ExprSchemable};
use std::collections::HashMap;

let expr = col("c1") + col("c2");
let schema = DFSchema::new_with_metadata(
    vec![
        DFField::new_unqualified("c1", DataType::Int32, true),
        DFField::new_unqualified("c2", DataType::Float32, true),
    ],
    HashMap::new(),
)
.unwrap();
print!("type = {}", expr.get_type(&schema).unwrap());

This results in the following output:

type = Float32

Conclusion

In this guide, we’ve seen how to create Exprs programmatically and how to rewrite them. This is useful for simplifying and optimizing Exprs. We’ve also seen how to test our rule to ensure it works properly.