Apache DataFusion Blog - Pepijn Van Eeckhoudt

Optimizing SQL CASE Expression Evaluation

2026-02-02T00:00:00+00:00

SQL's CASE expression is one of the few explicit conditional evaluation constructs the language provides. It allows you to control which expression from a set of expressions is evaluated for each row based on arbitrary boolean expressions. Its deceptively simple syntax hides significant implementation complexity. Over the past few Apache DataFusion releases, a series of improvements to CASE expression evaluator have been merged that reduce both CPU time and memory allocations. This post provides an overview of the original implementation, its performance bottlenecks, and the steps taken to address them.

Background: CASE Expression Evaluation¶

SQL supports two forms of CASE expressions:

Simple: CASE expr WHEN value1 THEN result1 WHEN value2 THEN result2 ... END
Searched: CASE WHEN condition1 THEN result1 WHEN condition2 THEN result2 ... END

The simple form evaluates an expression once for each input row and then tests that value against the expressions (typically constants) in each WHEN clause using equality comparisons.

Here's an example of the simple form:

CASE status
    WHEN 'pending' THEN 1
    WHEN 'active' THEN 2
    WHEN 'complete' THEN 3
    ELSE 0
END

In this CASE expression, status is evaluated once per row, and then its value is tested for equality with the values 'pending', 'active', and 'complete' in that order. The CASE expression evaluates to the value of the THEN expression corresponding to the first matching WHEN expression.

The searched CASE form is a more flexible variant. It evaluates completely independent boolean expressions for each branch. This allows you to test different columns with different operators per branch as shown in the following example:

CASE
    WHEN age > 65 THEN 'senior'
    WHEN childCount != 0 THEN 'parent'
    WHEN age < 21 THEN 'minor'
    ELSE 'adult'
END

In both forms, branches are evaluated sequentially with short-circuit semantics: for each row, once a WHEN condition matches, the corresponding THEN expression is evaluated. Any further branches are not evaluated for that row. This lazy evaluation model is critical for correctness. It lets you safely write CASE expressions like CASE WHEN d != 0 THEN n / d ELSE NULL END that are guaranteed to not trigger divide-by-zero errors.

Besides CASE, there are a few conditional scalar functions that provide similar, more restricted capabilities. These include COALESCE, IFNULL, and NVL2. You can consider each of these functions as the equivalent of a macro for CASE. For example, COALESCE(expr1, expr2, expr3) expands to:

CASE
  WHEN expr1 IS NOT NULL THEN expr1
  WHEN expr2 IS NOT NULL THEN expr2
  ELSE expr3
END

Since Apache DataFusion rewrites these conditional functions to their equivalent CASE expression, any optimizations related to CASE described in this post also apply to conditional function evaluation.

`CASE` Evaluation in DataFusion 50.0.0¶

For the remainder of this post, we'll be looking at 'searched CASE' evaluation. 'Simple CASE' uses a distinct, but very similar implementation. The same set of improvements has been applied to both.

The baseline implementation in DataFusion 50.0.0 evaluated CASE using a common, straightforward approach:

Start with an output array out with the same length as the input batch, filled with nulls. Additionally, create a bit vector remainder with the same length and each value set to true.
For each WHEN/THEN branch:
- Evaluate the WHEN condition for the remaining unmatched rows using PhysicalExpr::evaluate_selection, passing in the input batch and the remainder mask.
- If any rows matched, evaluate the THEN expression for those rows using PhysicalExpr::evaluate_selection.
- Merge the results into the out array using the zip kernel.
- Update the remainder mask to exclude the matched rows.
If there's an ELSE clause, evaluate it for any remaining unmatched rows and merge using zip.

Here's a simplified version of the Rust code for the original loop:

let mut out = new_null_array(&return_type, batch.num_rows());
let mut remainder = BooleanArray::from(vec![true; batch.num_rows()]);

for (when_expr, then_expr) in &self.when_then_expr {
    // Determine for which remaining rows the WHEN condition matches
    let when = when_expr.evaluate_selection(batch, &remainder)?
        .into_array(batch.num_rows())?;
    // Ensure any `NULL` values are treated as false
    let when_and_rem = and(&when, &remainder)?;

    if when_and_rem.true_count() == 0 {
        continue;
    }

    // Evaluate the THEN expression for matching rows
    let then = then_expr.evaluate_selection(batch, &when_and_rem)?;
    // Merge results into output array
    out = zip(&when_and_rem, &then_value, &out)?;
    // Update remainder mask to exclude matched rows
    remainder = and_not(&remainder, &when_and_rem)?;
}

Let's examine one iteration of this loop for the following CASE expression:

CASE
    WHEN col = 'b' THEN 100
    ELSE 200
END

Schematically, it will look as follows:

One iteration of the `CASE` evaluation loop

This implementation works perfectly fine, but there's significant room for optimization, mostly related to the usage of evaluate_selection. To understand why, we need to dig a little deeper into the implementation of that function. Here's a simplified version of it that captures the relevant parts:

pub trait PhysicalExpr {
    fn evaluate_selection(
        &self,
        batch: &RecordBatch,
        selection: &BooleanArray,
    ) -> Result<ColumnarValue> {
        // Reduce record batch to only include rows that match selection
        let filtered_batch = filter_record_batch(batch, selection)?;
        // Perform regular evaluation on filtered batch
        let filtered_result = self.evaluate(&filtered_batch)?;
        // Expand result array to match original batch length
        scatter(selection, filtered_result)
    }
}

Going back to the same example as before, the data flow in evaluate_selection looks like this:

evaluate_selection data flow

The evaluate_selection method first filters the input batch to only include rows that match the selection mask. It then calls the regular evaluate method using the filtered batch as input. Finally, to return a result array with the same number of rows as batch, the scatter function is called. This function produces a new array padded with null values for any rows that didn't match the selection mask.

So how can we improve the performance of the simple evaluation strategy and use of evaluate_selection?

Opportunity 1: Early Exit¶

The CASE evaluation loop always iterates through all branches, even when every row has already been matched. In queries where early branches match all rows, this results in unnecessary work being done for the remaining branches.

Opportunity 2: Optimize Repeated Filtering, Scattering, and Merging¶

Each iteration performs a number of operations that are very well-optimized, but still take up a significant amount of CPU time:

Filtering: PhysicalExpr::evaluate_selection filters the entire RecordBatch for each branch. For the WHEN expression, this is done even if the selection mask was entirely empty.
Scattering: PhysicalExpr::evaluate_selection scatters the filtered result back to the original RecordBatch length.
Merging: The zip kernel is called once per branch to merge partial results into the output array

Each of these operations needs to allocate memory for new arrays and shuffle quite a bit of data around.

Opportunity 3: Filter only Necessary Columns¶

The PhysicalExpr::evaluate_selection method filters the entire record batch, including columns that the current branch's WHEN and THEN expressions don't reference. For wide tables (many columns) with narrow expressions (few column references), this is wasteful.

Suppose you have a table with 26 columns named a through z, and the following simple CASE expression:

CASE
  WHEN a > 1000 THEN 'large'
  WHEN a >= 0 THEN 'positive'
  ELSE 'negative'
END

The implementation would filter all 26 columns even though only a single column is needed for the entire CASE expression evaluation. Again this involves a non-negligible amount of allocation and data copying.

Performance Optimizations¶

Optimization 1: Short-Circuit Early Exit¶

The first optimization is straightforward. As soon as we detect that all rows of the batch have been matched, we break out of the evaluation loop:

let mut remainder_count = batch.num_rows();

for (when_expr, then_expr) in &self.when_then_expr {
    if remainder_count == 0 {
        break;  // All rows matched, exit early
    }

    // ... evaluate branch ...

    let when_match_count = when_value.true_count();
    remainder_count -= when_match_count;
}

Additionally, we avoid evaluating the ELSE clause when no rows remain:

if let Some(else_expr) = &self.else_expr {
    remainder = or(&base_nulls, &remainder)?;
    if remainder.true_count() > 0 {
        // ... evaluate else ...
    }
}

For queries where early branches match all rows, this eliminates unnecessary branch evaluations and ELSE clause processing.

This optimization was implemented by Pepijn Van Eeckhoudt (@pepijnve) in PR #17898

Optimization 2: Optimized Result Merging¶

The second optimization fundamentally restructures how the results of each loop iteration will be merged. The diagram below illustrates the optimized data flow when evaluating the CASE WHEN col = 'b' THEN 100 ELSE 200 END from before:

optimized evaluation loop

In the reworked implementation, the evaluate_selection function is no longer used. The key insight is that we can defer all merging until the end of the evaluation loop by tracking result provenance. This was implemented with the following changes:

Augment the input batch with a column containing row indices.
Reduce the augmented batch after each loop iteration to only contain the remaining rows.
Use the row index column to track which partial result array contains the value for each row.
Perform a single merge operation at the end instead of a zip operation after each loop iteration.

These changes make it unnecessary to scatter and zip results in each loop iteration. Instead, when all rows have been matched, we then merge the partial results using arrow_select::merge::merge_n.

The diagram below illustrates how merge_n works for an example where three WHEN/THEN branches produced results. The first branch produced the result A for row 2, the second produced B for row 1, and the third produced C and D for rows 4 and 5.

merge_n example

The merge_n algorithm scans through the indices array. For each non-empty cell, it takes one value from the corresponding values array. In the example above, we first encounter 1. This takes the first element from the values array with index 1, resulting in B. The next cell contains 0 which takes A, from the first array. Finally, we encounter 2 twice. This takes the first and second element from the last values array respectively.

This algorithm was initially implemented in DataFusion for the CASE implementation, but in the meantime has been generalized and moved into the arrow-rs crate as arrow_select::merge::merge_n.

This optimization was implemented by Pepijn Van Eeckhoudt (@pepijnve) in PR #18152

Optimization 3: Column Projection¶

The third optimization addresses the "filtering unused columns" overhead through projection.

Look at the following query example where the mailing_address table has the columns name, surname, street, number, city, state, country:

SELECT *, CASE WHEN country = 'USA' THEN state ELSE country END AS region
FROM mailing_address

You can see that the CASE expression only references the columns country and state, but because all columns are being queried, projection pushdown cannot reduce the number of columns being fed in to the projection operator.

CASE evaluation without projection

During CASE evaluation, the batch must be filtered using the WHEN expression to evaluate the THEN expression values. As the diagram above shows, this filtering creates a reduced copy of all columns.

This unnecessary copying can be avoided by first narrowing the batch to only include the columns that are actually needed.

CASE evaluation with projection

At first glance, this might not seem beneficial, since we're introducing an additional processing step. Luckily projection of a record batch only requires a shallow copy of the record batch. The column arrays themselves are not copied, and the only work that is actually done is incrementing the reference counts of the columns.

Impact: For wide tables with narrow CASE expressions, this dramatically reduces filtering overhead by removing the copying of unused columns.

This optimization was implemented by Pepijn Van Eeckhoudt (@pepijnve) in PR #18329

Optimization 4: Eliminating Scatter in Two-Branch Case¶

Some of the earlier examples in this post use expressions of the form CASE WHEN condition THEN expr1 ELSE expr2 END to explain how the general evaluation loop works. For this kind of two-branch CASE expression, Apache DataFusion has a more optimized implementation that unrolls the loop. This specialized ExpressionOrExpression fast path still used evaluate_selection() for both branches which uses scatter and zip to combine the results incurring the same performance overhead as the general implementation.

The revised implementation eliminates the use of evaluate_selection as follows:

// Compute the `WHEN` condition for the entire batch
let when_filter = create_filter(&when_value);

// Compute a compact array of `THEN` values for the matching rows
let then_batch = filter_record_batch(batch, &when_filter)?;
let then_value = then_expr.evaluate(&then_batch)?;

// Compute a compact array of `ELSE` values for the non-matching rows
let else_filter = create_filter(&not(&when_value)?);
let else_batch = filter_record_batch(batch, &else_filter)?;
let else_value = else_expr.evaluate(&else_batch)?;

This produces two compact arrays, one for the THEN values and one for the ELSE values, which are then merged with the merge function. In contrast to zip, merge does not require both of its value inputs to have the same length. Instead it requires that the sum of the length of the value inputs matches the length of the mask array.

merge example

This eliminates unnecessary scatter operations and memory allocations for one of the most common CASE expression patterns.

Just like merge_n, this operation has been moved into arrow-rs as arrow_select::merge::merge.

This optimization was implemented by Pepijn Van Eeckhoudt (@pepijnve) in PR #18444

Optimization 5: Table Lookup of Constants¶

Up until now, we've discussed the implementations for generic CASE expressions that use non-constant expressions for both WHEN and THEN. Another common use of CASE is to perform a mapping from one set of constants to another. For instance, you can expand numeric constants to human-readable strings using the following CASE example.

CASE status
  WHEN 0 THEN 'idle'
  WHEN 1 THEN 'running'
  WHEN 2 THEN 'paused'
  WHEN 3 THEN 'stopped'
  ELSE 'unknown'
END

A final CASE optimization recognizes this pattern and compiles the CASE expression into a hash table. Rather than evaluating the WHEN and THEN expressions, the input expression is evaluated once, and the result array is computed using a vectorized hash table lookup. This approach avoids the need to filter the input batch and combine partial results entirely. The result array is computed in a single pass over the input values, and the computation time does not grow significantly with the number of WHEN branches in the CASE expression.

This optimization was implemented by Raz Luvaton (@rluvaton) in PR #18183

Results¶

The degree to which the performance optimizations described in this post will benefit your queries is highly dependent on both your data and your queries. To give some idea of the impact, we ran the following query on the TPC_H orders table with a scale factor of 100:

SELECT
    *,
    case o_orderstatus
        when 'O' then 'ordered'
        when 'F' then 'filled'
        when 'P' then 'pending'
        else 'other'
    end
from orders

This query was first run with DataFusion 50.0.0 to get a baseline measurement. The same query was then run with each optimization applied in turn. The recorded times are presented as the blue series in the chart below. The green series shows the time measurement for the SELECT * FROM orders to give an idea of the cost the addition of a CASE expression in a query incurs. All measurements were made with a target partition count of 1.

Performance measurements

What you can see in the chart is that the effect of the various optimizations compounds up to the project measurement. Up to that point these results are applicable to any CASE expression. The final improvement in the hash measurement is only applicable to simple CASE expressions with constant WHEN and THEN expressions.

The cumulative effect of these optimizations is a 63-71% reduction in CPU time spent evaluating CASE expressions compared to the baseline.

Summary¶

Through a number of targeted optimizations, we've transformed CASE expression evaluation from a simple, but unoptimized implementation into a highly optimized one. The optimizations described in this post compound: a CASE expression on a wide table with multiple branches and early matches benefits from all four optimizations simultaneously. The result is significantly reduced CPU time and memory allocation in SQL constructs that are essential for ETL-like queries.

About DataFusion¶

Apache DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast, data-centric systems such as databases, dataframe libraries, and machine learning and streaming applications. While DataFusion’s primary design goal is to accelerate the creation of other data-centric systems, it provides a reasonable experience directly out of the box as a dataframe library, Python library, and command-line SQL tool.

DataFusion's core thesis is that, as a community, together we can build much more advanced technology than any of us as individuals or companies could build alone. Without DataFusion, highly performant vectorized query engines would remain the domain of a few large companies and world-class research institutions. With DataFusion, we can all build on top of a shared foundation and focus on what makes our projects unique.

How to Get Involved¶

DataFusion is not a project built or driven by a single person, company, or foundation. Rather, our community of users and contributors works together to build a shared technology that none of us could have built alone.

If you are interested in joining us, we would love to have you. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests, or code. A list of open issues suitable for beginners is here, and you can find out how to reach us on the communication doc.

Using Rust async for Query Execution and Cancelling Long-Running Queries

2025-06-30T00:00:00+00:00

Have you ever tried to cancel a query that just wouldn't stop? In this post, we'll review how Rust's async programming model works, how DataFusion uses that model for CPU intensive tasks, and how this is used to cancel queries. Then we'll review some cases where queries could not be canceled in DataFusion and what the community did to resolve the problem.

Understanding Rust's Async Model¶

DataFusion, somewhat unconventionally, uses the Rust async system and the Tokio task scheduler for CPU intensive processing. To really understand the cancellation problem you first need to be familiar with Rust's asynchronous programming model which is a bit different from what you might be used to from other ecosystems. Let's go over the basics again as a refresher. If you're familiar with the ins and outs of Future and async you can skip this section.

Futures Are Inert¶

Rust's asynchronous programming model is built around the Future<T> trait. In contrast to, for instance, Javascript's Promise or Java's Future a Rust Future does not necessarily represent an actively running asynchronous job. Instead, a Future<T> represents a lazy calculation that only makes progress when explicitly asked to do so. This is done by calling the poll method of a Future. If nobody polls a Future explicitly, it is an inert object.

Calling Future::poll results in one of two options:

Poll::Pending if the evaluation is not yet complete, most often because it needs to wait for something like I/O before it can continue
Poll::Ready<T> when it has completed and produced a value

When a Future returns Pending, it saves its internal state so it can pick up where it left off the next time you poll it. This internal state management makes Rust's Futures memory-efficient and composable. Rather than freezing the full call stack leading to a certain point, only the relevant state to resume the future needs to be retained.

Additionally, a Future must set up the necessary signaling to notify the caller when it should call poll again, to avoid a busy-waiting loop. This is done using a Waker which the Future receives via the Context parameter of the poll function.

Manual implementations of Future are most often little finite state machines. Each state in the process of completing the calculation is modeled as a variant of an enum. Before a Future returns Pending, it bundles the data required to resume in an enum variant, stores that enum variant in itself, and then returns. While compact and efficient, the resulting code is often quite verbose.

The async keyword was introduced to make life easier on Rust programmers. It provides elegant syntactic sugar for the manual state machine Future approach. When you write an async function or block, the compiler transforms linear code into a state machine based Future similar to the one described above for you. Since all the state management is compiler generated and hidden from sight, async code tends to be easier to write initially, more readable afterward, while maintaining the same underlying mechanics.

The await keyword complements async pausing execution until a Future completes.
When you .await a Future, you're essentially telling the compiler to generate code that:

Polls the Future with the current (implicit) asynchronous context
If poll returns Poll::Pending, save the state of the Future so that it can resume at this point and return Poll::Pending
If it returns Poll::Ready(value), continue execution with that value

From Futures to Streams¶

The futures crate extends the Future model with a trait named Stream. Stream<Item = T> represents a sequence of values that are each produced asynchronously rather than just a single value. It's the asynchronous equivalent of Iterator<Item = T>.

The Stream trait has one method named poll_next that returns:

Poll::Pending when the next value isn't ready yet, just like a Future would
Poll::Ready(Some(value)) when a new value is available
Poll::Ready(None) when the stream is exhausted

Under the hood, an implementation of Stream is very similar to a Future. Typically, they're also implemented as state machines, the main difference being that they produce multiple values rather than just one. Just like Future, a Stream is inert unless explicitly polled.

Now that we understand the basics of Rust's async model, let's see how DataFusion leverages these concepts to execute queries.

How DataFusion Executes Queries¶

In DataFusion, the short version of how queries are executed is as follows (you can find more in-depth coverage of this in the DataFusion documentation):

First the query is compiled into a tree of ExecutionPlan nodes
ExecutionPlan::execute is called on the root of the tree.
This method returns a SendableRecordBatchStream (a pinned Box<dyn Stream<RecordBatch>>)
Stream::poll_next is called in a loop to get the results

In other words, the execution of a DataFusion query boils down to polling an asynchronous stream. Like all Stream implementations, we need to explicitly poll the stream for the query to make progress.

The Stream we get in step 2 is actually the root of a tree of Streams that mostly mirrors the execution plan tree. Each stream tree node processes the record batches it gets from its children. The leaves of the tree produce record batches themselves.

Query execution progresses each time you call poll_next on the root stream. This call typically cascades down the tree, with each node calling poll_next on its children to get the data it needs to process.

Here's where the first signs of problems start to show up: some operations (like aggregations, sorts, or certain join phases) need to process a lot of data before producing any output. When poll_next encounters one of these operations, it might require substantial work before it can return a record batch.

Tokio and Cooperative Scheduling¶

We need to make a small detour now via Tokio's scheduler before we can get to the query cancellation problem. DataFusion makes use of the Tokio asynchronous runtime, which uses a cooperative scheduling model. This is fundamentally different from preemptive scheduling that you might be used to:

In preemptive scheduling, the system can interrupt a task at any time to run something else
In cooperative scheduling, tasks must voluntarily yield control back to the scheduler

This distinction is crucial for understanding our cancellation problem.

A task in Tokio is modeled as a Future which is passed to one of the task initiation functions like spawn. Tokio runs the task by calling Future::poll in a loop until it returns Poll::Ready. While that Future::poll call is running, Tokio has no way to forcibly interrupt it. It must cooperate by periodically yielding control, either by returning Poll::Pending or Poll::Ready.

Similarly, when you try to abort a task by calling JoinHandle::abort(), the Tokio runtime can't immediately force it to stop. You're just telling Tokio: "When this task next yields control, don't call Future::poll anymore." If the task never yields, it can't be aborted.

The Cancellation Problem¶

With all the necessary background in place, now let's look at how the DataFusion CLI tries to run and cancel a query. The code below is a simplified version of what the CLI actually does:

fn exec_query() {
    let runtime: tokio::runtime::Runtime = ...;
    let stream: SendableRecordBatchStream = ...;

    runtime.block_on(async {
        tokio::select! {
            next_batch = stream.next() => ...
            _ = signal::ctrl_c() => ...,
        }
    })
}

First the CLI sets up a Tokio runtime instance. It then reads the query to execute from standard input or file and turns it into a Stream. Then it calls next on stream which is an async wrapper for poll_next. It passes this to the select! macro along with a ctrl-C handler.

The select! macro races these two Futures and completes when either one finishes. The intent is that when you press Ctrl+C, the signal::ctrl_c() Future should complete. The stream is cancelled when it is dropped as it is inert by itself and nothing will be able to call poll_next again.

But there's a catch: select! still follows cooperative scheduling rules. It polls each Future in sequence, and if the first one (our query) gets stuck in a long computation, it never gets around to polling the cancellation signal.

Imagine a query that needs to calculate something intensive, like sorting billions of rows. Unless the sorting Stream is written with care (which the one in DataFusion is), the poll_next call may take several minutes or even longer without returning. During this time, Tokio can't check if you've pressed Ctrl+C, and the query continues running despite your cancellation request.

A Closer Look at Blocking Operators¶

Let's peel back a layer of the onion and look at what's happening in a blocking poll_next implementation. Here's a drastically simplified version of a COUNT(*) aggregation - something you might use in a query like SELECT COUNT(*) FROM table:

struct BlockingStream {
    // the input: an inner stream that is wrapped
    stream: SendableRecordBatchStream,
    count: usize,
    finished: bool,
}

impl Stream for BlockingStream {
    type Item = Result<RecordBatch>;
    fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
        if self.finished {
            // return None if we're finished
            return Poll::Ready(None);
        }

        loop {
            // poll the input stream to get the next batch if ready
            match ready!(self.stream.poll_next_unpin(cx)) {
                // increment the counter if we got a batch
                Some(Ok(batch)) => self.count += batch.num_rows(),
                // on end-of-stream, create a record batch for the counter
                None => {
                    self.finished = true;
                    return Poll::Ready(Some(Ok(create_record_batch(self.count))));
                }
                // pass on any errors verbatim
                Some(Err(e)) => return Poll::Ready(Some(Err(e))),
            }
        }
    }
}

How does this code work? Let's break it down step by step:

1. Initial check: We first check if we've already finished processing. If so, we return Ready(None) to signal the end of our stream:

if self.finished {
    return Poll::Ready(None);
}

2. Processing loop: If we're not done yet, we enter a loop to process incoming batches from our input stream:

loop {
    match ready!(self.stream.poll_next_unpin(cx)) {
        // Handle different cases...
    }
}

The ready! macro checks if the input stream returned Pending and if so, immediately returns Pending from our function as well.

3. Processing data: For each batch we receive, we simply add its row count to our running total:

Some(Ok(batch)) => self.count += batch.num_rows(),

4. End of input: When the child stream is exhausted (returns None), we calculate our final result and convert it into a record batch (omitted for brevity):

None => {
    self.finished = true;
    return Poll::Ready(Some(Ok(create_record_batch(self.count))));
}

5. Error handling: If we encounter an error, we pass it along immediately:

Some(Err(e)) => return Poll::Ready(Some(Err(e))),

This code looks perfectly reasonable at first glance. But there's a subtle issue lurking here: what happens if the input stream always returns Ready and never returns Pending?

In that case, the processing loop will keep running without returning Poll::Pending and thus never yield control back to Tokio's scheduler. This means we could be stuck in a single poll_next call for quite some time - exactly the scenario that prevents query cancellation from working!

So how do we solve this problem? Let's explore some strategies to ensure our operators yield control periodically.

Unblocking Operators¶

Now let's look at how we can ensure we return Pending every now and then.

Independent Cooperative Operators¶

One simple way to return Pending is using a loop counter. We do the exact same thing as before, but on each loop iteration we decrement our counter. If the counter hits zero we return Pending. The following example ensures we iterate at most 128 times before yielding.

struct CountingSourceStream {
   counter: usize
}

impl Stream for CountingSourceStream {
    type Item = Result<RecordBatch>;

    fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
        if self.counter >= 128 {
            self.counter = 0;
            cx.waker().wake_by_ref();
            return Poll::Pending;
        }

        self.counter += 1;
        let batch = ...;
        Ready(Some(Ok(batch)))
    }
}

If CountingSourceStream was the input for the BlockingStream example above, the BlockingStream will receive a Pending periodically causing it to yield too. Can we really solve the cancel problem simply by periodically yielding in source streams?

Unfortunately, no. Let's look at what happens when we start combining operators in more complex configurations. Suppose we create a plan like this.

A plan that merges two branches by alternating between them.

Each CountingSource produces a Pending every 128 batches. The Filter is a stream that drops a batch every 50 record batches. Merge is a simple combining operator the uses futures::stream::select to combine two stream.

When we set this stream in motion, the merge operator will poll the left and right branch in a round-robin fashion. The sources will each emit Pending every 128 batches, but since the Filter drops batches, they arrive out-of-phase at the merge operator. As a consequence the merge operator will always have the opportunity of polling the other stream when one returns Pending. The Merge stream thus is an always ready stream, even though the sources are yielding. If we use Merge as the input to our aggregating operator we're right back where we started.

Coordinated Cooperation¶

Wouldn't it be great if we could get all the operators to coordinate amongst each other? When one of them determines that it's time to yield, all the other operators agree and start returning Pending as well. That way our task would be coaxed towards yielding even if it tried to poll many different operators.

Luckily(?), the developers of Tokio ran into the exact same problem described above when network servers were under heavy load and came up with a solution. Back in 2020, Tokio 0.2.14 introduced a per-task operation budget. Rather than having individual counters littered throughout the code, the Tokio runtime itself manages a per task counter which is decremented by Tokio resources. When the counter hits zero, all resources start returning Pending. The task will then yield, after which the Tokio runtime resets the counter.

To illustrate what this process looks like, let's have a look at the execution of the following query Stream tree when polled in a Tokio task.

Query plan for aggregating a sorted stream from two sources. Each source reads a stream of `RecordBatch`es, which are then merged into a single Stream by the `MergeStream` operator which is then aggregated by the `AggregateExec` operator. Arrows represent the data flow direction

If we assume a task budget of 1 unit, each time Tokio schedules the task would result in the following sequence of function calls.

Tokio task budget system, assuming the task budget is set to 1, for the plan above.

The aggregation stream would try to poll the merge stream in a loop. The first iteration of the loop consumes the single unit of budget, and returns Ready. The second iteration polls the merge stream again which now tries to poll the second scan stream. Since there is no budget remaining Pending is returned. The merge stream may now try to poll the first source stream again, but since the budget is still depleted Pending is returned as well. The merge stream now has no other option than to return Pending itself as well, causing the aggregation to break out of its loop. The Pending result bubbles all the way up to the Tokio runtime, at which point the runtime regains control. When the runtime reschedules the task, it resets the budget and calls poll on the task Future again for another round of progress.

The key mechanism that makes this work well is the single task budget that's shared amongst all the scan streams. Once the budget is depleted, no streams can make any further progress without first returning control to tokio. This causes all possible avenues the task has to make progress to return Pending which results in the task being nudged towards yielding control.

As it turns out DataFusion was already using this mechanism implicitly. Every exchange-like operator (such as RepartitionExec) internally makes use of a Tokio multiple producer, single consumer Channel. When calling Receiver::recv for one of these channels, a unit of Tokio task budget is consumed. As a consequence, query plans that made use of exchange-like operators were already mostly cancelable. The plan cancellation bug only showed up when running parts of plans without such operators, such as when using a single core.

Now let's see how we can explicitly implement this budget-based approach in our own operators.

Depleting The Tokio Budget¶

Let's revisit our original BlockingStream and adapt it to use Tokio's budget system.

The examples given here make use of functions from the Tokio coop module that are still internal at the time of writing. PR #7405 on the Tokio project will make these accessible for external use. The current DataFusion code emulates these functions as well as possible using has_budget_remaining and consume_budget.

struct BudgetSourceStream {
}

impl Stream for BudgetSourceStream {
    type Item = Result<RecordBatch>;

    fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
        let coop = ready!(tokio::task::coop::poll_proceed(cx));
        let batch: Poll<Option<Self::Item>> = ...;
        if batch.is_ready() {
            coop.made_progress();
        }
        batch
    }
}

The Stream now goes through the following steps:

1. Try to consume budget: the first thing the operator does is use poll_proceed to try to consume a unit of budget. If the budget is depleted, this function will return Pending. Otherwise, we consumed one budget unit and we can continue.

let coop = ready!(tokio::task::coop::poll_proceed(cx));

2. Try to do some work: next we try to produce a record batch. That might not be possible if we're reading from some asynchronous resource that's not ready.

let batch: Poll<Option<Self::Item>> = ...;

3. Commit the budget consumption: finally, if we did produce a batch, we need to tell Tokio that we were able to make progress.

That's done by calling the made_progress method on the value poll_proceed returned.

if batch.is_ready() {
   coop.made_progress();
}

You might be wondering why the call to made_progress is necessary. This clever construct makes it easier to manage the budget. The value returned by poll_proceed will actually restore the budget to its original value when it is dropped unless made_progress is called. This ensures that if we exit early from our poll_next implementation by returning Pending, that the budget we had consumed becomes available again. The task that invoked poll_next can then use that budget again to try to make some other Stream (or any resource for that matter) make progress.

Automatic Cooperation For All Operators¶

DataFusion 49.0.0 integrates the Tokio task budget based fix in all built-in source operators. This ensures that going forward, most queries will automatically be cancelable. See the PR for more details.

The design includes:

A new ExecutionPlan property that indicates if an operator participates in cooperative scheduling or not.
A new EnsureCooperative optimizer rule to inspect query plans and insert CooperativeExec nodes as needed to ensure custom source operators also participate.

These two changes combined already make it very unlikely you'll encounter any query that refuses to stop, even with custom operators. For those situations where the automatic mechanisms are still not sufficient, there's a new datafusion::physical_plan::coop module with utility functions that make it easy to adopt cooperative scheduling in your custom operators as well.

Acknowledgments¶

Thank you to Datadobi for sponsoring the development of this feature and to the DataFusion community contributors including Qi Zhu and Mehmet Ozan Kabak.

About DataFusion¶

Apache DataFusion is an extensible query engine toolkit, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion and similar technology are part of the next generation “Deconstructed Database” architectures, where new systems are built on a foundation of fast, modular components, rather than as a single tightly integrated system.

The DataFusion community is always looking for new contributors to help improve the project. If you are interested in learning more about how query execution works, help document or improve the DataFusion codebase, or just try it out, we would love for you to join us.

Apache DataFusion Blog - Pepijn Van Eeckhoudt

Optimizing SQL CASE Expression Evaluation

Background: CASE Expression Evaluation¶

CASE Evaluation in DataFusion 50.0.0¶

Opportunity 1: Early Exit¶

Opportunity 2: Optimize Repeated Filtering, Scattering, and Merging¶

Opportunity 3: Filter only Necessary Columns¶

Performance Optimizations¶

Optimization 1: Short-Circuit Early Exit¶

Optimization 2: Optimized Result Merging¶

Optimization 3: Column Projection¶

Optimization 4: Eliminating Scatter in Two-Branch Case¶

Optimization 5: Table Lookup of Constants¶

Results¶

Summary¶

About DataFusion¶

How to Get Involved¶

Using Rust async for Query Execution and Cancelling Long-Running Queries

Understanding Rust's Async Model¶

Futures Are Inert¶

From Futures to Streams¶

How DataFusion Executes Queries¶

Tokio and Cooperative Scheduling¶

The Cancellation Problem¶

A Closer Look at Blocking Operators¶

Unblocking Operators¶

Independent Cooperative Operators¶

Coordinated Cooperation¶

Depleting The Tokio Budget¶

Automatic Cooperation For All Operators¶

Acknowledgments¶

About DataFusion¶

`CASE` Evaluation in DataFusion 50.0.0¶