<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog - Pepijn Van Eeckhoudt</title><link href="https://datafusion.apache.org/blog/" rel="alternate"/><link href="https://datafusion.apache.org/blog/feeds/pepijn-van-eeckhoudt.atom.xml" rel="self"/><id>https://datafusion.apache.org/blog/</id><updated>2026-02-02T00:00:00+00:00</updated><entry><title>Optimizing SQL CASE Expression Evaluation</title><link href="https://datafusion.apache.org/blog/2026/02/02/datafusion_case" rel="alternate"/><published>2026-02-02T00:00:00+00:00</published><updated>2026-02-02T00:00:00+00:00</updated><author><name>Pepijn Van Eeckhoudt</name></author><id>tag:datafusion.apache.org,2026-02-02:/blog/2026/02/02/datafusion_case</id><summary type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;style&gt;
figure {
  margin: 20px 0;
}

figure img {
  display: block;
  max-width: 80%;
  margin: auto;
}

figcaption {
  font-style: italic;
  color: #555;
  font-size: 0.9em;
  max-width: 80%;
  margin: auto;
  text-align: center;
}
&lt;/style&gt;
&lt;p&gt;SQL's &lt;code&gt;CASE&lt;/code&gt; expression is one of the few explicit conditional evaluation constructs the language provides.
It allows you to control which expression from a …&lt;/p&gt;</summary><content type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;style&gt;
figure {
  margin: 20px 0;
}

figure img {
  display: block;
  max-width: 80%;
  margin: auto;
}

figcaption {
  font-style: italic;
  color: #555;
  font-size: 0.9em;
  max-width: 80%;
  margin: auto;
  text-align: center;
}
&lt;/style&gt;
&lt;p&gt;SQL's &lt;code&gt;CASE&lt;/code&gt; expression is one of the few explicit conditional evaluation constructs the language provides.
It allows you to control which expression from a set of expressions is evaluated for each row based on arbitrary boolean expressions.
Its deceptively simple syntax hides significant implementation complexity.
Over the past few &lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; releases, a series of improvements to &lt;code&gt;CASE&lt;/code&gt; expression evaluator have been merged that reduce both CPU time and memory allocations.
This post provides an overview of the original implementation, its performance bottlenecks, and the steps taken to address them.&lt;/p&gt;
&lt;h2 id="background-case-expression-evaluation"&gt;Background: CASE Expression Evaluation&lt;a class="headerlink" href="#background-case-expression-evaluation" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;SQL supports two forms of CASE expressions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Simple&lt;/strong&gt;: &lt;code&gt;CASE expr WHEN value1 THEN result1 WHEN value2 THEN result2 ... END&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Searched&lt;/strong&gt;: &lt;code&gt;CASE WHEN condition1 THEN result1 WHEN condition2 THEN result2 ... END&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The simple form evaluates an expression once for each input row and then tests that value against the expressions (typically constants) in each &lt;code&gt;WHEN&lt;/code&gt; clause using equality comparisons.&lt;/p&gt;
&lt;p&gt;Here's an example of the simple form:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-sql"&gt;CASE status
    WHEN 'pending' THEN 1
    WHEN 'active' THEN 2
    WHEN 'complete' THEN 3
    ELSE 0
END
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this &lt;code&gt;CASE&lt;/code&gt; expression, &lt;code&gt;status&lt;/code&gt; is evaluated once per row, and then its value is tested for equality with the values &lt;code&gt;'pending'&lt;/code&gt;, &lt;code&gt;'active'&lt;/code&gt;, and &lt;code&gt;'complete'&lt;/code&gt; in that order.
The &lt;code&gt;CASE&lt;/code&gt; expression evaluates to the value of the &lt;code&gt;THEN&lt;/code&gt; expression corresponding to the first matching &lt;code&gt;WHEN&lt;/code&gt; expression.&lt;/p&gt;
&lt;p&gt;The searched &lt;code&gt;CASE&lt;/code&gt; form is a more flexible variant.
It evaluates completely independent boolean expressions for each branch.
This allows you to test different columns with different operators per branch as shown in the following example:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-sql"&gt;CASE
    WHEN age &amp;gt; 65 THEN 'senior'
    WHEN childCount != 0 THEN 'parent'
    WHEN age &amp;lt; 21 THEN 'minor'
    ELSE 'adult'
END
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In both forms, branches are evaluated sequentially with short-circuit semantics: for each row, once a &lt;code&gt;WHEN&lt;/code&gt; condition matches, the corresponding &lt;code&gt;THEN&lt;/code&gt; expression is evaluated.
Any further branches are not evaluated for that row.
This lazy evaluation model is critical for correctness.
It lets you safely write &lt;code&gt;CASE&lt;/code&gt; expressions like &lt;code&gt;CASE WHEN d != 0 THEN n / d ELSE NULL END&lt;/code&gt; that are guaranteed to not trigger divide-by-zero errors.&lt;/p&gt;
&lt;p&gt;Besides &lt;code&gt;CASE&lt;/code&gt;, there are a few &lt;a href="https://datafusion.apache.org/user-guide/sql/scalar_functions.html#conditional-functions"&gt;conditional scalar functions&lt;/a&gt; that provide similar, more restricted capabilities.
These include &lt;code&gt;COALESCE&lt;/code&gt;, &lt;code&gt;IFNULL&lt;/code&gt;, and &lt;code&gt;NVL2&lt;/code&gt;.
You can consider each of these functions as the equivalent of a macro for &lt;code&gt;CASE&lt;/code&gt;.
For example, &lt;code&gt;COALESCE(expr1, expr2, expr3)&lt;/code&gt; expands to:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-sql"&gt;CASE
  WHEN expr1 IS NOT NULL THEN expr1
  WHEN expr2 IS NOT NULL THEN expr2
  ELSE expr3
END
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since &lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; rewrites these conditional functions to their equivalent &lt;code&gt;CASE&lt;/code&gt; expression, any optimizations related to &lt;code&gt;CASE&lt;/code&gt; described in this post also apply to conditional function evaluation.&lt;/p&gt;
&lt;h2 id="case-evaluation-in-datafusion-5000"&gt;&lt;code&gt;CASE&lt;/code&gt; Evaluation in DataFusion 50.0.0&lt;a class="headerlink" href="#case-evaluation-in-datafusion-5000" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;For the remainder of this post, we'll be looking at 'searched CASE' evaluation.
'Simple CASE' uses a distinct, but very similar implementation.
The same set of improvements has been applied to both.&lt;/p&gt;
&lt;p&gt;The baseline implementation in DataFusion 50.0.0 evaluated &lt;code&gt;CASE&lt;/code&gt; using a common, straightforward approach:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Start with an output array &lt;code&gt;out&lt;/code&gt; with the same length as the input batch, filled with nulls. Additionally, create a bit vector &lt;code&gt;remainder&lt;/code&gt; with the same length and each value set to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;For each &lt;code&gt;WHEN&lt;/code&gt;/&lt;code&gt;THEN&lt;/code&gt; branch:&lt;ul&gt;
&lt;li&gt;Evaluate the &lt;code&gt;WHEN&lt;/code&gt; condition for the remaining unmatched rows using &lt;a href="https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_selection"&gt;&lt;code&gt;PhysicalExpr::evaluate_selection&lt;/code&gt;&lt;/a&gt;, passing in the input batch and the &lt;code&gt;remainder&lt;/code&gt; mask.&lt;/li&gt;
&lt;li&gt;If any rows matched, evaluate the &lt;code&gt;THEN&lt;/code&gt; expression for those rows using &lt;code&gt;PhysicalExpr::evaluate_selection&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Merge the results into the &lt;code&gt;out&lt;/code&gt; array using the &lt;a href="https://docs.rs/arrow/latest/arrow/compute/kernels/zip/fn.zip.html"&gt;&lt;code&gt;zip&lt;/code&gt;&lt;/a&gt; kernel.&lt;/li&gt;
&lt;li&gt;Update the &lt;code&gt;remainder&lt;/code&gt; mask to exclude the matched rows.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;If there's an &lt;code&gt;ELSE&lt;/code&gt; clause, evaluate it for any remaining unmatched rows and merge using &lt;a href="https://docs.rs/arrow/latest/arrow/compute/kernels/zip/fn.zip.html"&gt;&lt;code&gt;zip&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Here's a simplified version of the Rust code for the original loop:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;let mut out = new_null_array(&amp;amp;return_type, batch.num_rows());
let mut remainder = BooleanArray::from(vec![true; batch.num_rows()]);

for (when_expr, then_expr) in &amp;amp;self.when_then_expr {
    // Determine for which remaining rows the WHEN condition matches
    let when = when_expr.evaluate_selection(batch, &amp;amp;remainder)?
        .into_array(batch.num_rows())?;
    // Ensure any `NULL` values are treated as false
    let when_and_rem = and(&amp;amp;when, &amp;amp;remainder)?;

    if when_and_rem.true_count() == 0 {
        continue;
    }

    // Evaluate the THEN expression for matching rows
    let then = then_expr.evaluate_selection(batch, &amp;amp;when_and_rem)?;
    // Merge results into output array
    out = zip(&amp;amp;when_and_rem, &amp;amp;then_value, &amp;amp;out)?;
    // Update remainder mask to exclude matched rows
    remainder = and_not(&amp;amp;remainder, &amp;amp;when_and_rem)?;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let's examine one iteration of this loop for the following &lt;code&gt;CASE&lt;/code&gt; expression:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-sql"&gt;CASE
    WHEN col = 'b' THEN 100
    ELSE 200
END
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Schematically, it will look as follows:&lt;/p&gt;
&lt;figure&gt;
&lt;img alt="Schematic representation of data flow in the original CASE implementation" class="img-fluid" src="/blog/images/case/original_loop.svg" width="100%"/&gt;
&lt;figcaption&gt;One iteration of the `CASE` evaluation loop&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;This implementation works perfectly fine, but there's significant room for optimization, mostly related to the usage of &lt;code&gt;evaluate_selection&lt;/code&gt;.
To understand why, we need to dig a little deeper into the implementation of that function.
Here's a simplified version of it that captures the relevant parts:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;pub trait PhysicalExpr {
    fn evaluate_selection(
        &amp;amp;self,
        batch: &amp;amp;RecordBatch,
        selection: &amp;amp;BooleanArray,
    ) -&amp;gt; Result&amp;lt;ColumnarValue&amp;gt; {
        // Reduce record batch to only include rows that match selection
        let filtered_batch = filter_record_batch(batch, selection)?;
        // Perform regular evaluation on filtered batch
        let filtered_result = self.evaluate(&amp;amp;filtered_batch)?;
        // Expand result array to match original batch length
        scatter(selection, filtered_result)
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Going back to the same example as before, the data flow in &lt;code&gt;evaluate_selection&lt;/code&gt; looks like this:&lt;/p&gt;
&lt;figure&gt;
&lt;img alt="Schematic representation of `evaluate_selection` evaluation" class="img-fluid" src="/blog/images/case/evaluate_selection.svg" width="100%"/&gt;
&lt;figcaption&gt;evaluate_selection data flow&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;The &lt;code&gt;evaluate_selection&lt;/code&gt; method first filters the input batch to only include rows that match the &lt;code&gt;selection&lt;/code&gt; mask.
It then calls the regular &lt;code&gt;evaluate&lt;/code&gt; method using the filtered batch as input.
Finally, to return a result array with the same number of rows as &lt;code&gt;batch&lt;/code&gt;, the &lt;code&gt;scatter&lt;/code&gt; function is called.
This function produces a new array padded with &lt;code&gt;null&lt;/code&gt; values for any rows that didn't match the &lt;code&gt;selection&lt;/code&gt; mask.&lt;/p&gt;
&lt;p&gt;So how can we improve the performance of the simple evaluation strategy and use of &lt;code&gt;evaluate_selection&lt;/code&gt;?&lt;/p&gt;
&lt;h3 id="opportunity-1-early-exit"&gt;Opportunity 1: Early Exit&lt;a class="headerlink" href="#opportunity-1-early-exit" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;CASE&lt;/code&gt; evaluation loop always iterates through all branches, even when every row has already been matched.
In queries where early branches match all rows, this results in unnecessary work being done for the remaining branches.&lt;/p&gt;
&lt;h3 id="opportunity-2-optimize-repeated-filtering-scattering-and-merging"&gt;Opportunity 2: Optimize Repeated Filtering, Scattering, and Merging&lt;a class="headerlink" href="#opportunity-2-optimize-repeated-filtering-scattering-and-merging" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Each iteration performs a number of operations that are very well-optimized, but still take up a significant amount of CPU time:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Filtering&lt;/strong&gt;: &lt;code&gt;PhysicalExpr::evaluate_selection&lt;/code&gt; filters the entire &lt;code&gt;RecordBatch&lt;/code&gt; for each branch. For the &lt;code&gt;WHEN&lt;/code&gt; expression, this is done even if the selection mask was entirely empty.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scattering&lt;/strong&gt;: &lt;code&gt;PhysicalExpr::evaluate_selection&lt;/code&gt; scatters the filtered result back to the original &lt;code&gt;RecordBatch&lt;/code&gt; length.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Merging&lt;/strong&gt;: The &lt;code&gt;zip&lt;/code&gt; kernel is called once per branch to merge partial results into the output array&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each of these operations needs to allocate memory for new arrays and shuffle quite a bit of data around.&lt;/p&gt;
&lt;h3 id="opportunity-3-filter-only-necessary-columns"&gt;Opportunity 3: Filter only Necessary Columns&lt;a class="headerlink" href="#opportunity-3-filter-only-necessary-columns" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;PhysicalExpr::evaluate_selection&lt;/code&gt; method filters the entire record batch, including columns that the current branch's &lt;code&gt;WHEN&lt;/code&gt; and &lt;code&gt;THEN&lt;/code&gt; expressions don't reference.
For wide tables (many columns) with narrow expressions (few column references), this is wasteful.&lt;/p&gt;
&lt;p&gt;Suppose you have a table with 26 columns named &lt;code&gt;a&lt;/code&gt; through &lt;code&gt;z&lt;/code&gt;, and the following simple &lt;code&gt;CASE&lt;/code&gt; expression:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-sql"&gt;CASE
  WHEN a &amp;gt; 1000 THEN 'large'
  WHEN a &amp;gt;= 0 THEN 'positive'
  ELSE 'negative'
END
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The implementation would filter all 26 columns even though only a single column is needed for the entire &lt;code&gt;CASE&lt;/code&gt; expression evaluation.
Again this involves a non-negligible amount of allocation and data copying.&lt;/p&gt;
&lt;h2 id="performance-optimizations"&gt;Performance Optimizations&lt;a class="headerlink" href="#performance-optimizations" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="optimization-1-short-circuit-early-exit"&gt;Optimization 1: Short-Circuit Early Exit&lt;a class="headerlink" href="#optimization-1-short-circuit-early-exit" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The first optimization is straightforward.
As soon as we detect that all rows of the batch have been matched, we break out of the evaluation loop:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;let mut remainder_count = batch.num_rows();

for (when_expr, then_expr) in &amp;amp;self.when_then_expr {
    if remainder_count == 0 {
        break;  // All rows matched, exit early
    }

    // ... evaluate branch ...

    let when_match_count = when_value.true_count();
    remainder_count -= when_match_count;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Additionally, we avoid evaluating the &lt;code&gt;ELSE&lt;/code&gt; clause when no rows remain:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;if let Some(else_expr) = &amp;amp;self.else_expr {
    remainder = or(&amp;amp;base_nulls, &amp;amp;remainder)?;
    if remainder.true_count() &amp;gt; 0 {
        // ... evaluate else ...
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For queries where early branches match all rows, this eliminates unnecessary branch evaluations and &lt;code&gt;ELSE&lt;/code&gt; clause processing.&lt;/p&gt;
&lt;p&gt;This optimization was implemented by Pepijn Van Eeckhoudt (&lt;a href="https://github.com/pepijnve"&gt;&lt;code&gt;@pepijnve&lt;/code&gt;&lt;/a&gt;) in &lt;a href="https://github.com/apache/datafusion/pull/17898"&gt;PR #17898&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="optimization-2-optimized-result-merging"&gt;Optimization 2: Optimized Result Merging&lt;a class="headerlink" href="#optimization-2-optimized-result-merging" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The second optimization fundamentally restructures how the results of each loop iteration will be merged.
The diagram below illustrates the optimized data flow when evaluating the &lt;code&gt;CASE WHEN col = 'b' THEN 100 ELSE 200 END&lt;/code&gt; from before:&lt;/p&gt;
&lt;figure&gt;
&lt;img alt="Schematic representation of optimized evaluation loop" class="img-fluid" src="/blog/images/case/merging.svg" width="100%"/&gt;
&lt;figcaption&gt;optimized evaluation loop&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;In the reworked implementation, the &lt;code&gt;evaluate_selection&lt;/code&gt; function is no longer used.
The key insight is that we can defer all merging until the end of the evaluation loop by tracking result provenance.
This was implemented with the following changes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Augment the input batch with a column containing row indices.&lt;/li&gt;
&lt;li&gt;Reduce the augmented batch after each loop iteration to only contain the remaining rows.&lt;/li&gt;
&lt;li&gt;Use the row index column to track which partial result array contains the value for each row.&lt;/li&gt;
&lt;li&gt;Perform a single merge operation at the end instead of a &lt;code&gt;zip&lt;/code&gt; operation after each loop iteration.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These changes make it unnecessary to &lt;code&gt;scatter&lt;/code&gt; and &lt;code&gt;zip&lt;/code&gt; results in each loop iteration.
Instead, when all rows have been matched, we then merge the partial results using &lt;a href="https://docs.rs/arrow-select/57.1.0/arrow_select/merge/fn.merge_n.html"&gt;&lt;code&gt;arrow_select::merge::merge_n&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The diagram below illustrates how &lt;code&gt;merge_n&lt;/code&gt; works for an example where three &lt;code&gt;WHEN/THEN&lt;/code&gt; branches produced results.
The first branch produced the result &lt;code&gt;A&lt;/code&gt; for row 2, the second produced &lt;code&gt;B&lt;/code&gt; for row 1, and the third produced &lt;code&gt;C&lt;/code&gt; and &lt;code&gt;D&lt;/code&gt; for rows 4 and 5.&lt;/p&gt;
&lt;figure&gt;
&lt;img alt="Schematic illustration of the merge_n algorithm" class="img-fluid" src="/blog/images/case/merge_n.svg" width="100%"/&gt;
&lt;figcaption&gt;merge_n example&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;The &lt;code&gt;merge_n&lt;/code&gt; algorithm scans through the indices array.
For each non-empty cell, it takes one value from the corresponding values array.
In the example above, we first encounter &lt;code&gt;1&lt;/code&gt;.
This takes the first element from the values array with index &lt;code&gt;1&lt;/code&gt;, resulting in &lt;code&gt;B&lt;/code&gt;.
The next cell contains &lt;code&gt;0&lt;/code&gt; which takes &lt;code&gt;A&lt;/code&gt;, from the first array.
Finally, we encounter &lt;code&gt;2&lt;/code&gt; twice.
This takes the first and second element from the last values array respectively.&lt;/p&gt;
&lt;p&gt;This algorithm was initially implemented in DataFusion for the &lt;code&gt;CASE&lt;/code&gt; implementation, but in the meantime has been generalized and moved into the &lt;code&gt;arrow-rs&lt;/code&gt; crate as &lt;a href="https://docs.rs/arrow-select/57.1.0/arrow_select/merge/fn.merge_n.html"&gt;&lt;code&gt;arrow_select::merge::merge_n&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This optimization was implemented by Pepijn Van Eeckhoudt (&lt;a href="https://github.com/pepijnve"&gt;&lt;code&gt;@pepijnve&lt;/code&gt;&lt;/a&gt;) in &lt;a href="https://github.com/apache/datafusion/pull/18152"&gt;PR #18152&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="optimization-3-column-projection"&gt;Optimization 3: Column Projection&lt;a class="headerlink" href="#optimization-3-column-projection" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The third optimization addresses the "filtering unused columns" overhead through projection.&lt;/p&gt;
&lt;p&gt;Look at the following query example where the &lt;code&gt;mailing_address&lt;/code&gt; table has the columns &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;surname&lt;/code&gt;, &lt;code&gt;street&lt;/code&gt;, &lt;code&gt;number&lt;/code&gt;, &lt;code&gt;city&lt;/code&gt;, &lt;code&gt;state&lt;/code&gt;, &lt;code&gt;country&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-sql"&gt;SELECT *, CASE WHEN country = 'USA' THEN state ELSE country END AS region
FROM mailing_address 
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can see that the &lt;code&gt;CASE&lt;/code&gt; expression only references the columns &lt;code&gt;country&lt;/code&gt; and &lt;code&gt;state&lt;/code&gt;, but because all columns are being queried, projection pushdown cannot reduce the number of columns being fed in to the projection operator.&lt;/p&gt;
&lt;figure&gt;
&lt;img alt="Schematic illustration of CASE evaluation without projection" class="img-fluid" src="/blog/images/case/no_projection.svg" width="100%"/&gt;
&lt;figcaption&gt;CASE evaluation without projection&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;During &lt;code&gt;CASE&lt;/code&gt; evaluation, the batch must be filtered using the &lt;code&gt;WHEN&lt;/code&gt; expression to evaluate the &lt;code&gt;THEN&lt;/code&gt; expression values.
As the diagram above shows, this filtering creates a reduced copy of all columns.&lt;/p&gt;
&lt;p&gt;This unnecessary copying can be avoided by first narrowing the batch to only include the columns that are actually needed.&lt;/p&gt;
&lt;figure&gt;
&lt;img alt="Schematic illustration of CASE evaluation with projection" class="img-fluid" src="/blog/images/case/projection.svg" width="100%"/&gt;
&lt;figcaption&gt;CASE evaluation with projection&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;At first glance, this might not seem beneficial, since we're introducing an additional processing step.
Luckily projection of a record batch only requires a shallow copy of the record batch.
The column arrays themselves are not copied, and the only work that is actually done is incrementing the reference counts of the columns.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Impact&lt;/strong&gt;: For wide tables with narrow CASE expressions, this dramatically reduces filtering overhead by removing the copying of unused columns.&lt;/p&gt;
&lt;p&gt;This optimization was implemented by Pepijn Van Eeckhoudt (&lt;a href="https://github.com/pepijnve"&gt;&lt;code&gt;@pepijnve&lt;/code&gt;&lt;/a&gt;) in &lt;a href="https://github.com/apache/datafusion/pull/18329"&gt;PR #18329&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="optimization-4-eliminating-scatter-in-two-branch-case"&gt;Optimization 4: Eliminating Scatter in Two-Branch Case&lt;a class="headerlink" href="#optimization-4-eliminating-scatter-in-two-branch-case" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Some of the earlier examples in this post use expressions of the form &lt;code&gt;CASE WHEN condition THEN expr1 ELSE expr2 END&lt;/code&gt; to explain how the general evaluation loop works.
For this kind of two-branch &lt;code&gt;CASE&lt;/code&gt; expression, &lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; has a more optimized implementation that unrolls the loop.
This specialized &lt;code&gt;ExpressionOrExpression&lt;/code&gt; fast path still used &lt;code&gt;evaluate_selection()&lt;/code&gt; for both branches which uses &lt;code&gt;scatter&lt;/code&gt; and &lt;code&gt;zip&lt;/code&gt; to combine the results incurring the same performance overhead as the general implementation.&lt;/p&gt;
&lt;p&gt;The revised implementation eliminates the use of &lt;code&gt;evaluate_selection&lt;/code&gt; as follows:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;// Compute the `WHEN` condition for the entire batch
let when_filter = create_filter(&amp;amp;when_value);

// Compute a compact array of `THEN` values for the matching rows
let then_batch = filter_record_batch(batch, &amp;amp;when_filter)?;
let then_value = then_expr.evaluate(&amp;amp;then_batch)?;

// Compute a compact array of `ELSE` values for the non-matching rows
let else_filter = create_filter(&amp;amp;not(&amp;amp;when_value)?);
let else_batch = filter_record_batch(batch, &amp;amp;else_filter)?;
let else_value = else_expr.evaluate(&amp;amp;else_batch)?;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This produces two compact arrays, one for the THEN values and one for the ELSE values, which are then merged with the &lt;code&gt;merge&lt;/code&gt; function.
In contrast to &lt;code&gt;zip&lt;/code&gt;, &lt;code&gt;merge&lt;/code&gt; does not require both of its value inputs to have the same length.
Instead it requires that the sum of the length of the value inputs matches the length of the mask array.&lt;/p&gt;
&lt;figure&gt;
&lt;img alt="Schematic illustration of the merge algorithm" class="img-fluid" src="/blog/images/case/merge.svg" width="100%"/&gt;
&lt;figcaption&gt;merge example&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;This eliminates unnecessary &lt;code&gt;scatter&lt;/code&gt; operations and memory allocations for one of the most common &lt;code&gt;CASE&lt;/code&gt; expression patterns.&lt;/p&gt;
&lt;p&gt;Just like &lt;code&gt;merge_n&lt;/code&gt;, this operation has been moved into &lt;code&gt;arrow-rs&lt;/code&gt; as &lt;a href="https://docs.rs/arrow-select/57.1.0/arrow_select/merge/fn.merge.html"&gt;&lt;code&gt;arrow_select::merge::merge&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This optimization was implemented by Pepijn Van Eeckhoudt (&lt;a href="https://github.com/pepijnve"&gt;&lt;code&gt;@pepijnve&lt;/code&gt;&lt;/a&gt;) in &lt;a href="https://github.com/apache/datafusion/pull/18444"&gt;PR #18444&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="optimization-5-table-lookup-of-constants"&gt;Optimization 5: Table Lookup of Constants&lt;a class="headerlink" href="#optimization-5-table-lookup-of-constants" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Up until now, we've discussed the implementations for generic &lt;code&gt;CASE&lt;/code&gt; expressions that use non-constant expressions for both &lt;code&gt;WHEN&lt;/code&gt; and &lt;code&gt;THEN&lt;/code&gt;.
Another common use of &lt;code&gt;CASE&lt;/code&gt; is to perform a mapping from one set of constants to another.
For instance, you can expand numeric constants to human-readable strings using the following &lt;code&gt;CASE&lt;/code&gt; example.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-sql"&gt;CASE status
  WHEN 0 THEN 'idle'
  WHEN 1 THEN 'running'
  WHEN 2 THEN 'paused'
  WHEN 3 THEN 'stopped'
  ELSE 'unknown'
END
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A final &lt;code&gt;CASE&lt;/code&gt; optimization recognizes this pattern and compiles the &lt;code&gt;CASE&lt;/code&gt; expression into a hash table.
Rather than evaluating the &lt;code&gt;WHEN&lt;/code&gt; and &lt;code&gt;THEN&lt;/code&gt; expressions, the input expression is evaluated once, and the result array is computed using a vectorized hash table lookup.
This approach avoids the need to filter the input batch and combine partial results entirely.
The result array is computed in a single pass over the input values, and the computation time does not grow significantly with the number of &lt;code&gt;WHEN&lt;/code&gt; branches in the &lt;code&gt;CASE&lt;/code&gt; expression.&lt;/p&gt;
&lt;p&gt;This optimization was implemented by Raz Luvaton (&lt;a href="https://github.com/rluvaton"&gt;&lt;code&gt;@rluvaton&lt;/code&gt;&lt;/a&gt;) in &lt;a href="https://github.com/apache/datafusion/pull/18183"&gt;PR #18183&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="results"&gt;Results&lt;a class="headerlink" href="#results" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The degree to which the performance optimizations described in this post will benefit your queries is highly dependent on both your data and your queries.
To give some idea of the impact, we ran the following query on the TPC_H &lt;code&gt;orders&lt;/code&gt; table with a scale factor of 100:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-sql"&gt;SELECT
    *,
    case o_orderstatus
        when 'O' then 'ordered'
        when 'F' then 'filled'
        when 'P' then 'pending'
        else 'other'
    end
from orders
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query was first run with DataFusion 50.0.0 to get a baseline measurement.
The same query was then run with each optimization applied in turn.
The recorded times are presented as the blue series in the chart below.
The green series shows the time measurement for the &lt;code&gt;SELECT * FROM orders&lt;/code&gt; to give an idea of the cost the addition of a &lt;code&gt;CASE&lt;/code&gt; expression in a query incurs.
All measurements were made with a target partition count of &lt;code&gt;1&lt;/code&gt;.&lt;/p&gt;
&lt;figure&gt;
&lt;img alt="Performance measurements chart" class="img-fluid" src="/blog/images/case/results.png" width="100%"/&gt;
&lt;figcaption&gt;Performance measurements&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;What you can see in the chart is that the effect of the various optimizations compounds up to the &lt;code&gt;project&lt;/code&gt; measurement.
Up to that point these results are applicable to any &lt;code&gt;CASE&lt;/code&gt; expression.
The final improvement in the &lt;code&gt;hash&lt;/code&gt; measurement is only applicable to simple &lt;code&gt;CASE&lt;/code&gt; expressions with constant &lt;code&gt;WHEN&lt;/code&gt; and &lt;code&gt;THEN&lt;/code&gt; expressions.&lt;/p&gt;
&lt;p&gt;The cumulative effect of these optimizations is a 63-71% reduction in CPU time spent evaluating &lt;code&gt;CASE&lt;/code&gt; expressions compared to the baseline.&lt;/p&gt;
&lt;h2 id="summary"&gt;Summary&lt;a class="headerlink" href="#summary" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Through a number of targeted optimizations, we've transformed &lt;code&gt;CASE&lt;/code&gt; expression evaluation from a simple, but unoptimized implementation into a highly optimized one.
The optimizations described in this post compound: a &lt;code&gt;CASE&lt;/code&gt; expression on a wide table with multiple branches and early matches benefits from all four optimizations simultaneously.
The result is significantly reduced CPU time and memory allocation in SQL constructs that are essential for ETL-like queries.&lt;/p&gt;
&lt;h2 id="about-datafusion"&gt;About DataFusion&lt;a class="headerlink" href="#about-datafusion" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; is an extensible query engine, written in &lt;a href="https://www.rust-lang.org/"&gt;Rust&lt;/a&gt;, that uses &lt;a href="https://arrow.apache.org"&gt;Apache Arrow&lt;/a&gt; as its in-memory format. DataFusion is used by developers to create new, fast, data-centric systems such as databases, dataframe libraries,
and machine learning and streaming applications.
While &lt;a href="https://datafusion.apache.org/user-guide/introduction.html#project-goals"&gt;DataFusion’s primary design goal&lt;/a&gt; is to accelerate the creation of other data-centric systems, it provides a reasonable experience directly out of the box as a &lt;a href="https://datafusion.apache.org/user-guide/dataframe.html"&gt;dataframe library&lt;/a&gt;, &lt;a href="https://datafusion.apache.org/python/"&gt;Python library&lt;/a&gt;, and &lt;a href="https://datafusion.apache.org/user-guide/cli/"&gt;command-line SQL tool&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;DataFusion's core thesis is that, as a community, together we can build much more advanced technology than any of us as individuals or companies could build alone.
Without DataFusion, highly performant vectorized query engines would remain the domain of a few large companies and world-class research institutions.
With DataFusion, we can all build on top of a shared foundation and focus on what makes our projects unique.&lt;/p&gt;
&lt;h2 id="how-to-get-involved"&gt;How to Get Involved&lt;a class="headerlink" href="#how-to-get-involved" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;DataFusion is not a project built or driven by a single person, company, or foundation.
Rather, our community of users and contributors works together to build a shared technology that none of us could have built alone.&lt;/p&gt;
&lt;p&gt;If you are interested in joining us, we would love to have you. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests, or code.
A list of open issues suitable for beginners is &lt;a href="https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22"&gt;here&lt;/a&gt;, and you can find out how to reach us on the &lt;a href="https://datafusion.apache.org/contributor-guide/communication.html"&gt;communication doc&lt;/a&gt;.&lt;/p&gt;</content><category term="blog"/></entry><entry><title>Using Rust async for Query Execution and Cancelling Long-Running Queries</title><link href="https://datafusion.apache.org/blog/2025/06/30/cancellation" rel="alternate"/><published>2025-06-30T00:00:00+00:00</published><updated>2025-06-30T00:00:00+00:00</updated><author><name>Pepijn Van Eeckhoudt</name></author><id>tag:datafusion.apache.org,2025-06-30:/blog/2025/06/30/cancellation</id><summary type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;style&gt;
figure {
  margin: 20px 0;
}

figure img {
  display: block;
  max-width: 80%;
  margin: auto;
}

figcaption {
  font-style: italic;
  color: #555;
  font-size: 0.9em;
  max-width: 80%;
  margin: auto;
  text-align: center;
}
&lt;/style&gt;
&lt;p&gt;Have you ever tried to cancel a query that just wouldn't stop?
In this post, we'll review how Rust's &lt;a href="https://doc.rust-lang.org/book/ch17-00-async-await.html"&gt;&lt;code&gt;async&lt;/code&gt; programming model&lt;/a&gt; works, how …&lt;/p&gt;</summary><content type="html">&lt;!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
--&gt;

&lt;style&gt;
figure {
  margin: 20px 0;
}

figure img {
  display: block;
  max-width: 80%;
  margin: auto;
}

figcaption {
  font-style: italic;
  color: #555;
  font-size: 0.9em;
  max-width: 80%;
  margin: auto;
  text-align: center;
}
&lt;/style&gt;
&lt;p&gt;Have you ever tried to cancel a query that just wouldn't stop?
In this post, we'll review how Rust's &lt;a href="https://doc.rust-lang.org/book/ch17-00-async-await.html"&gt;&lt;code&gt;async&lt;/code&gt; programming model&lt;/a&gt; works, how &lt;a href="https://datafusion.apache.org/"&gt;DataFusion&lt;/a&gt; uses that model for CPU intensive tasks, and how this is used to cancel queries.
Then we'll review some cases where queries could not be canceled in DataFusion and what the community did to resolve the problem.&lt;/p&gt;
&lt;h2 id="understanding-rusts-async-model"&gt;Understanding Rust's Async Model&lt;a class="headerlink" href="#understanding-rusts-async-model" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;DataFusion, somewhat unconventionally, &lt;a href="https://docs.rs/datafusion/latest/datafusion/#thread-scheduling-cpu--io-thread-pools-and-tokio-runtimes"&gt;uses the Rust async system and the Tokio task scheduler&lt;/a&gt; for CPU intensive processing.
To really understand the cancellation problem you first need to be familiar with Rust's asynchronous programming model which is a bit different from what you might be used to from other ecosystems.
Let's go over the basics again as a refresher.
If you're familiar with the ins and outs of &lt;code&gt;Future&lt;/code&gt; and &lt;code&gt;async&lt;/code&gt; you can skip this section.&lt;/p&gt;
&lt;h3 id="futures-are-inert"&gt;Futures Are Inert&lt;a class="headerlink" href="#futures-are-inert" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Rust's asynchronous programming model is built around the &lt;a href="https://doc.rust-lang.org/std/future/trait.Future.html"&gt;&lt;code&gt;Future&amp;lt;T&amp;gt;&lt;/code&gt;&lt;/a&gt; trait.
In contrast to, for instance, Javascript's &lt;a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise"&gt;&lt;code&gt;Promise&lt;/code&gt;&lt;/a&gt; or Java's &lt;a href="https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/concurrent/Future.html"&gt;&lt;code&gt;Future&lt;/code&gt;&lt;/a&gt; a Rust &lt;code&gt;Future&lt;/code&gt; does not necessarily represent an actively running asynchronous job.
Instead, a &lt;code&gt;Future&amp;lt;T&amp;gt;&lt;/code&gt; represents a lazy calculation that only makes progress when explicitly asked to do so.
This is done by calling the &lt;a href="https://doc.rust-lang.org/std/future/trait.Future.html#tymethod.poll"&gt;&lt;code&gt;poll&lt;/code&gt;&lt;/a&gt; method of a &lt;code&gt;Future&lt;/code&gt;.
If nobody polls a &lt;code&gt;Future&lt;/code&gt; explicitly, it is &lt;a href="https://doc.rust-lang.org/std/future/trait.Future.html#runtime-characteristics"&gt;an inert object&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Calling &lt;code&gt;Future::poll&lt;/code&gt; results in one of two options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://doc.rust-lang.org/std/task/enum.Poll.html#variant.Pending"&gt;&lt;code&gt;Poll::Pending&lt;/code&gt;&lt;/a&gt; if the evaluation is not yet complete, most often because it needs to wait for something like I/O before it can continue&lt;/li&gt;
&lt;li&gt;&lt;a href="https://doc.rust-lang.org/std/task/enum.Poll.html#variant.Ready"&gt;&lt;code&gt;Poll::Ready&amp;lt;T&amp;gt;&lt;/code&gt;&lt;/a&gt; when it has completed and produced a value&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When a &lt;code&gt;Future&lt;/code&gt; returns &lt;code&gt;Pending&lt;/code&gt;, it saves its internal state so it can pick up where it left off the next time you poll it.
This internal state management makes Rust's &lt;code&gt;Future&lt;/code&gt;s memory-efficient and composable.
Rather than freezing the full call stack leading to a certain point, only the relevant state to resume the future needs to be retained.&lt;/p&gt;
&lt;p&gt;Additionally, a &lt;code&gt;Future&lt;/code&gt; must set up the necessary signaling to notify the caller when it should call &lt;code&gt;poll&lt;/code&gt; again, to avoid a busy-waiting loop.
This is done using a &lt;a href="https://doc.rust-lang.org/std/task/struct.Waker.html"&gt;&lt;code&gt;Waker&lt;/code&gt;&lt;/a&gt; which the &lt;code&gt;Future&lt;/code&gt; receives via the &lt;code&gt;Context&lt;/code&gt; parameter of the &lt;code&gt;poll&lt;/code&gt; function. &lt;/p&gt;
&lt;p&gt;Manual implementations of &lt;code&gt;Future&lt;/code&gt; are most often little finite state machines.
Each state in the process of completing the calculation is modeled as a variant of an &lt;code&gt;enum&lt;/code&gt;.
Before a &lt;code&gt;Future&lt;/code&gt; returns &lt;code&gt;Pending&lt;/code&gt;, it bundles the data required to resume in an enum variant, stores that enum variant in itself, and then returns.
While compact and efficient, the resulting code is often quite verbose.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;async&lt;/code&gt; keyword was introduced to make life easier on Rust programmers.
It provides elegant syntactic sugar for the manual state machine &lt;code&gt;Future&lt;/code&gt; approach.
When you write an &lt;code&gt;async&lt;/code&gt; function or block, the compiler transforms linear code into a state machine based &lt;code&gt;Future&lt;/code&gt; similar to the one described above for you.
Since all the state management is compiler generated and hidden from sight, async code tends to be easier to write initially, more readable afterward, while maintaining the same underlying mechanics.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;await&lt;/code&gt; keyword complements &lt;code&gt;async&lt;/code&gt; pausing execution until a &lt;code&gt;Future&lt;/code&gt; completes. &lt;br/&gt;
When you &lt;code&gt;.await&lt;/code&gt; a &lt;code&gt;Future&lt;/code&gt;, you're essentially telling the compiler to generate code that:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Polls the &lt;code&gt;Future&lt;/code&gt; with the current (implicit) asynchronous context&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;poll&lt;/code&gt; returns &lt;code&gt;Poll::Pending&lt;/code&gt;, save the state of the &lt;code&gt;Future&lt;/code&gt; so that it can resume at this point and return &lt;code&gt;Poll::Pending&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;If it returns &lt;code&gt;Poll::Ready(value)&lt;/code&gt;, continue execution with that value&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="from-futures-to-streams"&gt;From Futures to Streams&lt;a class="headerlink" href="#from-futures-to-streams" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;a href="https://docs.rs/futures/latest/futures/"&gt;&lt;code&gt;futures&lt;/code&gt;&lt;/a&gt; crate extends the &lt;code&gt;Future&lt;/code&gt; model with a trait named &lt;a href="https://docs.rs/futures/latest/futures/prelude/trait.Stream.html"&gt;&lt;code&gt;Stream&lt;/code&gt;&lt;/a&gt;.
&lt;code&gt;Stream&amp;lt;Item = T&amp;gt;&lt;/code&gt; represents a sequence of values that are each produced asynchronously rather than just a single value.
It's the asynchronous equivalent of &lt;code&gt;Iterator&amp;lt;Item = T&amp;gt;&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;Stream&lt;/code&gt; trait has one method named &lt;a href="https://docs.rs/futures/latest/futures/prelude/trait.Stream.html#tymethod.poll_next"&gt;&lt;code&gt;poll_next&lt;/code&gt;&lt;/a&gt; that returns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Poll::Pending&lt;/code&gt; when the next value isn't ready yet, just like a &lt;code&gt;Future&lt;/code&gt; would&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Poll::Ready(Some(value))&lt;/code&gt; when a new value is available&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Poll::Ready(None)&lt;/code&gt; when the stream is exhausted&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Under the hood, an implementation of &lt;code&gt;Stream&lt;/code&gt; is very similar to a &lt;code&gt;Future&lt;/code&gt;.
Typically, they're also implemented as state machines, the main difference being that they produce multiple values rather than just one.
Just like &lt;code&gt;Future&lt;/code&gt;, a &lt;code&gt;Stream&lt;/code&gt; is inert unless explicitly polled.&lt;/p&gt;
&lt;p&gt;Now that we understand the basics of Rust's async model, let's see how DataFusion leverages these concepts to execute queries.&lt;/p&gt;
&lt;h2 id="how-datafusion-executes-queries"&gt;How DataFusion Executes Queries&lt;a class="headerlink" href="#how-datafusion-executes-queries" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In DataFusion, the short version of how queries are executed is as follows (you can find more in-depth coverage of this in the &lt;a href="https://docs.rs/datafusion/latest/datafusion/#streaming-execution"&gt;DataFusion documentation&lt;/a&gt;):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;First the query is compiled into a tree of &lt;a href="https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html"&gt;&lt;code&gt;ExecutionPlan&lt;/code&gt;&lt;/a&gt; nodes&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#tymethod.execute"&gt;&lt;code&gt;ExecutionPlan::execute&lt;/code&gt;&lt;/a&gt; is called on the root of the tree. &lt;/li&gt;
&lt;li&gt;This method returns a &lt;a href="https://docs.rs/datafusion/latest/datafusion/execution/type.SendableRecordBatchStream.html"&gt;&lt;code&gt;SendableRecordBatchStream&lt;/code&gt;&lt;/a&gt; (a pinned &lt;code&gt;Box&amp;lt;dyn Stream&amp;lt;RecordBatch&amp;gt;&amp;gt;&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Stream::poll_next&lt;/code&gt; is called in a loop to get the results&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In other words, the execution of a DataFusion query boils down to polling an asynchronous stream.
Like all &lt;code&gt;Stream&lt;/code&gt; implementations, we need to explicitly poll the stream for the query to make progress. &lt;/p&gt;
&lt;p&gt;The &lt;code&gt;Stream&lt;/code&gt; we get in step 2 is actually the root of a tree of &lt;code&gt;Streams&lt;/code&gt; that mostly mirrors the execution plan tree.
Each stream tree node processes the record batches it gets from its children.
The leaves of the tree produce record batches themselves.&lt;/p&gt;
&lt;p&gt;Query execution progresses each time you call &lt;code&gt;poll_next&lt;/code&gt; on the root stream.
This call typically cascades down the tree, with each node calling &lt;code&gt;poll_next&lt;/code&gt; on its children to get the data it needs to process.&lt;/p&gt;
&lt;p&gt;Here's where the first signs of problems start to show up: some operations (like aggregations, sorts, or certain join phases) need to process a lot of data before producing any output.
When &lt;code&gt;poll_next&lt;/code&gt; encounters one of these operations, it might require substantial work before it can return a record batch.&lt;/p&gt;
&lt;h3 id="tokio-and-cooperative-scheduling"&gt;Tokio and Cooperative Scheduling&lt;a class="headerlink" href="#tokio-and-cooperative-scheduling" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;We need to make a small detour now via Tokio's scheduler before we can get to the query cancellation problem.
DataFusion makes use of the &lt;a href="https://tokio.rs"&gt;Tokio asynchronous runtime&lt;/a&gt;, which uses a &lt;a href="https://docs.rs/tokio/latest/tokio/task/index.html#what-are-tasks"&gt;cooperative scheduling model&lt;/a&gt;.
This is fundamentally different from preemptive scheduling that you might be used to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In &lt;strong&gt;preemptive scheduling&lt;/strong&gt;, the system can interrupt a task at any time to run something else&lt;/li&gt;
&lt;li&gt;In &lt;strong&gt;cooperative scheduling&lt;/strong&gt;, tasks must voluntarily yield control back to the scheduler&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This distinction is crucial for understanding our cancellation problem.&lt;/p&gt;
&lt;p&gt;A task in Tokio is modeled as a &lt;code&gt;Future&lt;/code&gt; which is passed to one of the task initiation functions like &lt;a href="https://docs.rs/tokio/latest/tokio/task/fn.spawn.html"&gt;&lt;code&gt;spawn&lt;/code&gt;&lt;/a&gt;.
Tokio runs the task by calling &lt;code&gt;Future::poll&lt;/code&gt; in a loop until it returns &lt;code&gt;Poll::Ready&lt;/code&gt;.
While that &lt;code&gt;Future::poll&lt;/code&gt; call is running, Tokio has no way to forcibly interrupt it.
It must cooperate by periodically yielding control, either by returning &lt;code&gt;Poll::Pending&lt;/code&gt; or &lt;code&gt;Poll::Ready&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Similarly, when you try to abort a task by calling &lt;a href="https://docs.rs/tokio/latest/tokio/task/struct.JoinHandle.html#method.abort"&gt;&lt;code&gt;JoinHandle::abort()&lt;/code&gt;&lt;/a&gt;, the Tokio runtime can't immediately force it to stop.
You're just telling Tokio: "When this task next yields control, don't call &lt;code&gt;Future::poll&lt;/code&gt; anymore."
If the task never yields, it can't be aborted.&lt;/p&gt;
&lt;h3 id="the-cancellation-problem"&gt;The Cancellation Problem&lt;a class="headerlink" href="#the-cancellation-problem" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;With all the necessary background in place, now let's look at how the DataFusion CLI tries to run and cancel a query.
The code below is a simplified version of &lt;a href="https://github.com/apache/datafusion/blob/db13dd93579945628cd81d534c032f5e6cc77967/datafusion-cli/src/exec.rs#L179-L186"&gt;what the CLI actually does&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;fn exec_query() {
    let runtime: tokio::runtime::Runtime = ...;
    let stream: SendableRecordBatchStream = ...;

    runtime.block_on(async {
        tokio::select! {
            next_batch = stream.next() =&amp;gt; ...
            _ = signal::ctrl_c() =&amp;gt; ...,
        }
    })
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;First the CLI sets up a Tokio runtime instance.
It then reads the query to execute from standard input or file and turns it into a &lt;code&gt;Stream&lt;/code&gt;.
Then it calls &lt;code&gt;next&lt;/code&gt; on stream which is an &lt;code&gt;async&lt;/code&gt; wrapper for &lt;code&gt;poll_next&lt;/code&gt;.
It passes this to the &lt;a href="https://docs.rs/tokio/latest/tokio/macro.select.html"&gt;&lt;code&gt;select!&lt;/code&gt;&lt;/a&gt; macro along with a ctrl-C handler.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;select!&lt;/code&gt; macro races these two &lt;code&gt;Future&lt;/code&gt;s and completes when either one finishes.
The intent is that when you press Ctrl+C, the &lt;code&gt;signal::ctrl_c()&lt;/code&gt; &lt;code&gt;Future&lt;/code&gt; should complete.
The &lt;a href="https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#cancellation--aborting-execution"&gt;stream is cancelled&lt;/a&gt; when it is dropped as it is inert by itself and nothing will be able to call &lt;code&gt;poll_next&lt;/code&gt; again.&lt;/p&gt;
&lt;p&gt;But there's a catch: &lt;code&gt;select!&lt;/code&gt; still follows cooperative scheduling rules.
It polls each &lt;code&gt;Future&lt;/code&gt; in sequence, and if the first one (our query) gets stuck in a long computation, it never gets around to polling the cancellation signal.&lt;/p&gt;
&lt;p&gt;Imagine a query that needs to calculate something intensive, like sorting billions of rows.
Unless the sorting Stream is written with care (which the one in DataFusion is), the &lt;code&gt;poll_next&lt;/code&gt; call may take several minutes or even longer without returning.
During this time, Tokio can't check if you've pressed Ctrl+C, and the query continues running despite your cancellation request.&lt;/p&gt;
&lt;h2 id="a-closer-look-at-blocking-operators"&gt;A Closer Look at Blocking Operators&lt;a class="headerlink" href="#a-closer-look-at-blocking-operators" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Let's peel back a layer of the onion and look at what's happening in a blocking &lt;code&gt;poll_next&lt;/code&gt; implementation.
Here's a drastically simplified version of a &lt;code&gt;COUNT(*)&lt;/code&gt; aggregation - something you might use in a query like &lt;code&gt;SELECT COUNT(*) FROM table&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;struct BlockingStream {
    // the input: an inner stream that is wrapped
    stream: SendableRecordBatchStream,
    count: usize,
    finished: bool,
}

impl Stream for BlockingStream {
    type Item = Result&amp;lt;RecordBatch&amp;gt;;
    fn poll_next(mut self: Pin&amp;lt;&amp;amp;mut Self&amp;gt;, cx: &amp;amp;mut Context&amp;lt;'_&amp;gt;) -&amp;gt; Poll&amp;lt;Option&amp;lt;Self::Item&amp;gt;&amp;gt; {
        if self.finished {
            // return None if we're finished
            return Poll::Ready(None);
        }

        loop {
            // poll the input stream to get the next batch if ready
            match ready!(self.stream.poll_next_unpin(cx)) {
                // increment the counter if we got a batch
                Some(Ok(batch)) =&amp;gt; self.count += batch.num_rows(),
                // on end-of-stream, create a record batch for the counter
                None =&amp;gt; {
                    self.finished = true;
                    return Poll::Ready(Some(Ok(create_record_batch(self.count))));
                }
                // pass on any errors verbatim
                Some(Err(e)) =&amp;gt; return Poll::Ready(Some(Err(e))),
            }
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;How does this code work? Let's break it down step by step:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Initial check&lt;/strong&gt;: We first check if we've already finished processing. If so, we return &lt;code&gt;Ready(None)&lt;/code&gt; to signal the end of our stream:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;if self.finished {
    return Poll::Ready(None);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;2. Processing loop&lt;/strong&gt;: If we're not done yet, we enter a loop to process incoming batches from our input stream:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;loop {
    match ready!(self.stream.poll_next_unpin(cx)) {
        // Handle different cases...
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;a href="https://doc.rust-lang.org/beta/std/task/macro.ready.html"&gt;&lt;code&gt;ready!&lt;/code&gt;&lt;/a&gt; macro checks if the input stream returned &lt;code&gt;Pending&lt;/code&gt; and if so, immediately returns &lt;code&gt;Pending&lt;/code&gt; from our function as well.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Processing data&lt;/strong&gt;: For each batch we receive, we simply add its row count to our running total:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;Some(Ok(batch)) =&amp;gt; self.count += batch.num_rows(),
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;4. End of input&lt;/strong&gt;: When the child stream is exhausted (returns &lt;code&gt;None&lt;/code&gt;), we calculate our final result and convert it into a record batch (omitted for brevity):&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;None =&amp;gt; {
    self.finished = true;
    return Poll::Ready(Some(Ok(create_record_batch(self.count))));
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;5. Error handling&lt;/strong&gt;: If we encounter an error, we pass it along immediately:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;Some(Err(e)) =&amp;gt; return Poll::Ready(Some(Err(e))),
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This code looks perfectly reasonable at first glance.
But there's a subtle issue lurking here: what happens if the input stream &lt;em&gt;always&lt;/em&gt; returns &lt;code&gt;Ready&lt;/code&gt; and never returns &lt;code&gt;Pending&lt;/code&gt;?&lt;/p&gt;
&lt;p&gt;In that case, the processing loop will keep running without returning &lt;code&gt;Poll::Pending&lt;/code&gt; and thus never yield control back to Tokio's scheduler.
This means we could be stuck in a single &lt;code&gt;poll_next&lt;/code&gt; call for quite some time - exactly the scenario that prevents query cancellation from working!&lt;/p&gt;
&lt;p&gt;So how do we solve this problem? Let's explore some strategies to ensure our operators yield control periodically.&lt;/p&gt;
&lt;h2 id="unblocking-operators"&gt;Unblocking Operators&lt;a class="headerlink" href="#unblocking-operators" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Now let's look at how we can ensure we return &lt;code&gt;Pending&lt;/code&gt; every now and then.&lt;/p&gt;
&lt;h3 id="independent-cooperative-operators"&gt;Independent Cooperative Operators&lt;a class="headerlink" href="#independent-cooperative-operators" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;One simple way to return &lt;code&gt;Pending&lt;/code&gt; is using a loop counter.
We do the exact same thing as before, but on each loop iteration we decrement our counter.
If the counter hits zero we return &lt;code&gt;Pending&lt;/code&gt;.
The following example ensures we iterate at most 128 times before yielding.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;struct CountingSourceStream {
   counter: usize
}

impl Stream for CountingSourceStream {
    type Item = Result&amp;lt;RecordBatch&amp;gt;;

    fn poll_next(mut self: Pin&amp;lt;&amp;amp;mut Self&amp;gt;, cx: &amp;amp;mut Context&amp;lt;'_&amp;gt;) -&amp;gt; Poll&amp;lt;Option&amp;lt;Self::Item&amp;gt;&amp;gt; {
        if self.counter &amp;gt;= 128 {
            self.counter = 0;
            cx.waker().wake_by_ref();
            return Poll::Pending;
        }

        self.counter += 1;
        let batch = ...;
        Ready(Some(Ok(batch)))
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If &lt;code&gt;CountingSourceStream&lt;/code&gt; was the input for the &lt;code&gt;BlockingStream&lt;/code&gt; example above, 
the &lt;code&gt;BlockingStream&lt;/code&gt; will receive a &lt;code&gt;Pending&lt;/code&gt; periodically causing it to yield too. 
Can we really solve the cancel problem simply by periodically yielding in source streams? &lt;/p&gt;
&lt;p&gt;Unfortunately, no.
Let's look at what happens when we start combining operators in more complex configurations.
Suppose we create a plan like this.&lt;/p&gt;
&lt;figure&gt;
&lt;img alt="Diagram showing a plan that merges two branches that return Pending at different intervals." src="/blog/images/task-cancellation/merge_plan.png"/&gt;
&lt;figcaption&gt;A plan that merges two branches by alternating between them.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Each &lt;code&gt;CountingSource&lt;/code&gt; produces a &lt;code&gt;Pending&lt;/code&gt; every 128 batches.
The &lt;code&gt;Filter&lt;/code&gt; is a stream that drops a batch every 50 record batches.
Merge is a simple combining operator the uses &lt;code&gt;futures::stream::select&lt;/code&gt; to combine two stream.&lt;/p&gt;
&lt;p&gt;When we set this stream in motion, the merge operator will poll the left and right branch in a round-robin fashion.
The sources will each emit &lt;code&gt;Pending&lt;/code&gt; every 128 batches, but since the &lt;code&gt;Filter&lt;/code&gt; drops batches, they arrive out-of-phase at the merge operator.
As a consequence the merge operator will always have the opportunity of polling the other stream when one returns &lt;code&gt;Pending&lt;/code&gt;.
The &lt;code&gt;Merge&lt;/code&gt; stream thus is an always ready stream, even though the sources are yielding.
If we use &lt;code&gt;Merge&lt;/code&gt; as the input to our aggregating operator we're right back where we started.&lt;/p&gt;
&lt;h3 id="coordinated-cooperation"&gt;Coordinated Cooperation&lt;a class="headerlink" href="#coordinated-cooperation" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Wouldn't it be great if we could get all the operators to coordinate amongst each other?
When one of them determines that it's time to yield, all the other operators agree and start returning &lt;code&gt;Pending&lt;/code&gt; as well.
That way our task would be coaxed towards yielding even if it tried to poll many different operators.&lt;/p&gt;
&lt;p&gt;Luckily(?), the &lt;a href="https://tokio.rs/blog/2020-04-preemption"&gt;developers of Tokio ran into the exact same problem&lt;/a&gt; described above when network servers were under heavy load and came up with a solution.
Back in 2020, Tokio 0.2.14 introduced a per-task operation budget.
Rather than having individual counters littered throughout the code, the Tokio runtime itself manages a per task counter which is decremented by Tokio resources.
When the counter hits zero, all resources start returning &lt;code&gt;Pending&lt;/code&gt;.
The task will then yield, after which the Tokio runtime resets the counter.&lt;/p&gt;
&lt;p&gt;To illustrate what this process looks like, let's have a look at the execution of the following query &lt;code&gt;Stream&lt;/code&gt; tree when polled in a Tokio task.&lt;/p&gt;
&lt;figure&gt;
&lt;img alt="Diagram showing a plan with a task, AggregateExec, MergeStream and Two sources." src="/blog/images/task-cancellation/tokio_budget_plan.png"/&gt;
&lt;figcaption&gt;Query plan for aggregating a sorted stream from two sources. Each source reads a stream of `RecordBatch`es, which are then merged into a single Stream by the `MergeStream` operator which is then aggregated by the `AggregateExec` operator. Arrows represent the data flow direction&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;If we assume a task budget of 1 unit, each time Tokio schedules the task would result in the following sequence of function calls.&lt;/p&gt;
&lt;figure&gt;
&lt;img alt="Sequence diagram showing how the tokio task budget is used and reset." class="img-fluid" src="/blog/images/task-cancellation/tokio_budget.png"/&gt;
&lt;figcaption&gt;Tokio task budget system, assuming the task budget is set to 1, for the plan above.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;The aggregation stream would try to poll the merge stream in a loop.
The first iteration of the loop consumes the single unit of budget, and returns &lt;code&gt;Ready&lt;/code&gt;.
The second iteration polls the merge stream again which now tries to poll the second scan stream.
Since there is no budget remaining &lt;code&gt;Pending&lt;/code&gt; is returned.
The merge stream may now try to poll the first source stream again, but since the budget is still depleted &lt;code&gt;Pending&lt;/code&gt; is returned as well.
The merge stream now has no other option than to return &lt;code&gt;Pending&lt;/code&gt; itself as well, causing the aggregation to break out of its loop.
The &lt;code&gt;Pending&lt;/code&gt; result bubbles all the way up to the Tokio runtime, at which point the runtime regains control.
When the runtime reschedules the task, it resets the budget and calls &lt;code&gt;poll&lt;/code&gt; on the task &lt;code&gt;Future&lt;/code&gt; again for another round of progress.&lt;/p&gt;
&lt;p&gt;The key mechanism that makes this work well is the single task budget that's shared amongst all the scan streams.
Once the budget is depleted, no streams can make any further progress without first returning control to tokio.
This causes all possible avenues the task has to make progress to return &lt;code&gt;Pending&lt;/code&gt; which results in the task being nudged towards yielding control.&lt;/p&gt;
&lt;p&gt;As it turns out DataFusion was already using this mechanism implicitly.
Every exchange-like operator (such as &lt;code&gt;RepartitionExec&lt;/code&gt;) internally makes use of a Tokio multiple producer, single consumer &lt;a href="https://tokio.rs/tokio/tutorial/channels"&gt;&lt;code&gt;Channel&lt;/code&gt;&lt;/a&gt;.
When calling &lt;code&gt;Receiver::recv&lt;/code&gt; for one of these channels, a unit of Tokio task budget is consumed.
As a consequence, query plans that made use of exchange-like operators were
already mostly cancelable.
The plan cancellation bug only showed up when running parts of plans without such operators, such as when using a single core.&lt;/p&gt;
&lt;p&gt;Now let's see how we can explicitly implement this budget-based approach in our own operators.&lt;/p&gt;
&lt;h3 id="depleting-the-tokio-budget"&gt;Depleting The Tokio Budget&lt;a class="headerlink" href="#depleting-the-tokio-budget" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Let's revisit our original &lt;code&gt;BlockingStream&lt;/code&gt; and adapt it to use Tokio's budget system.&lt;/p&gt;
&lt;p&gt;The examples given here make use of functions from the Tokio &lt;code&gt;coop&lt;/code&gt; module that are still internal at the time of writing.
&lt;a href="https://github.com/tokio-rs/tokio/pull/7405"&gt;PR #7405&lt;/a&gt; on the Tokio project will make these accessible for external use.
The current DataFusion code emulates these functions as well as possible using &lt;a href="https://docs.rs/tokio/latest/tokio/task/coop/fn.has_budget_remaining.html"&gt;&lt;code&gt;has_budget_remaining&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://docs.rs/tokio/latest/tokio/task/coop/fn.consume_budget.html"&gt;&lt;code&gt;consume_budget&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;struct BudgetSourceStream {
}

impl Stream for BudgetSourceStream {
    type Item = Result&amp;lt;RecordBatch&amp;gt;;

    fn poll_next(mut self: Pin&amp;lt;&amp;amp;mut Self&amp;gt;, cx: &amp;amp;mut Context&amp;lt;'_&amp;gt;) -&amp;gt; Poll&amp;lt;Option&amp;lt;Self::Item&amp;gt;&amp;gt; {
        let coop = ready!(tokio::task::coop::poll_proceed(cx));
        let batch: Poll&amp;lt;Option&amp;lt;Self::Item&amp;gt;&amp;gt; = ...;
        if batch.is_ready() {
            coop.made_progress();
        }
        batch
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;Stream&lt;/code&gt; now goes through the following steps:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Try to consume budget&lt;/strong&gt;: the first thing the operator does is use &lt;code&gt;poll_proceed&lt;/code&gt; to try to consume a unit of budget.
If the budget is depleted, this function will return &lt;code&gt;Pending&lt;/code&gt;.
Otherwise, we consumed one budget unit and we can continue.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;let coop = ready!(tokio::task::coop::poll_proceed(cx));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;2. Try to do some work&lt;/strong&gt;: next we try to produce a record batch.
That might not be possible if we're reading from some asynchronous resource that's not ready.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;let batch: Poll&amp;lt;Option&amp;lt;Self::Item&amp;gt;&amp;gt; = ...;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;3. Commit the budget consumption&lt;/strong&gt;: finally, if we did produce a batch, we need to tell Tokio that we were able to make progress.&lt;/p&gt;
&lt;p&gt;That's done by calling the &lt;code&gt;made_progress&lt;/code&gt; method on the value &lt;code&gt;poll_proceed&lt;/code&gt; returned.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-rust"&gt;if batch.is_ready() {
   coop.made_progress();
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You might be wondering why the call to &lt;code&gt;made_progress&lt;/code&gt; is necessary.
This clever construct makes it easier to manage the budget.
The value returned by &lt;code&gt;poll_proceed&lt;/code&gt; will actually restore the budget to its original value when it is dropped unless &lt;code&gt;made_progress&lt;/code&gt; is called.
This ensures that if we exit early from our &lt;code&gt;poll_next&lt;/code&gt; implementation by returning &lt;code&gt;Pending&lt;/code&gt;, that the budget we had consumed becomes available again.
The task that invoked &lt;code&gt;poll_next&lt;/code&gt; can then use that budget again to try to make some other &lt;code&gt;Stream&lt;/code&gt; (or any resource for that matter) make progress.&lt;/p&gt;
&lt;h2 id="automatic-cooperation-for-all-operators"&gt;Automatic Cooperation For All Operators&lt;a class="headerlink" href="#automatic-cooperation-for-all-operators" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;DataFusion 49.0.0  integrates the Tokio task budget based fix in all built-in source operators.
This ensures that going forward, most queries will automatically be cancelable. 
See &lt;a href="https://github.com/apache/datafusion/pull/16398"&gt;the PR&lt;/a&gt; for more details.&lt;/p&gt;
&lt;p&gt;The design includes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A new &lt;code&gt;ExecutionPlan&lt;/code&gt; property that indicates if an operator participates in cooperative scheduling or not.&lt;/li&gt;
&lt;li&gt;A new &lt;code&gt;EnsureCooperative&lt;/code&gt; optimizer rule to inspect query plans and insert &lt;code&gt;CooperativeExec&lt;/code&gt; nodes as needed to ensure custom source operators also participate.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These two changes combined already make it very unlikely you'll encounter any query that refuses to stop, even with custom operators.
For those situations where the automatic mechanisms are still not sufficient, there's a new &lt;code&gt;datafusion::physical_plan::coop&lt;/code&gt; module
with utility functions that make it easy to adopt cooperative scheduling in your custom operators as well.  &lt;/p&gt;
&lt;h2 id="acknowledgments"&gt;Acknowledgments&lt;a class="headerlink" href="#acknowledgments" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Thank you to &lt;a href="https://datadobi.com/"&gt;Datadobi&lt;/a&gt; for sponsoring the development of this feature and to
the DataFusion community contributors including &lt;a href="https://github.com/zhuqi-lucas"&gt;Qi Zhu&lt;/a&gt; and &lt;a href="https://github.com/ozankabak"&gt;Mehmet Ozan
Kabak&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="about-datafusion"&gt;About DataFusion&lt;a class="headerlink" href="#about-datafusion" title="Permanent link"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; is an extensible query engine toolkit, written
in Rust, that uses &lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt; as its in-memory format. DataFusion and
similar technology are part of the next generation “Deconstructed Database”
architectures, where new systems are built on a foundation of fast, modular
components, rather than as a single tightly integrated system.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://datafusion.apache.org/contributor-guide/communication.html"&gt;DataFusion community&lt;/a&gt; is always looking for new contributors to help
improve the project. If you are interested in learning more about how query
execution works, help document or improve the DataFusion codebase, or just try
it out, we would love for you to join us.&lt;/p&gt;</content><category term="blog"/></entry></feed>