Apache DataFusion Blog - blog category

Articles in the blog category

Optimizing SQL CASE Expression Evaluation

Mon 02 February 2026
By Pepijn Van Eeckhoudt

SQL's CASE expression is one of the few explicit conditional evaluation constructs the language provides. It allows you to control which expression from a …
Apache DataFusion Comet 0.13.0 Release

Fri 30 January 2026
By pmc

The Apache DataFusion PMC is pleased to announce version 0.13.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately eight weeks of development …
Apache DataFusion 52.0.0 Released

Mon 12 January 2026
By pmc

We are proud to announce the release of DataFusion 52.0.0. This post highlights some of the major improvements since DataFusion 51.0.0. The complete list of changes is available in the changelog. Thanks to the 121 contributors for making this release possible.

Performance Improvements 🚀¶

We continue to …
Extending SQL in DataFusion: from ->> to TABLESAMPLE

Mon 12 January 2026
By Geoffrey Claude (Datadog)

If you embed DataFusion in your product, your users will eventually run SQL that DataFusion does not recognize. Not because the query is unreasonable, but because SQL in practice includes many dialects and system-specific statements.

Suppose you store data as Parquet files on S3 and want users to attach an …
Optimizing Repartitions in DataFusion: How I Went From Database Noob to Core Contribution

Mon 15 December 2025
By Gene Bordegaray

Databases are some of the most complex yet interesting pieces of software. They are amazing pieces of abstraction: query engines optimize and execute complex plans, storage engines provide sophisticated infrastructure as the backbone of the system, while intricate file formats lay the groundwork for particular workloads. All of this is …
Apache DataFusion Comet 0.12.0 Release

Thu 04 December 2025
By pmc

The Apache DataFusion PMC is pleased to announce version 0.12.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately four weeks of development …
Apache DataFusion 51.0.0 Released

Tue 25 November 2025
By pmc

Introduction¶

We are proud to announce the release of DataFusion 51.0.0. This post highlights some of the major improvements since DataFusion 50.0.0. The complete list of changes is available in the changelog. Thanks to the 128 contributors for making this release possible.

Performance Improvements 🚀¶

We continue …
Apache DataFusion Comet 0.11.0 Release

Tue 21 October 2025
By pmc

The Apache DataFusion PMC is pleased to announce version 0.11.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately five weeks of development …
Apache DataFusion 50.0.0 Released

Mon 29 September 2025
By pmc

Introduction¶

We are proud to announce the release of DataFusion 50.0.0. This blog post highlights some of the major improvements since the release of DataFusion 49.0.0. The complete list of changes is available in the changelog. Thanks to numerous contributors for making this release possible!

Performance …
Implementing User Defined Types and Custom Metadata in DataFusion

Sun 21 September 2025
By Tim Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew Lamb(InfluxData)

Apache DataFusion significantly improves support for user defined types and metadata. The user defined function APIs let users access metadata on the input columns to functions and produce metadata in the output.

User defined types == extension types¶

DataFusion directly uses Apache Arrow's DataTypes as its type system. This has …
Apache DataFusion Comet 0.10.0 Release

Tue 16 September 2025
By pmc

The Apache DataFusion PMC is pleased to announce version 0.10.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately ten weeks of development …
Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries

Wed 10 September 2025
By Adrian Garcia Badaracco (Pydantic), Andrew Lamb (InfluxData)

This blog post introduces the query engine optimization techniques called TopK and dynamic filters. We describe the motivating use case, how these optimizations work, and how we implemented them with the Apache DataFusion community to improve performance by an order of magnitude for some query patterns.

Motivation and Results¶

The …
Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet

Fri 15 August 2025
By Andrew Lamb (InfluxData)

It is a common misconception that Apache Parquet requires (slow) reparsing of metadata and is limited to indexing structures provided by the format. In fact, caching parsed metadata and using custom external indexes along with Parquet's hierarchical data organization can significantly speed up query processing.

In this blog, I describe …
Apache DataFusion 49.0.0 Released

Mon 28 July 2025
By pmc

Introduction¶

We are proud to announce the release of DataFusion 49.0.0. This blog post highlights some of the major improvements since the release of DataFusion 48.0.0. The complete list of changes is available in the changelog.

Performance Improvements 🚀¶

DataFusion continues to focus on enhancing performance, as …
Apache DataFusion 48.0.0 Released

Wed 16 July 2025
By PMC

We’re excited to announce the release of Apache DataFusion 48.0.0! As always, this version packs in a wide range of improvements and fixes. You can find the complete details in the full changelog. We’ll highlight the most important changes below and guide you through upgrading.

Breaking …
Embedding User-Defined Indexes in Apache Parquet Files

Mon 14 July 2025
By Qi Zhu (Cloudera), Jigao Luo (Systems Group at TU Darmstadt), and Andrew Lamb (InfluxData)

It’s a common misconception that Apache Parquet files are limited to basic Min/Max/Null Count statistics and Bloom filters, and that adding more advanced indexes requires changing the specification or creating a new file format. In fact, footer metadata and offset-based addressing already provide everything needed to embed …
Apache DataFusion 47.0.0 Released

Fri 11 July 2025
By PMC

We’re excited to announce the release of Apache DataFusion 47.0.0! This new version represents a significant milestone for the project, packing in a wide range of improvements and fixes. You can find the complete details in the full changelog. We’ll highlight the most important changes below …
Apache DataFusion Comet 0.9.0 Release

Tue 01 July 2025
By pmc

The Apache DataFusion PMC is pleased to announce version 0.9.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately ten weeks of development …
Using Rust async for Query Execution and Cancelling Long-Running Queries

Mon 30 June 2025
By Pepijn Van Eeckhoudt

Have you ever tried to cancel a query that just wouldn't stop? In this post, we'll review how Rust's async programming model works, how …
Optimizing SQL (and DataFrames) in DataFusion, Part 1: Query Optimization Overview

Sun 15 June 2025
By alamb, akurmustafa
Note: this blog was originally published on the InfluxData blog

Introduction¶

Sometimes Query Optimizers are seen as a sort of black magic, “the most challenging problem in computer science,” according to Father Pavlo, or some behind-the-scenes player. We believe this perception is because:
1. One must implement the rest of a …
Optimizing SQL (and DataFrames) in DataFusion, Part 2: Optimizers in Apache DataFusion

Sun 15 June 2025
By alamb, akurmustafa

Note, this blog was originally published on the InfluxData blog.

In the first part of this post, we discussed what a Query Optimizer is, what role it plays, and described how industrial optimizers are organized. In this second post, we describe various optimizations that are found in Apache DataFusion and …
Apache DataFusion Comet 0.8.0 Release

Tue 06 May 2025
By pmc

The Apache DataFusion PMC is pleased to announce version 0.8.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

This release covers approximately six weeks of development …
User defined Window Functions in DataFusion

Sat 19 April 2025
By Aditya Singh Rathore, Andrew Lamb

Window functions are a powerful feature in SQL, allowing for complex analytical computations over a subset of data. However, efficiently implementing them, especially sliding windows, can be quite challenging. With Apache DataFusion's user-defined window functions, developers can easily take advantage of all the effort put into DataFusion's implementation.

In …
tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust

Thu 10 April 2025
By Andrew Lamb, Achraf B, and Sean Smith

TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ with dbgen.

3 members of the Apache DataFusion community used Rust and open source development to build tpchgen-rs, a fully open TPC-H data generator over …
Apache DataFusion Python 46.0.0 Released

Sun 30 March 2025
By timsaucer

We are happy to announce that datafusion-python 46.0.0 has been released. This release brings in all of the new features of the core DataFusion 46.0.0 library. Since the last blog post for datafusion-python 43.1.0, a large number of improvements have been made that can …
Apache DataFusion 46.0.0 Released

Mon 24 March 2025
By Oznur Hanci and Berkay Sahin on behalf of the PMC

We’re excited to announce the release of Apache DataFusion 46.0.0! This new version represents a significant milestone for the project, packing in a wide range of improvements and fixes. You can find the complete details in the full changelog. We’ll highlight the most important changes below …
Efficient Filter Pushdown in Parquet

Fri 21 March 2025
By Xiangpeng Hao

Editor's Note: This blog was first published on Xiangpeng Hao's blog. Thanks to InfluxData for sponsoring this work as part of his PhD funding.

In the previous post …
Apache DataFusion Comet 0.7.0 Release

Thu 20 March 2025
By pmc

The Apache DataFusion PMC is pleased to announce version 0.7.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

Comet runs on commodity hardware and aims to …
Parquet Pruning in DataFusion: Read Only What Matters

Thu 20 March 2025
By Xiangpeng Hao

Editor's Note: This blog was first published on Xiangpeng Hao's blog. Thanks to InfluxData for sponsoring this work as part of his PhD funding.

Apache Parquet has become the industry standard for storing columnar data, and reading Parquet efficiently -- especially from remote storage -- is crucial for query performance.

Apache DataFusion …
Using Ordering for Better Plans in Apache DataFusion

Tue 11 March 2025
By Mustafa Akur, Andrew Lamb

Introduction¶

In this blog post, we explain when an ordering requirement of an operator is satisfied by its input data. This analysis is essential for order-based optimizations and is often more complex than one might initially think.

Ordering Requirement for an operator describes how the input data to that operator …
Apache DataFusion 45.0.0 Released

Thu 20 February 2025
By pmc

Introduction¶

We are very proud to announce DataFusion 45.0.0. This blog highlights some of the many major improvements since we released DataFusion 40.0.0 and a preview of what the community is thinking about in the next 6 months. It has been an exciting period of development …
Apache DataFusion Comet 0.6.0 Release

Mon 17 February 2025
By pmc

The Apache DataFusion PMC is pleased to announce version 0.6.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

Comet runs on commodity hardware and aims to …
Apache DataFusion Ballista 43.0.0 Released

Sun 02 February 2025
By milenkovicm

We are pleased to announce version 43.0.0 of the DataFusion Ballista. Ballista allows existing DataFusion applications to be scaled out on a cluster for use cases that are not practical to run on a single node.

Highlights of this release¶

Seamless Integration with DataFusion¶

The primary objective of …
Apache DataFusion Comet 0.5.0 Release

Fri 17 January 2025
By pmc

The Apache DataFusion PMC is pleased to announce version 0.5.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

Comet runs on commodity hardware and aims to …
Apache DataFusion Python 43.1.0 Released

Sat 14 December 2024
By timsaucer

We are happy to announce that datafusion-python 43.1.0 has been released. This release brings in all of the new features of the core DataFusion 43.0.0 library. Since the last blog post for datafusion-python 40.1.0, a large number of improvements have been made that can …
Apache DataFusion Comet 0.4.0 Release

Wed 20 November 2024
By pmc

The Apache DataFusion PMC is pleased to announce version 0.4.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

Comet runs on commodity hardware and aims to …
Comparing approaches to User Defined Functions in Apache DataFusion using Python

Tue 19 November 2024
By timsaucer

Personal Context¶

For a few months now I’ve been working with Apache DataFusion, a fast query engine written in Rust. From my experience the language that nearly all data scientists are working in is Python. In general, data scientists often use Pandas for in-memory tasks and PySpark for larger …
Apache DataFusion is now the fastest single node engine for querying Apache Parquet files

Mon 18 November 2024
By Andrew Lamb, Staff Engineer at InfluxData

I am extremely excited to announce that Apache DataFusion is the fastest engine for querying Apache Parquet files in ClickBench. It is faster than DuckDB, chDB and Clickhouse using the same hardware. It also marks the first time a Rust-based engine holds the top spot, which has previously been …
Apache DataFusion Comet 0.3.0 Release

Fri 27 September 2024
By pmc

The Apache DataFusion PMC is pleased to announce version 0.3.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

Comet runs on commodity hardware and aims to …
Using StringView / German Style Strings to Make Queries Faster: Part 1- Reading Parquet

Fri 13 September 2024
By Xiangpeng Hao, Andrew Lamb

Editor's Note: This is the first of a two part blog series that was first published on the InfluxData blog. Thanks to InfluxData for sponsoring this work as Xiangpeng Hao's summer intern project

This blog describes our experience implementing StringView in the Rust implementation of Apache Arrow, and integrating …
Using StringView / German Style Strings to make Queries Faster: Part 2 - String Operations

Fri 13 September 2024
By Xiangpeng Hao, Andrew Lamb

Editor's Note: This blog series was first published on the InfluxData blog. Thanks to InfluxData for sponsoring this work as Xiangpeng Hao's summer intern project

In the first post, we discussed the nuances required to accelerate Parquet loading using StringViewArray by reusing buffers and reducing copies. In this second …
Apache DataFusion Comet 0.2.0 Release

Wed 28 August 2024
By pmc

The Apache DataFusion PMC is pleased to announce version 0.2.0 of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

Comet runs on commodity hardware and aims to …
Apache DataFusion Python 40.1.0 Released, Significant usability updates

Tue 20 August 2024
By timsaucer

Introduction¶

We are happy to announce that DataFusion in Python 40.1.0 has been released. In addition to bringing in all of the new features of the core DataFusion 40.0.0 package, this release contains significant updates to the user interface and documentation. We listened to the python …
Apache DataFusion 40.0.0 Released

Wed 24 July 2024
By pmc

Introduction¶

We are proud to announce DataFusion 40.0.0. This blog highlights some of the many major improvements since we released DataFusion 34.0.0 and a preview of what the community is thinking about in the next 6 months. We are hoping to make more regular blog posts …
Apache DataFusion Comet 0.1.0 Release

Sat 20 July 2024
By pmc

The Apache DataFusion PMC is pleased to announce the first official source release of the Comet subproject.

Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for improved performance and efficiency without requiring any code changes.

Comet runs on commodity hardware and aims …
Announcing Apache Arrow DataFusion is now Apache DataFusion

Tue 07 May 2024
By pmc

Introduction¶

TLDR; Apache Arrow DataFusion --> Apache DataFusion

The Arrow PMC and newly created DataFusion PMC are happy to announce that as of April 16, 2024 the Apache Arrow DataFusion subproject is now a top level Apache Software Foundation project.

Background¶

Apache DataFusion is a fast, extensible query engine for building …
Announcing Apache Arrow DataFusion Comet

Wed 06 March 2024
By pmc

Introduction¶

The Apache Arrow PMC is pleased to announce the donation of the Comet project, a native Spark SQL Accelerator built on Apache Arrow DataFusion.

Comet is an Apache Spark plugin that uses Apache Arrow DataFusion to accelerate Spark workloads. It is designed as a drop-in replacement for Spark's JVM …
Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024

Fri 19 January 2024
By pmc

Introduction¶

We recently released DataFusion 34.0.0. This blog highlights some of the major improvements since we released DataFusion 26.0.0 (spoiler alert there are many) and a preview of where the community plans to focus in the next 6 months.

Apache Arrow DataFusion is an extensible query …
Aggregating Millions of Groups Fast in Apache Arrow DataFusion 28.0.0

Sat 05 August 2023
By alamb, Dandandan, tustvold

Aggregating Millions of Groups Fast in Apache Arrow DataFusion¶

Andrew Lamb, Daniël Heres, Raphael Taylor-Davies,

Note: this article was originally published on the InfluxData Blog

TLDR¶

Grouped aggregations are a core part of any analytic tool, creating understandable summaries of huge data volumes. Apache Arrow DataFusion’s parallel aggregation capability …
Apache Arrow DataFusion 26.0.0

Sat 24 June 2023
By pmc

It has been a whirlwind 6 months of DataFusion development since our last update: the community has grown, many features have been added, performance improved and we are discussing branching out to our own top level Apache Project.

Background¶

Apache Arrow DataFusion is an extensible query engine and database toolkit …
Apache Arrow DataFusion 16.0.0 Project Update

Thu 19 January 2023
By pmc

Introduction¶

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. It is targeted primarily at developers creating data intensive analytics, and offers mature SQL support, a DataFrame API, and many extension points.

Systems based on DataFusion perform very well in benchmarks …
Apache Arrow Ballista 0.9.0 Release

Fri 28 October 2022
By pmc
Introduction¶

Ballista is an Arrow-native distributed SQL query engine implemented in Rust.

Ballista 0.9.0 is now available and is the most significant release since the project was donated to Apache Arrow in 2021.

This release represents 4 weeks of work, with 66 commits from 14 contributors:
```
    22  Andy …
```
Apache Arrow DataFusion 13.0.0 Project Update

Tue 25 October 2022
By pmc

Introduction¶

Apache Arrow DataFusion 13.0.0 is released, and this blog contains an update on the project for the 5 months since our last update in May 2022.

DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL …
Apache Arrow DataFusion 8.0.0 Release

Mon 16 May 2022
By pmc

Introduction¶

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth …
Introducing Apache Arrow DataFusion Contrib

Mon 21 March 2022
By pmc

Introduction¶

Apache Arrow DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is …
Apache Arrow DataFusion 7.0.0 Release

Mon 28 February 2022
By pmc

Introduction¶

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth …
Apache Arrow DataFusion 6.0.0 Release

Fri 19 November 2021
By pmc

Introduction¶

DataFusion is an embedded query engine which leverages the unique features of Rust and Apache Arrow to provide a system that is high performance, easy to connect, easy to embed, and high quality.

The Apache Arrow team is pleased to announce the DataFusion 6.0.0 release. This covers …
Apache Arrow Ballista 0.5.0 Release

Wed 18 August 2021
By pmc
Ballista extends DataFusion to provide support for distributed queries. This is the first release of Ballista since the project was donated to the Apache Arrow project and includes 80 commits from 11 contributors.
```
git shortlog -sn 4.0.0..5.0.0 ballista/rust/client ballista/rust/core ballista/rust …
```
Apache Arrow DataFusion 5.0.0 Release

Wed 18 August 2021
By pmc
The Apache Arrow team is pleased to announce the DataFusion 5.0.0 release. This covers 4 months of development work and includes 211 commits from the following 31 distinct contributors.
```
$ git shortlog -sn 4.0.0..5.0.0 datafusion datafusion-cli datafusion-examples
    61  Jiayu Liu
    47  Andrew Lamb
    27 …
```
Ballista: A Distributed Scheduler for Apache Arrow

Mon 12 April 2021
By agrove

We are excited to announce that Ballista has been donated to the Apache Arrow project.

Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported …
DataFusion: A Rust-native Query Engine for Apache Arrow

Mon 04 February 2019
By agrove

We are excited to announce that DataFusion has been donated to the Apache Arrow project. DataFusion is an in-memory query engine for the Rust implementation of Apache Arrow.

Although DataFusion was started two years ago, it was recently re-implemented to be Arrow-native and currently has limited capabilities but does support …

Articles in the blog category

Performance Improvements 🚀¶

Introduction¶

Performance Improvements 🚀¶

Introduction¶

Performance …

User defined types == extension types¶

Motivation and Results¶

Introduction¶

Performance Improvements 🚀¶

Breaking …

Introduction¶

Introduction¶

Introduction¶

Highlights of this release¶

Seamless Integration with DataFusion¶

Personal Context¶

Introduction¶

Introduction¶

Introduction¶

Background¶

Introduction¶

Introduction¶

Aggregating Millions of Groups Fast in Apache Arrow DataFusion¶

TLDR¶

Background¶

Introduction¶

Introduction¶

Introduction¶

Introduction¶

Introduction¶

Introduction¶

Introduction¶