Apache DataFusion 45.0.0 Released
Posted on: Thu 20 February 2025 by pmc
Introduction
We are very proud to announce DataFusion 45.0.0. This blog highlights some of the many major improvements since we released DataFusion 40.0.0 and a preview of what the community is thinking about in the next 6 months. It has been an exciting period of development for DataFusion!
Apache DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast data centric systems such as databases, dataframe libraries, machine learning and streaming applications. While DataFusion’s primary design goal is to accelerate the creation of other data centric systems, it has a reasonable experience directly out of the box as a dataframe library, python library and command line SQL tool.
DataFusion's core thesis is that as a community, together we can build much more advanced technology than any of us as individuals or companies could do alone. Without DataFusion, highly performant vectorized query engines would remain the domain of a few large companies and world-class research institutions. With DataFusion, we can all build on top of a shared foundation, and focus on what makes our projects unique.
Community Growth 📈
In the last 6 months, between 40.0.0
and 45.0.0
, our community continues to
grow in new and exciting ways.
- We added several PMC members and new committers: @jayzhan211 and @jonahgao joined the PMC, @2010YOUY01, @rachelint, @findpi, @iffyio, @goldmedal, @Weijun-H, @Michael-J-Ward and @korowa joined as committers. See the mailing list for more details.
- In the core DataFusion repo alone we reviewed and accepted almost 1600 PRs from 206 different committers, created over 1100 issues and closed 751 of them 🚀. All changes are listed in the detailed changelogs.
- DataFusion focused meetups happened in multiple cities around the world: Hangzhou, Belgrade, New York, Seattle, Chicago, Boston and Amsterdam as well as a Rust NYC meetup in NYC focused on DataFusion.
DataFusion has put in an application to be part of Google Summer of Code with a number of ideas for projects with mentors already selected. Additionally, some ideas on how to make DataFusion an ideal selection for university database projects such as the CMU database classes have been put forward.
In addition, DataFusion has been appearing publicly more and more, both online and offline. Here are some highlights:
- A demonstration of how uwheel is integrated into DataFusion
- Integrating StringView into DataFusion - part 1 and part 2
- Building streams with DataFusion
- Caching in DataFusion: Don't read twice
- Parquet pruning in DataFusion: Read no more than you need
- DataFusion is one of The 10 coolest open source software tools
- Building databases over a weekend
Improved Performance 🚀
DataFusion hit a milestone in its development by becoming the fastest single node engine for querying Apache Parquet files in clickbench benchmark for the 43.0.0 release. A lot of work went into making this happen! While other engines have subsequently gotten faster, displacing DataFusion from the top spot, DataFusion still remains near the top and we are planning more improvements.
Figure 1: ClickBench performance improved over 33% between DataFusion 33 (released Nov. 2023) and DataFusion 45 (released Feb. 2025).
The task of integrating the new Arrow StringView which significantly improves performance for workloads that scan, filter and group by variable length string and binary data was completed and enabled by default in the past 6 months. The improvement is especially pronounced for Parquet files due to upstream work in the parquet reader. Kudos to @XiangpengHong, @AriesDevil, @PsiACE, @Weijun-H, @a10y, and @RinChanNOWWW for driving this project.
Improved Quality 📋
DataFusion continues to improve overall in quality. In addition to ongoing bug fixes, one of the most exciting improvements in the last 6 months was the addition of the SQLite sqllogictest suite thanks to @Omega359. These tests run over 5 million sql statements on every push to the main branch.
Support for explicitly checking logical plan invariants was added by @wiedld which can help catch implicit changes that might cause problems during upgrades.
We have also started other quality initiatives to make it easier to use DataFusion based on GlareDB's experience along with more extensive prerelease testing.
Improved Documentation 📚
We continue to improve the documentation to make it easier to get started using DataFusion. During the last 6 months two projects were initiated to migrate the function documentation from strictly static markdown files. First, @Omega359 to allow function documentation to be generated from code and @jonathanc-n and others helped with the migration, then @comphead lead a project to create a doc macro to allow for an even easier way to write function documentation. A special thanks to @Chen-Yuan-Lai for migrating many functions to the new syntax.
Additionally, the examples were refactored and cleaned up to improve their usefulness.
New Features ✨
There are too many new features in the last 6 months to list them all, but here are some highlights:
Functions
- Uniform Window Functions:
BuiltInWindowFunctions
was removed and all now use UDFs (@jcsherin) - Uniform Aggregate Functions:
BuiltInAggregateFunctions
was removed and all now use UDFs - As mentioned above function documentation was extracted from the markdown files
- Some new functions and sql support were added including 'show functions', 'to_local_time', 'regexp_count', 'map_extract', 'array_distance', 'array_any_value', 'greatest', 'least', 'arrays_overlap'
FFI
- Foreign Function Interface work has started. This should allow for using table providers across languages and versions of DataFusion. This is especially pertinent for integration with delta-rs and other table formats.
Materialized Views
- @suremarc has added a materialized view implementation in datafusion-contrib 🚀
Substrait
- A lot of work was put into improving and enhancing substrait support (@Blizzara, @westonpace, @tokoko, @vbarua, @LatrecheYasser, @notfilippo and others)
Looking Ahead: The Next Six Months ðŸ”
One of the long term goals of @alamb, DataFusion's PMC chair, has been to have 1000 DataFusion based projects. This may be the year that happens!
The community has been discussing what we will work on in the next six months. Some major initiatives are likely to be:
- Performance: A number of items have been identified as areas that could use additional work
- Memory usage: Tracking and improving memory usage, statistics and spilling to disk
- Google Summer of Code (GSOC): DataFusion is hopefully selected as a project and we start accepting and supporting student projects
- FFI: Extending the FFI implementation to support to all types of UDF's and SessionContext
- Spark Functions: A proposal has been made to add a crate covering spark compatible builtin functions
How to Get Involved
DataFusion is not a project built or driven by a single person, company, or foundation. Rather, our community of users and contributors work together to build a shared technology that none of us could have built alone.
If you are interested in joining us we would love to have you. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests or code. A list of open issues suitable for beginners is here and you can find how to reach us on the communication doc.
Copyright 2025, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.