Apache Arrow DataFusion 16.0.0 Project Update
Posted on: Thu 19 January 2023 by pmc
Introduction
DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. It is targeted primarily at developers creating data intensive analytics, and offers mature SQL support, a DataFrame API, and many extension points.
Systems based on DataFusion perform very well in benchmarks, especially considering they operate directly on parquet files rather than first loading into a specialized format. Some recent highlights include clickbench and the Cloudfuse.io standalone query engines page.
DataFusion is also part of a longer term trend, articulated clearly by Andy Pavlo in his 2022 Databases Retrospective. Database frameworks are proliferating and it is likely that all OLAP DBMSs and other data heavy applications, such as machine learning, will require a vectorized, highly performant query engine in the next 5 years to remain relevant. The only practical way to make such technology so widely available without many millions of dollars of investment is though open source engine such as DataFusion or Velox.
The rest of this post describes the improvements made to DataFusion over the last three months and some hints of where we are heading.
Community Growth
We again saw significant growth in the DataFusion community since our last update. There are some interesting metrics on OSSRank.
The DataFusion 16.0.0 release consists of 543 PRs from 73 distinct contributors, not including all the work that goes into dependencies such as arrow, parquet, and object_store, that much of the same community helps support. Thank you all for your help
Several new systems based on DataFusion were recently added:
Performance 🚀
Performance and efficiency are core values for DataFusion. While there is still a gap between DataFusion and the best of breed, tightly integrated systems such as DuckDB and Polars, DataFusion is closing the gap quickly. Performance highlights from the last three months:
- Up to 30% Faster Sorting and Merging using the new Row Format
- Advanced predicate pushdown, directly on parquet, directly from object storage, enabling sub millisecond filtering.
70%
fasterIN
expressions evaluation (#4057)- Sort and partition aware optimizations (#3969 and #4691)
- Filter selectivity analysis (#3868)
Runtime Resource Limits
Previously, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins.
In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to optionally spill to secondary storage. See #3941 for more detail.
SQL Window Functions
SQL Window Functions are useful for a variety of analysis and DataFusion's implementation support expanded significantly:
- Custom window frames such as
... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING)
- Unbounded window frames such as
... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING)
- Support for the
NTILE
window function (#4676) - Support for
GROUPS
mode (#4155)
Improved Joins
Joins are often the most complicated operations to handle well in analytics systems and DataFusion 16.0.0 offers significant improvements such as
- Cost based optimizer (CBO) automatically reorders join evaluations, selects algorithms (Merge / Hash), and pick build side based on available statistics and join type (
INNER
,LEFT
, etc) (#4219) - Fast non
column=column
equijoins such asJOIN ON a.x + 5 = b.y
- Better performance on non-equijoins (#4562)
Streaming Execution
One emerging use case for Datafusion is as a foundation for streaming-first data platforms. An important prerequisite is support for incremental execution for queries that can be computed incrementally.
With this release, DataFusion now supports the following streaming features:
- Data ingestion from infinite files such as FIFOs (#4694),
- Detection of pipeline-breaking queries in streaming use cases (#4694),
- Automatic input swapping for joins so probe side is a data stream (#4694),
- Intelligent elision of pipeline-breaking sort operations whenever possible (#4691),
- Incremental execution for more types of queries; e.g. queries involving finite window frames (#4777).
These are a major steps forward, and we plan even more improvements over the next few releases.
Better Support for Distributed Catalogs
16.0.0 has been enhanced support for asynchronous catalogs (#4607) to better support distributed metadata stores such as Delta.io and Apache Iceberg which require asynchronous I/O during planning to access remote catalogs. Previously, DataFusion required synchronous access to all relevant catalog information.
Additional SQL Support
SQL support continues to improve, including some of these highlights:
- Add TPC-DS query planning regression tests #4719
- Support for
PREPARE
statement #4490 - Automatic coercions ast between Date and Timestamp #4726
- Support type coercion for timestamp and utf8 #4312
- Full support for time32 and time64 literal values (
ScalarValue
) #4156 - New functions, incuding
uuid()
#4041,current_time
#4054,current_date
#4022 - Compressed CSV/JSON support #3642
The community has also invested in new sqllogic based tests to keep improving DataFusion's quality with less effort.
Plan Serialization and Substrait
DataFusion now supports serialization of physical plans, with a custom protocol buffers format. In addition, we are adding initial support for Substrait, a Cross-Language Serialization for Relational Algebra
How to Get Involved
Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!
If you are interested in contributing to DataFusion, we would love to have you join us. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here.
Check out our Communication Doc on more ways to engage with the community.
Appendix: Contributor Shoutout
Here is a list of people who have contributed PRs to this project over the last three releases, derived from git shortlog -sn 13.0.0..16.0.0 .
Thank you all!
113 Andrew Lamb
58 jakevin
46 Raphael Taylor-Davies
30 Andy Grove
19 Batuhan Taskaya
19 Remzi Yang
17 ygf11
16 Burak
16 Jeffrey
16 Marco Neumann
14 Kun Liu
12 Yang Jiang
10 mingmwang
9 Daniël Heres
9 Mustafa akur
9 comphead
9 mvanschellebeeck
9 xudong.w
7 dependabot[bot]
7 yahoNanJing
6 Brent Gardner
5 AssHero
4 Jiayu Liu
4 Wei-Ting Kuo
4 askoa
3 André Calado Coroado
3 Jie Han
3 Jon Mease
3 Metehan Yıldırım
3 Nga Tran
3 Ruihang Xia
3 baishen
2 Berkay Şahin
2 Dan Harris
2 Dongyan Zhou
2 Eduard Karacharov
2 Kikkon
2 Liang-Chi Hsieh
2 Marko Milenković
2 Martin Grigorov
2 Roman Nozdrin
2 Tim Van Wassenhove
2 r.4ntix
2 unconsolable
2 unvalley
1 Ajaya Agrawal
1 Alexander Spies
1 ArkashaJavelin
1 Artjoms Iskovs
1 BoredPerson
1 Christian Salvati
1 Creampanda
1 Data Psycho
1 Francis Du
1 Francis Le Roy
1 LFC
1 Marko Grujic
1 Matt Willian
1 Matthijs Brobbel
1 Max Burke
1 Mehmet Ozan Kabak
1 Rito Takeuchi
1 Roman Zeyde
1 Vrishabh
1 Zhang Li
1 ZuoTiJia
1 byteink
1 cfraz89
1 nbr
1 xxchan
1 yujie.zhang
1 zembunia
1 哇呜哇呜呀咦耶
Copyright 2024, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.