Apache Arrow DataFusion 13.0.0 Project Update
Posted on: Tue 25 October 2022 by pmc
Introduction
Apache Arrow DataFusion 13.0.0
is released, and this blog contains an update on the project for the 5 months since our last update in May 2022.
DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project to:
- Support SQL support
- Support DataFrame API
- Support a Domain Specific Query Language
- Easily and quickly read and process Parquet, JSON, Avro or CSV data.
- Read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
Background
DataFusion is used as the engine in many open source and commercial projects and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a "LLVM for database and AI systems"(alternate link) with announcements such as the release of FaceBook's Velox engine, the major investments in Acero as well as the continued popularity of Apache Calcite and other similar technologies.
While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and extension points for just about everything. Some DataFusion users use a subset of the features such as the frontend (e.g. dask-sql) or the execution engine, (e.g. Blaze), and some use many different components to build both SQL based and customized DSL based systems such as InfluxDB IOx and VegaFusion.
One of DataFusion’s advantages is its implementation in Rust and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the ease of parallelization with the high quality and standardized async
ecosystem , as well as its modern dependency management system and wonderful performance.
Summary
We have increased the frequency of DataFusion releases to monthly instead of quarterly. This makes it easier for the increasing number of projects that now depend on DataFusion.
We have also completed the "graduation" of Ballista to its own top-level arrow-ballista repository which decouples the two projects and allows each project to move even faster.
Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
Improved Support for Cloud Object Stores
DataFusion now supports many major cloud object stores (Amazon S3, Azure Blob Storage, and Google Cloud Storage) "out of the box" via the object_store crate. Using this integration, DataFusion optimizes reading parquet files by reading only the parts of the files that are needed.
Advanced SQL
DataFusion now supports correlated subqueries, by rewriting them as joins. See the Subquery page in the User Guide for more information.
In addition to numerous other small improvements, the following SQL features are now supported:
ROWS
,RANGE
,PRECEDING
andFOLLOWING
inOVER
clauses #3570ROLLUP
andCUBE
grouping set expressions #2446SUM DISTINCT
aggregate support #2405IN
andNOT IN
Subqueries by rewriting them toSEMI
/ANTI
#2421- Non equality predicates in
ON
clause ofLEFT
,RIGHT,
andFULL
joins #2591 - Exact
MEDIAN
#3009 GROUPING SETS
/CUBE
/ROLLUP
#2716
More DDL Support
Just as it is important to query, it is also important to give users the ability to define their data sources. We have added:
CREATE VIEW
#2279DESCRIBE <table>
#2642- Custom / Dynamic table provider factories #3311
SHOW CREATE TABLE
for support for views #2830
Faster Execution
Performance is always an important goal for DataFusion, and there are a number of significant new optimizations such as
- Optimizations of TopK (queries with a
LIMIT
orOFFSET
clause): #3527, #2521 - Reduce
left
/right
/full
joins toinner
join #2750 - Convert cross joins to inner joins when possible #3482
- Sort preserving
SortMergeJoin
#2699 - Improvements in group by and sort performance #2375
- Adaptive
regex_replace
implementation #3518
Optimizer Enhancements
Internally the optimizer has been significantly enhanced as well.
- Casting / coercion now happens during logical planning #3185 #3636
- More sophisticated expression analysis and simplification is available
Parquet
- The parquet reader can now read directly from parquet files on remote object storage #2489 #3051
- Experimental support for “predicate pushdown” with late materialization after filtering during the scan (another blog post on this topic is coming soon).
- Support reading directly from AWS S3 and other object stores via
datafusion-cli
#3631
DataType Support
- Support for
TimestampTz
#3660 - Expanded support for the
Decimal
type, includingIN
list and better built in coercion. - Expanded support for date/time manipulation such as
date_bin
built-in function , timestamp+/-
interval,TIME
literal values #3010, #3110, #3034 - Binary operations (
AND
,XOR
, etc): #3037 #3420 IS TRUE/FALSE
andIS [NOT] UNKNOWN
#3235, #3246
Upcoming Work
With the community growing and code accelerating, there is so much great stuff on the horizon. Some features we expect to land in the next few months:
- Complete Parquet Pushdown
- Additional date/time support
- Cost models, Nested Join Optimizations, analysis framework #128, #3843, #3845
Community Growth
The DataFusion 9.0.0 and 13.0.0 releases consists of 433 PRs from 64 distinct contributors. This does not count all the work that goes into our dependencies such as arrow, parquet, and object_store, that much of the same community helps nurture.
How to Get Involved
Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!
If you are interested in contributing to DataFusion, we would love to have you join us on our journey to create the most advanced open source query engine. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here.
Check out our Communication Doc on more ways to engage with the community.
Appendix: Contributor Shoutout
To give a sense of the number of people who contribute to this project regularly, we present for your consideration the following list derived from git shortlog -sn 9.0.0..13.0.0 .
Thank you all again!
87 Andy Grove
71 Andrew Lamb
29 Kun Liu
29 Kirk Mitchener
17 Wei-Ting Kuo
14 Yang Jiang
12 Raphael Taylor-Davies
11 Batuhan Taskaya
10 Brent Gardner
10 Remzi Yang
10 comphead
10 xudong.w
8 AssHero
7 Ruihang Xia
6 Dan Harris
6 Daniël Heres
6 Ian Alexander Joiner
6 Mike Roberts
6 askoa
4 BaymaxHWY
4 gorkem
4 jakevin
3 George Andronchik
3 Sarah Yurick
3 Stuart Carnie
2 Dalton Modlin
2 Dmitry Patsura
2 JasonLi
2 Jon Mease
2 Marco Neumann
2 yahoNanJing
1 Adilet Sarsembayev
1 Ayush Dattagupta
1 Dezhi Wu
1 Dhamotharan Sritharan
1 Eduard Karacharov
1 Francis Du
1 Harbour Zheng
1 Ismaël Mejía
1 Jack Klamer
1 Jeremy Dyer
1 Jiayu Liu
1 Kamil Konior
1 Liang-Chi Hsieh
1 Martin Grigorov
1 Matthijs Brobbel
1 Mehmet Ozan Kabak
1 Metehan Yıldırım
1 Morgan Cassels
1 Nitish Tiwari
1 Renjie Liu
1 Rito Takeuchi
1 Robert Pack
1 Thomas Cameron
1 Vrishabh
1 Xin Hao
1 Yijie Shen
1 byteink
1 kamille
1 mateuszkj
1 nvartolomei
1 yourenawo
1 Özgür Akkurt
Copyright 2024, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.