Apache Arrow DataFusion 7.0.0 Release
Posted on: Mon 28 February 2022 by pmc
Introduction
DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.
When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth checking out.
DataFusion's SQL, DataFrame
, and manual PlanBuilder
API let users access a sophisticated query optimizer and execution engine capable of fast, resource efficient, and parallel execution that takes optimal advantage of todays multicore hardware. Being written in Rust means DataFusion can offer both the safety of dynamic languages as well as the resource efficiency of a compiled language.
The Apache Arrow team is pleased to announce the DataFusion 7.0.0 release. This covers 4 months of development work and includes 195 commits from the following 37 distinct contributors.
44 Andrew Lamb
24 Kun Liu
23 Jiayu Liu
17 xudong.w
11 Yijie Shen
9 Matthew Turner
7 Liang-Chi Hsieh
5 Lin Ma
4 Stephen Carman
4 James Katz
4 Dmitry Patsura
4 QP Hou
3 dependabot[bot]
3 Remzi Yang
3 Yang
3 ic4y
3 Daniël Heres
2 Andy Grove
2 Raphael Taylor-Davies
2 Jason Tianyi Wang
2 Dan Harris
2 Sergey Melnychuk
1 Nitish Tiwari
1 Dom
1 Eduard Karacharov
1 Javier Goday
1 Boaz
1 Marko Mikulicic
1 Max Burke
1 Carol (Nichols || Goulding)
1 Phillip Cloud
1 Rich
1 Toby Hede
1 Will Jones
1 r.4ntix
1 rdettai
The following section highlights some of the improvements in this release. Of course, many other bug fixes and improvements have also been made and we refer you to the complete changelog for the full detail.
Summary
- DataFusion Crate
- The DataFusion crate is being split into multiple crates to decrease compilation times and improve the development experience. Initially,
datafusion-common
(the core DataFusion components) anddatafusion-expr
(DataFusion expressions, functions, and operators) have been split out. There will be additional splits after the 7.0 release. - Performance Improvements and Optimizations
- Arrow’s dyn scalar kernels are now used to enable efficient operations on
DictionaryArray
s #1685 - Switch from
std::sync::Mutex
toparking_lot::Mutex
#1720 - New Features
- Support for memory tracking and spilling to disk
- New math functions
- Support decimal type #1394#1407#1408#1431#1483#1554#1640
- Support for reading Parquet files with evolved schemas #1622#1709
- Support for registering
DataFrame
as table #1699 - Support for the
substring
function #1621 - Support
array_agg(distinct ...)
#1579 - Support
sort
on unprojected columns #1415 - Additional Integration Points
- A new public Expression simplification API #1717
- DataFusion-Contrib
- A new GitHub organization created as a home for both
DataFusion
extensions and as a testing ground for new features.- Extensions
- DataFusion-Python
- DataFusion-Java
- DataFusion-hdsfs-native
- DataFusion-ObjectStore-s3
- New Features
- DataFusion-Streams
- Arrow2
- An Arrow2 Branch has been created. There are ongoing discussions in DataFusion and arrow-rs about migrating
DataFusion
toArrow2
Documentation and Roadmap
We are working to consolidate the documentation into the official site. You can find more details there on topics such as the SQL status and a user guide. This is also an area we would love to get help from the broader community #1821.
To provide transparency on DataFusion’s priorities to users and developers a three month roadmap will be published at the beginning of each quarter. This can be found here here.
Upcoming Attractions
- Ballista is gaining momentum, and several groups are now evaluating and contributing to the project.
- Some of the proposed improvements
- Continued improvements for working with limited resources and large datasets
- Memory limited joins#1599
- Sort-merge join#141#1776
- Introduce row based bytes representation #1708
How to Get Involved
If you are interested in contributing to DataFusion, and learning about state of the art query processing, we would love to have you join us on the journey! You can help by trying out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here
Check out our new Communication Doc on more ways to engage with the community.
Copyright 2024, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.