Apache Arrow DataFusion 6.0.0 Release
Posted on: Fri 19 November 2021 by pmc
Introduction
DataFusion is an embedded query engine which leverages the unique features of Rust and Apache Arrow to provide a system that is high performance, easy to connect, easy to embed, and high quality.
The Apache Arrow team is pleased to announce the DataFusion 6.0.0 release. This covers 4 months of development work and includes 134 commits from the following 28 distinct contributors.
28 Andrew Lamb
26 Jiayu Liu
13 xudong963
9 rdettai
9 QP Hou
6 Matthew Turner
5 Daniël Heres
4 Guillaume Balaine
3 Francis Du
3 Marco Neumann
3 Jon Mease
3 Nga Tran
2 Yijie Shen
2 Ruihang Xia
2 Liang-Chi Hsieh
2 baishen
2 Andy Grove
2 Jason Tianyi Wang
1 Nan Zhu
1 Antoine Wendlinger
1 Krisztián Szűcs
1 Mike Seddon
1 Conner Murphy
1 Patrick More
1 Taehoon Moon
1 Tiphaine Ruy
1 adsharma
1 lichuan6
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.
New Website
Befitting a growing project, DataFusion now has its own website hosted as part of the main Apache Arrow Website
Roadmap
The community worked to gather their thoughts about where we are taking DataFusion into a public Roadmap for the first time
New Features
- Runtime operator metrics collection framework
- Object store abstraction for unified access to local or remote storage
- Hive style table partitioning support, for Parquet, CSV, Avro and Json files
- DataFrame API support for:
except
,intersect
,show
,limit
and window functions - SQL
EXPLAIN ANALYZE
with runtime metricstrim ( [ LEADING | TRAILING | BOTH ] [ FROM ] string text [, characters text ] )
syntax- Postgres style regular expression matching operators
~
,~*
,!~
, and!~*
- SQL set operators
UNION
,INTERSECT
, andEXCEPT
cume_dist
,percent_rank
window functionsdigest
,blake2s
,blake2b
,blake3
crypto functions- HyperLogLog based
approx_distinct
is distinct from
andis not distinct from
CREATE TABLE AS SELECT
- Accessing elements of nested
Struct
andList
columns (e.g.SELECT struct_column['field_name'], array_column[0] FROM ...
) - Boolean expressions in
CASE
statement DROP TABLE
VALUES
List- Postgres regex match operators
- Support for Avro format
- Support for
ScalarValue::Struct
- Automatic schema inference for CSV files
- Better interactive editing support in
datafusion-cli
as well aspsql
style commands such as\d
,\?
, and\q
- Generic constant evaluation and simplification framework
- Added common subexpression eliminate query plan optimization rule
- Python binding 0.4.0 with all Datafusion 6.0.0 features
With these new features, we are also now passing TPC-H queries 8, 13 and 21.
For the full list of new features with their relevant PRs, see the enhancements section in the changelog.
async
planning and decoupling file format from table layout
Driven by the need to support Hive style table partitioning, @rdettai introduced the following design change to the Datafusion core.
- The code for reading specific file formats (
Parquet
,Avro
,CSV
, andJSON
) was separated from the logic that handles grouping sets of files into execution partitions. - The query planning process was made
async
.
As a result, we are able to replace the old Parquet
, CSV
and JSON
table
providers with a single ListingTable
table provider.
This also sets up DataFusion and its plug-in ecosystem to supporting a wide range of catalogs and various object store implementations. You can read more about this change in the design document and on the arrow-datafusion#1010 PR.
How to Get Involved
If you are interested in contributing to DataFusion, we would love to have you! You can help by trying out DataFusion on some of your own data and projects and filing bug reports and helping to improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for beginners is here and the full list is here.
Check out our new Communication Doc on more ways to engage with the community.
Copyright 2025, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.