Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024
Posted on: Fri 19 January 2024 by pmc
Introduction
We recently released DataFusion 34.0.0. This blog highlights some of the major improvements since we released DataFusion 26.0.0 (spoiler alert there are many) and a preview of where the community plans to focus in the next 6 months.
Apache Arrow DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast data centric systems such as databases, dataframe libraries, machine learning and streaming applications. While DataFusion’s primary design goal is to accelerate creating other data centric systems, it has a reasonable experience directly out of the box as a dataframe library and command line SQL tool.
This may also be our last update on the Apache Arrow Site. Future updates will likely be on the DataFusion website as we are working to graduate to a top level project (Apache Arrow DataFusion → Apache DataFusion!) which will help focus governance and project growth. Also exciting, our first DataFusion in person meetup is planned for March 2024.
DataFusion is very much a community endeavor. Our core thesis is that as a
community we can build much more advanced technology than any of us as
individuals or companies could alone. In the last 6 months between 26.0.0
and
34.0.0
, community growth has been strong. We accepted and reviewed over a
thousand PRs from 124 different committers, created over 650 issues and closed 517
of them.
You can find a list of all changes in the detailed CHANGELOG.
Improved Performance 🚀
Performance is a key feature of DataFusion, DataFusion is
more than 2x faster on ClickBench compared to version 25.0.0
, as shown below:
Here are some specific enhancements we have made to improve performance:
* 2-3x better aggregation performance with many distinct groups
* Partially ordered grouping / streaming grouping
* [Specialized operator for "TopK" ORDER BY LIMIT XXX
]
* [Specialized operator for min(col) GROUP BY .. ORDER by min(col) LIMIT XXX
]
* Improved join performance
* Eliminate redundant sorting with sort order aware optimizers
New Features ✨
DML / Insert / Creating Files
DataFusion now supports writing data in parallel, to individual or multiple
files, using Parquet
, CSV
, JSON
, ARROW
and user defined formats.
Benchmark results show improvements up to 5x in some cases.
Similarly to reading, data can now be written to any [ObjectStore
]
implementation, including AWS S3, Azure Blob Storage, GCP Cloud Storage, local
files, and user defined implementations. While reading from hive style
partitioned tables has long been supported, it is now possible to write to such
tables as well.
For example, to write to a local file:
❯ CREATE EXTERNAL TABLE awesome_table(x INT) STORED AS PARQUET LOCATION '/tmp/my_awesome_table';
0 rows in set. Query took 0.003 seconds.
❯ INSERT INTO awesome_table SELECT x * 10 FROM my_source_table;
+-------+
| count |
+-------+
| 3 |
+-------+
1 row in set. Query took 0.024 seconds.
You can also write to files with the [COPY
], similarly to [DuckDB’s COPY
]:
❯ COPY (SELECT x + 1 FROM my_source_table) TO '/tmp/output.json';
+-------+
| count |
+-------+
| 3 |
+-------+
1 row in set. Query took 0.014 seconds.
$ cat /tmp/output.json
{"x":1}
{"x":2}
{"x":3}
Improved STRUCT
and ARRAY
support
DataFusion 34.0.0
has much improved STRUCT
and ARRAY
support, including a full range of struct functions and array functions.
For example, you can now use []
syntax and array_length
to access and inspect arrays:
❯ SELECT column1,
column1[1] AS first_element,
array_length(column1) AS len
FROM my_table;
+-----------+---------------+-----+
| column1 | first_element | len |
+-----------+---------------+-----+
| [1, 2, 3] | 1 | 3 |
| [2] | 2 | 1 |
| [4, 5] | 4 | 2 |
+-----------+---------------+-----+
❯ SELECT column1, column1['c0'] FROM my_table;
+------------------+----------------------+
| column1 | my_table.column1[c0] |
+------------------+----------------------+
| {c0: foo, c1: 1} | foo |
| {c0: bar, c1: 2} | bar |
+------------------+----------------------+
2 rows in set. Query took 0.002 seconds.
Other Features
Other notable features include: * Support aggregating datasets that exceed memory size, with group by spill to disk * All operators now track and limit their memory consumption, including Joins
Building Systems is Easier with DataFusion 🛠️
Documentation
It is easier than ever to get started using DataFusion with the new Library Users Guide as well as significantly improved the API documentation.
User Defined Window and Table Functions
In addition to DataFusion's User Defined Scalar Functions, and User Defined Aggregate Functions, DataFusion now supports User Defined Window Functions and User Defined Table Functions.
For example, [the datafusion-cli
] implements a DuckDB style [parquet_metadata
]
function as a user defined table function (source code here):
❯ SELECT
path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
FROM
parquet_metadata('hits.parquet')
WHERE path_in_schema = '"WatchID"'
LIMIT 3;
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| path_in_schema | row_group_id | row_group_num_rows | stats_min | stats_max | total_compressed_size |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| "WatchID" | 0 | 450560 | 4611687214012840539 | 9223369186199968220 | 3883759 |
| "WatchID" | 1 | 612174 | 4611689135232456464 | 9223371478009085789 | 5176803 |
| "WatchID" | 2 | 344064 | 4611692774829951781 | 9223363791697310021 | 3031680 |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
3 rows in set. Query took 0.053 seconds.
Growth of DataFusion 📈
DataFusion has been appearing more publically in the wild. For example * New projects built using DataFusion such as lancedb, GlareDB, Arroyo, and optd. * Public talks such as Apache Arrow Datafusion: Vectorized Execution Framework For Maximum Performance in CommunityOverCode Asia 2023 * Blogs posts such as Apache Arrow, Arrow/DataFusion, AI-native Data Infra, Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0, and A Guide to User-Defined Functions in Apache Arrow DataFusion
We have also submitted a paper to SIGMOD 2024, one of the premiere database conferences, describing DataFusion in a technically formal style and making the case that it is possible to create a modular and extensive query engine without sacrificing performance. We hope this paper helps people evaluating DataFusion for their needs understand it better.
DataFusion in 2024 🥳
Some major initiatives from contributors we know of this year are:
-
Modularity: Make DataFusion even more modular, such as unifying built in and user functions, making it easier to customize DataFusion's behavior.
-
Community Growth: Graduate to our own top level Apache project, and subsequently add more committers and PMC members to keep pace with project growth.
-
Use case white papers: Write blog posts and videos explaining how to use DataFusion for real-world use cases.
-
Testing: Improve CI infrastructure and test coverage, more fuzz testing, and better functional and performance regression testing.
-
Planning Time: Reduce the time taken to plan queries, both wide tables of 1000s of columns, and in general.
-
Aggregate Performance: Improve the speed of aggregating "high cardinality" data when there are many (e.g. millions) of distinct groups.
-
Statistics: Improved statistics handling with an eye towards more sophisticated expression analysis and cost models.
How to Get Involved
If you are interested in contributing to DataFusion we would love to have you join us. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests or code. A list of open issues suitable for beginners is here.
As the community grows, we are also looking to restart biweekly calls / meetings. Timezones are always a challenge for such meetings, but we hope to have two calls that can work for most attendees. If you are interested in helping, or just want to say hi, please drop us a note via one of the methods listed in our Communication Doc.
Copyright 2024, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.