Apache Arrow Ballista 0.5.0 Release
Posted on: Wed 18 August 2021 by pmc
Ballista extends DataFusion to provide support for distributed queries. This is the first release of Ballista since the project was donated to the Apache Arrow project and includes 80 commits from 11 contributors.
git shortlog -sn 4.0.0..5.0.0 ballista/rust/client ballista/rust/core ballista/rust/executor ballista/rust/scheduler
27 Andy Grove
15 Jiayu Liu
12 Andrew Lamb
8 Ximo Guanter
6 Daniël Heres
5 QP Hou
2 Jorge Leitao
1 Javier Goday
1 K.I. (Dennis) Jung
1 Mike Seddon
1 sathis
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.
Performance and Scalability
Ballista is now capable of running complex SQL queries at scale and supports scalable distributed joins. We have been benchmarking using individual queries from the TPC-H benchmark at scale factors up to 1000 (1 TB). When running against CSV files, performance is generally very close to DataFusion, and significantly faster in some cases due to the fact that the scheduler limits the number of concurrent tasks that run at any given time. Performance against large Parquet datasets is currently non ideal due to some issues (#867, #868) that we hope to resolve for the next release.
New Features
The main new features in this release are:
- Ballista queries can now be executed by calling DataFrame.collect()
- The shuffle mechanism has been re-implemented
- Distributed hash-partitioned joins are now supported
- Keda autoscaling is supported
To get started with Ballista, refer to the crate documentation.
Now that the basic functionality is in place, the focus for the next release will be to improve the performance and scalability as well as improving the documentation.
How to Get Involved
If you are interested in contributing to Ballista, we would love to have you! You can help by trying out Ballista on some of your own data and projects and filing bug reports and helping to improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for beginners is here and the full list is here.
Copyright 2024, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.