Apache Arrow Ballista 0.5.0 Release

Posted on: Wed 18 August 2021 by pmc

Ballista extends DataFusion to provide support for distributed queries. This is the first release of Ballista since the project was donated to the Apache Arrow project and includes 80 commits from 11 contributors.

git shortlog -sn 4.0.0..5.0.0 ballista/rust/client ballista/rust/core ballista/rust/executor ballista/rust/scheduler
  27  Andy Grove
  15  Jiayu Liu
  12  Andrew Lamb
   8  Ximo Guanter
   6  Daniël Heres
   5  QP Hou
   2  Jorge Leitao
   1  Javier Goday
   1  K.I. (Dennis) Jung
   1  Mike Seddon
   1  sathis

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.

Performance and Scalability

Ballista is now capable of running complex SQL queries at scale and supports scalable distributed joins. We have been benchmarking using individual queries from the TPC-H benchmark at scale factors up to 1000 (1 TB). When running against CSV files, performance is generally very close to DataFusion, and significantly faster in some cases due to the fact that the scheduler limits the number of concurrent tasks that run at any given time. Performance against large Parquet datasets is currently non ideal due to some issues (#867, #868) that we hope to resolve for the next release.

New Features

The main new features in this release are:

  • Ballista queries can now be executed by calling DataFrame.collect()
  • The shuffle mechanism has been re-implemented
  • Distributed hash-partitioned joins are now supported
  • Keda autoscaling is supported

To get started with Ballista, refer to the crate documentation.

Now that the basic functionality is in place, the focus for the next release will be to improve the performance and scalability as well as improving the documentation.

How to Get Involved

If you are interested in contributing to Ballista, we would love to have you! You can help by trying out Ballista on some of your own data and projects and filing bug reports and helping to improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for beginners is here and the full list is here.

Copyright 2024, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.