Apache DataFusion Ballista 52.0.0 Changelog#

Performance related:

perf: optimize shuffle writer with buffered I/O and fix file size bug #1386 (andygrove)

Implemented enhancements:

feat: add config option for skipping arrow ipc read validation #1374 (killzoner)
feat: improve tpch benchmark CLI #1391 (andygrove)
feat: Add sort-based shuffle implementation #1389 (andygrove)
feat: New ballista python interface #1338 (milenkovicm)
feat: Add batch coalescing ability to shuffle reader exec #1380 (danielhumanmod)
feat: Add arrow flight proxy to scheduler #1351 (sebbegg)
feat: Creating SubstraitSchedulerClient and standalone Substrait examples #1376 (mattcuento)
feat: Cluster RPC customisations to support TLS and custom headers #1400 (phillipleblanc)
feat: add -c config override flag to tpch benchmark #1435 (andygrove)
feat: Extract execution_graph to a trait #1361 (milenkovicm)
feat: Add spark-compat mode to integrate datafusion-spark features au… #1416 (mattcuento)
feat: add Dataframe.cache() factory (no planner handling) #1420 (killzoner)
feat: Adaptive query execution (AQE) planner fundamentals #1372 (milenkovicm)
feat: Make push scheduling policy default as it has lower latency #1461 (milenkovicm)
feat: job scheduling with push based job status updates #1478 (milenkovicm)

Fixed bugs:

fix: compile issue after unsuccessful merge #1402 (milenkovicm)
fix: prost build keda and TLS RPC example #1429 (killzoner)
fix: remove scheduler_config_spec.toml as it is unused #1462 (milenkovicm)
fix: Don’t use maxrows as a “fetched rows” but calculate it from the batches #1480 (martin-g)

Documentation updates:

docs: fix outdated content in documentation #1385 (andygrove)
docs: use tpchgen-rs for TPC-H data generation #1390 (andygrove)
docs: add Jupyter notebook support documentation #1399 (andygrove)
chore: Document ballista features in README.md #1418 (mattcuento)

Merged pull requests:

feat: add config option for skipping arrow ipc read validation #1374 (killzoner)
docs: fix outdated content in documentation #1385 (andygrove)
restrict python CI to python directory #1383 (Huy1Ng)
perf: optimize shuffle writer with buffered I/O and fix file size bug #1386 (andygrove)
docs: use tpchgen-rs for TPC-H data generation #1390 (andygrove)
feat: improve tpch benchmark CLI #1391 (andygrove)
doc: Add Ballista extensions example to the docs. #1382 (LouisBurke)
feat: Add sort-based shuffle implementation #1389 (andygrove)
feat: New ballista python interface #1338 (milenkovicm)
doc: add more details for protobuf extension #1393 (LouisBurke)
feat: Add batch coalescing ability to shuffle reader exec #1380 (danielhumanmod)
docs: add Jupyter notebook support documentation #1399 (andygrove)
feat: Add arrow flight proxy to scheduler #1351 (sebbegg)
chore: update datafusion to 52 #1394 (killzoner)
feat: Creating SubstraitSchedulerClient and standalone Substrait examples #1376 (mattcuento)
fix: compile issue after unsuccessful merge #1402 (milenkovicm)
feat: Cluster RPC customisations to support TLS and custom headers #1400 (phillipleblanc)
chore: Document ballista features in README.md #1418 (mattcuento)
fix: prost build keda and TLS RPC example #1429 (killzoner)
Improve sort-based shuffle: single spill file per partition and batch coalescing #1431 (andygrove)
feat: add -c config override flag to tpch benchmark #1435 (andygrove)
feat: Extract execution_graph to a trait #1361 (milenkovicm)
chore: add confirmation before tarball is released #1445 (milenkovicm)
minor: add test to cover IPC arrow file read #1450 (milenkovicm)
feat: Add spark-compat mode to integrate datafusion-spark features au… #1416 (mattcuento)
feat: add Dataframe.cache() factory (no planner handling) #1420 (killzoner)
fix: remove scheduler_config_spec.toml as it is unused #1462 (milenkovicm)
feat: Adaptive query execution (AQE) planner fundamentals #1372 (milenkovicm)
feat: Make push scheduling policy default as it has lower latency #1461 (milenkovicm)
minor: improve log statements #1482 (milenkovicm)
chore: update datafusion to 52.2 and other deps to latest #1483 (milenkovicm)
fix: Don’t use maxrows as a “fetched rows” but calculate it from the batches #1480 (martin-g)
feat: job scheduling with push based job status updates #1478 (milenkovicm)