Apache DataFusion Ballista 52.0.0 Changelog#

Performance related:

  • perf: optimize shuffle writer with buffered I/O and fix file size bug #1386 (andygrove)

Implemented enhancements:

  • feat: add config option for skipping arrow ipc read validation #1374 (killzoner)

  • feat: improve tpch benchmark CLI #1391 (andygrove)

  • feat: Add sort-based shuffle implementation #1389 (andygrove)

  • feat: New ballista python interface #1338 (milenkovicm)

  • feat: Add batch coalescing ability to shuffle reader exec #1380 (danielhumanmod)

  • feat: Add arrow flight proxy to scheduler #1351 (sebbegg)

  • feat: Creating SubstraitSchedulerClient and standalone Substrait examples #1376 (mattcuento)

  • feat: Cluster RPC customisations to support TLS and custom headers #1400 (phillipleblanc)

  • feat: add -c config override flag to tpch benchmark #1435 (andygrove)

  • feat: Extract execution_graph to a trait #1361 (milenkovicm)

  • feat: Add spark-compat mode to integrate datafusion-spark features au… #1416 (mattcuento)

  • feat: add Dataframe.cache() factory (no planner handling) #1420 (killzoner)

  • feat: Adaptive query execution (AQE) planner fundamentals #1372 (milenkovicm)

  • feat: Make push scheduling policy default as it has lower latency #1461 (milenkovicm)

  • feat: job scheduling with push based job status updates #1478 (milenkovicm)

Fixed bugs:

  • fix: compile issue after unsuccessful merge #1402 (milenkovicm)

  • fix: prost build keda and TLS RPC example #1429 (killzoner)

  • fix: remove scheduler_config_spec.toml as it is unused #1462 (milenkovicm)

  • fix: Don’t use maxrows as a “fetched rows” but calculate it from the batches #1480 (martin-g)

Documentation updates:

  • docs: fix outdated content in documentation #1385 (andygrove)

  • docs: use tpchgen-rs for TPC-H data generation #1390 (andygrove)

  • docs: add Jupyter notebook support documentation #1399 (andygrove)

  • chore: Document ballista features in README.md #1418 (mattcuento)

Merged pull requests:

  • feat: add config option for skipping arrow ipc read validation #1374 (killzoner)

  • docs: fix outdated content in documentation #1385 (andygrove)

  • restrict python CI to python directory #1383 (Huy1Ng)

  • perf: optimize shuffle writer with buffered I/O and fix file size bug #1386 (andygrove)

  • docs: use tpchgen-rs for TPC-H data generation #1390 (andygrove)

  • feat: improve tpch benchmark CLI #1391 (andygrove)

  • doc: Add Ballista extensions example to the docs. #1382 (LouisBurke)

  • feat: Add sort-based shuffle implementation #1389 (andygrove)

  • feat: New ballista python interface #1338 (milenkovicm)

  • doc: add more details for protobuf extension #1393 (LouisBurke)

  • feat: Add batch coalescing ability to shuffle reader exec #1380 (danielhumanmod)

  • docs: add Jupyter notebook support documentation #1399 (andygrove)

  • feat: Add arrow flight proxy to scheduler #1351 (sebbegg)

  • chore: update datafusion to 52 #1394 (killzoner)

  • feat: Creating SubstraitSchedulerClient and standalone Substrait examples #1376 (mattcuento)

  • fix: compile issue after unsuccessful merge #1402 (milenkovicm)

  • feat: Cluster RPC customisations to support TLS and custom headers #1400 (phillipleblanc)

  • chore: Document ballista features in README.md #1418 (mattcuento)

  • fix: prost build keda and TLS RPC example #1429 (killzoner)

  • Improve sort-based shuffle: single spill file per partition and batch coalescing #1431 (andygrove)

  • feat: add -c config override flag to tpch benchmark #1435 (andygrove)

  • feat: Extract execution_graph to a trait #1361 (milenkovicm)

  • chore: add confirmation before tarball is released #1445 (milenkovicm)

  • minor: add test to cover IPC arrow file read #1450 (milenkovicm)

  • feat: Add spark-compat mode to integrate datafusion-spark features au… #1416 (mattcuento)

  • feat: add Dataframe.cache() factory (no planner handling) #1420 (killzoner)

  • fix: remove scheduler_config_spec.toml as it is unused #1462 (milenkovicm)

  • feat: Adaptive query execution (AQE) planner fundamentals #1372 (milenkovicm)

  • feat: Make push scheduling policy default as it has lower latency #1461 (milenkovicm)

  • minor: improve log statements #1482 (milenkovicm)

  • chore: update datafusion to 52.2 and other deps to latest #1483 (milenkovicm)

  • fix: Don’t use maxrows as a “fetched rows” but calculate it from the batches #1480 (martin-g)

  • feat: job scheduling with push based job status updates #1478 (milenkovicm)