Apache DataFusion Ballista 0.5.0 Changelog#

Full Changelog

Breaking changes:

  • [ballista] support date_part and date_turnc ser/de, pass tpch 7 #840 (houqp)

  • Box ScalarValue:Lists, reduce size by half size #788 (alamb)

  • Support DataFrame.collect for Ballista DataFrames #785 (andygrove)

  • JOIN conditions are order dependent #778 (seddonm1)

  • UnresolvedShuffleExec should represent a single shuffle #727 (andygrove)

  • Ballista: Make shuffle partitions configurable in benchmarks #702 (andygrove)

  • Rename MergeExec to CoalescePartitionsExec #635 (andygrove)

  • Ballista: Rename QueryStageExec to ShuffleWriterExec #633 (andygrove)

  • fix 593, reduce cloning by taking ownership in logical planner’s from fn #610 (Jimexist)

  • fix join column handling logic for On and Using constraints #605 (houqp)

  • Move ballista standalone mode to client #589 (edrevo)

  • Ballista: Implement map-side shuffle #543 (andygrove)

  • ShuffleReaderExec now supports multiple locations per partition #541 (andygrove)

  • Make external hostname in executor optional #232 (edrevo)

  • Remove namespace from executors #75 (edrevo)

  • Support qualified columns in queries #55 (houqp)

  • Read CSV format text from stdin or memory #54 (heymind)

  • Remove Ballista DataFrame #48 (andygrove)

  • Use atomics for SQLMetric implementation, remove unused name field #25 (returnString)

Implemented enhancements:

  • Add crate documentation for Ballista crates #830

  • Support DataFrame.collect for Ballista DataFrames #787

  • Ballista: Prep for supporting shuffle correctly, part one #736

  • Ballista: Implement physical plan serde for ShuffleWriterExec #710

  • Ballista: Finish implementing shuffle mechanism #707

  • Rename QueryStageExec to ShuffleWriterExec #542

  • Ballista ShuffleReaderExec should be able to read from multiple locations per partition #540

  • [Ballista] Use deployments in k8s user guide #473

  • Ballista refactor QueryStageExec in preparation for map-side shuffle #458

  • Ballista: Implement map-side of shuffle #456

  • Refactor Ballista to separate Flight logic from execution logic #449

  • Use published versions of arrow rather than github shas #393

  • BallistaContext::collect() logging is too noisy #352

  • Update Ballista to use new physical plan formatter utility #343

  • Add Ballista Getting Started documentation #329

  • Remove references to ballistacompute Docker Hub repo #325

  • Implement scalable distributed joins #63

  • Remove hard-coded Ballista version from scripts #32

  • Implement streaming versions of Dataframe.collect methods #789 (andygrove)

  • Ballista shuffle is finally working as intended, providing scalable distributed joins #750 (andygrove)

  • Update to use arrow 5.0 #721 (alamb)

  • Implement serde for ShuffleWriterExec #712 (andygrove)

  • dedup using join column in wildcard expansion #678 (houqp)

  • Implement metrics for shuffle read and write #676 (andygrove)

  • Remove hard-coded PartitionMode from Ballista serde #637 (andygrove)

  • Ballista: Implement scalable distributed joins #634 (andygrove)

  • Add Keda autoscaling for ballista in k8s #586 (edrevo)

  • Add some resiliency to lost executors #568 (edrevo)

  • Add partition by constructs in window functions and modify logical planning #501 (Jimexist)

  • Support anti join #482 (Dandandan)

  • add order by construct in window function and logical plans #463 (Jimexist)

  • Refactor Ballista executor so that FlightService delegates to an Executor struct #450 (andygrove)

  • implement lead and lag built-in window function #429 (Jimexist)

  • Implement fmt_as for ShuffleReaderExec #400 (andygrove)

  • Add window expression part 1 - logical and physical planning, structure, to/from proto, and explain, for empty over clause only #334 (Jimexist)

  • [breaking change] fix 265, log should be log10, and add ln #271 (Jimexist)

  • Allow table providers to indicate their type for catalog metadata #205 (returnString)

  • Add query 19 to TPC-H regression tests #59 (Dandandan)

  • Use arrow eq kernels in CaseWhen expression evaluation #52 (Dandandan)

  • Add option param for standalone mode #42 (djKooks)

  • [DataFusion] Optimize hash join inner workings, null handling fix #24 (Dandandan)

  • [Ballista] Docker files for ui #22 (msathis)

Fixed bugs:

  • Ballista: TPC-H q3 @ SF=1000 never completes #835

  • Ballista does not support MIN/MAX aggregate functions #832

  • Ballista docker images fail to build #828

  • Ballista: UnresolvedShuffleExec should only have a single stage_id #726

  • Ballista integration tests are failing #623

  • Integration test build failure due to arrow-rs using unstable feature #596

  • cargo build cannot build the project #531

  • ShuffleReaderExec does not get formatted correctly in displayable physical plan #399

  • Implement serde for MIN and MAX #833 (andygrove)

  • Ballista: Prep for fixing shuffle mechansim, part 1 #738 (andygrove)

  • Ballista: Shuffle write bug fix #714 (andygrove)

  • honor table name for csv/parquet scan in ballista plan serde #629 (houqp)

  • MINOR: Fix integration tests by adding datafusion-cli module to docker image #322 (andygrove)

Documentation updates:

Performance improvements:

  • Ballista: Avoid sleeping between polling for tasks #698 (Dandandan)

  • Make BallistaContext::collect streaming #535 (edrevo)

Closed issues:

  • Confirm git tagging strategy for releases #770

  • arrow::util::pretty::pretty_format_batches missing #769

  • move the assert_batches_eq! macros to a non part of datafusion #745

  • fix an issue where aliases are not respected in generating downstream schemas in window expr #592

  • make the planner to print more succinct and useful information in window function explain clause #526

  • move window frame module to be in logical_plan #517

  • use a more rust idiomatic way of handling nth_value #448

  • Make Ballista not depend on arrow directly #446

  • create a test with more than one partition for window functions #435

  • Implement hash-partitioned hash aggregate #27

  • Consider using GitHub pages for DataFusion/Ballista documentation #18

  • Add Ballista to default cargo workspace #17

  • Update “repository” in Cargo.toml #16

  • Consolidate TPC-H benchmarks #6

  • [Ballista] Fix integration test script #4

  • Ballista should not have separate DataFrame implementation #2

Merged pull requests:

  • Change datatype of tpch keys from Int32 to UInt64 to support sf=1000 #836 (andygrove)

  • Add ballista-examples to docker build #829 (andygrove)

  • Update dependencies: prost to 0.8 and tonic to 0.5 #818 (alamb)

  • Move hash_array into hash_utils.rs #807 (alamb)

  • Fix: Update clippy lints for Rust 1.54 #794 (alamb)

  • MINOR: Remove unused Ballista query execution code path #732 (andygrove)

  • [fix] benchmark run with compose #666 (rdettai)

  • bring back dev scripts for ballista #648 (Jimexist)

  • Remove unnecessary mutex #639 (edrevo)

  • round trip TPCH queries in tests #630 (houqp)

  • Fix build #627 (andygrove)

  • in ballista also check for UI prettier changes #578 (Jimexist)

  • turn on clippy rule for needless borrow #545 (Jimexist)

  • reuse datafusion physical planner in ballista building from protobuf #532 (Jimexist)

  • update cargo.toml in python crate and fix unit test due to hash joins #483 (Jimexist)

  • make VOLUME declaration in tpch datagen docker absolute #466 (crepererum)

  • Refactor QueryStageExec in preparation for implementing map-side shuffle #459 (andygrove)

  • Simplified usage of use arrow in ballista. #447 (jorgecarleitao)

  • Benchmark subcommand to distinguish between DataFusion and Ballista #402 (jgoday)

  • #352: BallistaContext::collect() logging is too noisy #394 (jgoday)

  • cleanup function return type fn #350 (Jimexist)

  • Update Ballista to use new physical plan formatter utility #344 (andygrove)

  • Update arrow dependencies again #341 (alamb)

  • Remove references to Ballista Docker images published to ballistacompute Docker Hub repo #326 (andygrove)

  • Update arrow-rs deps #317 (alamb)

  • Update arrow deps #269 (alamb)

  • Enable redundant_field_names clippy lint #261 (Dandandan)

  • Update arrow-rs deps (to fix build due to flatbuffers update) #224 (alamb)

  • update arrow-rs deps to latest master #216 (alamb)

* This Changelog was automatically generated by github_changelog_generator