Apache DataFusion Ballista 0.5.0 Changelog#
Breaking changes:
[ballista] support date_part and date_turnc ser/de, pass tpch 7 #840 (houqp)
Box ScalarValue:Lists, reduce size by half size #788 (alamb)
Support DataFrame.collect for Ballista DataFrames #785 (andygrove)
UnresolvedShuffleExec should represent a single shuffle #727 (andygrove)
Ballista: Make shuffle partitions configurable in benchmarks #702 (andygrove)
Ballista: Rename QueryStageExec to ShuffleWriterExec #633 (andygrove)
fix 593, reduce cloning by taking ownership in logical planner’s
fromfn #610 (Jimexist)fix join column handling logic for
OnandUsingconstraints #605 (houqp)ShuffleReaderExec now supports multiple locations per partition #541 (andygrove)
Use atomics for SQLMetric implementation, remove unused name field #25 (returnString)
Implemented enhancements:
Add crate documentation for Ballista crates #830
Support DataFrame.collect for Ballista DataFrames #787
Ballista: Prep for supporting shuffle correctly, part one #736
Ballista: Implement physical plan serde for ShuffleWriterExec #710
Ballista: Finish implementing shuffle mechanism #707
Rename QueryStageExec to ShuffleWriterExec #542
Ballista ShuffleReaderExec should be able to read from multiple locations per partition #540
[Ballista] Use deployments in k8s user guide #473
Ballista refactor QueryStageExec in preparation for map-side shuffle #458
Ballista: Implement map-side of shuffle #456
Refactor Ballista to separate Flight logic from execution logic #449
Use published versions of arrow rather than github shas #393
BallistaContext::collect() logging is too noisy #352
Update Ballista to use new physical plan formatter utility #343
Add Ballista Getting Started documentation #329
Remove references to ballistacompute Docker Hub repo #325
Implement scalable distributed joins #63
Remove hard-coded Ballista version from scripts #32
Implement streaming versions of Dataframe.collect methods #789 (andygrove)
Ballista shuffle is finally working as intended, providing scalable distributed joins #750 (andygrove)
Implement metrics for shuffle read and write #676 (andygrove)
Remove hard-coded PartitionMode from Ballista serde #637 (andygrove)
Ballista: Implement scalable distributed joins #634 (andygrove)
Add
partition byconstructs in window functions and modify logical planning #501 (Jimexist)add
order byconstruct in window function and logical plans #463 (Jimexist)Refactor Ballista executor so that FlightService delegates to an Executor struct #450 (andygrove)
implement lead and lag built-in window function #429 (Jimexist)
Add window expression part 1 - logical and physical planning, structure, to/from proto, and explain, for empty over clause only #334 (Jimexist)
[breaking change] fix 265, log should be log10, and add ln #271 (Jimexist)
Allow table providers to indicate their type for catalog metadata #205 (returnString)
Use arrow eq kernels in CaseWhen expression evaluation #52 (Dandandan)
[DataFusion] Optimize hash join inner workings, null handling fix #24 (Dandandan)
Fixed bugs:
Ballista: TPC-H q3 @ SF=1000 never completes #835
Ballista does not support MIN/MAX aggregate functions #832
Ballista docker images fail to build #828
Ballista: UnresolvedShuffleExec should only have a single stage_id #726
Ballista integration tests are failing #623
Integration test build failure due to arrow-rs using unstable feature #596
cargo buildcannot build the project #531ShuffleReaderExec does not get formatted correctly in displayable physical plan #399
Ballista: Prep for fixing shuffle mechansim, part 1 #738 (andygrove)
honor table name for csv/parquet scan in ballista plan serde #629 (houqp)
MINOR: Fix integration tests by adding datafusion-cli module to docker image #322 (andygrove)
Documentation updates:
Add minimal crate documentation for Ballista crates #831 (andygrove)
Update ballista.proto link in architecture doc #502 (terrycorley)
Make it easier for developers to find Ballista documentation #330 (andygrove)
Instructions for cross-compiling Ballista to the Raspberry Pi #263 (andygrove)
Performance improvements:
Closed issues:
Confirm git tagging strategy for releases #770
arrow::util::pretty::pretty_format_batches missing #769
move the
assert_batches_eq!macros to a non part of datafusion #745fix an issue where aliases are not respected in generating downstream schemas in window expr #592
make the planner to print more succinct and useful information in window function explain clause #526
move window frame module to be in
logical_plan#517use a more rust idiomatic way of handling nth_value #448
Make Ballista not depend on arrow directly #446
create a test with more than one partition for window functions #435
Implement hash-partitioned hash aggregate #27
Consider using GitHub pages for DataFusion/Ballista documentation #18
Add Ballista to default cargo workspace #17
Update “repository” in Cargo.toml #16
Consolidate TPC-H benchmarks #6
[Ballista] Fix integration test script #4
Ballista should not have separate DataFrame implementation #2
Merged pull requests:
Change datatype of tpch keys from Int32 to UInt64 to support sf=1000 #836 (andygrove)
Update dependencies: prost to 0.8 and tonic to 0.5 #818 (alamb)
MINOR: Remove unused Ballista query execution code path #732 (andygrove)
in ballista also check for UI prettier changes #578 (Jimexist)
reuse datafusion physical planner in ballista building from protobuf #532 (Jimexist)
update cargo.toml in python crate and fix unit test due to hash joins #483 (Jimexist)
make
VOLUMEdeclaration in tpch datagen docker absolute #466 (crepererum)Refactor QueryStageExec in preparation for implementing map-side shuffle #459 (andygrove)
Simplified usage of
use arrowin ballista. #447 (jorgecarleitao)Benchmark subcommand to distinguish between DataFusion and Ballista #402 (jgoday)
#352: BallistaContext::collect() logging is too noisy #394 (jgoday)
Update Ballista to use new physical plan formatter utility #344 (andygrove)
Remove references to Ballista Docker images published to ballistacompute Docker Hub repo #326 (andygrove)
Update arrow-rs deps (to fix build due to flatbuffers update) #224 (alamb)
* This Changelog was automatically generated by github_changelog_generator