DataFusion Comet 0.17.0 Changelog#
This release consists of 192 commits from 19 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
fix: prevent native sort crash for Struct(Map(…)) keys #4157 (0lai0)
fix: add TimestampLTZ-as-NTZ correctness tests and compatibility docs #4220 (andygrove)
fix: complete native_datafusion Parquet schema-mismatch rejections #4229 (andygrove)
fix: configurable fallback when parquet vectorized reader is disabled (#4352) #4355 (andygrove)
fix: make BloomFilter intermediate buffer Spark-compatible #4390 (andygrove)
fix: make ToJson PartialEq
consistent with PartialEq #4446 (0lai0) fix(codegen): Use setSafe for fixed-width writes into nested collection children whose element count is data-dependent #4549 (mbutrovich)
fix: correct GetStructField null handling for null parent structs (value + nullability) #4523 (schenksj)
fix: route StringReplace through codegen dispatcher to fix empty-search divergence #4537 (andygrove)
fix: mark non-UTF8_BINARY collations as Incompatible for concat and reverse #4567 (andygrove)
fix: rebalance deep AND/OR chains to avoid protobuf recursion limit #4531 (schenksj)
fix: array_size returns -1 instead of null for null input #4578 (andygrove)
fix: codegen dispatcher returns NULL for invalid try_make_timestamp inputs (#4554) #4579 (andygrove)
fix: honor ANSI mode for make_date/next_day and stop next_day trimming #4566 (andygrove)
fix: clean up CometCast support-level reporting (#4501) #4595 (andygrove)
fix: Use thread context classloader for Iceberg class loading in CometScanRule #4609 (chern)
fix: allow EvalMode.TRY in CometRemainder to support try_mod #4615 (marvelshan)
fix: bump iceberg-rust dependency, pick up fix for duplicate rows when FileScanTask smaller than Parquet row group #4621 (mbutrovich)
fix: gate bit_length/octet_length on BinaryType and downgrade translate #4594 (andygrove)
fix: fall back to Spark for str_to_map when legacy regex split truncation is enabled #4627 (andygrove)
fix: accept CometBroadcastNestedLoopJoinExec in Spark 3.4 SPARK-34593 plan assertions #4644 (andygrove)
fix: route decode through codegen dispatcher to honor Spark 4.0 legacy flags #4639 (andygrove)
Performance related:
fix: allow safe mixed Spark/Comet partial/final aggregate execution #4015 (andygrove)
perf: use bulk-NULL semantics in split and substring, skip Vec allocation in split #4403 (mbutrovich)
perf: cache offsetBufferAddress in CometPlainVector for variable-width vectors #4364 (0lai0)
perf: cache CometDecodedVector validityBufferAddress #4435 (mbutrovich)
perf: avoid FFI import/export between native subtree and ShuffleWriter #4507 (mbutrovich)
perf: replace CometBatchIterator FFI input path with the Arrow C Stream Interface #4572 (mbutrovich)
Implemented enhancements:
feat: add JVM UDF framework for native execution #4232 (andygrove)
feat: accelerate Iceberg RewriteDataFiles reads via Comet native scan #4251 (jordepic)
feat: add native support for substring_index expression #4286 (andygrove)
feat: add native support for greatest and least expressions #4274 (andygrove)
feat: Add num_rows and TaskContext to CometUDFBridge.evaluate #4306 (mbutrovich)
feat: Support Spark Expression Decode #4284 (YutaLin)
feat: support stateful CometUDFs #4345 (mbutrovich)
feat: Wire DataFusion function Claude skill and
csc#4337 (comphead)feat: Support Spark expression: current_time_zone #4348 (YutaLin)
feat: Support Spark expression: local_timestamp #4331 (YutaLin)
feat: add from_utc_timestamp and to_utc_timestamp expressions #4308 (andygrove)
feat: add support for
posexplodeandposexplode_outer#4270 (andygrove)feat: disable Comet by default when CometShuffleManager is not registered #4328 (andygrove)
feat: add GroupsAccumulator for variance, stddev, covariance, correlation #4254 (andygrove)
feat: wire
factorialand update wire skill #4349 (comphead)feat: Support Spark expression: convert_timezone #4369 (YutaLin)
feat: implement make_time and to_time #4256 (parthchandra)
feat: adding math sec expression #4371 (athlcode)
feat: implement parse_url #4350 (parthchandra)
feat(experimental): ScalaUDF and Java UDF support via Janino codegen #4267 (mbutrovich)
feat: expand date/time expression support using codegen dispatcher #4417 (andygrove)
feat: vendor-pluggable S3 credentials for native scans #4309 (mbutrovich)
feat: make parse_url compatible #4413 (parthchandra)
feat: support Spark expression slice #4149 (andygrove)
feat: support NullType in row-to-Arrow conversion and shuffle #4460 (mbutrovich)
feat: route Upper/Lower/InitCap through codegen dispatcher #4499 (andygrove)
feat: add GetTimestamp support via codegen dispatcher #4454 (andygrove)
feat: enable JVM Scala UDF codegen dispatch by default #4514 (andygrove)
feat: support dayname and monthname natively #4544 (andygrove)
feat: route aes_encrypt / aes_decrypt / try_aes_decrypt through codegen dispatcher #4557 (andygrove)
feat: support Spark expression json_array_length #4365 (kazantsev-maksim)
feat: 100% Spark-compatible JSON support via codegen dispatcher #4305 (andygrove)
feat: Add 100% Spark-compatible regex support via codegen dispatcher #4239 (andygrove)
feat: route Map → Map casts to native cast_map_to_map #4606 (slavlotski)
feat: route structured-text functions through codegen dispatcher #4620 (andygrove)
feat: route additional scalar expressions through codegen dispatcher #4538 (andygrove)
feat: route higher-order functions through codegen dispatcher #4618 (andygrove)
feat: Native Broadcast nested loop join support #4429 (coderfender)
feat: opt timezone expressions into codegen dispatch #4638 (andygrove)
feat: opt array_intersect, array_except, array_join into codegen dispatch #4636 (andygrove)
feat: add native implementations of
regexp_extractandregexp_extract_all#4146 (andygrove)feat: wire try_to_number and general filter lambda through codegen dispatch #4634 (andygrove)
feat: wire mask and map (create_map) through codegen dispatch #4635 (andygrove)
Documentation updates:
docs: clarify Maven test invocation for ScalaTest suites #4238 (andygrove)
docs: Add benchmark results for 0.16.0 #4272 (andygrove)
docs: Update benchmark results #4300 (andygrove)
docs: add AGENTS.md with build, test, and Spark diff guidance #4083 (andygrove)
docs: update user guide supported expressions list #4304 (andygrove)
docs: add contributor guide for bringing up a new Spark version #4133 (andygrove)
docs: show child links on Expression Compatibility page #4319 (andygrove)
docs: move changelogs from dev/ to docs/source/changelog/ #4330 (andygrove)
docs: add versioning policy #4324 (andygrove)
docs: remove references to native_datafusion and native_iceberg_compat scans #4362 (andygrove)
docs: collapse archived user guide versions behind a single Older Versions page #4426 (andygrove)
docs: group user and contributor guide nav into captioned sections #4424 (andygrove)
docs: list date/time expressions added in #4417 #4443 (andygrove)
docs: clarify support-level and reason consistency in audit-comet-expression skill #4447 (andygrove)
docs: mark explode/posexplode, cast aliases, and rewrite-backed datetime functions as supported #4543 (andygrove)
docs: Rewrite supported expressions page to show complete overview of what is and is not supported by Comet #4550 (andygrove)
docs: rework expression docs (source-of-truth status, per-category audits, refined status semantics) #4568 (andygrove)
docs: fix conflicting status legend and inconsistent issue links in expressions.md #4574 (andygrove)
docs: stop prettier table re-alignment churn in expressions.md #4583 (andygrove)
docs: lead README with the Arrow-native framing #4428 (andygrove)
docs: require audit skill to file issues and add Spark 4.1.1 to version list #4468 (andygrove)
docs: remove unused status legend entries from expression reference #4622 (andygrove)
docs: mark dayname, monthname, and regexp_extract family as supported in expressions.md #4628 (andygrove)
docs: explain native vs codegen-dispatch implementation model #4629 (andygrove)
docs: Rewrite operators page to show complete overview of what is and is not supported by Comet #4563 (andygrove)
docs: reflect codegen dispatch fallback in expression compatibility guide #4649 (andygrove)
docs: Update to reflect removal of
CometBatchIterator#4659 (mbutrovich)docs: adopt issue #4419 terminology in Understanding Comet Plans guide #4650 (andygrove)
docs: adopt issue #4419 terminology in Scala/Java UDF guide #4651 (andygrove)
docs: adopt issue #4419 terminology in data sources guide #4652 (andygrove)
docs: [branch-0.17] generate release docs, update script #4663 (mbutrovich)
Other:
chore(deps): bump arrow from 58.1.0 to 58.2.0 in /native #4264 (dependabot[bot])
chore(deps): bump tokio from 1.52.1 to 1.52.2 in /native in the all-other-cargo-deps group #4263 (dependabot[bot])
test: remove “Comet (Scan)” cases from microbenchmarks #4258 (andygrove)
fix(spark-expr): preserve scalar tag in WideDecimalBinaryExpr when both inputs are scalars #4187 (tomz)
chore: remove fuzz-testing maven module #4085 (andygrove)
chore: Document Comet runtime data debug #4235 (comphead)
build: remove docker-publish workflow #4241 (andygrove)
build: fix OOM on standard GitHub runners for Spark SQL tests #4285 (andygrove)
build: Start 0.17.0 development #4273 (andygrove)
ci: convert spark_sql_test paths-ignore to explicit paths allow-list #4290 (andygrove)
ci: skip workflows for PRs tagged
[skip ci]or labeledskip-ci#4291 (andygrove)Revert “ci: skip workflows for PRs tagged with skip ci #4301 (andygrove)
chore: update documentation links for 0.16.0 release #4314 (andygrove)
chore: native_datafusion use try_pushdown_filters #4299 (mbutrovich)
test: add SQL test coverage for spark.sql.legacy.timeParserPolicy #4183 (andygrove)
ci: use ubuntu-slim for lightweight jobs #4326 (mbutrovich)
deps: bump arrow and parquet to 58.3.0 #4346 (mbutrovich)
ci: fix Spark 4.0.2/JDK 21 flake by enabling per-suite dedicated JVMs #4360 (andygrove)
refactor: Move most of
comet-commonmodule intocomet-spark#4325 (andygrove)chore: Remove config option for
native_iceberg_compat#4019 (andygrove)chore: remove dead native_iceberg_compat code path #4363 (andygrove)
test: let Spark 4 test profiles use the Spark-default ANSI mode #4370 (andygrove)
test: enable nested array cast coverage #4278 (manuzhang)
chore(deps): bump the all-other-cargo-deps group across 1 directory with 3 updates #4340 (dependabot[bot])
chore: wire
rintbuilt in function #4372 (comphead)test: fix merge conflicts in CodegenFuzzSuite and CodegenSuite #4388 (mbutrovich)
chore: drop leftover JVM Parquet helpers and the native_datafusion scan name #4385 (andygrove)
chore: remove dead useDecimal128 plumbing #4382 (andygrove)
chore(deps): bump actions/stale from 10.2.0 to 10.3.0 #4400 (dependabot[bot])
chore(deps): bump github/codeql-action from 4.35.2 to 4.35.5 #4401 (dependabot[bot])
chore(deps): bump the all-other-cargo-deps group in /native with 3 updates #4402 (dependabot[bot])
ci: use path allow list for iceberg workflow triggers #4407 (andygrove)
ci: split spark_sql_test workflow per Spark version #4408 (andygrove)
ci: Run macOS PR build on single Spark version #4409 (andygrove)
ci: run miri nightly instead of on every push and PR #4411 (andygrove)
ci: consolidate pr_build test matrix and switch triggers to allow-list #4410 (andygrove)
ci: split iceberg_spark_test workflow per Iceberg version #4414 (andygrove)
Feat: to_json Infinity/-Infinity Nan values support #3875 (kazantsev-maksim)
ci: scope Spark SQL trigger paths to per-version shims and diff #4415 (andygrove)
chore: remove dead vector classes left over from native_iceberg_compat removal #4416 (andygrove)
chore: wire
shiftrightunsigned#4375 (comphead)chore(audit): audit BitAndAgg and expand tests #4437 (andygrove)
chore(audit): audit any and expand tests #4436 (andygrove)
chore(audit): audit Average and expand tests #4439 (andygrove)
bug: no column projection should still persist row count #4444 (coderfender)
chore(audit): audit struct expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4469 (andygrove)
chore(audit): audit math expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4486 (andygrove)
chore(audit): audit cast across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4493 (andygrove)
chore(audit): audit remaining array expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4483 (andygrove)
chore(audit): audit conditional expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4475 (andygrove)
chore(audit): audit bitwise expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4479 (andygrove)
chore(audit): audit predicate expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4480 (andygrove)
chore(audit): audit map expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4478 (andygrove)
chore(audit): audit misc expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4474 (andygrove)
chore(audit): audit collection expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4473 (andygrove)
chore(audit): audit json expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4470 (andygrove)
chore(deps): bump github/codeql-action from 4.35.5 to 4.36.0 #4510 (dependabot[bot])
chore(deps): bump the all-other-cargo-deps group in /native with 3 updates #4511 (dependabot[bot])
ci: gate long-running jobs behind ubuntu-slim jobs #4494 (mbutrovich)
refactor: rename withInfo to withFallbackReason for clarity #4508 (andygrove)
chore(audit): audit hash expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4476 (andygrove)
test: add array_contains null semantics coverage #4422 (michaelmitchell-bit)
test: add SQL file tests for regr_* linear-regression aggregates #4551 (andygrove)
test: add SQL file tests for RuntimeReplaceable functions accelerated by Comet #4562 (andygrove)
test: add SQL file tests for try_to_date, try_to_timestamp, try_make_timestamp #4555 (andygrove)
build: upgrade Spark 4.1 to 4.1.2 #4399 (manuzhang)
chore: drop misleading ANSI-mode incompatibility note from CometSum #4111 (coderfender)
ci: add fast syntactic-only scalafix gate #4581 (andygrove)
test: enable float/double/binary array casts to string #4386 (manuzhang)
test: add timestamp ntz array cast coverage #4589 (manuzhang)
test: cover array date cast fallback #4593 (manuzhang)
chore: document programmatic access to Comet fallback reasons #4597 (comphead)
chore(deps): bump coursier/setup-action from 1 to 3 #4600 (dependabot[bot])
chore(deps): bump github/codeql-action from 4.36.0 to 4.36.2 #4601 (dependabot[bot])
ci: stop labeled events from cancelling the commit CI run #4610 (mbutrovich)
chore(deps): bump assertables from 9.9.0 to 10.1.0 in /native #4604 (dependabot[bot])
chore(deps): bump the all-other-cargo-deps group across 1 directory with 5 updates #4602 (dependabot[bot])
chore(deps): align object_store_opendal with opendal #4612 (manuzhang)
chore(shuffle): add interleave_time metric and specify buffer size for output_data buffer writer #4599 (wForget)
refactor: route shim-registered expressions through CometExpressionSerde #4139 (andygrove)
chore(audit): audit string expressions across Spark 3.4.3, 3.5.8, 4.0.1 #4461 (andygrove)
chore(audit): audit any_value and expand tests #4438 (andygrove)
chore: specify heap, metadata mem sizes for sql_core* tests #4623 (comphead)
chore: audit date/time expressions #4448 (andygrove)
bug: fix corrupt 4.1.2 patch file #4642 (coderfender)
chore: fallback for
spark.sql.legacy.castComplexTypesToString.enabled= true #4630 (comphead)chore: refactor CI to have centralized SBT action #4643 (comphead)
chore: Add join benchmarks #4598 (coderfender)
chore: [branch-0.17] change version from 0.17.0-SNAPSHOT to 0.17.0 #4661 (mbutrovich)
chore: [branch-0.17] update release_process.md #4666 (mbutrovich)
Credits#
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
117 Andy Grove
21 Matt Butrovich
12 dependabot[bot]
9 Oleks V
6 Manu Zhang
5 Bhargava Vadlamani
4 Bolin Lin
3 ChenChen Lai
3 Parth Chandra
2 Kazantsev Maksim
2 Scott Schenkein
1 Jordan Epstein
1 Krishna Sudarshan J
1 Tom Zeng
1 Vladislav Zabolotsky
1 William Chern
1 Zaki
1 Zhen Wang
1 michaelmitchell-bit
Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.