DataFusion Comet 0.17.0 Changelog#

This release consists of 192 commits from 19 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

  • fix: prevent native sort crash for Struct(Map(…)) keys #4157 (0lai0)

  • fix: add TimestampLTZ-as-NTZ correctness tests and compatibility docs #4220 (andygrove)

  • fix: complete native_datafusion Parquet schema-mismatch rejections #4229 (andygrove)

  • fix: configurable fallback when parquet vectorized reader is disabled (#4352) #4355 (andygrove)

  • fix: make BloomFilter intermediate buffer Spark-compatible #4390 (andygrove)

  • fix: make ToJson PartialEq consistent with PartialEq #4446 (0lai0)

  • fix(codegen): Use setSafe for fixed-width writes into nested collection children whose element count is data-dependent #4549 (mbutrovich)

  • fix: correct GetStructField null handling for null parent structs (value + nullability) #4523 (schenksj)

  • fix: route StringReplace through codegen dispatcher to fix empty-search divergence #4537 (andygrove)

  • fix: mark non-UTF8_BINARY collations as Incompatible for concat and reverse #4567 (andygrove)

  • fix: rebalance deep AND/OR chains to avoid protobuf recursion limit #4531 (schenksj)

  • fix: array_size returns -1 instead of null for null input #4578 (andygrove)

  • fix: codegen dispatcher returns NULL for invalid try_make_timestamp inputs (#4554) #4579 (andygrove)

  • fix: honor ANSI mode for make_date/next_day and stop next_day trimming #4566 (andygrove)

  • fix: clean up CometCast support-level reporting (#4501) #4595 (andygrove)

  • fix: Use thread context classloader for Iceberg class loading in CometScanRule #4609 (chern)

  • fix: allow EvalMode.TRY in CometRemainder to support try_mod #4615 (marvelshan)

  • fix: bump iceberg-rust dependency, pick up fix for duplicate rows when FileScanTask smaller than Parquet row group #4621 (mbutrovich)

  • fix: gate bit_length/octet_length on BinaryType and downgrade translate #4594 (andygrove)

  • fix: fall back to Spark for str_to_map when legacy regex split truncation is enabled #4627 (andygrove)

  • fix: accept CometBroadcastNestedLoopJoinExec in Spark 3.4 SPARK-34593 plan assertions #4644 (andygrove)

  • fix: route decode through codegen dispatcher to honor Spark 4.0 legacy flags #4639 (andygrove)

Performance related:

  • fix: allow safe mixed Spark/Comet partial/final aggregate execution #4015 (andygrove)

  • perf: use bulk-NULL semantics in split and substring, skip Vec allocation in split #4403 (mbutrovich)

  • perf: cache offsetBufferAddress in CometPlainVector for variable-width vectors #4364 (0lai0)

  • perf: cache CometDecodedVector validityBufferAddress #4435 (mbutrovich)

  • perf: avoid FFI import/export between native subtree and ShuffleWriter #4507 (mbutrovich)

  • perf: replace CometBatchIterator FFI input path with the Arrow C Stream Interface #4572 (mbutrovich)

Implemented enhancements:

  • feat: add JVM UDF framework for native execution #4232 (andygrove)

  • feat: accelerate Iceberg RewriteDataFiles reads via Comet native scan #4251 (jordepic)

  • feat: add native support for substring_index expression #4286 (andygrove)

  • feat: add native support for greatest and least expressions #4274 (andygrove)

  • feat: Add num_rows and TaskContext to CometUDFBridge.evaluate #4306 (mbutrovich)

  • feat: Support Spark Expression Decode #4284 (YutaLin)

  • feat: support stateful CometUDFs #4345 (mbutrovich)

  • feat: Wire DataFusion function Claude skill and csc #4337 (comphead)

  • feat: Support Spark expression: current_time_zone #4348 (YutaLin)

  • feat: Support Spark expression: local_timestamp #4331 (YutaLin)

  • feat: add from_utc_timestamp and to_utc_timestamp expressions #4308 (andygrove)

  • feat: add support for posexplode and posexplode_outer #4270 (andygrove)

  • feat: disable Comet by default when CometShuffleManager is not registered #4328 (andygrove)

  • feat: add GroupsAccumulator for variance, stddev, covariance, correlation #4254 (andygrove)

  • feat: wire factorial and update wire skill #4349 (comphead)

  • feat: Support Spark expression: convert_timezone #4369 (YutaLin)

  • feat: implement make_time and to_time #4256 (parthchandra)

  • feat: adding math sec expression #4371 (athlcode)

  • feat: implement parse_url #4350 (parthchandra)

  • feat(experimental): ScalaUDF and Java UDF support via Janino codegen #4267 (mbutrovich)

  • feat: expand date/time expression support using codegen dispatcher #4417 (andygrove)

  • feat: vendor-pluggable S3 credentials for native scans #4309 (mbutrovich)

  • feat: make parse_url compatible #4413 (parthchandra)

  • feat: support Spark expression slice #4149 (andygrove)

  • feat: support NullType in row-to-Arrow conversion and shuffle #4460 (mbutrovich)

  • feat: route Upper/Lower/InitCap through codegen dispatcher #4499 (andygrove)

  • feat: add GetTimestamp support via codegen dispatcher #4454 (andygrove)

  • feat: enable JVM Scala UDF codegen dispatch by default #4514 (andygrove)

  • feat: support dayname and monthname natively #4544 (andygrove)

  • feat: route aes_encrypt / aes_decrypt / try_aes_decrypt through codegen dispatcher #4557 (andygrove)

  • feat: support Spark expression json_array_length #4365 (kazantsev-maksim)

  • feat: 100% Spark-compatible JSON support via codegen dispatcher #4305 (andygrove)

  • feat: Add 100% Spark-compatible regex support via codegen dispatcher #4239 (andygrove)

  • feat: route Map → Map casts to native cast_map_to_map #4606 (slavlotski)

  • feat: route structured-text functions through codegen dispatcher #4620 (andygrove)

  • feat: route additional scalar expressions through codegen dispatcher #4538 (andygrove)

  • feat: route higher-order functions through codegen dispatcher #4618 (andygrove)

  • feat: Native Broadcast nested loop join support #4429 (coderfender)

  • feat: opt timezone expressions into codegen dispatch #4638 (andygrove)

  • feat: opt array_intersect, array_except, array_join into codegen dispatch #4636 (andygrove)

  • feat: add native implementations of regexp_extract and regexp_extract_all #4146 (andygrove)

  • feat: wire try_to_number and general filter lambda through codegen dispatch #4634 (andygrove)

  • feat: wire mask and map (create_map) through codegen dispatch #4635 (andygrove)

Documentation updates:

  • docs: clarify Maven test invocation for ScalaTest suites #4238 (andygrove)

  • docs: Add benchmark results for 0.16.0 #4272 (andygrove)

  • docs: Update benchmark results #4300 (andygrove)

  • docs: add AGENTS.md with build, test, and Spark diff guidance #4083 (andygrove)

  • docs: update user guide supported expressions list #4304 (andygrove)

  • docs: add contributor guide for bringing up a new Spark version #4133 (andygrove)

  • docs: show child links on Expression Compatibility page #4319 (andygrove)

  • docs: move changelogs from dev/ to docs/source/changelog/ #4330 (andygrove)

  • docs: add versioning policy #4324 (andygrove)

  • docs: remove references to native_datafusion and native_iceberg_compat scans #4362 (andygrove)

  • docs: collapse archived user guide versions behind a single Older Versions page #4426 (andygrove)

  • docs: group user and contributor guide nav into captioned sections #4424 (andygrove)

  • docs: list date/time expressions added in #4417 #4443 (andygrove)

  • docs: clarify support-level and reason consistency in audit-comet-expression skill #4447 (andygrove)

  • docs: mark explode/posexplode, cast aliases, and rewrite-backed datetime functions as supported #4543 (andygrove)

  • docs: Rewrite supported expressions page to show complete overview of what is and is not supported by Comet #4550 (andygrove)

  • docs: rework expression docs (source-of-truth status, per-category audits, refined status semantics) #4568 (andygrove)

  • docs: fix conflicting status legend and inconsistent issue links in expressions.md #4574 (andygrove)

  • docs: stop prettier table re-alignment churn in expressions.md #4583 (andygrove)

  • docs: lead README with the Arrow-native framing #4428 (andygrove)

  • docs: require audit skill to file issues and add Spark 4.1.1 to version list #4468 (andygrove)

  • docs: remove unused status legend entries from expression reference #4622 (andygrove)

  • docs: mark dayname, monthname, and regexp_extract family as supported in expressions.md #4628 (andygrove)

  • docs: explain native vs codegen-dispatch implementation model #4629 (andygrove)

  • docs: Rewrite operators page to show complete overview of what is and is not supported by Comet #4563 (andygrove)

  • docs: reflect codegen dispatch fallback in expression compatibility guide #4649 (andygrove)

  • docs: Update to reflect removal of CometBatchIterator #4659 (mbutrovich)

  • docs: adopt issue #4419 terminology in Understanding Comet Plans guide #4650 (andygrove)

  • docs: adopt issue #4419 terminology in Scala/Java UDF guide #4651 (andygrove)

  • docs: adopt issue #4419 terminology in data sources guide #4652 (andygrove)

  • docs: [branch-0.17] generate release docs, update script #4663 (mbutrovich)

Other:

  • chore(deps): bump arrow from 58.1.0 to 58.2.0 in /native #4264 (dependabot[bot])

  • chore(deps): bump tokio from 1.52.1 to 1.52.2 in /native in the all-other-cargo-deps group #4263 (dependabot[bot])

  • test: remove “Comet (Scan)” cases from microbenchmarks #4258 (andygrove)

  • fix(spark-expr): preserve scalar tag in WideDecimalBinaryExpr when both inputs are scalars #4187 (tomz)

  • chore: remove fuzz-testing maven module #4085 (andygrove)

  • chore: Document Comet runtime data debug #4235 (comphead)

  • build: remove docker-publish workflow #4241 (andygrove)

  • build: fix OOM on standard GitHub runners for Spark SQL tests #4285 (andygrove)

  • build: Start 0.17.0 development #4273 (andygrove)

  • ci: convert spark_sql_test paths-ignore to explicit paths allow-list #4290 (andygrove)

  • ci: skip workflows for PRs tagged [skip ci] or labeled skip-ci #4291 (andygrove)

  • Revert “ci: skip workflows for PRs tagged with skip ci #4301 (andygrove)

  • chore: update documentation links for 0.16.0 release #4314 (andygrove)

  • chore: native_datafusion use try_pushdown_filters #4299 (mbutrovich)

  • test: add SQL test coverage for spark.sql.legacy.timeParserPolicy #4183 (andygrove)

  • ci: use ubuntu-slim for lightweight jobs #4326 (mbutrovich)

  • deps: bump arrow and parquet to 58.3.0 #4346 (mbutrovich)

  • ci: fix Spark 4.0.2/JDK 21 flake by enabling per-suite dedicated JVMs #4360 (andygrove)

  • refactor: Move most of comet-common module into comet-spark #4325 (andygrove)

  • chore: Remove config option for native_iceberg_compat #4019 (andygrove)

  • chore: remove dead native_iceberg_compat code path #4363 (andygrove)

  • test: let Spark 4 test profiles use the Spark-default ANSI mode #4370 (andygrove)

  • test: enable nested array cast coverage #4278 (manuzhang)

  • chore(deps): bump the all-other-cargo-deps group across 1 directory with 3 updates #4340 (dependabot[bot])

  • chore: wire rint built in function #4372 (comphead)

  • test: fix merge conflicts in CodegenFuzzSuite and CodegenSuite #4388 (mbutrovich)

  • chore: drop leftover JVM Parquet helpers and the native_datafusion scan name #4385 (andygrove)

  • chore: remove dead useDecimal128 plumbing #4382 (andygrove)

  • chore(deps): bump actions/stale from 10.2.0 to 10.3.0 #4400 (dependabot[bot])

  • chore(deps): bump github/codeql-action from 4.35.2 to 4.35.5 #4401 (dependabot[bot])

  • chore(deps): bump the all-other-cargo-deps group in /native with 3 updates #4402 (dependabot[bot])

  • ci: use path allow list for iceberg workflow triggers #4407 (andygrove)

  • ci: split spark_sql_test workflow per Spark version #4408 (andygrove)

  • ci: Run macOS PR build on single Spark version #4409 (andygrove)

  • ci: run miri nightly instead of on every push and PR #4411 (andygrove)

  • ci: consolidate pr_build test matrix and switch triggers to allow-list #4410 (andygrove)

  • ci: split iceberg_spark_test workflow per Iceberg version #4414 (andygrove)

  • Feat: to_json Infinity/-Infinity Nan values support #3875 (kazantsev-maksim)

  • ci: scope Spark SQL trigger paths to per-version shims and diff #4415 (andygrove)

  • chore: remove dead vector classes left over from native_iceberg_compat removal #4416 (andygrove)

  • chore: wire shiftrightunsigned #4375 (comphead)

  • chore(audit): audit BitAndAgg and expand tests #4437 (andygrove)

  • chore(audit): audit any and expand tests #4436 (andygrove)

  • chore(audit): audit Average and expand tests #4439 (andygrove)

  • bug: no column projection should still persist row count #4444 (coderfender)

  • chore(audit): audit struct expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4469 (andygrove)

  • chore(audit): audit math expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4486 (andygrove)

  • chore(audit): audit cast across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4493 (andygrove)

  • chore(audit): audit remaining array expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4483 (andygrove)

  • chore(audit): audit conditional expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4475 (andygrove)

  • chore(audit): audit bitwise expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4479 (andygrove)

  • chore(audit): audit predicate expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4480 (andygrove)

  • chore(audit): audit map expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4478 (andygrove)

  • chore(audit): audit misc expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4474 (andygrove)

  • chore(audit): audit collection expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4473 (andygrove)

  • chore(audit): audit json expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4470 (andygrove)

  • chore(deps): bump github/codeql-action from 4.35.5 to 4.36.0 #4510 (dependabot[bot])

  • chore(deps): bump the all-other-cargo-deps group in /native with 3 updates #4511 (dependabot[bot])

  • ci: gate long-running jobs behind ubuntu-slim jobs #4494 (mbutrovich)

  • refactor: rename withInfo to withFallbackReason for clarity #4508 (andygrove)

  • chore(audit): audit hash expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 #4476 (andygrove)

  • test: add array_contains null semantics coverage #4422 (michaelmitchell-bit)

  • test: add SQL file tests for regr_* linear-regression aggregates #4551 (andygrove)

  • test: add SQL file tests for RuntimeReplaceable functions accelerated by Comet #4562 (andygrove)

  • test: add SQL file tests for try_to_date, try_to_timestamp, try_make_timestamp #4555 (andygrove)

  • build: upgrade Spark 4.1 to 4.1.2 #4399 (manuzhang)

  • chore: drop misleading ANSI-mode incompatibility note from CometSum #4111 (coderfender)

  • ci: add fast syntactic-only scalafix gate #4581 (andygrove)

  • test: enable float/double/binary array casts to string #4386 (manuzhang)

  • test: add timestamp ntz array cast coverage #4589 (manuzhang)

  • test: cover array date cast fallback #4593 (manuzhang)

  • chore: document programmatic access to Comet fallback reasons #4597 (comphead)

  • chore(deps): bump coursier/setup-action from 1 to 3 #4600 (dependabot[bot])

  • chore(deps): bump github/codeql-action from 4.36.0 to 4.36.2 #4601 (dependabot[bot])

  • ci: stop labeled events from cancelling the commit CI run #4610 (mbutrovich)

  • chore(deps): bump assertables from 9.9.0 to 10.1.0 in /native #4604 (dependabot[bot])

  • chore(deps): bump the all-other-cargo-deps group across 1 directory with 5 updates #4602 (dependabot[bot])

  • chore(deps): align object_store_opendal with opendal #4612 (manuzhang)

  • chore(shuffle): add interleave_time metric and specify buffer size for output_data buffer writer #4599 (wForget)

  • refactor: route shim-registered expressions through CometExpressionSerde #4139 (andygrove)

  • chore(audit): audit string expressions across Spark 3.4.3, 3.5.8, 4.0.1 #4461 (andygrove)

  • chore(audit): audit any_value and expand tests #4438 (andygrove)

  • chore: specify heap, metadata mem sizes for sql_core* tests #4623 (comphead)

  • chore: audit date/time expressions #4448 (andygrove)

  • bug: fix corrupt 4.1.2 patch file #4642 (coderfender)

  • chore: fallback for spark.sql.legacy.castComplexTypesToString.enabled = true #4630 (comphead)

  • chore: refactor CI to have centralized SBT action #4643 (comphead)

  • chore: Add join benchmarks #4598 (coderfender)

  • chore: [branch-0.17] change version from 0.17.0-SNAPSHOT to 0.17.0 #4661 (mbutrovich)

  • chore: [branch-0.17] update release_process.md #4666 (mbutrovich)

Credits#

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

   117	Andy Grove
    21	Matt Butrovich
    12	dependabot[bot]
     9	Oleks V
     6	Manu Zhang
     5	Bhargava Vadlamani
     4	Bolin Lin
     3	ChenChen Lai
     3	Parth Chandra
     2	Kazantsev Maksim
     2	Scott Schenkein
     1	Jordan Epstein
     1	Krishna Sudarshan J
     1	Tom Zeng
     1	Vladislav Zabolotsky
     1	William Chern
     1	Zaki
     1	Zhen Wang
     1	michaelmitchell-bit

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.