Running Iceberg Spark Tests#
Running Apache Iceberg’s Spark tests with Comet enabled is a good way to ensure that Comet produces the same results as Spark when reading Iceberg tables. To enable this, we apply diff files to the Apache Iceberg source code so that Comet is loaded when we run the tests.
Here is an overview of the changes that the diffs make to Iceberg:
Configure Comet as a dependency and set the correct version in
libs.versions.tomlandbuild.gradleDelete upstream Comet reader classes that reference legacy Comet APIs removed in #3739. These classes were added upstream in apache/iceberg#15674 and depend on Comet’s old Iceberg Java integration. Since Comet now uses a native Iceberg scan, these classes fail to compile and must be removed.
Configure test base classes (
TestBase,ExtensionsTestBase,ScanTestBase, etc.) to load the Comet Spark plugin and shuffle manager
1. Install Comet#
Run make release in Comet to install the Comet JAR into the local Maven repository, specifying the Spark version.
PROFILES="-Pspark-4.1" make release
2. Clone Iceberg and Apply Diff#
Clone Apache Iceberg locally and apply the diff file from Comet against the matching tag.
git clone git@github.com:apache/iceberg.git apache-iceberg
cd apache-iceberg
git checkout apache-iceberg-1.8.1
git apply ../datafusion-comet/dev/diffs/iceberg/1.8.1.diff
3. Run Iceberg Spark Tests#
ENABLE_COMET=true ./gradlew -DsparkVersions=3.5 -DscalaVersion=2.13 -DflinkVersions= -DkafkaVersions= \
:iceberg-spark:iceberg-spark-3.5_2.13:test \
-Pquick=true -x javadoc
The three Gradle targets tested in CI are:
Gradle Target |
What It Covers |
|---|---|
|
Core read/write paths (Parquet, Avro, ORC, vectorized), scan operations, filtering, bloom filters, runtime filtering, deletion handling, structured streaming, DDL/DML (create/alter/drop, writes, deletes), filter and aggregate pushdown, actions (snapshot expiration, file rewriting, orphan cleanup, table migration), serialization, and data format conversions. |
|
SQL extensions: stored procedures (migrate, snapshot, cherrypick, rollback, rewrite-data-files, rewrite-manifests, expire-snapshots, remove-orphan-files, etc.), row-level operations (copy-on-write and merge-on-read update/delete/merge), DDL extensions (branches, tags, alter schema, partition fields), changelog tables/views, metadata tables, and views. |
|
A single smoke test ( |
Updating Diffs#
To update a diff (e.g. after modifying test configuration), apply the existing diff, make changes, then regenerate:
cd apache-iceberg
git reset --hard apache-iceberg-1.8.1 && git clean -fd
git apply ../datafusion-comet/dev/diffs/iceberg/1.8.1.diff
# Make changes, then run spotless to fix formatting
./gradlew spotlessApply
# Stage any new or deleted files, then generate the diff
git add -A
git diff apache-iceberg-1.8.1 > ../datafusion-comet/dev/diffs/iceberg/1.8.1.diff
Repeat for each Iceberg version (1.8.1, 1.9.1, 1.10.0). The file contents differ between versions, so each diff must be generated against its own tag.
Running Tests in CI#
The iceberg_spark_test_<version>.yml workflows apply these diffs and run the three Gradle targets above
against each Iceberg version. Iceberg 1.8.1 runs against Spark 3.4.3 with Java 11; Iceberg 1.9.1 and 1.10.0
run against Spark 3.5.8 with Java 17. The latest Iceberg version (1.10) runs on every pull request and on
pushes to main; the older versions (1.8, 1.9) run only on pushes to main. All caller workflows delegate to
iceberg_spark_test_reusable.yml, which holds the build and test job logic.