Running Spark SQL Tests

Running Apache Spark’s SQL tests with Comet enabled is a good way to ensure that Comet produces the same results as that version of Spark. To enable this, we apply some changes to the Apache Spark source code so that Comet is enabled when we run the tests.

Here is an overview of the changes that we need to make to Spark:

  • Update the pom.xml to add a dependency on Comet

  • Modify SparkSession to load the Comet extension

  • Modify TestHive to load Comet

  • Modify SQLTestUtilsBase to load Comet when ENABLE_COMET environment variable exists

Here are the steps involved in running the Spark SQL tests with Comet, using Spark 3.4.3 for this example.

1. Install Comet

Run make release in Comet to install the Comet JAR into the local Maven repository, specifying the Spark version.

PROFILES="-Pspark-3.4" make release

2. Clone Spark and Apply Diff

Clone Apache Spark locally and apply the diff file from Comet.

git clone git@github.com:apache/spark.git apache-spark
cd apache-spark
git checkout v3.4.3
git apply ../datafusion-comet/dev/diffs/3.4.3.diff

3. Run Spark SQL Tests

Use the following commands to run the SQL tests locally.

ENABLE_COMET=true build/sbt catalyst/test
ENABLE_COMET=true build/sbt "sql/testOnly * -- -l org.apache.spark.tags.ExtendedSQLTest -l org.apache.spark.tags.SlowSQLTest"
ENABLE_COMET=true build/sbt "sql/testOnly * -- -n org.apache.spark.tags.ExtendedSQLTest"
ENABLE_COMET=true build/sbt "sql/testOnly * -- -n org.apache.spark.tags.SlowSQLTest"
ENABLE_COMET=true build/sbt "hive/testOnly * -- -l org.apache.spark.tags.ExtendedHiveTest -l org.apache.spark.tags.SlowHiveTest"
ENABLE_COMET=true build/sbt "hive/testOnly * -- -n org.apache.spark.tags.ExtendedHiveTest"
ENABLE_COMET=true build/sbt "hive/testOnly * -- -n org.apache.spark.tags.SlowHiveTest"

Creating a diff file for a new Spark version

Once Comet has support for a new Spark version, we need to create a diff file that can be applied to that version of Apache Spark to enable Comet when running tests. This is a highly manual process and the process can vary depending on the changes in the new version of Spark, but here is a general guide to the process.

We typically start by applying a patch from a previous version of Spark. For example, when enabling the tests for Spark version 3.5.1 we may start by applying the existing diff for 3.4.3 first.

cd git/apache/spark
git checkout v3.5.1
git apply --reject --whitespace=fix ../datafusion-comet/dev/diffs/3.4.3.diff

Any changes that cannot be cleanly applied will instead be written out to reject files. For example, the above command generated the following files.

find . -name "*.rej"
./pom.xml.rej
./sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRebaseDatetimeSuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/sources/CreateTableAsSelectSuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala.rej
./sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala.rej
./sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala.rej

The changes in these reject files need to be applied manually.

One method is to use the wiggle command (brew install wiggle on Mac).

For example:

wiggle --replace ./sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ./sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala.rej

Generating The Diff File

git diff v3.5.1 > ../datafusion-comet/dev/diffs/3.5.1.diff

Running Tests in CI

The easiest way to run the tests is to create a PR against Comet and let CI run the tests. When working with a new Spark version, the spark_sql_test.yaml and spark_sql_test_ansi.yaml files will need updating with the new version.