SQL File Tests#
CometSqlFileTestSuite is a test suite that automatically discovers .sql test files and
runs each query through both Spark and Comet, comparing results. This provides a lightweight
way to add expression and operator test coverage without writing Scala test code.
Running the tests#
mvn test -pl spark -Dsuites="org.apache.comet.CometSqlFileTestSuite" -Dtest=none
Test file location#
SQL test files live under:
spark/src/test/resources/sql-tests/expressions/
Files are organized into category subdirectories:
expressions/
aggregate/ -- avg, sum, count, min_max, ...
array/ -- array_contains, array_append, get_array_item, ...
bitwise/
cast/
conditional/ -- case_when, coalesce, if_expr, ...
datetime/ -- date_add, date_diff, unix_timestamp, ...
decimal/
hash/
map/ -- get_map_value, map_keys, map_values, ...
math/ -- abs, ceil, floor, round, sqrt, ...
misc/ -- width_bucket, scalar_subquery, ...
string/ -- concat, like, substring, lower, upper, ...
struct/ -- create_named_struct, get_struct_field, ...
The test suite recursively discovers all .sql files in these directories. Each file becomes
one or more ScalaTest test cases.
File format#
A test file consists of SQL comments, directives, statements, and queries separated by blank lines. Here is a minimal example:
-- ConfigMatrix: parquet.enable.dictionary=false,true
statement
CREATE TABLE test_abs(v double) USING parquet
statement
INSERT INTO test_abs VALUES (1.5), (-2.5), (0.0), (NULL)
query
SELECT abs(v) FROM test_abs
Directives#
Directives are SQL comments at the top of the file that configure how the test runs.
Config#
Sets a Spark SQL config for all queries in the file.
-- Config: spark.sql.ansi.enabled=true
ConfigMatrix#
Runs the entire file once per combination of values. Multiple ConfigMatrix lines produce a
cross product of all combinations.
-- ConfigMatrix: parquet.enable.dictionary=false,true
This generates two test cases:
sql-file: expressions/cast/cast.sql [parquet.enable.dictionary=false]
sql-file: expressions/cast/cast.sql [parquet.enable.dictionary=true]
MinSparkVersion#
Skips the file when running on a Spark version older than the specified version.
-- MinSparkVersion: 3.5
Statements#
A statement block executes DDL or DML and does not check results. Use this for CREATE TABLE
and INSERT commands. Table names are automatically extracted for cleanup after the test.
statement
CREATE TABLE my_table(x int, y double) USING parquet
statement
INSERT INTO my_table VALUES (1, 2.0), (3, 4.0), (NULL, NULL)
Queries#
A query block executes a SELECT and compares results between Spark and Comet. The query
mode controls how results are validated.
query (default mode)#
Checks that the query runs natively on Comet (not falling back to Spark) and that results match Spark exactly.
query
SELECT abs(v) FROM test_abs
query spark_answer_only#
Only checks that Comet results match Spark. Does not assert that the query runs natively. Use this for expressions that Comet may not fully support yet but should still produce correct results.
query spark_answer_only
SELECT some_expression(v) FROM test_table
query tolerance=<value>#
Checks results with a numeric tolerance. Useful for floating-point functions where small differences are acceptable.
query tolerance=0.0001
SELECT cos(v) FROM test_trig
query expect_fallback(<reason>)#
Asserts that the query falls back to Spark and verifies the fallback reason contains the given string.
query expect_fallback(unsupported expression)
SELECT unsupported_func(v) FROM test_table
query ignore(<reason>)#
Skips the query entirely. Use this for queries that hit known bugs. The reason should be a link to the tracking GitHub issue.
-- Comet bug: space(-1) causes native crash
query ignore(https://github.com/apache/datafusion-comet/issues/3326)
SELECT space(n) FROM test_space WHERE n < 0
Adding a new test#
Create a
.sqlfile under the appropriate subdirectory inspark/src/test/resources/sql-tests/expressions/. Create a new subdirectory if no existing category fits.Add the Apache license header as a SQL comment.
Add a
ConfigMatrixdirective if the test should run with multiple Parquet configurations. Most expression tests use:-- ConfigMatrix: parquet.enable.dictionary=false,trueCreate tables and insert test data using
statementblocks. Include edge cases such asNULL, boundary values, and negative numbers.Add
queryblocks for each expression or behavior to test. Use the defaultquerymode when you expect Comet to run the expression natively. Usequery spark_answer_onlywhen native execution is not yet expected.Run the tests to verify:
mvn test -pl spark -Dsuites="org.apache.comet.CometSqlFileTestSuite" -Dtest=none
Tips for writing thorough tests#
Cover all combinations of literal and column arguments#
Comet often uses different code paths for literal values versus column references. Tests should exercise both. For a function with multiple arguments, test every useful combination.
For a single-argument function, test both a column reference and a literal:
-- column argument (reads from Parquet, goes through columnar evaluation)
query
SELECT ascii(s) FROM test_ascii
-- literal arguments
query
SELECT ascii('A'), ascii(''), ascii(NULL)
For a multi-argument function like concat_ws(sep, str1, str2, ...), test with the separator
as a column versus a literal, and similarly for the other arguments:
-- all columns
query
SELECT concat_ws(sep, a, b) FROM test_table
-- literal separator, column values
query
SELECT concat_ws(',', a, b) FROM test_table
-- all literals
query
SELECT concat_ws(',', 'hello', 'world')
Note on constant folding: Normally Spark constant-folds all-literal expressions during
planning, so Comet would never see them. However, CometSqlFileTestSuite automatically
disables constant folding (by excluding ConstantFolding from the optimizer rules), so
all-literal queries are evaluated by Comet’s native engine. This means you can use the
default query mode for all-literal cases and they will be tested natively just like
column-based queries.
Cover edge cases#
Include edge-case values in your test data. The exact cases depend on the function, but common ones include:
NULL values – every test should include NULLs
Empty strings – for string functions
Zero, negative, and very large numbers – for numeric functions
Boundary values –
INT_MIN,INT_MAX,NaN,Infinity,-Infinityfor numeric typesSpecial characters and multibyte UTF-8 – for string functions (e.g.
'é','中文','\t')Empty arrays/maps – for collection functions
Single-element and multi-element collections – for aggregate and collection functions
One file per expression#
Keep each .sql file focused on a single expression or a small group of closely related
expressions. This makes failures easy to locate and keeps files readable.
Use comments to label sections#
Add SQL comments before query blocks to describe what aspect of the expression is being tested. This helps reviewers and future maintainers understand the intent:
-- literal separator with NULL values
query
SELECT concat_ws(',', NULL, 'b', 'c')
-- empty separator
query
SELECT concat_ws('', a, b, c) FROM test_table
Using agentic coding tools#
Writing thorough SQL test files is a task well suited to agentic coding tools such as
Claude Code. You can point the tool at an existing test file as an example, describe the
expression you want to test, and ask it to generate a complete .sql file covering all
argument combinations and edge cases. This is significantly faster than writing the
combinatorial test cases by hand and helps ensure nothing is missed.
For example:
Read the test file spark/src/test/resources/sql-tests/expressions/string/ascii.sql
and the documentation in docs/source/contributor-guide/sql-file-tests.md.
Then write a similar test file for the `reverse` function, covering column arguments,
literal arguments, NULLs, empty strings, and multibyte characters.
Handling test failures#
When a query fails due to a known Comet bug:
File a GitHub issue describing the problem.
Change the query mode to
ignore(...)with a link to the issue.Optionally add a SQL comment above the query explaining the problem.
-- GetArrayItem returns incorrect results with dynamic index
query ignore(https://github.com/apache/datafusion-comet/issues/3332)
SELECT arr[idx] FROM test_get_array_item
When the bug is fixed, remove the ignore(...) and restore the original query mode.
Architecture#
The test infrastructure consists of two Scala files:
SqlFileTestParser(spark/src/test/scala/org/apache/comet/SqlFileTestParser.scala) – Parses.sqlfiles into aSqlTestFiledata structure containing directives, statements, and queries.CometSqlFileTestSuite(spark/src/test/scala/org/apache/comet/CometSqlFileTestSuite.scala) – Discovers test files at suite initialization time, generates ScalaTest test cases for each file and config combination, and executes them usingCometTestBaseassertion methods.
Tables created in test files are automatically cleaned up after each test.