Compatibility Guide#

Comet aims to provide consistent results with the version of Apache Spark that is being used.

This guide offers information about areas of functionality where there are known differences.

Parquet#

Comet has the following limitations when reading Parquet files:

  • Comet does not support reading decimals encoded in binary format.

  • No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported.

ANSI Mode#

Comet will fall back to Spark for the following expressions when ANSI mode is enabled. These expressions can be enabled by setting spark.comet.expression.EXPRNAME.allowIncompatible=true, where EXPRNAME is the Spark expression class name. See the Comet Supported Expressions Guide for more information on this configuration setting.

  • Average (supports all numeric inputs except decimal types)

  • Cast (in some cases)

There is an epic where we are tracking the work to fully implement ANSI support.

Floating-point Number Comparison#

Spark normalizes NaN and zero for floating point numbers for several cases. See NormalizeFloatingNumbers optimization rule in Spark. However, one exception is comparison. Spark does not normalize NaN and zero when comparing values because they are handled well in Spark (e.g., SQLOrderingUtil.compareFloats). But the comparison functions of arrow-rs used by DataFusion do not normalize NaN and zero (e.g., arrow::compute::kernels::cmp::eq). So Comet adds additional normalization expression of NaN and zero for comparisons, and may still have differences to Spark in some cases, especially when the data contains both positive and negative zero. This is likely an edge case that is not of concern for many users. If it is a concern, setting spark.comet.exec.strictFloatingPoint=true will make relevant operations fall back to Spark.

Incompatible Expressions#

Expressions that are not 100% Spark-compatible will fall back to Spark by default and can be enabled by setting spark.comet.expression.EXPRNAME.allowIncompatible=true, where EXPRNAME is the Spark expression class name. See the Comet Supported Expressions Guide for more information on this configuration setting.

Array Expressions#

  • ArrayContains: Returns null instead of false for empty arrays with literal values. #3346

  • ArrayRemove: Returns null when the element to remove is null, instead of removing null elements from the array. #3173

  • GetArrayItem: Known correctness issues with index handling, including off-by-one errors and incorrect results with dynamic (non-literal) index values. #3330, #3332

  • ArraysOverlap: Inconsistent behavior when arrays contain NULL values. #3645, #2036

  • ArrayUnion: Sorts input arrays before performing the union, while Spark preserves the order of the first array and appends unique elements from the second. #3644

Date/Time Expressions#

  • Hour, Minute, Second: Incorrectly apply timezone conversion to TimestampNTZ inputs. TimestampNTZ stores local time without timezone, so no conversion should be applied. These expressions work correctly with Timestamp inputs. #3180

  • TruncTimestamp (date_trunc): Produces incorrect results when used with non-UTC timezones. Compatible when timezone is UTC. #2649

Math Expressions#

  • Ceil, Floor: Incorrect results for Decimal type inputs. #1729

  • Tan: tan(-0.0) produces 0.0 instead of -0.0. #1897

Aggregate Expressions#

  • Corr: Returns null instead of NaN in some edge cases. #2646

  • First, Last: These functions are not deterministic. When ignoreNulls is set, results may not match Spark. #1630

Struct Expressions#

  • StructsToJson (to_json): Does not support +Infinity and -Infinity for numeric types (float, double). #3016

Regular Expressions#

Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java’s regular expression engine. Comet will fall back to Spark for patterns that are known to produce different results, but this can be overridden by setting spark.comet.expression.regexp.allowIncompatible=true.

Window Functions#

Comet’s support for window functions is incomplete and known to be incorrect. It is disabled by default and should not be used in production. The feature will be enabled in a future release. Tracking issue: #2721.

Round-Robin Partitioning#

Comet’s native shuffle implementation of round-robin partitioning (df.repartition(n)) is not compatible with Spark’s implementation and is disabled by default. It can be enabled by setting spark.comet.native.shuffle.partitioning.roundrobin.enabled=true.

Why the incompatibility exists:

Spark’s round-robin partitioning sorts rows by their binary UnsafeRow representation before assigning them to partitions. This ensures deterministic output for fault tolerance (task retries produce identical results). Comet uses Arrow format internally, which has a completely different binary layout than UnsafeRow, making it impossible to match Spark’s exact partition assignments.

Comet’s approach:

Instead of true round-robin assignment, Comet implements round-robin as hash partitioning on ALL columns. This achieves the same semantic goals:

  • Even distribution: Rows are distributed evenly across partitions (as long as the hash varies sufficiently - in some cases there could be skew)

  • Deterministic: Same input always produces the same partition assignments (important for fault tolerance)

  • No semantic grouping: Unlike hash partitioning on specific columns, this doesn’t group related rows together

The only difference is that Comet’s partition assignments will differ from Spark’s. When results are sorted, they will be identical to Spark. Unsorted results may have different row ordering.

Cast#

Cast operations in Comet fall into three levels of support:

  • C (Compatible): The results match Apache Spark

  • I (Incompatible): The results may match Apache Spark for some inputs, but there are known issues where some inputs will result in incorrect results or exceptions. The query stage will fall back to Spark by default. Setting spark.comet.expression.Cast.allowIncompatible=true will allow all incompatible casts to run natively in Comet, but this is not recommended for production use.

  • U (Unsupported): Comet does not provide a native version of this cast expression and the query stage will fall back to Spark.

  • N/A: Spark does not support this cast.

Legacy Mode#

binary

boolean

byte

date

decimal

double

float

integer

long

short

string

timestamp

binary

-

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

C

N/A

boolean

N/A

-

C

N/A

C

C

C

C

C

C

C

C

byte

C

C

-

N/A

C

C

C

C

C

C

C

C

date

N/A

C

C

-

C

C

C

C

C

C

C

C

decimal

N/A

C

C

N/A

-

C

C

C

C

C

C

C

double

N/A

C

C

N/A

I

-

C

C

C

C

C

C

float

N/A

C

C

N/A

I

C

-

C

C

C

C

C

integer

C

C

C

N/A

C

C

C

-

C

C

C

C

long

C

C

C

N/A

C

C

C

C

-

C

C

C

short

C

C

C

N/A

C

C

C

C

C

-

C

C

string

C

C

C

C

I

C

C

C

C

C

-

I

timestamp

N/A

U

U

C

U

U

U

U

C

U

C

-

Notes:

  • decimal -> string: There can be formatting differences in some case due to Spark using scientific notation where Comet does not

  • double -> decimal: There can be rounding differences

  • double -> string: There can be differences in precision. For example, the input “1.4E-45” will produce 1.0E-45 instead of 1.4E-45

  • float -> decimal: There can be rounding differences

  • float -> string: There can be differences in precision. For example, the input “1.4E-45” will produce 1.0E-45 instead of 1.4E-45

  • string -> date: Only supports years between 262143 BC and 262142 AD

  • string -> decimal: Does not support fullwidth unicode digits (e.g \uFF10) or strings containing null bytes (e.g \u0000)

  • string -> timestamp: Not all valid formats are supported

Try Mode#

binary

boolean

byte

date

decimal

double

float

integer

long

short

string

timestamp

binary

-

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

C

N/A

boolean

N/A

-

C

N/A

C

C

C

C

C

C

C

U

byte

U

C

-

N/A

C

C

C

C

C

C

C

C

date

N/A

U

U

-

U

U

U

U

U

U

C

C

decimal

N/A

C

C

N/A

-

C

C

C

C

C

C

C

double

N/A

C

C

N/A

I

-

C

C

C

C

C

C

float

N/A

C

C

N/A

I

C

-

C

C

C

C

C

integer

U

C

C

N/A

C

C

C

-

C

C

C

C

long

U

C

C

N/A

C

C

C

C

-

C

C

C

short

U

C

C

N/A

C

C

C

C

C

-

C

C

string

C

C

C

C

I

C

C

C

C

C

-

I

timestamp

N/A

U

U

C

U

U

U

U

C

U

C

-

Notes:

  • decimal -> string: There can be formatting differences in some case due to Spark using scientific notation where Comet does not

  • double -> decimal: There can be rounding differences

  • double -> string: There can be differences in precision. For example, the input “1.4E-45” will produce 1.0E-45 instead of 1.4E-45

  • float -> decimal: There can be rounding differences

  • float -> string: There can be differences in precision. For example, the input “1.4E-45” will produce 1.0E-45 instead of 1.4E-45

  • string -> date: Only supports years between 262143 BC and 262142 AD

  • string -> decimal: Does not support fullwidth unicode digits (e.g \uFF10) or strings containing null bytes (e.g \u0000)

  • string -> timestamp: Not all valid formats are supported

ANSI Mode#

binary

boolean

byte

date

decimal

double

float

integer

long

short

string

timestamp

binary

-

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

C

N/A

boolean

N/A

-

C

N/A

C

C

C

C

C

C

C

U

byte

U

C

-

N/A

C

C

C

C

C

C

C

C

date

N/A

U

U

-

U

U

U

U

U

U

C

C

decimal

N/A

C

C

N/A

-

C

C

C

C

C

C

C

double

N/A

C

C

N/A

I

-

C

C

C

C

C

C

float

N/A

C

C

N/A

I

C

-

C

C

C

C

C

integer

U

C

C

N/A

C

C

C

-

C

C

C

C

long

U

C

C

N/A

C

C

C

C

-

C

C

C

short

U

C

C

N/A

C

C

C

C

C

-

C

C

string

C

C

C

C

I

C

C

C

C

C

-

I

timestamp

N/A

U

U

C

U

U

U

U

C

U

C

-

Notes:

  • decimal -> string: There can be formatting differences in some case due to Spark using scientific notation where Comet does not

  • double -> decimal: There can be rounding differences

  • double -> string: There can be differences in precision. For example, the input “1.4E-45” will produce 1.0E-45 instead of 1.4E-45

  • float -> decimal: There can be rounding differences

  • float -> string: There can be differences in precision. For example, the input “1.4E-45” will produce 1.0E-45 instead of 1.4E-45

  • string -> date: Only supports years between 262143 BC and 262142 AD

  • string -> decimal: Does not support fullwidth unicode digits (e.g \uFF10) or strings containing null bytes (e.g \u0000)

  • string -> timestamp: ANSI mode not supported

See the tracking issue for more details.