Parquet Compatibility#
Comet’s Parquet scan offloads decoding to native code and produces Arrow batches for the rest of the plan. Comet falls back to Spark when the scan cannot be converted (for example, due to one of the unsupported features listed below).
Parquet Scan Limitations#
The following features are not supported and cause Comet to fall back to Spark:
Decimals encoded in binary format.
ShortTypecolumns, by default. When reading Parquet files written by systems other than Spark that contain columns with the logical typeUINT_8(unsigned 8-bit integers), Comet may produce different results than Spark. Spark mapsUINT_8toShortType, but Comet’s Arrow-based readers respect the unsigned type and read the data as unsigned rather than signed. Since Comet cannot distinguishShortTypecolumns that came fromUINT_8versus signedINT16, by default Comet falls back to Spark when scanning Parquet files containingShortTypecolumns. This behavior can be disabled by settingspark.comet.scan.unsignedSmallIntSafetyCheck=false. Note thatByteTypecolumns are always safe because they can only come from signedINT8, where truncation preserves the signed value.Default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported.
Spark’s Datasource V2 API. When
spark.sql.sources.useV1SourceListdoes not includeparquet, Spark uses the V2 API for Parquet scans. Comet’s Parquet scan only supports the V1 API.Spark metadata columns (e.g.,
_metadata.file_path)No support for row indexes
No support for
input_file_name(),input_file_block_start(), orinput_file_block_length()SQL functions. Comet’s Parquet scan does not use Spark’sFileScanRDD, so these functions cannot populate their values.No support for
ignoreMissingFilesorignoreCorruptFilesbeing set totrueDuplicate field names in case-insensitive mode (e.g., a Parquet file with both
Bandbcolumns) are detected at read time and raise aSparkRuntimeExceptionwith error class_LEGACY_ERROR_TEMP_2093, matching Spark’s behavior.spark.sql.parquet.enableVectorizedReader=false. Disabling the vectorized reader opts into Spark’s parquet-mr semantics (silent overflow, null-on-narrowing), which Comet’s native reader does not replicate. By default Comet falls back to Spark in this case. Setspark.comet.scan.allowDisabledParquetVectorizedReader=trueto opt in to running the Comet Parquet scan regardless. See #4352.
The following limitation may produce incorrect results without falling back to Spark:
No support for datetime rebasing. When reading Parquet files containing dates or timestamps written before Spark 3.0 (which used a hybrid Julian/Gregorian calendar), dates/timestamps will be read as if they were written using the Proleptic Gregorian calendar. This may produce incorrect results for dates before October 15, 1582.
The following limitation raises an error at scan time rather than falling back to Spark:
Invalid UTF-8 bytes in
STRINGcolumns. Spark permits arbitrary byte sequences in aSTRINGcolumn (for example fromCAST(X'C1' AS STRING)), but Comet’s native execution path is built on Arrow, whose string type is strictly UTF-8. Reading a Parquet file whoseSTRINGcolumn contains non-UTF-8 bytes fails withParquet error: encountered non UTF-8 data. Disable Comet for the query, or cast the column toBINARYbefore persisting, if you need to preserve non-UTF-8 bytes. See #4121.
The following limitation may produce incorrect results on Spark versions prior to 4.0 without falling back to Spark:
Reading
TimestampLTZasTimestampNTZ. On Spark 3.x, Spark raises an error per SPARK-36182 because LTZ encodes UTC-adjusted instants that cannot be safely reinterpreted as timezone-free values. Comet does not raise this error and instead returns the raw UTC instant as aTimestampNTZvalue. This applies to all LTZ physical encodings (INT96, TIMESTAMP_MICROS, TIMESTAMP_MILLIS). On Spark 4.0+, this read is permitted (SPARK-47447) and Comet matches Spark’s behavior. See #4219.
Schema Mismatch Handling#
The issues in this subsection apply only when the requested read schema differs from the schema written
to the Parquet file. They do not affect a plain spark.read.parquet(path) that infers the schema
from file metadata, because in that case the requested schema and file schema match by construction.
Schema mismatch happens in two real-world scenarios:
The user provides an explicit read schema:
spark.read.schema(<schema>).parquet(path)(or the equivalent DataFrame API).Schema evolution / partitioned reads where files in a single dataset were written at different times with different types, or a table-format catalog (Iceberg, Delta) records a logical schema that has evolved past one or more underlying Parquet files. Spark coerces the file types to the table types at read time.
Spark’s vectorized Parquet reader fully validates these conversions in ParquetVectorUpdaterFactory.getUpdater
and throws SchemaColumnConvertNotSupportedException for unsupported pairs. Comet’s Parquet scan
mirrors that validation in its schema adapter; the entries below are the remaining gaps.
Note that the exact set of accepted conversions has changed between Spark versions
(for example, Spark 3.x’s schemaEvolution.enabled flag gates INT32 → INT64, FLOAT → DOUBLE,
and INT32 → DOUBLE widening that Spark 4.0+ accepts unconditionally; TimestampLTZ → TimestampNTZ
is rejected by Spark 3.x but accepted by Spark 4.0+). Comet aims to follow the per-version Spark
behavior.
ParquetSchemaConverterrors do not include the file path. The mismatch itself is detected and rejected correctly, but the resulting Spark error message readsEncountered error while reading file . Data type mismatches…(note the empty path). Behavior is consistent across Spark versions. See #4316.Spark 3.x: extra
SparkExceptionlayer in the cause chain. The native error is translated to aSparkExceptionwhose cause isSchemaColumnConvertNotSupportedException(matching what Spark would throw); on Spark 3.x the executor / task error handling re-wraps this once more on the way back to the driver, producing a two-level chain (SparkException → SparkException → SchemaColumnConvertNotSupportedException) instead of the one-level chain Spark’s own vectorized reader produces. Code that catchesSparkExceptionand inspects only the immediate cause viae.getCause.isInstanceOf[SchemaColumnConvertNotSupportedException]will see the innerSparkExceptioninstead. Walk the cause chain to recover theSchemaColumnConvertNotSupportedException. Spark 4.0+ produces a single-level chain, matching vanilla Spark. See #4354.