Parquet Compatibility#
Comet currently has two distinct implementations of the Parquet scan operator.
Scan Implementation |
Notes |
|---|---|
|
Fully native scan |
|
Hybrid JVM/native scan |
The configuration property spark.comet.scan.impl is used to select an implementation. The default setting is
spark.comet.scan.impl=auto, which attempts to use native_datafusion first, and falls back to Spark if the scan
cannot be converted (e.g., due to unsupported features). Most users should not need to change this setting. However,
it is possible to force Comet to use a particular implementation for all scan operations by setting this
configuration property to one of the following implementations. For example: --conf spark.comet.scan.impl=native_datafusion.
native_datafusion Limitations#
The native_datafusion scan has some additional limitations, mostly related to Parquet metadata. All of these
cause Comet to fall back to Spark (including when using auto mode). Note that the native_datafusion scan
requires spark.comet.exec.enabled=true because the scan node must be wrapped by CometExecRule.
No support for row indexes
No support for reading Parquet field IDs
No support for
input_file_name(),input_file_block_start(), orinput_file_block_length()SQL functions. Thenative_datafusionscan does not use Spark’sFileScanRDD, so these functions cannot populate their values.No support for
ignoreMissingFilesorignoreCorruptFilesbeing set totrueDuplicate field names in case-insensitive mode (e.g., a Parquet file with both
Bandbcolumns) are detected at read time and raise aSparkRuntimeExceptionwith error class_LEGACY_ERROR_TEMP_2093, matching Spark’s behavior.
native_iceberg_compat Limitations#
The native_iceberg_compat scan has the following additional limitation that may produce incorrect results
without falling back to Spark:
Some Spark configuration values are hard-coded to their defaults rather than respecting user-specified values. This may produce incorrect results when non-default values are set. The affected configurations are
spark.sql.parquet.binaryAsString,spark.sql.parquet.int96AsTimestamp,spark.sql.caseSensitive,spark.sql.parquet.inferTimestampNTZ.enabled, andspark.sql.legacy.parquet.nanosAsLong. See issue #1816 for more details.