JSON Compatibility#
Comet can evaluate JSON expressions (get_json_object, from_json, to_json,
json_array_length) two ways:
Codegen dispatcher (default): Spark’s own
doGenCodefor the expression runs inside the Comet pipeline (via Comet’s Arrow-direct codegen dispatcher), giving byte-exact compatibility with Spark at the cost of a JNI roundtrip per batch. This rides the codegen dispatcher (spark.comet.exec.scalaUDF.codegen.enabled, enabled by default); if the dispatcher is disabled, the operator falls back to Spark.Native (rust) path: the native DataFusion implementation. Faster, but has known compatibility gaps with Spark on certain inputs, so it is opt-in per expression via the expression’s
allowIncompatibleconfig. Any expression or input case with no native implementation falls back to the codegen dispatcher.
Expression coverage#
SQL |
Native (rust) path |
Opt-in config |
|---|---|---|
|
Supported, with gaps on single-quoted JSON and unescaped control characters |
|
|
Supported with restrictions (PERMISSIVE mode only, simple schema types only) |
|
|
Supported for struct inputs only, no options |
|
|
Supported, with gaps on single-quoted JSON, unescaped control characters, and trailing content |
|
When the native path is enabled but an expression or input case has no native
implementation (for example to_json with map or array inputs, or from_json
with an unsupported schema), Comet falls back to the codegen dispatcher for that
case.
When to use the native path#
You want the faster native path and your inputs avoid the known compatibility gaps above.
Enable it per expression, for example
spark.comet.expression.GetJsonObject.allowIncompatible=true. Cases the native path does not cover still fall back to the codegen dispatcher.