Spark Expression Support#

This page is the complete reference for how Apache Comet handles each Spark built-in expression. Comet accelerates expressions either with a native (Rust) implementation or by dispatching to a Spark-compatible codegen path. When an expression is not supported, Comet transparently falls back to Spark for that part of the plan; results are unaffected.

Expressions marked ✅ Supported are enabled by default and produce Spark-compatible results.

Some ✅ Supported expressions have specific incompatible cases that fall back to Spark by default. Those cases must be opted into per expression with spark.comet.expression.EXPRNAME.allowIncompatible=true (where EXPRNAME is the Spark expression class name, for example Cast). There is no global opt-in.

Most expressions can also be disabled with spark.comet.expression.EXPRNAME.enabled=false, where EXPRNAME is the Spark expression class name (for example Length or StartsWith). See the Comet Configuration Guide for the full list.

Status legend#

Status

Meaning

✅ Supported

Comet produces Spark-compatible results by default. Some inputs or forms may fall back to Spark, and any incompatible behavior is opt-in (off by default).

🔜 Planned

Intended; tracked by an open issue or pull request.

Not currently planned#

Comet focuses acceleration on mainstream relational, string, datetime, math, and collection expressions. The following function families are not currently planned for native acceleration (they are not on the 1.0 roadmap): specialized functionality with narrow real-world analytics use and high implementation cost. They fall back to Spark and may be reconsidered based on demand:

  • Probabilistic sketches and approximate top-k (kll_sketch_*, hll_*, theta_*, count_min_sketch, bitmap_*, approx_top_k*): specialized data structures with exact-correctness traps.

  • Geospatial (st_*): brand-new Spark 4.1 functionality, specialized.

  • Avro / Protobuf codecs (from_avro, to_avro, from_protobuf, to_protobuf, schema_of_avro): format conversion belongs at the IO layer, not expression evaluation.

  • JVM reflection (java_method, reflect): niche, and they invoke arbitrary JVM methods (a security concern).

  • UTF-8 validation (is_valid_utf8, make_valid_utf8, validate_utf8, try_validate_utf8): niche Spark 4.x string-validation helpers.

  • Miscellaneous niche (histogram_numeric, version, sentences, quote): low-value or specialized functions with little benefit from native acceleration.

The file-metadata functions input_file_name, input_file_block_start, and input_file_block_length depend on scan-internal per-row file information rather than the expression layer; their support status is covered in the scan compatibility guide.

Note that approx_count_distinct, median, and mode are planned: they are mainstream (median and mode are exact aggregates). approx_percentile / percentile_approx are not currently planned because their approximate results cannot be made bit-identical to Spark.

The tables below list every Spark built-in expression with its current status.

agg_funcs#

Function

Status

Notes

any

any_value

approx_count_distinct

🔜

tracking #4098

array_agg

🔜

Array aggregate (related to collect_list, #2524)

avg

Interval types fall back

bit_and

bit_or

bit_xor

bool_and

bool_or

collect_list

🔜

#2524

collect_set

corr

count

count_if

covar_pop

covar_samp

every

first

first_value

grouping

🔜

Grouping indicator for ROLLUP/CUBE/GROUPING SETS

grouping_id

🔜

Grouping indicator for ROLLUP/CUBE/GROUPING SETS

kurtosis

🔜

tracking #4098

last

last_value

listagg

🔜

String aggregation

max

max_by

🔜

#3841

mean

median

🔜

tracking #4098

min

min_by

🔜

#3841

mode

🔜

#3970

percentile

🔜

#4542

percentile_cont

🔜

Percentile aggregate

percentile_disc

🔜

Percentile aggregate

regr_avgx

Native: Spark rewrites to Average (tests in #4551)

regr_avgy

Native: Spark rewrites to Average (tests in #4551)

regr_count

Native: Spark rewrites to Count (tests in #4551)

regr_intercept

🔜

Falls back; can reuse covar_pop/var_pop accumulators (#4552)

regr_r2

🔜

Falls back; can reuse the corr accumulator (#4552)

regr_slope

🔜

Falls back; can reuse covar_pop/var_pop accumulators (#4552)

regr_sxx

🔜

Falls back; can reuse var_pop accumulator (#4552)

regr_sxy

🔜

Falls back; can reuse covar_pop accumulator (#4552)

regr_syy

🔜

Falls back; can reuse var_pop accumulator (#4552)

skewness

🔜

tracking #4098

some

std

stddev

stddev_pop

stddev_samp

string_agg

🔜

String aggregation (alias of listagg)

sum

try_avg

🔜

tracking #4098

try_sum

🔜

tracking #4098

var_pop

var_samp

variance


array_funcs#

Function

Status

Notes

array

array_append

array_compact

array_contains

NaN/signed-zero handling may differ (details)

array_distinct

NaN/signed-zero handling may differ (details)

array_except

Incompatible; falls back by default (details)

array_insert

array_intersect

Incompatible; falls back by default (details)

array_join

Incompatible; falls back by default (details)

array_max

NaN ordering may differ (details)

array_min

NaN ordering may differ (details)

array_position

Binary/struct/map/null elements fall back

array_prepend

🔜

Sibling of array_append

array_remove

array_repeat

array_union

NaN/signed-zero handling may differ (details)

arrays_overlap

arrays_zip

element_at

MapType input falls back

flatten

Binary/struct/map elements fall back

get

sequence

shuffle

🔜

Random array shuffle

slice

Native (#4149)

sort_array

Nested struct/null arrays fall back


bitwise_funcs#

Function

Status

Notes

&

<<

>>

>>>

Operator alias for shiftrightunsigned (Spark 4.0+)

^

bit_count

bit_get

getbit

shiftright

shiftrightunsigned

|

~


collection_funcs#

Function

Status

Notes

array_size

cardinality

MapType input falls back

concat

Binary/array children fall back

reverse

Binary-element arrays fall back (Incompatible) (details)

size

MapType input falls back


conditional_funcs#

Function

Status

Notes

coalesce

if

ifnull

nanvl

nullif

nullifzero

Lowers to if/= (Spark 4.0+)

nvl

nvl2

when

zeroifnull

Lowers to coalesce (Spark 4.0+)


conversion_funcs#

The type-name conversion functions (bigint, binary, boolean, date, decimal, double, float, int, smallint, string, timestamp, tinyint) are SQL aliases for CAST(... AS <type>) and share the support and caveats of cast.

Function

Status

Notes

cast

Some casts fall back; float-to-decimal is opt-in (details)


csv_funcs#

Function

Status

Notes

from_csv

schema_of_csv

to_csv


datetime_funcs#

Function

Status

Notes

add_months

convert_timezone

curdate

Constant-folded to a literal (alias of current_date)

current_date

Constant-folded to a literal before Comet sees the plan

current_time

🔜

Blocked on Spark 4.1 TIME type support (#4288)

current_timestamp

Constant-folded to a literal before Comet sees the plan

current_timezone

date_add

date_diff

date_format

date_from_unix_date

date_part

date_sub

date_trunc

dateadd

datediff

datepart

day

dayname

🔜

#4544

dayofmonth

dayofweek

dayofyear

extract

from_unixtime

from_utc_timestamp

Legacy zone forms fall back (Incompatible) (details)

hour

last_day

localtimestamp

make_date

make_dt_interval

🔜

#4541

make_interval

🔜

Produces legacy CalendarInterval; tracked by #4540

make_time

🔜

Spark 4.1 TIME type; tracked by #4288

make_timestamp

make_timestamp_ltz

2-arg TIME form falls back

make_timestamp_ntz

2-arg TIME form falls back

make_ym_interval

🔜

#4541

minute

month

monthname

🔜

#4544

months_between

next_day

now

Constant-folded to a literal (alias of current_timestamp)

quarter

second

session_window

🔜

Time-window grouping; tracked by #4553

time_diff

🔜

Spark 4.1 TIME type; tracked by #4288

time_trunc

🔜

Spark 4.1 TIME type; tracked by #4288

timestamp_micros

timestamp_millis

timestamp_seconds

to_date

Rewrites to Cast (or Cast(GetTimestamp) with a format) before Comet sees the plan

to_time

🔜

Spark 4.1 TIME type; tracked by #4288

to_timestamp

Rewrites to Cast (or GetTimestamp with a format) before Comet sees the plan

to_timestamp_ltz

Rewrites to to_timestamp (TimestampType)

to_timestamp_ntz

Rewrites to to_timestamp (TimestampNTZType)

to_unix_timestamp

to_utc_timestamp

Legacy zone forms fall back (Incompatible) (details)

trunc

try_make_interval

🔜

Produces legacy CalendarInterval; tracked by #4540

try_make_timestamp

try_to_date

🔜

Rewrites to Cast/GetTimestamp but currently falls back; tracked by #4556

try_to_time

🔜

Spark 4.1 TIME type; tracked by #4288

try_to_timestamp

🔜

Rewrites to Cast/GetTimestamp but currently falls back; tracked by #4556

unix_date

unix_micros

unix_millis

unix_seconds

unix_timestamp

weekday

weekofyear

window

🔜

Time-window grouping; tracked by #4553

window_time

🔜

Time-window grouping; tracked by #4553

year


generator_funcs#

explode and posexplode are supported via CometExplodeExec (operator-level, not expression-level). The outer variants are wired but marked Incompatible; they require spark.comet.exec.explode.enabled=true and allowIncompatible.

Function

Status

Notes

explode

via CometExplodeExec

explode_outer

outer=true falls back (Incompatible) (audit)

inline

🔜

Operator-level generator (like explode)

inline_outer

🔜

Operator-level generator (like explode)

posexplode

via CometExplodeExec

posexplode_outer

outer=true falls back (Incompatible) (audit)

stack

🔜

Operator-level generator


hash_funcs#

Function

Status

Notes

crc32

hash

md5

sha

sha1

sha2

xxhash64


json_funcs#

Function

Status

Notes

from_json

Falls back by default; opt-in via allowIncompatible (audit)

get_json_object

Some inputs need allowIncompatible (audit)

json_array_length

Single-quoted/trailing JSON needs allowIncompatible (audit)

json_object_keys

json_tuple

🔜

#3160

schema_of_json

to_json

Options and map/array inputs fall back (audit)


lambda_funcs#

Function

Status

Notes

aggregate

array_sort

exists

filter

🔜

General lambda not yet wired; the array_compact form is supported (#4224)

forall

map_filter

map_zip_with

reduce

transform

transform_keys

transform_values

zip_with


map_funcs#

Function

Status

Notes

element_at

MapType input falls back

map

🔜

Constructs a map

map_concat

map_contains_key

map_entries

map_from_arrays

map_from_entries

BinaryType key/value falls back (Incompatible) (details)

map_keys

map_values

str_to_map

try_element_at

Lowers to element_at; array input (MapType falls back)


math_funcs#

Function

Status

Notes

%

*

Interval multiplication falls back

+

-

/

abs

Interval types fall back

acos

acosh

asin

asinh

atan

atan2

atanh

bin

bround

cbrt

ceil

Two-arg form falls back

ceiling

conv

cos

cosh

cot

csc

degrees

div

e

Folds to a literal (like pi)

exp

expm1

factorial

floor

Two-arg form falls back

greatest

hex

hypot

least

ln

log

log10

log1p

log2

mod

negative

pi

pmod

positive

pow

power

radians

rand

randn

random

Alias for rand (Spark 4.0+); seed must be a literal

randstr

🔜

Random string (Spark 4.0+)

rint

round

Float/double inputs fall back

sec

shiftleft

sign

signum

sin

sinh

sqrt

tan

tanh

try_add

Datetime/interval form falls back

try_divide

try_mod

try_multiply

try_subtract

unhex

uniform

Constant-folded; literal arguments only (Spark 4.0+)

width_bucket


misc_funcs#

Function

Status

Notes

aes_decrypt

Routed through the JVM codegen dispatcher

aes_encrypt

Routed through the JVM codegen dispatcher; nondeterministic IV by default

assert_true

🔜

Lowers to RaiseError, which falls back

current_catalog

Resolved to a literal by the analyzer (ReplaceCurrentLike)

current_database

Resolved to a literal by the analyzer (ReplaceCurrentLike)

current_schema

Alias of current_database; resolved to a literal by the analyzer

current_user

Resolved to a literal by the analyzer; same as user

equal_null

Lowers to <=> (EqualNullSafe)

is_variant_null

🔜

tracking #4098

monotonically_increasing_id

parse_json

🔜

tracking #4098

raise_error

🔜

Raises a runtime error

rand

Seed must be a literal

randn

Seed must be a literal

schema_of_variant

🔜

tracking #4098

schema_of_variant_agg

🔜

tracking #4098

session_user

Alias of current_user; resolved to a literal by the analyzer

spark_partition_id

to_variant_object

🔜

tracking #4098

try_aes_decrypt

Routed through the JVM codegen dispatcher

try_parse_json

🔜

tracking #4098

try_variant_get

🔜

tracking #4098

typeof

Foldable; resolved to a literal before Comet sees the plan

user

Resolved to a literal by the Spark analyzer before reaching Comet

uuid

🔜

Nondeterministic random UUID

variant_get

🔜

tracking #4098


predicate_funcs#

Function

Status

Notes

!

<

<=

<=>

=

==

>

>=

and

between

ilike

in

isnan

isnotnull

isnull

like

not

or

regexp

Falls back by default; opt-in via allowIncompatible (details)

regexp_like

Falls back by default; opt-in via allowIncompatible (details)

rlike

Falls back by default; opt-in via allowIncompatible (details)


string_funcs#

Function

Status

Notes

ascii

base64

🔜

Lowers to StaticInvoke(encode) (not allowlisted); falls back

bit_length

btrim

char

char_length

character_length

chr

collate

🔜

Spark collation (umbrella #2190)

collation

Constant-folded to a literal (Spark 4.0+)

concat_ws

contains

decode

elt

encode

🔜

Lowers to StaticInvoke(encode) (not allowlisted); falls back

endswith

find_in_set

format_number

format_string

initcap

instr

lcase

left

len

length

levenshtein

locate

lower

lpad

ltrim

luhn_check

Native via StaticInvoke (tests: luhn_check.sql)

mask

🔜

Data masking

octet_length

overlay

position

printf

regexp_count

🔜

tracking #4098

regexp_extract

🔜

tracking #4098

regexp_extract_all

🔜

tracking #4098

regexp_instr

🔜

tracking #4098

regexp_replace

regexp_substr

🔜

tracking #4098

repeat

replace

right

rpad

rtrim

soundex

space

split

split_part

🔜

Lowers to element_at(StringSplitSQL(...)); StringSplitSQL falls back (#4561)

startswith

substr

substring

substring_index

to_binary

Hex form accelerated; other formats fall back

to_char

to_number

to_varchar

translate

trim

try_to_binary

🔜

Lowers to TryEval(...), which falls back

try_to_number

🔜

TRY variant of to_number

ucase

unbase64

upper


struct_funcs#

Function

Status

Notes

named_struct

Duplicate field names fall back

struct


url_funcs#

Function

Status

Notes

parse_url

try_url_decode

url_decode

url_encode


window_funcs#

Window functions run via CometWindowExec. Window support is disabled by default due to known correctness issues (tracking #2721). When enabled, lag and lead are explicitly wired; aggregate window functions (count, min, max, sum) are also supported. Ranking functions (rank, dense_rank, row_number, ntile, percent_rank, cume_dist, nth_value) are not yet wired in the window serde and fall back to Spark.

Function

Status

Notes

cume_dist

🔜

Window function; tracked by #2721

dense_rank

🔜

Window function; tracked by #2721

lag

via CometWindowExec

lead

via CometWindowExec

nth_value

🔜

Window function; tracked by #2721

ntile

🔜

Window function; tracked by #2721

percent_rank

🔜

Window function; tracked by #2721

rank

🔜

Window function; tracked by #2721

row_number

🔜

Window function; tracked by #2721


xml_funcs#

Function

Status

Notes

from_xml

Spark 4.0+

schema_of_xml

Spark 4.0+

to_xml

Spark 4.0+

xpath

xpath_boolean

xpath_double

xpath_float

xpath_int

xpath_long

xpath_number

Alias of xpath_double

xpath_short

xpath_string


Beyond SQL functions#

Comet also accelerates a number of Catalyst expressions that have no Spark SQL function name and therefore do not appear in the tables above. These arise from the DataFrame API, from SQL syntax other than function calls, or from the query optimizer. They include:

  • Operator and optimizer-injected expressions: runtime bloom-filter join probes (BloomFilterMightContain, BloomFilterAggregate), optimized IN sets (InSet), scalar subqueries (ScalarSubquery), and floating-point normalization (KnownFloatingPointNormalized).

  • Accessor expressions (subscript and field access, not functions): struct field access (col.field), array element access (arr[i]), and map value access (map[key]).

  • Internal decimal arithmetic: CheckOverflow, MakeDecimal, and UnscaledValue, which the analyzer inserts around decimal operations.

  • User-defined functions: Scala UDFs registered through the DataFrame or SQL API.

  • Structural expressions: aliases, attribute references, literals, sort orders, and CASE WHEN.

This list is illustrative, not exhaustive: the per-function tables are not the complete set of expressions Comet can accelerate.

See also#