string_funcs Expression Audits#
Audit notes for expressions in this category that have been audited. Absence of an entry means the expression has not been audited yet, not that it is unsupported. See the user guide Spark Expression Support for current support status.
ascii#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
StringType -> IntegerType;nullSafeEvalreturnscodePointAt(0)of the first char, or0for the empty string. Wired viaCometScalarFunction("ascii")and resolved to DataFusionascii(chars().next() as i32); first-code-point semantics match for ASCII, BMP, and supplementary code points.Spark 4.0.1 (audited 2026-05-27):
inputTypeswidened toStringTypeWithCollation(supportsTrimCollation = true); behaviour unchanged forUTF8_BINARY. Comet does not propagate collation, so non-default collations may diverge silently (https://github.com/apache/datafusion-comet/issues/4496).
bit_length#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
(StringType|BinaryType) -> IntegerType; eval returnsnumBytes * 8for strings and.length * 8for binary.Spark 4.0.1 (audited 2026-05-27):
inputTypeswidened toStringTypeWithCollation(supportsTrimCollation = true); semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).Known limitation: wired as a raw
CometScalarFunction("bit_length")with noBinaryTypeguard. DataFusion’sBitLengthFuncsignature only accepts string types, sobit_length(<binary>)execute-fails on the native side instead of falling back cleanly (https://github.com/apache/datafusion-comet/issues/4464).
btrim#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
StringTrimBothisRuntimeReplaceableand rewritten toStringTrim(srcStr, trimStr)before serde runs. Support is provided by thetrimentry; no dedicated serde registration.Spark 4.0.1 (audited 2026-05-27):
StringTrim(the rewrite target) routes throughCollationSupport.StringTrim.execand usesStringTypeNonCSAICollation(supportsTrimCollation = true); semantics unchanged forUTF8_BINARY. Non-default collations may diverge in Comet (https://github.com/apache/datafusion-comet/issues/4496).
char#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
Chr(LongType) -> StringType;lon < 0returns"", else((lon & 0xFF) as char).toString(sochr(256)andchr(0)both return\u0000).Spark 4.0.1 (audited 2026-05-27): semantics unchanged;
NullIntoleranttrait replaced byoverride def nullIntolerant: Boolean = true. Resolves natively todatafusion_spark::function::string::char::CharFunc, which mirrors Spark’s negative-input and& 0xFFsemantics.
char_length#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): registry alias of
Length. Same support aslength.Spark 4.0.1 (audited 2026-05-27): unchanged alias of
Length.
character_length#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): registry alias of
Length. Same support aslength.Spark 4.0.1 (audited 2026-05-27): unchanged alias of
Length.
chr#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): registry alias of
Chr. Same support aschar.Spark 4.0.1 (audited 2026-05-27): unchanged alias of
Chr.
concat_ws#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
Seq[Expression] -> StringType; NULL separator yields NULL, NULL element values are skipped, children can beStringTypeorArrayType(StringType). Comet serde rewrites a NULL-literal separator to a NULL of the result type and bails out on all-foldable inputs so Spark’sConstantFoldinghandles them; otherwise delegates to DataFusionconcat_ws.Spark 4.0.1 (audited 2026-05-27):
inputTypeswidened toStringTypeWithCollation/AbstractArrayType;dataTypebecomeschildren.head.dataType(collation-derived). Semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
contains#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
UTF8String.containsonStringType; the parser routes(BinaryType, BinaryType)toBinaryPredicate, so Comet only ever sees the String form.Spark 4.0.1 (audited 2026-05-27): routes through
CollationSupport.Contains.exec(..., collationId); behaviour identical forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
decode#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
StringDecode(bin, charset)evaluated directly; invalid sequences silently substitute replacement characters vianew String(bytes, charset).Spark 4.0.1 (audited 2026-05-27): refactored to
RuntimeReplaceablewhosereplacementis aStaticInvoke(StringDecode.decode, bin, charset, legacyCharsets, legacyErrorAction); the 4-arg form raises on malformed input unless legacy flags are set.Known limitations: Comet handles
decodeviaCommonStringExprs.stringDecodefrom the version shims (noCometExpressionSerde[StringDecode]registration, so the function does not surface in the auto-generated compatibility docs: https://github.com/apache/datafusion-comet/issues/4466). Only literalcharset = 'utf-8'(case-insensitive) is supported; everything else falls back. The Spark 4.0legacyCharsets/legacyErrorActionflags are ignored: Comet always lowers toCast(bin, StringType, TRY), so invalid UTF-8 yields NULL where Spark 3.x substitutes replacement characters and Spark 4.0 (non-legacy) raises (https://github.com/apache/datafusion-comet/issues/4465).
endswith#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
UTF8String.endsWithonStringType; binary form routed toBinaryPredicatebefore Comet.Spark 4.0.1 (audited 2026-05-27): routes through
CollationSupport.EndsWith.exec; semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
initcap#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
string.toLowerCase.toTitleCaseonUTF8String; word boundary is JavaCharacter.isWhitespace. Comet routes to DataFusioninitcap, which splits on!is_alphanumeric()(hyphens, apostrophes, and punctuation all split words), so Comet is unconditionallyIncompatible(https://github.com/apache/datafusion-comet/issues/1052).Spark 4.0.1 (audited 2026-05-27): routes through
CollationSupport.InitCap.exec(collation- and ICU-aware) and propagateschild.dataType. Comet ignores collation; 3.x divergences persist plus collation/ICU mismatches (https://github.com/apache/datafusion-comet/issues/4496).
instr#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
StringInstr(str, substr) -> IntegerType; returnsstring.indexOf(sub, 0) + 1(1-based, 0 when not found, 1 on empty substring). Resolves to DataFusionstrpos(aliasinstr) with matching semantics.Spark 4.0.1 (audited 2026-05-27): routes through
CollationSupport.StringInstr.exec; semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
lcase#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): registry alias of
Lower. Same support aslower.Spark 4.0.1 (audited 2026-05-27): unchanged alias of
Lower.
left#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
RuntimeReplaceablewithreplacement = Substring(str, Literal(1), len); acceptsStringTypeorBinaryTypeplusIntegerType. Comet serde rewrites to aSubstringproto withstart=1, len=lenValue.getSupportLeveldeclaresUnsupportedfor non-literallenso the dispatcher falls back uniformly.Spark 4.0.1 (audited 2026-05-27):
inputTypeswidened withStringTypeWithCollation; behaviour unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
len#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): registry alias of
Length. Same support aslength.Spark 4.0.1 (audited 2026-05-27): unchanged alias of
Length.
length#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
(StringType|BinaryType) -> IntegerType; eval returnsnumCharsfor strings and.lengthfor binary.BinaryTypeinput falls back viaUnsupported(DataFusion’scharacter_lengthaccepts string types only).Spark 4.0.1 (audited 2026-05-27):
inputTypeswidened toStringTypeWithCollation(supportsTrimCollation = true); semantics unchanged. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
lower#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline. JVM default-locale
toLowerCaseonUTF8String. Comet routes to DataFusionlower(Rust Unicode default case mapping, no locale awareness) and is unconditionallyIncompatible; users opt in via the standardspark.comet.expression.Lower.allowIncompatible=true.Spark 4.0.1 (audited 2026-05-27): routes through
CollationSupport.Lower.exec(v, collationId, useICU)withSQLConf.ICU_CASE_MAPPINGS_ENABLED;inputTypeswidened toStringTypeWithCollation. Comet ignores collation and ICU mode, so non-default collations orICU_CASE_MAPPINGS_ENABLED=truediverge even after opting in (https://github.com/apache/datafusion-comet/issues/2190).
lpad#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
StringLPad(str, len, pad) -> StringType;len <= 0returns the empty string, emptypadreturnsstrunchanged, NULL inputs propagate. Comet serde requiresstrto be a column andpadto be a literal; otherwise falls back.Spark 4.0.1 (audited 2026-05-27):
NullIntoleranttrait replaced byoverride def nullIntolerant: Boolean = true;inputTypeswidened toStringTypeWithCollation(supportsTrimCollation = true). Semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).Known limitation:
lpad(<binary>, ...)is rewritten by Spark toBinaryPad / StaticInvoke(ByteArray.lpad)before serde runs and always falls back to Spark.
ltrim#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
StringTrimLeftextendsString2TrimExpression; no-arg form strips ASCII space0x20only. The two-arg parser formltrim(trimStr, srcStr)is swapped to(srcStr, Option(trimStr))by Spark’s secondary constructor, so children match DataFusionltrim(str, chars).Spark 4.0.1 (audited 2026-05-27): routes through
CollationSupport.StringTrimLeft.exec; semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
octet_length#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
(StringType|BinaryType) -> IntegerType; eval returnsnumBytesfor strings and.lengthfor binary.Spark 4.0.1 (audited 2026-05-27):
inputTypeswidened toStringTypeWithCollation; semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).Known limitation: wired as a raw
CometScalarFunction("octet_length")with noBinaryTypeguard. DataFusion’sOctetLengthFuncsignature only accepts string types, sooctet_length(<binary>)execute-fails on the native side instead of falling back cleanly (https://github.com/apache/datafusion-comet/issues/4464).
regexp_replace#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
RegExpReplace(subject, regexp, rep, pos)with foldablepos > 0; uses JavaPattern. Comet supports onlypos = 1(other offsets fall back) and injects a'g'flag because DataFusion’sregexp_replacestops at the first match by default.Spark 4.0.1 (audited 2026-05-27): adds raw-string literal support at the parser level and
nullIntolerant: Boolean = true; runtime semantics unchanged.Known limitation: regex semantics differ (Rust
regexcrate vs JavaPattern);RegExp.isSupportedPatterncurrently returnsfalsefor every pattern, so the path always requiresspark.comet.expression.regexp.allowIncompatible=true.
repeat#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
StringRepeat(str, times)withnullSafeEval(s, n) = s.repeat(n);UTF8String.repeatreturns the empty string forn <= 0. Comet caststimestoLongTypeand delegates to DataFusionrepeat, which mirrors Spark for negative counts.Spark 4.0.1 (audited 2026-05-27): adds
nullIntolerant: Booleanfield;dataTypebecomesstr.dataType(collation-tracking). Semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
replace#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
StringReplace(src, search, replace); whensearchis empty, Spark returnssrcunchanged (short-circuit onsearch.numBytes == 0). DataFusionreplaceinstead insertsreplacebetween every character, soCometStringReplace.getSupportLevelmarksIncompatible(Some(reason))whensearchis a literal empty string and falls back to Spark by default (https://github.com/apache/datafusion-comet/issues/4497).Spark 4.0.1 (audited 2026-05-27): routes through
CollationSupport.StringReplace.exec; semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
right#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
RuntimeReplaceablewithreplacement = If(IsNull(str), null, If(len <= 0, "", Substring(str, -len, len))); acceptsStringTypeplusIntegerType. Comet serde rewrites positivelento aSubstringproto withstart=-len, len=len; forlen <= 0it builds anIf(IsNull(str), null, "")proto chain to preserve NULL propagation.getSupportLeveldeclaresUnsupportedfor non-literallenso the dispatcher falls back uniformly.Spark 4.0.1 (audited 2026-05-27):
inputTypeswidened with collation; usesUnaryMinus(len, failOnError = false)to avoid integer-overflow exceptions onlen = Int.MinValue. Semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
rpad#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
StringRPad(str, len, pad) -> StringType; same edge-case behaviour aslpad(negative len, empty pad, NULL propagation). Comet serde requires columnstrand literalpad.Spark 4.0.1 (audited 2026-05-27): same evolution as
lpad; default-pad literal type tightened; semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).Known limitation: same
BinaryPad / StaticInvokerewrite aslpadcausesrpad(<binary>, ...)to fall back.
rtrim#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
StringTrimRightextendsString2TrimExpression; semantically symmetric toltrim.Spark 4.0.1 (audited 2026-05-27): routes through
CollationSupport.StringTrimRight.exec; semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
space#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
StringSpace(IntegerType) -> StringType; negative input yields the empty string. Resolves natively todatafusion_spark::function::string::space::SparkSpace.Spark 4.0.1 (audited 2026-05-27): semantics unchanged;
NullIntoleranttrait replaced bynullIntolerant: Booleanoverride.
split#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
StringSplit(str, regex, limit);limit > 0permits at mostlimit-1splits,limit <= 0is unlimited. Comet registerssplitas a custom UDF (native/spark-expr/src/string_funcs/split.rs) using the Rustregexcrate, and is unconditionallyIncompatibledue to regex-engine differences.Spark 4.0.1 (audited 2026-05-27): wraps the regex via
CollationSupport.collationAwareRegexand changesdataTypetoArrayType(str.dataType, ...). Comet does not honour collation flags (https://github.com/apache/datafusion-comet/issues/4496).
startswith#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
UTF8String.startsWithonStringType; binary form routed toBinaryPredicate.Spark 4.0.1 (audited 2026-05-27): routes through
CollationSupport.StartsWith.exec; semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
substr#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): registry alias of
Substring. Same support assubstring.Spark 4.0.1 (audited 2026-05-27): unchanged alias of
Substring.
substring#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
TernaryExpression; two-arg form defaultslen = Integer.MAX_VALUE; supportsStringTypeandBinaryType. Comet serializes to a dedicatedSubstringproto.getSupportLeveldeclaresUnsupportedwhen eitherposorlenis not aLiteralso the dispatcher falls back uniformly.Spark 4.0.1 (audited 2026-05-27):
inputTypeswidened withStringTypeWithCollation; semantics unchanged forUTF8_BINARY. NativeSubstringExprimplements Spark’s negative-start clamping and is exercised against ASCII, multibyte UTF-8, emoji, decomposed and Telugu inputs. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
substring_index#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
TernaryExpression(StringType, StringType, IntegerType) -> StringType. Comet castscounttoLongTypeand delegates to DataFusion’ssubstr_indexUDF (aliassubstring_index).Spark 4.0.1 (audited 2026-05-27): routes through
CollationSupport.SubstringIndex.execand propagatesstrExpr.dataType; semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
translate#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
StringTranslate(src, from, to);UTF8String.translate(dict)is code-point based, and any character mapped explicitly to U+0000 intois also deleted.Spark 4.0.1 (audited 2026-05-27): routes through
CollationSupport.StringTranslate.exec; semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).Known divergence: DataFusion’s
translateis grapheme-based (Spark uses code points), and does not delete characters mapped to U+0000 into. Currently the support level isCompatible(https://github.com/apache/datafusion-comet/issues/4463).
trim#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline.
StringTrimno-arg form strips ASCII space0x20only (matches DataFusionbtrim’s default); two-arg form’s children are(srcStr, trimStr)after Spark’s secondary-constructor swap.Spark 4.0.1 (audited 2026-05-27): routes through
CollationSupport.StringTrim.execand usesStringTypeNonCSAICollation; semantics unchanged forUTF8_BINARY. Non-default collations not honoured by Comet (https://github.com/apache/datafusion-comet/issues/4496).
ucase#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): registry alias of
Upper. Same support asupper.Spark 4.0.1 (audited 2026-05-27): unchanged alias of
Upper.
upper#
Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
Spark 3.5.8 (audited 2026-05-27): baseline. JVM default-locale
toUpperCaseonUTF8String. Comet routes to DataFusionupper(Rust Unicode default case mapping, no locale awareness) and is unconditionallyIncompatible; users opt in via the standardspark.comet.expression.Upper.allowIncompatible=true.Spark 4.0.1 (audited 2026-05-27): routes through
CollationSupport.Upper.exec(v, collationId, useICU)withSQLConf.ICU_CASE_MAPPINGS_ENABLED. Comet does not propagate collation or ICU mode; non-default collations orICU_CASE_MAPPINGS_ENABLED=truediverge even after opting in (https://github.com/apache/datafusion-comet/issues/2190).