Regular Expressions#
Comet evaluates Spark regular-expression expressions (rlike, regexp_replace, split,
regexp_extract, regexp_extract_all, regexp_instr) two ways:
Codegen dispatcher (default) — Spark’s own
doGenCodefor the expression runs inside Comet’s Arrow-direct codegen dispatcher (the same dispatcher used by Comet’sScalaUDFcodegen path). This is 100% compatible with Spark, at the cost of one JNI round-trip per batch. It is enabled by default (spark.comet.exec.scalaUDF.codegen.enabled=true); if the dispatcher is disabled, regex expressions fall back to Spark.Native (rust) engine — the Rust
regexcrate, run natively with no JNI overhead. It is faster but has different semantics from Java regex (see below), so it is opt-in per expression via that expression’sallowIncompatibleflag.rlike,regexp_replace,split,regexp_extract, andregexp_extract_allhave a native implementation;regexp_instrdoes not and always runs through the codegen dispatcher.
SQL |
Native (rust) opt-in config |
|---|---|
|
|
|
|
|
|
|
|
|
|
When the native path is opted in but a case has no native implementation (for example a non-scalar
rlike pattern, regexp_replace with a non-1 offset, or regexp_extract with a non-literal
pattern or idx), Comet routes that case through the codegen dispatcher.
Disabling Comet for individual regex expressions#
Each regex expression has a per-class spark.comet.expression.<ClassName>.enabled flag (default
true) that disables Comet’s serde for that expression and forces a Spark fallback. This is
useful for narrowing a regression or comparing performance on a single operator without changing
the engine selector:
Expression |
Config |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Choosing an engine#
Rust engine |
Codegen dispatcher (default) |
|
|---|---|---|
Compatibility |
Differs from Java regex (see below) |
100% compatible with Spark |
Feature coverage |
|
All regexp expressions ( |
Performance |
Fully native, no JNI overhead |
One JNI round-trip per batch (Arrow vectors stay columnar) |
Pattern support |
Linear-time subset only |
All Java regex features (backreferences, lookaround, etc.) |
The Rust engine is faster but cannot match Java regex semantics for every pattern. Opting in per
expression (for example spark.comet.expression.RLike.allowIncompatible=true) declares acceptance
of those differences.
The codegen dispatcher is the default and is enabled by spark.comet.exec.scalaUDF.codegen.enabled,
so it can be disabled globally to fall back to Spark for the regex family.
Why the engines differ#
Java’s java.util.regex is a backtracking engine in the Perl/PCRE family. It supports the full range of
features that style of engine provides, including some whose worst-case running time grows exponentially with
the input.
Rust’s regex crate is a finite-automaton engine in the RE2 family. It deliberately omits features that
cannot be implemented with a guarantee of linear-time matching. In exchange, every pattern it does accept runs
in time linear in the size of the input. This is the same trade-off RE2, Go’s regexp, and several other
engines make.
The practical consequence is that Java accepts a strictly larger set of patterns than the Rust engine, and several constructs that look the same in source have different semantics on the two sides.
Features supported by Java but not by the Rust engine#
Patterns that use any of the following will not compile in Comet’s Rust engine and must run on Spark (or use the Java engine):
Backreferences such as
\1,\2, or\k<name>. The Rust engine has no backtracking and cannot match a previously captured group.Lookaround, including lookahead (
(?=...),(?!...)) and lookbehind ((?<=...),(?<!...)).Atomic groups (
(?>...)).Possessive quantifiers (
*+,++,?+,{n,m}+). Rust supports greedy and lazy quantifiers but not possessive.Embedded code, conditionals, and recursion such as
(?(cond)yes|no)or(?R). Rust accepts none of these.
Features that exist on both sides but behave differently#
Even where both engines accept a construct, the matching behavior is not always the same.
Unicode-aware character classes. In the Rust engine,
\d,\w,\s, and.are Unicode-aware by default, so\dmatches every digit codepoint defined by Unicode rather than only0-9. Java’s defaults match ASCII only and require theUNICODE_CHARACTER_CLASSflag (or(?U)inline) to switch to Unicode semantics. The same pattern can therefore match a different set of characters on each side.Line terminators. In multiline mode, Java treats
\r,\n,\r\n, and a few additional Unicode line separators as line boundaries by default. The Rust engine treats only\nas a line boundary unless CRLF mode is enabled.^,$, and.(with(?s)off) all depend on this definition.Case-insensitive matching. Both engines support
(?i), but Java’s default is ASCII case folding while the Rust engine uses full Unicode simple case folding when Unicode mode is on. Patterns that match characters outside ASCII can produce different results.POSIX character classes. The Rust engine supports
[[:alpha:]]style POSIX classes inside bracket expressions but not Java’s\p{Alpha}shorthand. Java accepts both. Unicode property escapes (\p{L},\p{Greek}, etc.) are supported by both engines but cover slightly different sets of properties.Octal and Unicode escapes. Java accepts
\0nnnfor octal and\uXXXXfor a BMP codepoint. Rust uses\x{...}for arbitrary codepoints and does not accept Java’s bare\uXXXXform.Empty matches in
split. Spark’sStringSplit, which is built on Java’s regex, includes leading empty strings produced by zero-width matches at the start of the input. The Rust engine’ssplitfollows different rules, so split results can differ in edge cases involving empty matches even when the pattern itself is identical on both sides.
When the Rust engine is safe#
For most ASCII-only, non-anchored patterns that use only literal characters, simple character classes, and
ordinary quantifiers, the two engines produce the same results. If you are confident your patterns fit this
shape and want to avoid the JNI overhead of the Java engine, switching to the Rust engine with
allowIncompatible=true is generally safe.
For anything that uses backreferences, lookaround, or relies on Java’s specific Unicode or line-handling defaults, use the Java engine.