Regular Expressions#

Comet evaluates Spark regular-expression expressions (rlike, regexp_replace, split, regexp_extract, regexp_extract_all, regexp_instr) two ways:

  • Codegen dispatcher (default) — Spark’s own doGenCode for the expression runs inside Comet’s Arrow-direct codegen dispatcher (the same dispatcher used by Comet’s ScalaUDF codegen path). This is 100% compatible with Spark, at the cost of one JNI round-trip per batch. It is enabled by default (spark.comet.exec.scalaUDF.codegen.enabled=true); if the dispatcher is disabled, regex expressions fall back to Spark.

  • Native (rust) engine — the Rust regex crate, run natively with no JNI overhead. It is faster but has different semantics from Java regex (see below), so it is opt-in per expression via that expression’s allowIncompatible flag. rlike, regexp_replace, split, regexp_extract, and regexp_extract_all have a native implementation; regexp_instr does not and always runs through the codegen dispatcher.

SQL

Native (rust) opt-in config

rlike

spark.comet.expression.RLike.allowIncompatible

regexp_replace

spark.comet.expression.RegExpReplace.allowIncompatible

regexp_extract

spark.comet.expression.RegExpExtract.allowIncompatible

regexp_extract_all

spark.comet.expression.RegExpExtractAll.allowIncompatible

split

spark.comet.expression.StringSplit.allowIncompatible

When the native path is opted in but a case has no native implementation (for example a non-scalar rlike pattern, regexp_replace with a non-1 offset, or regexp_extract with a non-literal pattern or idx), Comet routes that case through the codegen dispatcher.

Disabling Comet for individual regex expressions#

Each regex expression has a per-class spark.comet.expression.<ClassName>.enabled flag (default true) that disables Comet’s serde for that expression and forces a Spark fallback. This is useful for narrowing a regression or comparing performance on a single operator without changing the engine selector:

Expression

Config

rlike

spark.comet.expression.RLike.enabled=false

regexp_extract

spark.comet.expression.RegExpExtract.enabled=false

regexp_extract_all

spark.comet.expression.RegExpExtractAll.enabled=false

regexp_instr

spark.comet.expression.RegExpInStr.enabled=false

regexp_replace

spark.comet.expression.RegExpReplace.enabled=false

split

spark.comet.expression.StringSplit.enabled=false

Choosing an engine#

Rust engine

Codegen dispatcher (default)

Compatibility

Differs from Java regex (see below)

100% compatible with Spark

Feature coverage

rlike, regexp_replace, split, regexp_extract, regexp_extract_all natively; regexp_instr via fallthrough

All regexp expressions (rlike, regexp_extract, regexp_extract_all, regexp_instr, regexp_replace, split)

Performance

Fully native, no JNI overhead

One JNI round-trip per batch (Arrow vectors stay columnar)

Pattern support

Linear-time subset only

All Java regex features (backreferences, lookaround, etc.)

The Rust engine is faster but cannot match Java regex semantics for every pattern. Opting in per expression (for example spark.comet.expression.RLike.allowIncompatible=true) declares acceptance of those differences.

The codegen dispatcher is the default and is enabled by spark.comet.exec.scalaUDF.codegen.enabled, so it can be disabled globally to fall back to Spark for the regex family.

Why the engines differ#

Java’s java.util.regex is a backtracking engine in the Perl/PCRE family. It supports the full range of features that style of engine provides, including some whose worst-case running time grows exponentially with the input.

Rust’s regex crate is a finite-automaton engine in the RE2 family. It deliberately omits features that cannot be implemented with a guarantee of linear-time matching. In exchange, every pattern it does accept runs in time linear in the size of the input. This is the same trade-off RE2, Go’s regexp, and several other engines make.

The practical consequence is that Java accepts a strictly larger set of patterns than the Rust engine, and several constructs that look the same in source have different semantics on the two sides.

Features supported by Java but not by the Rust engine#

Patterns that use any of the following will not compile in Comet’s Rust engine and must run on Spark (or use the Java engine):

  • Backreferences such as \1, \2, or \k<name>. The Rust engine has no backtracking and cannot match a previously captured group.

  • Lookaround, including lookahead ((?=...), (?!...)) and lookbehind ((?<=...), (?<!...)).

  • Atomic groups ((?>...)).

  • Possessive quantifiers (*+, ++, ?+, {n,m}+). Rust supports greedy and lazy quantifiers but not possessive.

  • Embedded code, conditionals, and recursion such as (?(cond)yes|no) or (?R). Rust accepts none of these.

Features that exist on both sides but behave differently#

Even where both engines accept a construct, the matching behavior is not always the same.

  • Unicode-aware character classes. In the Rust engine, \d, \w, \s, and . are Unicode-aware by default, so \d matches every digit codepoint defined by Unicode rather than only 0-9. Java’s defaults match ASCII only and require the UNICODE_CHARACTER_CLASS flag (or (?U) inline) to switch to Unicode semantics. The same pattern can therefore match a different set of characters on each side.

  • Line terminators. In multiline mode, Java treats \r, \n, \r\n, and a few additional Unicode line separators as line boundaries by default. The Rust engine treats only \n as a line boundary unless CRLF mode is enabled. ^, $, and . (with (?s) off) all depend on this definition.

  • Case-insensitive matching. Both engines support (?i), but Java’s default is ASCII case folding while the Rust engine uses full Unicode simple case folding when Unicode mode is on. Patterns that match characters outside ASCII can produce different results.

  • POSIX character classes. The Rust engine supports [[:alpha:]] style POSIX classes inside bracket expressions but not Java’s \p{Alpha} shorthand. Java accepts both. Unicode property escapes (\p{L}, \p{Greek}, etc.) are supported by both engines but cover slightly different sets of properties.

  • Octal and Unicode escapes. Java accepts \0nnn for octal and \uXXXX for a BMP codepoint. Rust uses \x{...} for arbitrary codepoints and does not accept Java’s bare \uXXXX form.

  • Empty matches in split. Spark’s StringSplit, which is built on Java’s regex, includes leading empty strings produced by zero-width matches at the start of the input. The Rust engine’s split follows different rules, so split results can differ in edge cases involving empty matches even when the pattern itself is identical on both sides.

When the Rust engine is safe#

For most ASCII-only, non-anchored patterns that use only literal characters, simple character classes, and ordinary quantifiers, the two engines produce the same results. If you are confident your patterns fit this shape and want to avoid the JNI overhead of the Java engine, switching to the Rust engine with allowIncompatible=true is generally safe.

For anything that uses backreferences, lookaround, or relies on Java’s specific Unicode or line-handling defaults, use the Java engine.