Spark-Compatible Functions#
DataFusion ships Spark-compatible versions of a wide set of functions
(string, math, datetime, hash, array, aggregate) through the upstream
datafusion-spark crate. datafusion-python exposes these under
datafusion.functions.spark for use from the DataFrame API, and via
enable_spark_functions() for use from
SQL.
Why a Separate Namespace?#
Several Spark functions share names with DataFusion built-ins but differ in semantics. The most common divergences:
concatpropagates NULL.concat('a', NULL, 'b')returns NULL under Spark semantics, whereas the DataFusion default returns'ab'.substringis 1-indexed and supports negative positions counting from the end of the string.rounduses HALF_UP rounding mode (round(2.5, 0) == 3).Numeric functions (
floor,ceil,mod) follow Spark’s edge-case handling for negative values and decimals.
Enabling Spark functions does not affect the DataFrame API: you choose which implementation to call by which module you import from.
DataFrame API#
Import spark and use it like any other functions module. The Spark
functions can go anywhere you’d put a DataFusion expression — inside
select, filter, with_column, aggregate, and so on.
from datafusion import SessionContext, col, lit
from datafusion.functions import spark
ctx = SessionContext()
df = ctx.from_pydict({"s": ["hello", "world"]})
# SHA-256 hash with Spark semantics
df.select(spark.sha2(col("s"), lit(256)).alias("h")).show()
# 1-indexed substring
df.select(spark.substring(col("s"), lit(1), lit(3)).alias("p")).show()
SQL#
To use Spark functions in SQL queries, call
enable_spark_functions() on the context.
This registers every Spark UDF/UDAF/UDWF, overriding any DataFusion built-in
of the same name.
from datafusion import SessionContext
ctx = SessionContext()
ctx.enable_spark_functions()
ctx.sql("SELECT sha2('hello', 256)").show()
ctx.sql("SELECT concat('a', NULL, 'b')").show() # -> NULL, not 'ab'
The override applies for the lifetime of the session. To call DataFusion’s
built-in versions afterwards, create a fresh SessionContext.
Function Reference#
The full, up-to-date list of available Spark functions — with signatures
and per-function docstrings — lives in the
datafusion.functions.spark API reference.