Spark-Compatible Functions#

DataFusion ships Spark-compatible versions of a wide set of functions (string, math, datetime, hash, array, aggregate) through the upstream datafusion-spark crate. datafusion-python exposes these under datafusion.functions.spark for use from the DataFrame API, and via enable_spark_functions() for use from SQL.

Why a Separate Namespace?#

Several Spark functions share names with DataFusion built-ins but differ in semantics. The most common divergences:

  • concat propagates NULL. concat('a', NULL, 'b') returns NULL under Spark semantics, whereas the DataFusion default returns 'ab'.

  • substring is 1-indexed and supports negative positions counting from the end of the string.

  • round uses HALF_UP rounding mode (round(2.5, 0) == 3).

  • Numeric functions (floor, ceil, mod) follow Spark’s edge-case handling for negative values and decimals.

Enabling Spark functions does not affect the DataFrame API: you choose which implementation to call by which module you import from.

DataFrame API#

Import spark and use it like any other functions module. The Spark functions can go anywhere you’d put a DataFusion expression — inside select, filter, with_column, aggregate, and so on.

from datafusion import SessionContext, col, lit
from datafusion.functions import spark

ctx = SessionContext()
df = ctx.from_pydict({"s": ["hello", "world"]})

# SHA-256 hash with Spark semantics
df.select(spark.sha2(col("s"), lit(256)).alias("h")).show()

# 1-indexed substring
df.select(spark.substring(col("s"), lit(1), lit(3)).alias("p")).show()

SQL#

To use Spark functions in SQL queries, call enable_spark_functions() on the context. This registers every Spark UDF/UDAF/UDWF, overriding any DataFusion built-in of the same name.

from datafusion import SessionContext

ctx = SessionContext()
ctx.enable_spark_functions()

ctx.sql("SELECT sha2('hello', 256)").show()
ctx.sql("SELECT concat('a', NULL, 'b')").show()   # -> NULL, not 'ab'

The override applies for the lifetime of the session. To call DataFusion’s built-in versions afterwards, create a fresh SessionContext.

Function Reference#

The full, up-to-date list of available Spark functions — with signatures and per-function docstrings — lives in the datafusion.functions.spark API reference.