Spark-Compatible Functions

Ballista provides an optional spark-compat Cargo feature that enables Spark-compatible scalar, aggregate, and window functions from the datafusion-spark crate.

Enabling the Feature

The spark-compat feature must be explicitly enabled at build time. It is not enabled by default.

Building from Source

To build Ballista components with Spark-compatible functions:

# Build all components with spark-compat feature
cargo build --features spark-compat --release

# Build scheduler only
cargo build -p ballista-scheduler --features spark-compat --release

# Build executor only
cargo build -p ballista-executor --features spark-compat --release

# Build CLI with spark-compat
cargo build -p ballista-cli --features spark-compat --release

For more installation options, see Installing with Cargo.

What’s Included

When the spark-compat feature is enabled, Ballista’s function registry automatically includes additional functions from the datafusion-spark crate:

Note: For a comprehensive list of available functions, refer to the datafusion-spark crate documentation. These functions are provided in addition to DataFusion’s default functions.

Scalar Functions

Spark-compatible scalar functions provide additional string, mathematical, and cryptographic operations.

Aggregate Functions

Spark-compatible aggregate functions extend DataFusion’s built-in aggregations with additional statistical and analytical functions.

Window Functions

Spark-compatible window functions provide additional analytical capabilities for windowed operations.

Usage Examples

Once the spark-compat feature is enabled at build time, the functions are automatically available in SQL queries:

Example 1: Using SHA-1 Hash Function

SELECT sha1('Ballista') AS hash_value;

Output:

+------------------------------------------+
| hash_value                               |
+------------------------------------------+
| 8b8e1f0e55f8f0e3c7a8... (hex string)    |
+------------------------------------------+

Example 2: Using expm1 for Precision

SELECT
    expm1(0.001) AS precise_value,
    exp(0.001) - 1 AS standard_value;

The expm1 function provides better numerical precision for small values compared to computing exp(x) - 1 directly.

Example 3: Combining with DataFusion Functions

Spark-compatible functions work alongside DataFusion’s built-in functions:

SELECT
    name,
    upper(name) AS name_upper,           -- DataFusion function
    sha1(name) AS name_hash,             -- Spark-compat function
    length(name) AS name_length          -- DataFusion function
FROM users;

Use Cases

The spark-compat feature is useful when:

  • Migrating from Spark: Easing the transition by providing familiar function names and behaviors

  • Cross-Platform Queries: Writing queries that use similar functions across Spark and Ballista environments

  • Specific Function Needs: Requiring particular Spark-style functions (like sha1, conv, etc.) that aren’t in DataFusion’s default set

  • Team Familiarity: Your team is more familiar with Spark’s function library

See Also