datafusion¶

DataFusion python package.

This is a Python library that binds to Apache Arrow in-memory query engine DataFusion. See https://datafusion.apache.org/python for more information.

Submodules¶

Attributes¶

`DFSchema`
`col`
`column`
`udaf`
`udf`
`udtf`
`udwf`

Classes¶

`Accumulator`	Defines how an `AggregateUDF` accumulates values.
`AggregateUDF`	Class for performing scalar user-defined functions (UDF).
`Catalog`	DataFusion data catalog.
`Database`	See Schema.
`ExecutionPlan`	Represent nodes in the DataFusion Physical Plan.
`Expr`	Expression object.
`LogicalPlan`	Logical Plan.
`ParquetColumnOptions`	Parquet options for individual columns.
`ParquetWriterOptions`	Advanced parquet writer options.
`RecordBatch`	This class is essentially a wrapper for `pa.RecordBatch`.
`RecordBatchStream`	This class represents a stream of record batches.
`RuntimeEnvBuilder`	Runtime configuration options.
`SQLOptions`	Options to be used when performing SQL queries.
`ScalarUDF`	Class for performing scalar user-defined functions (UDF).
`SessionConfig`	Session configuration options.
`Table`	DataFusion table.
`TableFunction`	Class for performing user-defined table functions (UDTF).
`WindowFrame`	Defines a window frame for performing window operations.
`WindowUDF`	Class for performing window user-defined functions (UDF).

Functions¶

`configure_formatter`(→ None)	Configure the global DataFrame HTML formatter.
`lit`(→ expr.Expr)	Create a literal expression.
`literal`(→ expr.Expr)	Create a literal expression.
`read_avro`(→ datafusion.dataframe.DataFrame)	Create a `DataFrame` for reading Avro data source.
`read_csv`(→ datafusion.dataframe.DataFrame)	Read a CSV data source.
`read_json`(→ datafusion.dataframe.DataFrame)	Read a line-delimited JSON data source.
`read_parquet`(→ datafusion.dataframe.DataFrame)	Read a Parquet source into a `Dataframe`.

Package Contents¶

class datafusion.Accumulator¶

Defines how an AggregateUDF accumulates values.

abstract evaluate() → pyarrow.Scalar¶: Return the resultant value.

abstract merge(states: list[pyarrow.Array]) → None¶: Merge a set of states.

abstract state() → list[pyarrow.Scalar]¶: Return the current state.

abstract update(*values: pyarrow.Array) → None¶: Evaluate an array of values and update state.

class datafusion.AggregateUDF(name: str, accumulator: Callable[[], Accumulator], input_types: list[pyarrow.DataType], return_type: pyarrow.DataType, state_type: list[pyarrow.DataType], volatility: Volatility | str)¶

Class for performing scalar user-defined functions (UDF).

Aggregate UDFs operate on a group of rows and return a single value. See also ScalarUDF for operating on a row by row basis.

Instantiate a user-defined aggregate function (UDAF).

See udaf() for a convenience function and argument descriptions.

__call__(*args: datafusion.expr.Expr) → datafusion.expr.Expr¶

Execute the UDAF.

This function is not typically called by an end user. These calls will occur during the evaluation of the dataframe.

__repr__() → str¶: Print a string representation of the Aggregate UDF.

static from_pycapsule(func: AggregateUDFExportable) → AggregateUDF¶

Create an Aggregate UDF from AggregateUDF PyCapsule object.

This function will instantiate a Aggregate UDF that uses a DataFusion AggregateUDF that is exported via the FFI bindings.

static udaf(input_types: pyarrow.DataType | list[pyarrow.DataType], return_type: pyarrow.DataType, state_type: list[pyarrow.DataType], volatility: Volatility | str, name: str | None = None) → Callable[Ellipsis, AggregateUDF]¶

static udaf(accum: Callable[[], Accumulator], input_types: pyarrow.DataType | list[pyarrow.DataType], return_type: pyarrow.DataType, state_type: list[pyarrow.DataType], volatility: Volatility | str, name: str | None = None) → AggregateUDF

Create a new User-Defined Aggregate Function (UDAF).

This class allows you to define an aggregate function that can be used in data aggregation or window function calls.

Usage:

As a function: udaf(accum, input_types, return_type, state_type, volatility, name).
As a decorator: @udaf(input_types, return_type, state_type, volatility, name). When using udaf as a decorator, do not pass accum explicitly.

Function example:

If your Accumulator can be instantiated with no arguments, you can simply pass it’s type as accum. If you need to pass additional arguments to it’s constructor, you can define a lambda or a factory method. During runtime the Accumulator will be constructed for every instance in which this UDAF is used. The following examples are all valid:

import pyarrow as pa
import pyarrow.compute as pc

class Summarize(Accumulator):
    def __init__(self, bias: float = 0.0):
        self._sum = pa.scalar(bias)

    def state(self) -> list[pa.Scalar]:
        return [self._sum]

    def update(self, values: pa.Array) -> None:
        self._sum = pa.scalar(self._sum.as_py() + pc.sum(values).as_py())

    def merge(self, states: list[pa.Array]) -> None:
        self._sum = pa.scalar(self._sum.as_py() + pc.sum(states[0]).as_py())

    def evaluate(self) -> pa.Scalar:
        return self._sum

def sum_bias_10() -> Summarize:
    return Summarize(10.0)

udaf1 = udaf(Summarize, pa.float64(), pa.float64(), [pa.float64()],
    "immutable")
udaf2 = udaf(sum_bias_10, pa.float64(), pa.float64(), [pa.float64()],
    "immutable")
udaf3 = udaf(lambda: Summarize(20.0), pa.float64(), pa.float64(),
    [pa.float64()], "immutable")

Decorator example::

@udaf(pa.float64(), pa.float64(), [pa.float64()], "immutable")
def udf4() -> Summarize:
    return Summarize(10.0)

Parameters:

accum – The accumulator python function. Only needed when calling as a function. Skip this argument when using udaf as a decorator. If you have a Rust backed AggregateUDF within a PyCapsule, you can pass this parameter and ignore the rest. They will be determined directly from the underlying function. See the online documentation for more information.
input_types – The data types of the arguments to accum.
return_type – The data type of the return value.
state_type – The data types of the intermediate accumulation.
volatility – See Volatility for allowed values.
name – A descriptive name for the function.

Returns:

A user-defined aggregate function, which can be used in either data aggregation or window function calls.

_udaf¶

class datafusion.Catalog(catalog: datafusion._internal.catalog.RawCatalog)¶

DataFusion data catalog.

This constructor is not typically called by the end user.

__repr__() → str¶: Print a string representation of the catalog.

database(name: str = 'public') → Schema¶: Returns the database with the given name from this catalog.

deregister_schema(name: str, cascade: bool = True) → Schema | None¶: Deregister a schema from this catalog.

static memory_catalog() → Catalog¶: Create an in-memory catalog provider.

names() → set[str]¶: This is an alias for schema_names.

register_schema(name, schema) → Schema | None¶: Register a schema with this catalog.

schema(name: str = 'public') → Schema¶: Returns the database with the given name from this catalog.

schema_names() → set[str]¶: Returns the list of schemas in this catalog.

catalog¶

class datafusion.Database(schema: datafusion._internal.catalog.RawSchema)¶

Bases: Schema

See Schema.

This constructor is not typically called by the end user.

class datafusion.ExecutionPlan(plan: datafusion._internal.ExecutionPlan)¶

Represent nodes in the DataFusion Physical Plan.

This constructor should not be called by the end user.

__repr__() → str¶: Print a string representation of the physical plan.

children() → list[ExecutionPlan]¶

Get a list of children ExecutionPlan that act as inputs to this plan.

The returned list will be empty for leaf nodes such as scans, will contain a single value for unary nodes, or two values for binary nodes (such as joins).

display() → str¶: Print the physical plan.

display_indent() → str¶: Print an indented form of the physical plan.

static from_proto(ctx: datafusion.context.SessionContext, data: bytes) → ExecutionPlan¶

Create an ExecutionPlan from protobuf bytes.

Tables created in memory from record batches are currently not supported.

to_proto() → bytes¶

Convert an ExecutionPlan into protobuf bytes.

Tables created in memory from record batches are currently not supported.

_raw_plan¶

property partition_count: int¶: Returns the number of partitions in the physical plan.

class datafusion.Expr(expr: datafusion._internal.expr.RawExpr)¶

Expression object.

Expressions are one of the core concepts in DataFusion. See Expressions in the online documentation for more information.

This constructor should not be called by the end user.

__add__(rhs: Any) → Expr¶

Addition operator.