datafusion¶
DataFusion python package.
This is a Python library that binds to Apache Arrow in-memory query engine DataFusion. See https://datafusion.apache.org/python for more information.
Subpackages¶
Submodules¶
Attributes¶
Classes¶
Defines how an |
|
Class for performing scalar user defined functions (UDF). |
|
DataFusion data catalog. |
|
DataFusion Database. |
|
Expression object. |
|
This class is essentially a wrapper for |
|
This class represents a stream of record batches. |
|
Runtime configuration options. |
|
Options to be used when performing SQL queries. |
|
Class for performing scalar user defined functions (UDF). |
|
Session configuration options. |
|
DataFusion table. |
|
Defines a window frame for performing window operations. |
Functions¶
|
Create a column expression. |
|
Create a column expression. |
|
Create a literal expression. |
|
Create a literal expression. |
Package Contents¶
- class datafusion.Accumulator¶
Defines how an
AggregateUDF
accumulates values.- abstract evaluate() pyarrow.Scalar ¶
Return the resultant value.
- abstract merge(states: List[pyarrow.Array]) None ¶
Merge a set of states.
- abstract state() List[pyarrow.Scalar] ¶
Return the current state.
- abstract update(values: pyarrow.Array) None ¶
Evaluate an array of values and update state.
- class datafusion.AggregateUDF(name: str | None, accumulator: _A, input_types: list[pyarrow.DataType], return_type: _R, state_type: list[pyarrow.DataType], volatility: Volatility | str)¶
Class for performing scalar user defined functions (UDF).
Aggregate UDFs operate on a group of rows and return a single value. See also
ScalarUDF
for operating on a row by row basis.Instantiate a user defined aggregate function (UDAF).
See
udaf()
for a convenience function and argument descriptions.- __call__(*args: datafusion.expr.Expr) datafusion.expr.Expr ¶
Execute the UDAF.
This function is not typically called by an end user. These calls will occur during the evaluation of the dataframe.
- static udaf(accum: _A, input_types: list[pyarrow.DataType], return_type: _R, state_type: list[pyarrow.DataType], volatility: Volatility | str, name: str | None = None) AggregateUDF ¶
Create a new User Defined Aggregate Function.
The accumulator function must be callable and implement
Accumulator
.- Parameters:
accum – The accumulator python function.
input_types – The data types of the arguments to
accum
.return_type – The data type of the return value.
state_type – The data types of the intermediate accumulation.
volatility – See
Volatility
for allowed values.name – A descriptive name for the function.
- Returns:
A user defined aggregate function, which can be used in either data aggregation or window function calls.
- _udf¶
- class datafusion.Catalog(catalog: datafusion._internal.Catalog)¶
DataFusion data catalog.
This constructor is not typically called by the end user.
- database(name: str = 'public') Database ¶
Returns the database with the given
name
from this catalog.
- names() list[str] ¶
Returns the list of databases in this catalog.
- catalog¶
- class datafusion.Database(db: datafusion._internal.Database)¶
DataFusion Database.
This constructor is not typically called by the end user.
- names() set[str] ¶
Returns the list of all tables in this database.
- db¶
- class datafusion.Expr(expr: datafusion._internal.expr.Expr)¶
Expression object.
Expressions are one of the core concepts in DataFusion. See Expressions in the online documentation for more information.
This constructor should not be called by the end user.
- __add__(rhs: Any) Expr ¶
Addition operator.
Accepts either an expression or any valid PyArrow scalar literal value.
- __eq__(rhs: Any) Expr ¶
Equal to.
Accepts either an expression or any valid PyArrow scalar literal value.
- __ge__(rhs: Any) Expr ¶
Greater than or equal to.
Accepts either an expression or any valid PyArrow scalar literal value.
- __getitem__(key: str | int) Expr ¶
Retrieve sub-object.
If
key
is a string, returns the subfield of the struct. Ifkey
is an integer, retrieves the element in the array. Note that the element index begins at0
, unlike array_element which begins at1
.
- __gt__(rhs: Any) Expr ¶
Greater than.
Accepts either an expression or any valid PyArrow scalar literal value.
- __le__(rhs: Any) Expr ¶
Less than or equal to.
Accepts either an expression or any valid PyArrow scalar literal value.
- __lt__(rhs: Any) Expr ¶
Less than.
Accepts either an expression or any valid PyArrow scalar literal value.
- __mod__(rhs: Any) Expr ¶
Modulo operator (%).
Accepts either an expression or any valid PyArrow scalar literal value.
- __mul__(rhs: Any) Expr ¶
Multiplication operator.
Accepts either an expression or any valid PyArrow scalar literal value.
- __ne__(rhs: Any) Expr ¶
Not equal to.
Accepts either an expression or any valid PyArrow scalar literal value.
- __repr__() str ¶
Generate a string representation of this expression.
- __sub__(rhs: Any) Expr ¶
Subtraction operator.
Accepts either an expression or any valid PyArrow scalar literal value.
- __truediv__(rhs: Any) Expr ¶
Division operator.
Accepts either an expression or any valid PyArrow scalar literal value.
- canonical_name() str ¶
Returns a complete string representation of this expression.
- cast(to: pyarrow.DataType[Any] | Type[float] | Type[int] | Type[str] | Type[bool]) Expr ¶
Cast to a new data type.
- column_name(plan: datafusion._internal.LogicalPlan) str ¶
Compute the output column name based on the provided logical plan.
- display_name() str ¶
Returns the name of this expression as it should appear in a schema.
This name will not include any CAST expressions.
- distinct() ExprFuncBuilder ¶
Only evaluate distinct values for an aggregate function.
This function will create an
ExprFuncBuilder
that can be used to set parameters for either window or aggregate functions. If used on any other type of expression, an error will be generated whenbuild()
is called.
- filter(filter: Expr) ExprFuncBuilder ¶
Filter an aggregate function.
This function will create an
ExprFuncBuilder
that can be used to set parameters for either window or aggregate functions. If used on any other type of expression, an error will be generated whenbuild()
is called.
- static literal(value: Any) Expr ¶
Creates a new expression representing a scalar value.
value
must be a valid PyArrow scalar value or easily castable to one.
- null_treatment(null_treatment: datafusion.common.NullTreatment) ExprFuncBuilder ¶
Set the treatment for
null
values for a window or aggregate function.This function will create an
ExprFuncBuilder
that can be used to set parameters for either window or aggregate functions. If used on any other type of expression, an error will be generated whenbuild()
is called.
- order_by(*exprs: Expr) ExprFuncBuilder ¶
Set the ordering for a window or aggregate function.
This function will create an
ExprFuncBuilder
that can be used to set parameters for either window or aggregate functions. If used on any other type of expression, an error will be generated whenbuild()
is called.
- partition_by(*partition_by: Expr) ExprFuncBuilder ¶
Set the partitioning for a window function.
This function will create an
ExprFuncBuilder
that can be used to set parameters for either window or aggregate functions. If used on any other type of expression, an error will be generated whenbuild()
is called.
- python_value() Any ¶
Extracts the Expr value into a PyObject.
This is only valid for literal expressions.
- Returns:
Python object representing literal value of the expression.
- rex_call_operands() list[Expr] ¶
Return the operands of the expression based on it’s variant type.
Row expressions, Rex(s), operate on the concept of operands. Different variants of Expressions, Expr(s), store those operands in different datastructures. This function examines the Expr variant and returns the operands to the calling logic.
- rex_call_operator() str ¶
Extracts the operator associated with a row expression type call.
- rex_type() datafusion.common.RexType ¶
Return the Rex Type of this expression.
A Rex (Row Expression) specifies a single row of data.That specification could include user defined functions or types. RexType identifies the row as one of the possible valid
RexType
.
- sort(ascending: bool = True, nulls_first: bool = True) Expr ¶
Creates a sort
Expr
from an existingExpr
.- Parameters:
ascending – If true, sort in ascending order.
nulls_first – Return null values first.
- to_variant() Any ¶
Convert this expression into a python object if possible.
- types() datafusion.common.DataTypeMap ¶
Return the
DataTypeMap
.- Returns:
DataTypeMap which represents the PythonType, Arrow DataType, and SqlType Enum which this expression represents.
- variant_name() str ¶
Returns the name of the Expr variant.
Ex:
IsNotNull
,Literal
,BinaryExpr
, etc
- window_frame(window_frame: WindowFrame) ExprFuncBuilder ¶
Set the frame fora window function.
This function will create an
ExprFuncBuilder
that can be used to set parameters for either window or aggregate functions. If used on any other type of expression, an error will be generated whenbuild()
is called.
- __radd__¶
- __rand__¶
- __rmod__¶
- __rmul__¶
- __ror__¶
- __rsub__¶
- __rtruediv__¶
- _to_pyarrow_types¶
- expr¶
- class datafusion.RecordBatch(record_batch: datafusion._internal.RecordBatch)¶
This class is essentially a wrapper for
pyarrow.RecordBatch
.This constructor is generally not called by the end user.
See the
RecordBatchStream
iterator for generating this class.- to_pyarrow() pyarrow.RecordBatch ¶
Convert to
pyarrow.RecordBatch
.
- record_batch¶
- class datafusion.RecordBatchStream(record_batch_stream: datafusion._internal.RecordBatchStream)¶
This class represents a stream of record batches.
These are typically the result of a
execute_stream()
operation.This constructor is typically not called by the end user.
- __iter__() typing_extensions.Self ¶
Iterator function.
- __next__() RecordBatch ¶
Iterator function.
- next() RecordBatch | None ¶
See
__next__()
for the iterator function.
- rbs¶
- class datafusion.RuntimeConfig¶
Runtime configuration options.
Create a new
RuntimeConfig
with default values.- with_disk_manager_disabled() RuntimeConfig ¶
Disable the disk manager, attempts to create temporary files will error.
- Returns:
A new
RuntimeConfig
object with the updated setting.
- with_disk_manager_os() RuntimeConfig ¶
Use the operating system’s temporary directory for disk manager.
- Returns:
A new
RuntimeConfig
object with the updated setting.
- with_disk_manager_specified(*paths: str | pathlib.Path) RuntimeConfig ¶
Use the specified paths for the disk manager’s temporary files.
- Parameters:
paths – Paths to use for the disk manager’s temporary files.
- Returns:
A new
RuntimeConfig
object with the updated setting.
- with_fair_spill_pool(size: int) RuntimeConfig ¶
Use a fair spill pool with the specified size.
This pool works best when you know beforehand the query has multiple spillable operators that will likely all need to spill. Sometimes it will cause spills even when there was sufficient memory (reserved for other operators) to avoid doing so:
┌───────────────────────z──────────────────────z───────────────┐ │ z z │ │ z z │ │ Spillable z Unspillable z Free │ │ Memory z Memory z Memory │ │ z z │ │ z z │ └───────────────────────z──────────────────────z───────────────┘
- Parameters:
size – Size of the memory pool in bytes.
- Returns:
A new
RuntimeConfig
object with the updated setting.
Examples usage:
config = RuntimeConfig().with_fair_spill_pool(1024)
- with_greedy_memory_pool(size: int) RuntimeConfig ¶
Use a greedy memory pool with the specified size.
This pool works well for queries that do not need to spill or have a single spillable operator. See
with_fair_spill_pool()
if there are multiple spillable operators that all will spill.- Parameters:
size – Size of the memory pool in bytes.
- Returns:
A new
RuntimeConfig
object with the updated setting.
Example usage:
config = RuntimeConfig().with_greedy_memory_pool(1024)
- with_temp_file_path(path: str | pathlib.Path) RuntimeConfig ¶
Use the specified path to create any needed temporary files.
- Parameters:
path – Path to use for temporary files.
- Returns:
A new
RuntimeConfig
object with the updated setting.
Example usage:
config = RuntimeConfig().with_temp_file_path("/tmp")
- with_unbounded_memory_pool() RuntimeConfig ¶
Use an unbounded memory pool.
- Returns:
A new
RuntimeConfig
object with the updated setting.
- config_internal¶
- class datafusion.SQLOptions¶
Options to be used when performing SQL queries.
Create a new
SQLOptions
with default values.The default values are: - DDL commands are allowed - DML commands are allowed - Statements are allowed
- with_allow_ddl(allow: bool = True) SQLOptions ¶
Should DDL (Data Definition Language) commands be run?
Examples of DDL commands include
CREATE TABLE
andDROP TABLE
.- Parameters:
allow – Allow DDL commands to be run.
- Returns:
A new
SQLOptions
object with the updated setting.
Example usage:
options = SQLOptions().with_allow_ddl(True)
- with_allow_dml(allow: bool = True) SQLOptions ¶
Should DML (Data Manipulation Language) commands be run?
Examples of DML commands include
INSERT INTO
andDELETE
.- Parameters:
allow – Allow DML commands to be run.
- Returns:
A new
SQLOptions
object with the updated setting.
Example usage:
options = SQLOptions().with_allow_dml(True)
- with_allow_statements(allow: bool = True) SQLOptions ¶
Should statements such as
SET VARIABLE
andBEGIN TRANSACTION
be run?- Parameters:
allow – Allow statements to be run.
- Returns:
py:class:SQLOptions` object with the updated setting.
- Return type:
A new
Example usage:
options = SQLOptions().with_allow_statements(True)
- options_internal¶
- class datafusion.ScalarUDF(name: str | None, func: Callable[Ellipsis, _R], input_types: list[pyarrow.DataType], return_type: _R, volatility: Volatility | str)¶
Class for performing scalar user defined functions (UDF).
Scalar UDFs operate on a row by row basis. See also
AggregateUDF
for operating on a group of rows.Instantiate a scalar user defined function (UDF).
See helper method
udf()
for argument details.- __call__(*args: datafusion.expr.Expr) datafusion.expr.Expr ¶
Execute the UDF.
This function is not typically called by an end user. These calls will occur during the evaluation of the dataframe.
- static udf(func: Callable[Ellipsis, _R], input_types: list[pyarrow.DataType], return_type: _R, volatility: Volatility | str, name: str | None = None) ScalarUDF ¶
Create a new User Defined Function.
- Parameters:
func – A callable python function.
input_types – The data types of the arguments to
func
. This list must be of the same length as the number of arguments.return_type – The data type of the return value from the python function.
volatility – See
Volatility
for allowed values.name – A descriptive name for the function.
- Returns:
- A user defined aggregate function, which can be used in either data
aggregation or window function calls.
- _udf¶
- class datafusion.SessionConfig(config_options: dict[str, str] | None = None)¶
Session configuration options.
Create a new
SessionConfig
with the given configuration options.- Parameters:
config_options – Configuration options.
- set(key: str, value: str) SessionConfig ¶
Set a configuration option.
Args: key: Option key. value: Option value.
- Returns:
A new
SessionConfig
object with the updated setting.
- with_batch_size(batch_size: int) SessionConfig ¶
Customize batch size.
- Parameters:
batch_size – Batch size.
- Returns:
A new
SessionConfig
object with the updated setting.
- with_create_default_catalog_and_schema(enabled: bool = True) SessionConfig ¶
Control if the default catalog and schema will be automatically created.
- Parameters:
enabled – Whether the default catalog and schema will be automatically created.
- Returns:
A new
SessionConfig
object with the updated setting.
- with_default_catalog_and_schema(catalog: str, schema: str) SessionConfig ¶
Select a name for the default catalog and schema.
- Parameters:
catalog – Catalog name.
schema – Schema name.
- Returns:
A new
SessionConfig
object with the updated setting.
- with_information_schema(enabled: bool = True) SessionConfig ¶
Enable or disable the inclusion of
information_schema
virtual tables.- Parameters:
enabled – Whether to include
information_schema
virtual tables.- Returns:
A new
SessionConfig
object with the updated setting.
- with_parquet_pruning(enabled: bool = True) SessionConfig ¶
Enable or disable the use of pruning predicate for parquet readers.
Pruning predicates will enable the reader to skip row groups.
- Parameters:
enabled – Whether to use pruning predicate for parquet readers.
- Returns:
A new
SessionConfig
object with the updated setting.
- with_repartition_aggregations(enabled: bool = True) SessionConfig ¶
Enable or disable the use of repartitioning for aggregations.
Enabling this improves parallelism.
- Parameters:
enabled – Whether to use repartitioning for aggregations.
- Returns:
A new
SessionConfig
object with the updated setting.
- with_repartition_file_min_size(size: int) SessionConfig ¶
Set minimum file range size for repartitioning scans.
- Parameters:
size – Minimum file range size.
- Returns:
A new
SessionConfig
object with the updated setting.
- with_repartition_file_scans(enabled: bool = True) SessionConfig ¶
Enable or disable the use of repartitioning for file scans.
- Parameters:
enabled – Whether to use repartitioning for file scans.
- Returns:
A new
SessionConfig
object with the updated setting.
- with_repartition_joins(enabled: bool = True) SessionConfig ¶
Enable or disable the use of repartitioning for joins to improve parallelism.
- Parameters:
enabled – Whether to use repartitioning for joins.
- Returns:
A new
SessionConfig
object with the updated setting.
- with_repartition_sorts(enabled: bool = True) SessionConfig ¶
Enable or disable the use of repartitioning for window functions.
This may improve parallelism.
- Parameters:
enabled – Whether to use repartitioning for window functions.
- Returns:
A new
SessionConfig
object with the updated setting.
- with_repartition_windows(enabled: bool = True) SessionConfig ¶
Enable or disable the use of repartitioning for window functions.
This may improve parallelism.
- Parameters:
enabled – Whether to use repartitioning for window functions.
- Returns:
A new
SessionConfig
object with the updated setting.
- with_target_partitions(target_partitions: int) SessionConfig ¶
Customize the number of target partitions for query execution.
Increasing partitions can increase concurrency.
- Parameters:
target_partitions – Number of target partitions.
- Returns:
A new
SessionConfig
object with the updated setting.
- config_internal¶
- class datafusion.Table(table: datafusion._internal.Table)¶
DataFusion table.
This constructor is not typically called by the end user.
- schema() pyarrow.Schema ¶
Returns the schema associated with this table.
- property kind: str¶
Returns the kind of table.
- table¶
- class datafusion.WindowFrame(units: str, start_bound: Any | None, end_bound: Any | None)¶
Defines a window frame for performing window operations.
Construct a window frame using the given parameters.
- Parameters:
units – Should be one of
rows
,range
, orgroups
.start_bound – Sets the preceding bound. Must be >= 0. If none, this will be set to unbounded. If unit type is
groups
, this parameter must be set.end_bound – Sets the following bound. Must be >= 0. If none, this will be set to unbounded. If unit type is
groups
, this parameter must be set.
- get_frame_units() str ¶
Returns the window frame units for the bounds.
- get_lower_bound() WindowFrameBound ¶
Returns starting bound.
- get_upper_bound()¶
Returns end bound.
- window_frame¶
- datafusion.col(value: str)¶
Create a column expression.
- datafusion.column(value: str)¶
Create a column expression.
- datafusion.lit(value)¶
Create a literal expression.
- datafusion.literal(value)¶
Create a literal expression.
- datafusion.DFSchema¶