datafusion.dataframe¶
DataFrame
is one of the core concepts in DataFusion.
See Concepts in the online documentation for more information.
Classes¶
Enum representing the available compression types for Parquet files. |
|
Two dimensional table representation of data. |
Module Contents¶
- class datafusion.dataframe.Compression(*args, **kwds)¶
Bases:
enum.Enum
Enum representing the available compression types for Parquet files.
- classmethod from_str(value: str) Compression ¶
Convert a string to a Compression enum value.
- Parameters:
value – The string representation of the compression type.
- Returns:
The Compression enum lowercase value.
- Raises:
ValueError – If the string does not match any Compression enum value.
- get_default_level() int | None ¶
Get the default compression level for the compression type.
- Returns:
The default compression level for the compression type.
- BROTLI = 'brotli'¶
- GZIP = 'gzip'¶
- LZ4 = 'lz4'¶
- LZ4_RAW = 'lz4_raw'¶
- SNAPPY = 'snappy'¶
- UNCOMPRESSED = 'uncompressed'¶
- ZSTD = 'zstd'¶
- class datafusion.dataframe.DataFrame(df: datafusion._internal.DataFrame)¶
Two dimensional table representation of data.
See Concepts in the online documentation for more information.
This constructor is not to be used by the end user.
See
SessionContext
for methods to create aDataFrame
.- __arrow_c_stream__(requested_schema: pyarrow.Schema) Any ¶
Export an Arrow PyCapsule Stream.
This will execute and collect the DataFrame. We will attempt to respect the requested schema, but only trivial transformations will be applied such as only returning the fields listed in the requested schema if their data types match those in the DataFrame.
- Parameters:
requested_schema – Attempt to provide the DataFrame using this schema.
- Returns:
Arrow PyCapsule object.
- __getitem__(key: str | List[str]) DataFrame ¶
Return a new :py:class`DataFrame` with the specified column or columns.
- Parameters:
key – Column name or list of column names to select.
- Returns:
DataFrame with the specified column or columns.
- __repr__() str ¶
Return a string representation of the DataFrame.
- Returns:
String representation of the DataFrame.
- _repr_html_() str ¶
- aggregate(group_by: list[datafusion.expr.Expr] | datafusion.expr.Expr, aggs: list[datafusion.expr.Expr] | datafusion.expr.Expr) DataFrame ¶
Aggregates the rows of the current DataFrame.
- Parameters:
group_by – List of expressions to group by.
aggs – List of expressions to aggregate.
- Returns:
DataFrame after aggregation.
- cast(mapping: dict[str, pyarrow.DataType[Any]]) DataFrame ¶
Cast one or more columns to a different data type.
- Parameters:
mapping – Mapped with column as key and column dtype as value.
- Returns:
DataFrame after casting columns
- collect() list[pyarrow.RecordBatch] ¶
Execute this
DataFrame
and collect results into memory.Prior to calling
collect
, modifying a DataFrme simply updates a plan (no actual computation is performed). Callingcollect
triggers the computation.- Returns:
List of
pyarrow.RecordBatch
collected from the DataFrame.
- collect_partitioned() list[list[pyarrow.RecordBatch]] ¶
Execute this DataFrame and collect all partitioned results.
This operation returns
pyarrow.RecordBatch
maintaining the input partitioning.- Returns:
- List of list of
RecordBatch
collected from the DataFrame.
- List of list of
- count() int ¶
Return the total number of rows in this
DataFrame
.Note that this method will actually run a plan to calculate the count, which may be slow for large or complicated DataFrames.
- Returns:
Number of rows in the DataFrame.
- describe() DataFrame ¶
Return the statistics for this DataFrame.
Only summarized numeric datatypes at the moments and returns nulls for non-numeric datatypes.
The output format is modeled after pandas.
- Returns:
A summary DataFrame containing statistics.
- distinct() DataFrame ¶
Return a new
DataFrame
with all duplicated rows removed.- Returns:
DataFrame after removing duplicates.
- drop(*columns: str) DataFrame ¶
Drop arbitrary amount of columns.
- Parameters:
columns – Column names to drop from the dataframe.
- Returns:
DataFrame with those columns removed in the projection.
- except_all(other: DataFrame) DataFrame ¶
Calculate the exception of two
DataFrame
.The two
DataFrame
must have exactly the same schema.- Parameters:
other – DataFrame to calculate exception with.
- Returns:
DataFrame after exception.
- execute_stream() datafusion.record_batch.RecordBatchStream ¶
Executes this DataFrame and returns a stream over a single partition.
- Returns:
Record Batch Stream over a single partition.
- execute_stream_partitioned() list[datafusion.record_batch.RecordBatchStream] ¶
Executes this DataFrame and returns a stream for each partition.
- Returns:
One record batch stream per partition.
- execution_plan() datafusion.plan.ExecutionPlan ¶
Return the execution/physical plan.
- Returns:
Execution plan.
- explain(verbose: bool = False, analyze: bool = False) DataFrame ¶
Return a DataFrame with the explanation of its plan so far.
If
analyze
is specified, runs the plan and reports metrics.- Parameters:
verbose – If
True
, more details will be included.analyze – If
Tru`e
, the plan will run and metrics reported.
- Returns:
DataFrame with the explanation of its plan.
- filter(*predicates: datafusion.expr.Expr) DataFrame ¶
Return a DataFrame for which
predicate
evaluates toTrue
.Rows for which
predicate
evaluates toFalse
orNone
are filtered out. If more than one predicate is provided, these predicates will be combined as a logical AND. If more complex logic is required, see the logical operations infunctions
.- Parameters:
predicates – Predicate expression(s) to filter the DataFrame.
- Returns:
DataFrame after filtering.
- head(n: int = 5) DataFrame ¶
Return a new
DataFrame
with a limited number of rows.- Parameters:
n – Number of rows to take from the head of the DataFrame.
- Returns:
DataFrame after limiting.
- intersect(other: DataFrame) DataFrame ¶
Calculate the intersection of two
DataFrame
.The two
DataFrame
must have exactly the same schema.- Parameters:
other – DataFrame to intersect with.
- Returns:
DataFrame after intersection.
- join(right: DataFrame, on: str | Sequence[str], how: Literal['inner', 'left', 'right', 'full', 'semi', 'anti'] = 'inner', *, left_on: None = None, right_on: None = None, join_keys: None = None) DataFrame ¶
- join(right: DataFrame, on: None = None, how: Literal['inner', 'left', 'right', 'full', 'semi', 'anti'] = 'inner', *, left_on: str | Sequence[str], right_on: str | Sequence[str], join_keys: tuple[list[str], list[str]] | None = None) DataFrame
- join(right: DataFrame, on: None = None, how: Literal['inner', 'left', 'right', 'full', 'semi', 'anti'] = 'inner', *, join_keys: tuple[list[str], list[str]], left_on: None = None, right_on: None = None) DataFrame
Join this
DataFrame
with anotherDataFrame
.on has to be provided or both left_on and right_on in conjunction.
- Parameters:
right – Other DataFrame to join with.
on – Column names to join on in both dataframes.
how – Type of join to perform. Supported types are “inner”, “left”, “right”, “full”, “semi”, “anti”.
left_on – Join column of the left dataframe.
right_on – Join column of the right dataframe.
join_keys – Tuple of two lists of column names to join on. [Deprecated]
- Returns:
DataFrame after join.
- join_on(right: DataFrame, *on_exprs: datafusion.expr.Expr, how: Literal['inner', 'left', 'right', 'full', 'semi', 'anti'] = 'inner') DataFrame ¶
Join two
DataFrame
using the specified expressions.On expressions are used to support in-equality predicates. Equality predicates are correctly optimized
- Parameters:
right – Other DataFrame to join with.
on_exprs – single or multiple (in)-equality predicates.
how – Type of join to perform. Supported types are “inner”, “left”, “right”, “full”, “semi”, “anti”.
- Returns:
DataFrame after join.
- limit(count: int, offset: int = 0) DataFrame ¶
Return a new
DataFrame
with a limited number of rows.- Parameters:
count – Number of rows to limit the DataFrame to.
offset – Number of rows to skip.
- Returns:
DataFrame after limiting.
- logical_plan() datafusion.plan.LogicalPlan ¶
Return the unoptimized
LogicalPlan
.- Returns:
Unoptimized logical plan.
- optimized_logical_plan() datafusion.plan.LogicalPlan ¶
Return the optimized
LogicalPlan
.- Returns:
Optimized logical plan.
- repartition(num: int) DataFrame ¶
Repartition a DataFrame into
num
partitions.The batches allocation uses a round-robin algorithm.
- Parameters:
num – Number of partitions to repartition the DataFrame into.
- Returns:
Repartitioned DataFrame.
- repartition_by_hash(*exprs: datafusion.expr.Expr, num: int) DataFrame ¶
Repartition a DataFrame using a hash partitioning scheme.
- Parameters:
exprs – Expressions to evaluate and perform hashing on.
num – Number of partitions to repartition the DataFrame into.
- Returns:
Repartitioned DataFrame.
- schema() pyarrow.Schema ¶
Return the
pyarrow.Schema
of this DataFrame.The output schema contains information on the name, data type, and nullability for each column.
- Returns:
Describing schema of the DataFrame
- select(*exprs: datafusion.expr.Expr | str) DataFrame ¶
Project arbitrary expressions into a new
DataFrame
.- Parameters:
exprs – Either column names or
Expr
to select.- Returns:
DataFrame after projection. It has one column for each expression.
Example usage:
The following example will return 3 columns from the original dataframe. The first two columns will be the original column
a
andb
since the string “a” is assumed to refer to column selection. Also a duplicate of columna
will be returned with the column namealternate_a
:df = df.select("a", col("b"), col("a").alias("alternate_a"))
- select_columns(*args: str) DataFrame ¶
Filter the DataFrame by columns.
- Returns:
DataFrame only containing the specified columns.
- show(num: int = 20) None ¶
Execute the DataFrame and print the result to the console.
- Parameters:
num – Number of lines to show.
- sort(*exprs: datafusion.expr.Expr | datafusion.expr.SortExpr) DataFrame ¶
Sort the DataFrame by the specified sorting expressions.
Note that any expression can be turned into a sort expression by calling its`
sort
method.- Parameters:
exprs – Sort expressions, applied in order.
- Returns:
DataFrame after sorting.
- tail(n: int = 5) DataFrame ¶
Return a new
DataFrame
with a limited number of rows.Be aware this could be potentially expensive since the row size needs to be determined of the dataframe. This is done by collecting it.
- Parameters:
n – Number of rows to take from the tail of the DataFrame.
- Returns:
DataFrame after limiting.
- to_arrow_table() pyarrow.Table ¶
Execute the
DataFrame
and convert it into an Arrow Table.- Returns:
Arrow Table.
- to_pandas() pandas.DataFrame ¶
Execute the
DataFrame
and convert it into a Pandas DataFrame.- Returns:
Pandas DataFrame.
- to_polars() polars.DataFrame ¶
Execute the
DataFrame
and convert it into a Polars DataFrame.- Returns:
Polars DataFrame.
- to_pydict() dict[str, list[Any]] ¶
Execute the
DataFrame
and convert it into a dictionary of lists.- Returns:
Dictionary of lists.
- to_pylist() list[dict[str, Any]] ¶
Execute the
DataFrame
and convert it into a list of dictionaries.- Returns:
List of dictionaries.
- transform(func: Callable[Ellipsis, DataFrame], *args: Any) DataFrame ¶
Apply a function to the current DataFrame which returns another DataFrame.
This is useful for chaining together multiple functions. For example:
def add_3(df: DataFrame) -> DataFrame: return df.with_column("modified", lit(3)) def within_limit(df: DataFrame, limit: int) -> DataFrame: return df.filter(col("a") < lit(limit)).distinct() df = df.transform(modify_df).transform(within_limit, 4)
- Parameters:
func – A callable function that takes a DataFrame as it’s first argument
args – Zero or more arguments to pass to func
- Returns:
After applying func to the original dataframe.
- Return type:
- union(other: DataFrame, distinct: bool = False) DataFrame ¶
Calculate the union of two
DataFrame
.The two
DataFrame
must have exactly the same schema.- Parameters:
other – DataFrame to union with.
distinct – If
True
, duplicate rows will be removed.
- Returns:
DataFrame after union.
- union_distinct(other: DataFrame) DataFrame ¶
Calculate the distinct union of two
DataFrame
.The two
DataFrame
must have exactly the same schema. Any duplicate rows are discarded.- Parameters:
other – DataFrame to union with.
- Returns:
DataFrame after union.
- unnest_columns(*columns: str, preserve_nulls: bool = True) DataFrame ¶
Expand columns of arrays into a single row per array element.
- Parameters:
columns – Column names to perform unnest operation on.
preserve_nulls – If False, rows with null entries will not be returned.
- Returns:
A DataFrame with the columns expanded.
- with_column(name: str, expr: datafusion.expr.Expr) DataFrame ¶
Add an additional column to the DataFrame.
- Parameters:
name – Name of the column to add.
expr – Expression to compute the column.
- Returns:
DataFrame with the new column.
- with_column_renamed(old_name: str, new_name: str) DataFrame ¶
Rename one column by applying a new projection.
This is a no-op if the column to be renamed does not exist.
The method supports case sensitive rename with wrapping column name into one the following symbols (” or ‘ or `).
- Parameters:
old_name – Old column name.
new_name – New column name.
- Returns:
DataFrame with the column renamed.
- with_columns(*exprs: datafusion.expr.Expr | Iterable[datafusion.expr.Expr], **named_exprs: datafusion.expr.Expr) DataFrame ¶
Add columns to the DataFrame.
By passing expressions, iteratables of expressions, or named expressions. To pass named expressions use the form name=Expr.
Example usage: The following will add 4 columns labeled a, b, c, and d:
df = df.with_columns( lit(0).alias('a'), [lit(1).alias('b'), lit(2).alias('c')], d=lit(3) )
- Parameters:
exprs – Either a single expression or an iterable of expressions to add.
named_exprs – Named expressions in the form of
name=expr
- Returns:
DataFrame with the new columns added.
- write_csv(path: str | pathlib.Path, with_header: bool = False) None ¶
Execute the
DataFrame
and write the results to a CSV file.- Parameters:
path – Path of the CSV file to write.
with_header – If true, output the CSV header row.
- write_json(path: str | pathlib.Path) None ¶
Execute the
DataFrame
and write the results to a JSON file.- Parameters:
path – Path of the JSON file to write.
- write_parquet(path: str | pathlib.Path, compression: str | Compression = Compression.ZSTD, compression_level: int | None = None) None ¶
Execute the
DataFrame
and write the results to a Parquet file.- Parameters:
path – Path of the Parquet file to write.
compression – Compression type to use. Default is “ZSTD”. Available compression types are: - “uncompressed”: No compression. - “snappy”: Snappy compression. - “gzip”: Gzip compression. - “brotli”: Brotli compression. - “lz4”: LZ4 compression. - “lz4_raw”: LZ4_RAW compression. - “zstd”: Zstandard compression.
Note – LZO is not yet implemented in arrow-rs and is therefore excluded.
compression_level – Compression level to use. For ZSTD, the recommended range is 1 to 22, with the default being 4. Higher levels provide better compression but slower speed.
- df¶