datafusion.io¶

IO read functions using global context.

Functions¶

`read_avro`(→ datafusion.dataframe.DataFrame)	Create a `DataFrame` for reading Avro data source.
`read_csv`(→ datafusion.dataframe.DataFrame)	Read a CSV data source.
`read_json`(→ datafusion.dataframe.DataFrame)	Read a line-delimited JSON data source.
`read_parquet`(→ datafusion.dataframe.DataFrame)	Read a Parquet source into a `Dataframe`.

Module Contents¶

datafusion.io.read_avro(path: str | pathlib.Path, schema: pyarrow.Schema | None = None, file_partition_cols: list[tuple[str, str | pyarrow.DataType]] | None = None, file_extension: str = '.avro') → datafusion.dataframe.DataFrame¶

Create a DataFrame for reading Avro data source.

This function will use the global context. Any functions or tables registered with another context may not be accessible when used with a DataFrame created using this function.

Parameters:

path – Path to the Avro file.
schema – The data source schema.
file_partition_cols – Partition columns.
file_extension – File extension to select.

Returns:

DataFrame representation of the read Avro file

datafusion.io.read_csv(path: str | pathlib.Path | list[str] | list[pathlib.Path], schema: pyarrow.Schema | None = None, has_header: bool = True, delimiter: str = ',', schema_infer_max_records: int = 1000, file_extension: str = '.csv', table_partition_cols: list[tuple[str, str | pyarrow.DataType]] | None = None, file_compression_type: str | None = None) → datafusion.dataframe.DataFrame¶

Read a CSV data source.

This function will use the global context. Any functions or tables registered with another context may not be accessible when used with a DataFrame created using this function.

Parameters:

path – Path to the CSV file
schema – An optional schema representing the CSV files. If None, the CSV reader will try to infer it based on data in file.
has_header – Whether the CSV file have a header. If schema inference is run on a file with no headers, default column names are created.
delimiter – An optional column delimiter.
schema_infer_max_records – Maximum number of rows to read from CSV files for schema inference if needed.
file_extension – File extension; only files with this extension are selected for data input.
table_partition_cols – Partition columns.
file_compression_type – File compression type.

Returns:

DataFrame representation of the read CSV files

datafusion.io.read_json(path: str | pathlib.Path, schema: pyarrow.Schema | None = None, schema_infer_max_records: int = 1000, file_extension: str = '.json', table_partition_cols: list[tuple[str, str | pyarrow.DataType]] | None = None, file_compression_type: str | None = None) → datafusion.dataframe.DataFrame¶

Read a line-delimited JSON data source.

This function will use the global context. Any functions or tables registered with another context may not be accessible when used with a DataFrame created using this function.

Parameters:

path – Path to the JSON file.
schema – The data source schema.
schema_infer_max_records – Maximum number of rows to read from JSON files for schema inference if needed.
file_extension – File extension; only files with this extension are selected for data input.
table_partition_cols – Partition columns.
file_compression_type – File compression type.

Returns:

DataFrame representation of the read JSON files.

datafusion.io.read_parquet(path: str | pathlib.Path, table_partition_cols: list[tuple[str, str | pyarrow.DataType]] | None = None, parquet_pruning: bool = True, file_extension: str = '.parquet', skip_metadata: bool = True, schema: pyarrow.Schema | None = None, file_sort_order: list[list[datafusion.expr.Expr]] | None = None) → datafusion.dataframe.DataFrame¶

Read a Parquet source into a Dataframe.

This function will use the global context. Any functions or tables registered with another context may not be accessible when used with a DataFrame created using this function.

Parameters:

path – Path to the Parquet file.
table_partition_cols – Partition columns.
parquet_pruning – Whether the parquet reader should use the predicate to prune row groups.
file_extension – File extension; only files with this extension are selected for data input.
skip_metadata – Whether the parquet reader should skip any metadata that may be in the file schema. This can help avoid schema conflicts due to metadata.
schema – An optional schema representing the parquet files. If None, the parquet reader will try to infer it based on data in the file.
file_sort_order – Sort order for the file.

Returns:

DataFrame representation of the read Parquet files

datafusion.input.location

datafusion.object_store