datafusion.io¶
IO read functions using global context.
Functions¶
|
Create a |
|
Read a CSV data source. |
|
Read a line-delimited JSON data source. |
|
Read a Parquet source into a |
Module Contents¶
- datafusion.io.read_avro(path: str | pathlib.Path, schema: pyarrow.Schema | None = None, file_partition_cols: list[tuple[str, str]] | None = None, file_extension: str = '.avro') datafusion.dataframe.DataFrame ¶
Create a
DataFrame
for reading Avro data source.This function will use the global context. Any functions or tables registered with another context may not be accessible when used with a DataFrame created using this function.
- Parameters:
path – Path to the Avro file.
schema – The data source schema.
file_partition_cols – Partition columns.
file_extension – File extension to select.
- Returns:
DataFrame representation of the read Avro file
- datafusion.io.read_csv(path: str | pathlib.Path | list[str] | list[pathlib.Path], schema: pyarrow.Schema | None = None, has_header: bool = True, delimiter: str = ',', schema_infer_max_records: int = 1000, file_extension: str = '.csv', table_partition_cols: list[tuple[str, str]] | None = None, file_compression_type: str | None = None) datafusion.dataframe.DataFrame ¶
Read a CSV data source.
This function will use the global context. Any functions or tables registered with another context may not be accessible when used with a DataFrame created using this function.
- Parameters:
path – Path to the CSV file
schema – An optional schema representing the CSV files. If None, the CSV reader will try to infer it based on data in file.
has_header – Whether the CSV file have a header. If schema inference is run on a file with no headers, default column names are created.
delimiter – An optional column delimiter.
schema_infer_max_records – Maximum number of rows to read from CSV files for schema inference if needed.
file_extension – File extension; only files with this extension are selected for data input.
table_partition_cols – Partition columns.
file_compression_type – File compression type.
- Returns:
DataFrame representation of the read CSV files
- datafusion.io.read_json(path: str | pathlib.Path, schema: pyarrow.Schema | None = None, schema_infer_max_records: int = 1000, file_extension: str = '.json', table_partition_cols: list[tuple[str, str]] | None = None, file_compression_type: str | None = None) datafusion.dataframe.DataFrame ¶
Read a line-delimited JSON data source.
This function will use the global context. Any functions or tables registered with another context may not be accessible when used with a DataFrame created using this function.
- Parameters:
path – Path to the JSON file.
schema – The data source schema.
schema_infer_max_records – Maximum number of rows to read from JSON files for schema inference if needed.
file_extension – File extension; only files with this extension are selected for data input.
table_partition_cols – Partition columns.
file_compression_type – File compression type.
- Returns:
DataFrame representation of the read JSON files.
- datafusion.io.read_parquet(path: str | pathlib.Path, table_partition_cols: list[tuple[str, str]] | None = None, parquet_pruning: bool = True, file_extension: str = '.parquet', skip_metadata: bool = True, schema: pyarrow.Schema | None = None, file_sort_order: list[list[datafusion.expr.Expr]] | None = None) datafusion.dataframe.DataFrame ¶
Read a Parquet source into a
Dataframe
.This function will use the global context. Any functions or tables registered with another context may not be accessible when used with a DataFrame created using this function.
- Parameters:
path – Path to the Parquet file.
table_partition_cols – Partition columns.
parquet_pruning – Whether the parquet reader should use the predicate to prune row groups.
file_extension – File extension; only files with this extension are selected for data input.
skip_metadata – Whether the parquet reader should skip any metadata that may be in the file schema. This can help avoid schema conflicts due to metadata.
schema – An optional schema representing the parquet files. If None, the parquet reader will try to infer it based on data in the file.
file_sort_order – Sort order for the file.
- Returns:
DataFrame representation of the read Parquet files