datafusion.io

IO read functions using global context.

Functions

read_avro(→ datafusion.dataframe.DataFrame)

Create a DataFrame for reading Avro data source.

read_csv(→ datafusion.dataframe.DataFrame)

Read a CSV data source.

read_json(→ datafusion.dataframe.DataFrame)

Read a line-delimited JSON data source.

read_parquet(→ datafusion.dataframe.DataFrame)

Read a Parquet source into a Dataframe.

Module Contents

datafusion.io.read_avro(path: str | pathlib.Path, schema: pyarrow.Schema | None = None, file_partition_cols: list[tuple[str, str]] | None = None, file_extension: str = '.avro') datafusion.dataframe.DataFrame

Create a DataFrame for reading Avro data source.

This function will use the global context. Any functions or tables registered with another context may not be accessible when used with a DataFrame created using this function.

Parameters:
  • path – Path to the Avro file.

  • schema – The data source schema.

  • file_partition_cols – Partition columns.

  • file_extension – File extension to select.

Returns:

DataFrame representation of the read Avro file

datafusion.io.read_csv(path: str | pathlib.Path | list[str] | list[pathlib.Path], schema: pyarrow.Schema | None = None, has_header: bool = True, delimiter: str = ',', schema_infer_max_records: int = 1000, file_extension: str = '.csv', table_partition_cols: list[tuple[str, str]] | None = None, file_compression_type: str | None = None) datafusion.dataframe.DataFrame

Read a CSV data source.

This function will use the global context. Any functions or tables registered with another context may not be accessible when used with a DataFrame created using this function.

Parameters:
  • path – Path to the CSV file

  • schema – An optional schema representing the CSV files. If None, the CSV reader will try to infer it based on data in file.

  • has_header – Whether the CSV file have a header. If schema inference is run on a file with no headers, default column names are created.

  • delimiter – An optional column delimiter.

  • schema_infer_max_records – Maximum number of rows to read from CSV files for schema inference if needed.

  • file_extension – File extension; only files with this extension are selected for data input.

  • table_partition_cols – Partition columns.

  • file_compression_type – File compression type.

Returns:

DataFrame representation of the read CSV files

datafusion.io.read_json(path: str | pathlib.Path, schema: pyarrow.Schema | None = None, schema_infer_max_records: int = 1000, file_extension: str = '.json', table_partition_cols: list[tuple[str, str]] | None = None, file_compression_type: str | None = None) datafusion.dataframe.DataFrame

Read a line-delimited JSON data source.

This function will use the global context. Any functions or tables registered with another context may not be accessible when used with a DataFrame created using this function.

Parameters:
  • path – Path to the JSON file.

  • schema – The data source schema.

  • schema_infer_max_records – Maximum number of rows to read from JSON files for schema inference if needed.

  • file_extension – File extension; only files with this extension are selected for data input.

  • table_partition_cols – Partition columns.

  • file_compression_type – File compression type.

Returns:

DataFrame representation of the read JSON files.

datafusion.io.read_parquet(path: str | pathlib.Path, table_partition_cols: list[tuple[str, str]] | None = None, parquet_pruning: bool = True, file_extension: str = '.parquet', skip_metadata: bool = True, schema: pyarrow.Schema | None = None, file_sort_order: list[list[datafusion.expr.Expr]] | None = None) datafusion.dataframe.DataFrame

Read a Parquet source into a Dataframe.

This function will use the global context. Any functions or tables registered with another context may not be accessible when used with a DataFrame created using this function.

Parameters:
  • path – Path to the Parquet file.

  • table_partition_cols – Partition columns.

  • parquet_pruning – Whether the parquet reader should use the predicate to prune row groups.

  • file_extension – File extension; only files with this extension are selected for data input.

  • skip_metadata – Whether the parquet reader should skip any metadata that may be in the file schema. This can help avoid schema conflicts due to metadata.

  • schema – An optional schema representing the parquet files. If None, the parquet reader will try to infer it based on data in the file.

  • file_sort_order – Sort order for the file.

Returns:

DataFrame representation of the read Parquet files