Format Options¶

DataFusion supports customizing how data is read from or written to disk as a result of a COPY, INSERT INTO, or CREATE EXTERNAL TABLE statements. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. In some cases, Options can be specified in multiple ways with a set order of precedence.

Specifying Options and Order of Precedence¶

Format-related options can be specified in three ways, in decreasing order of precedence:

CREATE EXTERNAL TABLE syntax
COPY option tuples
Session-level config defaults

For a list of supported session-level config defaults, see Configuration Settings. These defaults apply to all operations but have the lowest level of precedence.

If creating an external table, table-specific format options can be specified when the table is created using the OPTIONS clause:

CREATE EXTERNAL TABLE
  my_table(a bigint, b bigint)
  STORED AS csv
  LOCATION '/tmp/my_csv_table/'
  OPTIONS(
    NULL_VALUE 'NAN',
    'has_header' 'true',
    'format.delimiter' ';'
  );

When running INSERT INTO my_table ..., the options from the CREATE TABLE will be respected (e.g., gzip compression, special delimiter, and header row included). Note that compression, header, and delimiter settings can also be specified within the OPTIONS tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified, the OPTIONS setting will be ignored.

For example, with the table defined above, running the following command:

INSERT INTO my_table VALUES(1,2);

Results in a new CSV file with the specified options:

$ cat /tmp/my_csv_table/bmC8zWFvLMtWX68R_0.csv
a;b
1;2

Finally, options can be passed when running a COPY command.

COPY source_table
  TO 'test/table_with_options'
  PARTITIONED BY (column3, column4)
  OPTIONS (
    format parquet,
    compression snappy,
    'compression::column1' 'zstd(5)',
  )

In this example, we write the entire source_table out to a folder of Parquet files. One Parquet file will be written in parallel to the folder for each partition in the query. The next option compression set to snappy indicates that unless otherwise specified, all columns should use the snappy compression codec. The option compression::col1 sets an override, so that the column col1 in the Parquet file will use the ZSTD compression codec with compression level 5. In general, Parquet options that support column-specific settings can be specified with the syntax OPTION::COLUMN.NESTED.PATH.

Available Options¶

JSON Format Options¶

The following options are available when reading or writing JSON files. Note: If any unsupported option is specified, an error will be raised and the query will fail.

Option	Description	Default Value
COMPRESSION	Sets the compression that should be applied to the entire JSON file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED.	UNCOMPRESSED

Example:

CREATE EXTERNAL TABLE t(a int)
STORED AS JSON
LOCATION '/tmp/foo/'
OPTIONS('COMPRESSION' 'gzip');

CSV Format Options¶

The following options are available when reading or writing CSV files. Note: If any unsupported option is specified, an error will be raised and the query will fail.

Option	Description	Default Value
COMPRESSION	Sets the compression that should be applied to the entire CSV file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED.	UNCOMPRESSED
HAS_HEADER	Sets if the CSV file should include column headers. If not set, uses session or system default.	None
DELIMITER	Sets the character which should be used as the column delimiter within the CSV file.	`,` (comma)
QUOTE	Sets the character which should be used for quoting values within the CSV file.	`"` (double quote)
TERMINATOR	Sets the character which should be used as the line terminator within the CSV file.	None
ESCAPE	Sets the character which should be used for escaping special characters within the CSV file.	None
DOUBLE_QUOTE	Sets if quotes within quoted fields should be escaped by doubling them (e.g., `"aaa""bbb"`).	None
NEWLINES_IN_VALUES	Sets if newlines in quoted values are supported. If not set, uses session or system default.	None
DATE_FORMAT	Sets the format that dates should be encoded in within the CSV file.	None
DATETIME_FORMAT	Sets the format that datetimes should be encoded in within the CSV file.	None
TIMESTAMP_FORMAT	Sets the format that timestamps should be encoded in within the CSV file.	None
TIMESTAMP_TZ_FORMAT	Sets the format that timestamps with timezone should be encoded in within the CSV file.	None
TIME_FORMAT	Sets the format that times should be encoded in within the CSV file.	None
NULL_VALUE	Sets the string which should be used to indicate null values within the CSV file.	None
NULL_REGEX	Sets the regex pattern to match null values when loading CSVs.	None
SCHEMA_INFER_MAX_REC	Sets the maximum number of records to scan to infer the schema.	None
COMMENT	Sets the character which should be used to indicate comment lines in the CSV file.	None

Example:

CREATE EXTERNAL TABLE t (col1 varchar, col2 int, col3 boolean)
STORED AS CSV
LOCATION '/tmp/foo/'
OPTIONS('DELIMITER' '|', 'HAS_HEADER' 'true', 'NEWLINES_IN_VALUES' 'true');

Parquet Format Options¶

The following options are available when reading or writing Parquet files. If any unsupported option is specified, an error will be raised and the query will fail. If a column-specific option is specified for a column that does not exist, the option will be ignored without error.

Option	Can be Column Specific?	Description	OPTIONS Key	Default Value
COMPRESSION	Yes	Sets the internal Parquet compression codec for data pages, optionally including the compression level. Applies globally if set without `::col`, or specifically to a column if set using `'compression::column_name'`. Valid values: `uncompressed`, `snappy`, `gzip(level)`, `lzo`, `brotli(level)`, `lz4`, `zstd(level)`, `lz4_raw`.	`'compression'` or `'compression::col'`	zstd(3)
ENCODING	Yes	Sets the encoding scheme for data pages. Valid values: `plain`, `plain_dictionary`, `rle`, `bit_packed`, `delta_binary_packed`, `delta_length_byte_array`, `delta_byte_array`, `rle_dictionary`, `byte_stream_split`. Use key `'encoding'` or `'encoding::col'` in OPTIONS.	`'encoding'` or `'encoding::col'`	None
DICTIONARY_ENABLED	Yes	Sets whether dictionary encoding should be enabled globally or for a specific column.	`'dictionary_enabled'` or `'dictionary_enabled::col'`	true
STATISTICS_ENABLED	Yes	Sets the level of statistics to write (`none`, `chunk`, `page`).	`'statistics_enabled'` or `'statistics_enabled::col'`	page
BLOOM_FILTER_ENABLED	Yes	Sets whether a bloom filter should be written for a specific column.	`'bloom_filter_enabled::column_name'`	None
BLOOM_FILTER_FPP	Yes	Sets bloom filter false positive probability (global or per column).	`'bloom_filter_fpp'` or `'bloom_filter_fpp::col'`	None
BLOOM_FILTER_NDV	Yes	Sets bloom filter number of distinct values (global or per column).	`'bloom_filter_ndv'` or `'bloom_filter_ndv::col'`	None
MAX_ROW_GROUP_SIZE	No	Sets the maximum number of rows per row group. Larger groups require more memory but can improve compression and scan efficiency.	`'max_row_group_size'`	1048576
ENABLE_PAGE_INDEX	No	If true, reads the Parquet data page level metadata (the Page Index), if present, to reduce I/O and decoding.	`'enable_page_index'`	true
PRUNING	No	If true, enables row group pruning based on min/max statistics.	`'pruning'`	true
SKIP_METADATA	No	If true, skips optional embedded metadata in the file schema.	`'skip_metadata'`	true
METADATA_SIZE_HINT	No	Sets the size hint (in bytes) for fetching Parquet file metadata.	`'metadata_size_hint'`	None
PUSHDOWN_FILTERS	No	If true, enables filter pushdown during Parquet decoding.	`'pushdown_filters'`	false
REORDER_FILTERS	No	If true, enables heuristic reordering of filters during Parquet decoding.	`'reorder_filters'`	false
SCHEMA_FORCE_VIEW_TYPES	No	If true, reads Utf8/Binary columns as view types.	`'schema_force_view_types'`	true
BINARY_AS_STRING	No	If true, reads Binary columns as strings.	`'binary_as_string'`	false
DATA_PAGESIZE_LIMIT	No	Sets best effort maximum size of data page in bytes.	`'data_pagesize_limit'`	1048576
DATA_PAGE_ROW_COUNT_LIMIT	No	Sets best effort maximum number of rows in data page.	`'data_page_row_count_limit'`	20000
DICTIONARY_PAGE_SIZE_LIMIT	No	Sets best effort maximum dictionary page size, in bytes.	`'dictionary_page_size_limit'`	1048576
WRITE_BATCH_SIZE	No	Sets write_batch_size in bytes.	`'write_batch_size'`	1024
WRITER_VERSION	No	Sets the Parquet writer version (`1.0` or `2.0`).	`'writer_version'`	1.0
SKIP_ARROW_METADATA	No	If true, skips writing Arrow schema information into the Parquet file metadata.	`'skip_arrow_metadata'`	false
CREATED_BY	No	Sets the “created by” string in the Parquet file metadata.	`'created_by'`	datafusion version X.Y.Z
COLUMN_INDEX_TRUNCATE_LENGTH	No	Sets the length (in bytes) to truncate min/max values in column indexes.	`'column_index_truncate_length'`	64
STATISTICS_TRUNCATE_LENGTH	No	Sets statistics truncate length.	`'statistics_truncate_length'`	None
BLOOM_FILTER_ON_WRITE	No	Sets whether bloom filters should be written for all columns by default (can be overridden per column).	`'bloom_filter_on_write'`	false
ALLOW_SINGLE_FILE_PARALLELISM	No	Enables parallel serialization of columns in a single file.	`'allow_single_file_parallelism'`	true
MAXIMUM_PARALLEL_ROW_GROUP_WRITERS	No	Maximum number of parallel row group writers.	`'maximum_parallel_row_group_writers'`	1
MAXIMUM_BUFFERED_RECORD_BATCHES_PER_STREAM	No	Maximum number of buffered record batches per stream.	`'maximum_buffered_record_batches_per_stream'`	2
KEY_VALUE_METADATA	No (Key is specific)	Adds custom key-value pairs to the file metadata. Use the format `'metadata::your_key_name' 'your_value'`. Multiple entries allowed.	`'metadata::key_name'`	None

Example:

CREATE EXTERNAL TABLE t (id bigint, value double, category varchar)
STORED AS PARQUET
LOCATION '/tmp/parquet_data/'
OPTIONS(
  'COMPRESSION::user_id' 'snappy',
  'ENCODING::col_a' 'delta_binary_packed',
  'MAX_ROW_GROUP_SIZE' '1000000',
  'BLOOM_FILTER_ENABLED::id' 'true'
);

Special Functions

Prepared Statements