Write Options

DataFusion supports customizing how data is written out to disk as a result of a COPY or INSERT INTO query. There are a few special options, file format (e.g. CSV or parquet) specific options, and parquet column specific options. Options can also in some cases be specified in multiple ways with a set order of precedence.

Specifying Options and Order of Precedence

Write related options can be specified in the following ways:

  • Session level config defaults

  • CREATE EXTERNAL TABLE options

  • COPY option tuples

For a list of supported session level config defaults see Configuration Settings. These defaults apply to all write operations but have the lowest level of precedence.

If inserting to an external table, table specific write options can be specified when the table is created using the OPTIONS clause:

CREATE EXTERNAL TABLE
  my_table(a bigint, b bigint)
  STORED AS csv
  COMPRESSION TYPE gzip
  DELIMITER ';'
  LOCATION '/test/location/my_csv_table/'
  OPTIONS(
    NULL_VALUE 'NAN',
    'has_header' 'true'
  )

When running INSERT INTO my_table ..., the options from the CREATE TABLE will be respected (gzip compression, special delimiter, and header row included). There will be a single output file if the output path doesn’t have folder format, i.e. ending with a \. Note that compression, header, and delimiter settings can also be specified within the OPTIONS tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified the OPTIONS setting will be ignored. NULL_VALUE is a CSV format specific option that determines how null values should be encoded within the CSV file.

Finally, options can be passed when running a COPY command.

COPY source_table
  TO 'test/table_with_options'
  PARTITIONED BY (column3, column4)
  OPTIONS (
    format parquet,
    compression snappy,
    'compression::column1' 'zstd(5)',
  )

In this example, we write the entirety of source_table out to a folder of parquet files. One parquet file will be written in parallel to the folder for each partition in the query. The next option compression set to snappy indicates that unless otherwise specified all columns should use the snappy compression codec. The option compression::col1 sets an override, so that the column col1 in the parquet file will use ZSTD compression codec with compression level 5. In general, parquet options which support column specific settings can be specified with the syntax OPTION::COLUMN.NESTED.PATH.

Available Options

Execution Specific Options

The following options are available when executing a COPY query.

Option

Description

Default Value

execution.keep_partition_by_columns

Flag to retain the columns in the output data when using PARTITIONED BY queries.

false

Note: execution.keep_partition_by_columns flag can also be enabled through ExecutionOptions within SessionConfig.

JSON Format Specific Options

The following options are available when writing JSON files. Note: If any unsupported option is specified, an error will be raised and the query will fail.

Option

Description

Default Value

COMPRESSION

Sets the compression that should be applied to the entire JSON file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED.

UNCOMPRESSED

CSV Format Specific Options

The following options are available when writing CSV files. Note: if any unsupported options is specified an error will be raised and the query will fail.

Option

Description

Default Value

COMPRESSION

Sets the compression that should be applied to the entire CSV file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED.

UNCOMPRESSED

HEADER

Sets if the CSV file should include column headers

false

DATE_FORMAT

Sets the format that dates should be encoded in within the CSV file

arrow-rs default

DATETIME_FORMAT

Sets the format that datetimes should be encoded in within the CSV file

arrow-rs default

TIME_FORMAT

Sets the format that times should be encoded in within the CSV file

arrow-rs default

RFC3339

If true, uses RFC339 format for date and time encodings

arrow-rs default

NULL_VALUE

Sets the string which should be used to indicate null values within the CSV file.

arrow-rs default

DELIMITER

Sets the character which should be used as the column delimiter within the CSV file.

arrow-rs default

Parquet Format Specific Options

The following options are available when writing parquet files. If any unsupported option is specified an error will be raised and the query will fail. If a column specific option is specified for a column which does not exist, the option will be ignored without error. For default values, see: Configuration Settings.

Option

Can be Column Specific?

Description

COMPRESSION

Yes

Sets the compression codec and if applicable compression level to use

MAX_ROW_GROUP_SIZE

No

Sets the maximum number of rows that can be encoded in a single row group. Larger row groups require more memory to write and read.

DATA_PAGESIZE_LIMIT

No

Sets the best effort maximum page size in bytes

WRITE_BATCH_SIZE

No

Maximum number of rows written for each column in a single batch

WRITER_VERSION

No

Parquet writer version (1.0 or 2.0)

DICTIONARY_PAGE_SIZE_LIMIT

No

Sets best effort maximum dictionary page size in bytes

CREATED_BY

No

Sets the “created by” property in the parquet file

COLUMN_INDEX_TRUNCATE_LENGTH

No

Sets the max length of min/max value fields in the column index.

DATA_PAGE_ROW_COUNT_LIMIT

No

Sets best effort maximum number of rows in a data page.

BLOOM_FILTER_ENABLED

Yes

Sets whether a bloom filter should be written into the file.

ENCODING

Yes

Sets the encoding that should be used (e.g. PLAIN or RLE)

DICTIONARY_ENABLED

Yes

Sets if dictionary encoding is enabled. Use this instead of ENCODING to set dictionary encoding.

STATISTICS_ENABLED

Yes

Sets if statistics are enabled at PAGE or ROW_GROUP level.

MAX_STATISTICS_SIZE

Yes

Sets the maximum size in bytes that statistics can take up.

BLOOM_FILTER_FPP

Yes

Sets the false positive probability (fpp) for the bloom filter. Implicitly sets BLOOM_FILTER_ENABLED to true.

BLOOM_FILTER_NDV

Yes

Sets the number of distinct values (ndv) for the bloom filter. Implicitly sets bloom_filter_enabled to true.