Local Files / Directories#

Files can be queried directly by enclosing the file, directory name or a remote location in single ' quotes as shown in the examples.

Create a CSV file to query.

$ echo "a,b" > data.csv
$ echo "1,2" >> data.csv

Query that single file (the CLI also supports parquet, compressed csv, avro, json and more)

$ datafusion-cli
DataFusion CLI v17.0.0
> select * from 'data.csv';
+---+---+
| a | b |
+---+---+
| 1 | 2 |
+---+---+
1 row in set. Query took 0.007 seconds.

You can also query directories of files with compatible schemas:

$ ls data_dir/
data.csv   data2.csv

$ datafusion-cli
DataFusion CLI v16.0.0
> select * from 'data_dir';
+---+---+
| a | b |
+---+---+
| 3 | 4 |
| 1 | 2 |
+---+---+
2 rows in set. Query took 0.007 seconds.

Remote Files / Directories#

You can also query directly any remote location supported by DataFusion without registering the location as a table. For example, to read from a remote parquet file via HTTP(S) you can use the following:

select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'
+----------+
| COUNT(*) |
+----------+
| 1000000  |
+----------+
1 row in set. Query took 0.595 seconds.

To read from an AWS S3 or GCS, use s3 or gs as a protocol prefix. For example, to read a file in an S3 bucket named my-data-bucket use the URL s3://my-data-bucketand set the relevant access credentials as environmental variables (e.g. for AWS S3 you can use AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY).

> select count(*) from 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';
+------------+
| count(*)   |
+------------+
| 1310903963 |
+------------+

See the CREATE EXTERNAL TABLE section below for additional configuration options.

`CREATE EXTERNAL TABLE`#

It is also possible to create a table backed by files or remote locations via CREATE EXTERNAL TABLE as shown below. Note that DataFusion does not support wildcards (e.g. *) in file paths; instead, specify the directory path directly to read all compatible files in that directory.

For example, to create a table hits backed by a local parquet file named hits.parquet:

CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'hits.parquet';

To create a table hits backed by a remote parquet file via HTTP(S):

CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';

In both cases, hits now can be queried as a regular table:

select count(*) from hits;
+----------+
| COUNT(*) |
+----------+
| 1000000  |
+----------+
1 row in set. Query took 0.344 seconds.

Why Wildcards Are Not Supported

Although wildcards (e.g., .parquet or **/.parquet) may work for local filesystems in some cases, they are not supported by DataFusion CLI. This is because wildcards are not universally applicable across all storage backends (e.g., S3, GCS). Instead, DataFusion expects the user to specify the directory path, and it will automatically read all compatible files within that directory.

For example, the following usage is not supported:

CREATE EXTERNAL TABLE test (
    message TEXT,
    day DATE
)
STORED AS PARQUET
LOCATION 'gs://bucket/*.parquet';

Instead, you should use:

CREATE EXTERNAL TABLE test (
    message TEXT,
    day DATE
)
STORED AS PARQUET
LOCATION 'gs://bucket/my_table/';

When specifying a directory path that has a Hive compliant partition structure, by default, DataFusion CLI will automatically parse and incorporate the Hive columns and their values into the table’s schema and data. Given the following remote object paths:

gs://bucket/my_table/a=1/b=100/file1.parquet
gs://bucket/my_table/a=2/b=200/file2.parquet

my_table can be queried and filtered on the Hive columns:

CREATE EXTERNAL TABLE my_table
STORED AS PARQUET
LOCATION 'gs://bucket/my_table/';

SELECT count(*) FROM my_table WHERE b=200;
+----------+
| count(*) |
+----------+
| 1        |
+----------+

Formats#

Parquet#

The schema information for parquet will be derived automatically.

CREATE EXTERNAL TABLE taxi
STORED AS PARQUET
LOCATION '/mnt/nyctaxi/tripdata.parquet';

Register a single folder parquet datasource. Note: All files inside must be valid parquet files and have compatible schemas

Note

Paths must end in Slash /: The path must end in / otherwise DataFusion will treat the path as a file and not a directory

CREATE EXTERNAL TABLE taxi
STORED AS PARQUET
LOCATION '/mnt/nyctaxi/';

Parquet Specific Options#

You can specify additional options for parquet files using the OPTIONS clause. For example, to read and write a parquet directory with encryption settings you could use:

CREATE EXTERNAL TABLE encrypted_parquet_table
(
double_field double,
float_field float
)
STORED AS PARQUET LOCATION 'pq/' OPTIONS (
    -- encryption
    'format.crypto.file_encryption.encrypt_footer' 'true',
    'format.crypto.file_encryption.footer_key_as_hex' '30313233343536373839303132333435',  -- b"0123456789012345"
    'format.crypto.file_encryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450"
    'format.crypto.file_encryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451"
    -- decryption
    'format.crypto.file_decryption.footer_key_as_hex' '30313233343536373839303132333435', -- b"0123456789012345"
    'format.crypto.file_decryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450"
    'format.crypto.file_decryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451"
);

Here the keys are specified in hexadecimal format because they are binary data. These can be encoded in SQL using:

select encode('0123456789012345', 'hex');
/*
+----------------------------------------------+
| encode(Utf8("0123456789012345"),Utf8("hex")) |
+----------------------------------------------+
| 30313233343536373839303132333435             |
+----------------------------------------------+
*/

For more details on the available options, refer to the Rust TableParquetOptions documentation in DataFusion.

CSV#

DataFusion will infer the CSV schema automatically or you can provide it explicitly.

CREATE EXTERNAL TABLE test
STORED AS CSV
LOCATION '/path/to/aggregate_test_100.csv'
OPTIONS ('has_header' 'true');

CREATE EXTERNAL TABLE test (
    c1  VARCHAR NOT NULL,
    c2  INT NOT NULL,
    c3  SMALLINT NOT NULL,
    c4  SMALLINT NOT NULL,
    c5  INT NOT NULL,
    c6  BIGINT NOT NULL,
    c7  SMALLINT NOT NULL,
    c8  INT NOT NULL,
    c9  BIGINT NOT NULL,
    c10 VARCHAR NOT NULL,
    c11 FLOAT NOT NULL,
    c12 DOUBLE NOT NULL,
    c13 VARCHAR NOT NULL
)
STORED AS CSV
LOCATION '/path/to/aggregate_test_100.csv';

Locations#

HTTP(s)#

To read from a remote parquet file via HTTP(S):

CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';

S3#

DataFusion CLI supports configuring AWS S3 via the CREATE EXTERNAL TABLE statement and standard AWS configuration methods (via the aws-config AWS SDK crate).

To create an external table from a file in an S3 bucket with explicit credentials:

CREATE EXTERNAL TABLE test
STORED AS PARQUET
OPTIONS(
    'aws.access_key_id' '******',
    'aws.secret_access_key' '******',
    'aws.region' 'us-east-2'
)
LOCATION 's3://bucket/path/file.parquet';

To create an external table using environment variables:

$ export AWS_DEFAULT_REGION=us-east-2
$ export AWS_SECRET_ACCESS_KEY=******
$ export AWS_ACCESS_KEY_ID=******

$ datafusion-cli
`datafusion-cli v21.0.0
> create CREATE TABLE test STORED AS PARQUET LOCATION 's3://bucket/path/file.parquet';
0 rows in set. Query took 0.374 seconds.
> select * from test;
+----------+----------+
| column_1 | column_2 |
+----------+----------+
| 1        | 2        |
+----------+----------+
1 row in set. Query took 0.171 seconds.

To read from a public S3 bucket without signatures, use the aws.SKIP_SIGNATURE option:

CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/'
OPTIONS(aws.SKIP_SIGNATURE true);

Credentials are taken in this order of precedence:

Explicitly specified in the OPTIONS clause of the CREATE EXTERNAL TABLE statement.
Determined by aws-config crate (standard environment variables such as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as well as other AWS specific features).

If no credentials are specified, DataFusion CLI will use unsigned requests to S3, which allows reading from public buckets.

Supported configuration options are:

Environment Variable	Configuration Option	Description
`AWS_ACCESS_KEY_ID`	`aws.access_key_id`
`AWS_SECRET_ACCESS_KEY`	`aws.secret_access_key`
`AWS_DEFAULT_REGION`	`aws.region`
`AWS_ENDPOINT`	`aws.endpoint`
`AWS_SESSION_TOKEN`	`aws.token`
`AWS_CONTAINER_CREDENTIALS_RELATIVE_URI`		See IAM Roles
`AWS_ALLOW_HTTP`		If “true”, permit HTTP connections without TLS
`AWS_SKIP_SIGNATURE`	`aws.skip_signature`	If “true”, does not sign requests
	`aws.nosign`	Alias for `skip_signature`

OSS#

Alibaba cloud OSS data sources must have connection credentials configured

CREATE EXTERNAL TABLE test
STORED AS PARQUET
OPTIONS(
    'aws.access_key_id' '******',
    'aws.secret_access_key' '******',
    'aws.oss.endpoint' 'https://bucket.oss-cn-hangzhou.aliyuncs.com'
)
LOCATION 'oss://bucket/path/file.parquet';

The supported OPTIONS are

access_key_id
secret_access_key
endpoint

Note that the endpoint format of oss needs to be: https://{bucket}.{oss-region-endpoint}

COS#

Tencent cloud COS data sources data sources must have connection credentials configured

CREATE EXTERNAL TABLE test
STORED AS PARQUET
OPTIONS(
    'aws.access_key_id' '******',
    'aws.secret_access_key' '******',
    'aws.cos.endpoint' 'https://cos.ap-singapore.myqcloud.com'
)
LOCATION 'cos://bucket/path/file.parquet';

The supported OPTIONS are:

access_key_id
secret_access_key
endpoint

Note that the endpoint format of urls must be: https://cos.{cos-region-endpoint}

GCS#

Google Cloud Storage data sources must have connection credentials configured

For example, to create an external table from a file in a GCS bucket

CREATE EXTERNAL TABLE test
STORED AS PARQUET
OPTIONS(
    'gcp.service_account_path' '/tmp/gcs.json',
)
LOCATION 'gs://bucket/path/file.parquet';

It is also possible to specify the access information using environment variables:

$ export GOOGLE_SERVICE_ACCOUNT=/tmp/gcs.json

$ datafusion-cli
DataFusion CLI v21.0.0
> create external table test stored as parquet location 'gs://bucket/path/file.parquet';
0 rows in set. Query took 0.374 seconds.
> select * from test;
+----------+----------+
| column_1 | column_2 |
+----------+----------+
| 1        | 2        |
+----------+----------+
1 row in set. Query took 0.171 seconds.

Supported configuration options are:

Environment Variable	Configuration Option	Description
`GOOGLE_SERVICE_ACCOUNT`	`gcp.service_account_path`	location of service account file
`GOOGLE_SERVICE_ACCOUNT_PATH`	`gcp.service_account_path`	(alias) location of service account file
`SERVICE_ACCOUNT`	`gcp.service_account_path`	(alias) location of service account file
`GOOGLE_SERVICE_ACCOUNT_KEY`	`gcp.service_account_key`	JSON serialized service account key
`GOOGLE_APPLICATION_CREDENTIALS`	`gcp.application_credentials_path`	location of application credentials file
`GOOGLE_BUCKET`		bucket name
`GOOGLE_BUCKET_NAME`		(alias) bucket name

Local Files / Directories#

Remote Files / Directories#

CREATE EXTERNAL TABLE#

Formats#

Parquet#

Parquet Specific Options#

CSV#

Locations#

HTTP(s)#

S3#

OSS#

COS#

GCS#

This Page

`CREATE EXTERNAL TABLE`#