DML¶
DML stands for “Data Manipulation Language” and relates to inserting and modifying data in tables.
COPY¶
Copies the contents of a table or query to file(s). Supported file
formats are parquet
, csv
, json
, and arrow
.
COPY { table_name | query } TO 'file_name' [ STORED AS format ] [ PARTITIONED BY column_name [, ...] ] [ OPTIONS( option [, ... ] ) ]
STORED AS
specifies the file format the COPY
command will write. If this
clause is not specified, it will be inferred from the file extension if possible.
PARTITIONED BY
specifies the columns to use for partitioning the output files into
separate hive-style directories. By default, columns used in PARTITIONED BY
will be removed
from the output format. If you want to keep the columns, you should provide the option
execution.keep_partition_by_columns true
. execution.keep_partition_by_columns
flag can also
be enabled through ExecutionOptions
within SessionConfig
.
The output format is determined by the first match of the following rules:
Value of
STORED AS
Filename extension (e.g.
foo.parquet
impliesPARQUET
format)
For a detailed list of valid OPTIONS, see Write Options.
Examples¶
Copy the contents of source_table
to file_name.json
in JSON format:
> COPY source_table TO 'file_name.json';
+-------+
| count |
+-------+
| 2 |
+-------+
Copy the contents of source_table
to one or more Parquet formatted
files in the dir_name
directory:
> COPY source_table TO 'dir_name' STORED AS PARQUET;
+-------+
| count |
+-------+
| 2 |
+-------+
Copy the contents of source_table
to multiple directories
of hive-style partitioned parquet files:
> COPY source_table TO 'dir_name' STORED AS parquet, PARTITIONED BY (column1, column2);
+-------+
| count |
+-------+
| 2 |
+-------+
If the the data contains values of x
and y
in column1 and only a
in
column2, output files will appear in the following directory structure:
dir_name/
column1=x/
column2=a/
<file>.parquet
<file>.parquet
...
column1=y/
column2=a/
<file>.parquet
<file>.parquet
...
Run the query SELECT * from source ORDER BY time
and write the
results (maintaining the order) to a parquet file named
output.parquet
with a maximum parquet row group size of 10MB:
> COPY (SELECT * from source ORDER BY time) TO 'output.parquet' OPTIONS (MAX_ROW_GROUP_SIZE 10000000);
+-------+
| count |
+-------+
| 2 |
+-------+
INSERT¶
Examples¶
Insert values into a table.
INSERT INTO table_name { VALUES ( expression [, ...] ) [, ...] | query }
> INSERT INTO target_table VALUES (1, 'Foo'), (2, 'Bar');
+-------+
| count |
+-------+
| 2 |
+-------+