Class DataFrame
- All Implemented Interfaces:
AutoCloseable
DataFrame. Created
by SessionContext.sql(String) or other planning entry points and executed by either
collect(org.apache.arrow.memory.BufferAllocator) (materializes every batch on the native heap before returning) or executeStream(org.apache.arrow.memory.BufferAllocator) (yields one batch at a time as Java drains the reader).
Instances are not thread-safe and must be closed. Both collect(org.apache.arrow.memory.BufferAllocator) and
executeStream(org.apache.arrow.memory.BufferAllocator) consume the DataFrame: a successfully consumed DataFrame cannot be
consumed again by either method (or by other executors such as count()), and close() on an already-consumed instance is a no-op.
-
Method Summary
Modifier and TypeMethodDescriptioncache()Materialise this DataFrame into an in-memory table and return a new DataFrame that scans it.voidclose()org.apache.arrow.vector.ipc.ArrowReadercollect(org.apache.arrow.memory.BufferAllocator allocator) Execute the plan and return its record batches as anArrowReader.longcount()Execute the plan and return the number of rows.describe()Compute summary statistics (count, null_count, mean, std, min, max, median) over this DataFrame's columns and return them as a new DataFrame.distinct()Deduplicate rows across all columns.dropColumns(String... columnNames) Drop the named columns.org.apache.arrow.vector.ipc.ArrowReaderexecuteStream(org.apache.arrow.memory.BufferAllocator allocator) Execute the plan and return its record batches as a streamingArrowReader.explain(boolean verbose, boolean analyze) Return a new DataFrame whose rows describe the plan that would execute this DataFrame.Apply a SQL predicate to produce a filtered DataFrame.Equi-join this DataFrame withrighton the named columns, using the givenJoinType.Equi-join this DataFrame withright, restricting the result with a residual SQL filter parsed against the combined schema (left columns followed by right columns; columns may be qualified with the relation alias when ambiguous).Join this DataFrame withrightusing arbitrary SQL predicates parsed against the combined schema.limit(int fetch) Take the firstfetchrows.limit(int skip, int fetch) Skipskiprows, then take the nextfetchrows.org.apache.arrow.vector.types.pojo.Schemaschema()Return the ArrowSchemaof this DataFrame's output.Project the listed columns into a new DataFrame.voidshow()Execute the plan and print formatted batches to native stdout.voidshow(int limit) Execute the plan and print the firstlimitrows to native stdout.unnestColumns(String... columns) Expand list or struct columns into rows or fields, with defaultUnnestOptions(i.e.unnestColumns(UnnestOptions options, String... columns) Expand list or struct columns into rows or fields with the suppliedUnnestOptions.withColumn(String name, String expr) Add a column to this DataFrame computed from a SQL expression.withColumnRenamed(String oldName, String newName) Rename a column.voidMaterialize this DataFrame as CSV atpath.voidwriteCsv(String path, CsvWriteOptions options) Materialize this DataFrame as CSV atpathwith the suppliedCsvWriteOptions.voidMaterialize this DataFrame as newline-delimited JSON atpath.voidwriteJson(String path, JsonWriteOptions options) Materialize this DataFrame as newline-delimited JSON atpathwith the suppliedJsonWriteOptions.voidwriteParquet(String path) Materialize this DataFrame as Parquet atpath.voidwriteParquet(String path, ParquetWriteOptions options) Materialize this DataFrame as Parquet atpathwith the suppliedParquetWriteOptions.
-
Method Details
-
collect
public org.apache.arrow.vector.ipc.ArrowReader collect(org.apache.arrow.memory.BufferAllocator allocator) Execute the plan and return its record batches as anArrowReader.Consumes this DataFrame: the native plan is released as soon as the stream is established. The caller is responsible for closing the returned reader, and the supplied allocator must outlive it.
This method materializes every batch on the native heap before the first batch crosses the FFI boundary, which can OOM the Rust side for unbounded or very large result sets. Prefer
executeStream(BufferAllocator)for analytics-scale queries. -
executeStream
public org.apache.arrow.vector.ipc.ArrowReader executeStream(org.apache.arrow.memory.BufferAllocator allocator) Execute the plan and return its record batches as a streamingArrowReader. Each call toArrowReader.loadNextBatch()drives one asyncstream.next()on the native side, so memory pressure stays bounded by the executor pipeline plus one in-flight batch instead of the full result set.Consumes this DataFrame with the same lifecycle rules as
collect(BufferAllocator): the native plan is released as soon as the stream is established, the caller closes the returned reader, and the supplied allocator must outlive it.For result sets that fit comfortably in native memory and are read in their entirety,
collect(BufferAllocator)remains a reasonable choice. For TB-scale or unbounded result sets, use this method. -
schema
public org.apache.arrow.vector.types.pojo.Schema schema()Return the ArrowSchemaof this DataFrame's output. Non-consuming: the receiver remains usable and must still be closed independently. Schema inspection does not execute the plan.The schema is transferred via Arrow IPC; no
BufferAllocatoris required because a schema carries no buffer data. -
explain
Return a new DataFrame whose rows describe the plan that would execute this DataFrame. Non-consuming: the receiver remains usable and must still be closed independently.With
verbose=falseandanalyze=false(the cheap, lazy variant), the result contains the logical plan only.verbose=trueadds optimised-plan and physical-plan rows;analyze=trueruns the plan and attaches per-operator metrics. Render viashow()orcollect(BufferAllocator). -
cache
Materialise this DataFrame into an in-memory table and return a new DataFrame that scans it. Non-consuming: the receiver remains usable and must still be closed independently.Executes the plan eagerly: the entire result set is held in native memory until the returned DataFrame is closed. Suitable for intermediate results that will be reused across multiple downstream queries.
- Throws:
RuntimeException- if execution fails.
-
describe
Compute summary statistics (count, null_count, mean, std, min, max, median) over this DataFrame's columns and return them as a new DataFrame. Non-consuming: the receiver remains usable and must still be closed independently.Executes the plan: DataFusion runs seven aggregate sub-plans against this DataFrame to build the summary table. Numeric columns receive every statistic; non-numeric columns receive
count/null_count/min/maxwhere applicable.- Throws:
RuntimeException- if execution fails.
-
count
public long count()Execute the plan and return the number of rows. -
show
public void show()Execute the plan and print formatted batches to native stdout. -
show
public void show(int limit) Execute the plan and print the firstlimitrows to native stdout. -
select
Project the listed columns into a new DataFrame. The receiver remains usable and must still be closed independently. -
filter
Apply a SQL predicate to produce a filtered DataFrame. The predicate is parsed against this DataFrame's own schema. The receiver remains usable and must still be closed independently. -
limit
Take the firstfetchrows. Equivalent tolimit(int, int)withskip = 0. The receiver remains usable and must still be closed independently. -
limit
Skipskiprows, then take the nextfetchrows. Both arguments must be non-negative. The receiver remains usable and must still be closed independently. -
distinct
Deduplicate rows across all columns. The receiver remains usable and must still be closed independently. -
dropColumns
Drop the named columns. The inverse ofselect(String...). The receiver remains usable and must still be closed independently. -
withColumnRenamed
Rename a column. The receiver remains usable and must still be closed independently. -
withColumn
Add a column to this DataFrame computed from a SQL expression. If a column with the given name already exists, it is replaced in place; otherwise the new column is appended. The expression is parsed against this DataFrame's own schema, matching the convention used byfilter(String). The receiver remains usable and must still be closed independently.- Throws:
IllegalArgumentException- ifnameorexprisnull.
-
unnestColumns
Expand list or struct columns into rows or fields, with defaultUnnestOptions(i.e.preserveNulls = true). The receiver remains usable and must still be closed independently. -
unnestColumns
Expand list or struct columns into rows or fields with the suppliedUnnestOptions. The receiver remains usable and must still be closed independently.- Throws:
IllegalArgumentException- ifoptionsorcolumnsisnull.
-
join
Equi-join this DataFrame withrighton the named columns, using the givenJoinType. The receiver andrightboth remain usable and must still be closed independently.Equivalent to SQL
left <type> JOIN right ON l.leftCols[0] = r.rightCols[0] AND ....leftColsandrightColsmust have the same length.- Throws:
IllegalArgumentException- if any argument isnullorleftCols.length != rightCols.length.IllegalStateException- if either DataFrame is closed or already collected.RuntimeException- if join planning fails (column collision in the combined schema, unknown column names, etc.).
-
join
public DataFrame join(DataFrame right, JoinType type, String[] leftCols, String[] rightCols, String filter) Equi-join this DataFrame withright, restricting the result with a residual SQL filter parsed against the combined schema (left columns followed by right columns; columns may be qualified with the relation alias when ambiguous). The receiver andrightboth remain usable and must still be closed independently.For outer joins, the filter is applied only to matched rows; unmatched rows are passed through with nulls on the unmatched side, matching DataFusion's semantics.
- Throws:
IllegalArgumentException- if any argument isnullorleftCols.length != rightCols.length.IllegalStateException- if either DataFrame is closed or already collected.RuntimeException- if join planning or filter parsing fails.
-
joinOn
Join this DataFrame withrightusing arbitrary SQL predicates parsed against the combined schema. Each predicate is parsed independently and the join evaluates their conjunction. Predicates may reference columns from either side and may be qualified with the relation alias when ambiguous (e.g."left.x = right.x"). The receiver andrightboth remain usable and must still be closed independently.DataFusion's optimiser identifies and rewrites equality predicates into hash-join keys automatically, so
joinOn(right, INNER, "l.id = r.id")plans equivalently tojoin(DataFrame, JoinType, String[], String[])with a single key. UsejoinOnwhen the predicate is not a simple equality, e.g. inequality joins or range conditions.- Throws:
IllegalArgumentException- ifrightortypeisnull, orpredicatesisnullor empty, or any predicate isnull.IllegalStateException- if either DataFrame is closed or already collected.RuntimeException- if predicate parsing or join planning fails.
-
writeParquet
Materialize this DataFrame as Parquet atpath. The path is treated as a directory unless overridden viaParquetWriteOptions.singleFileOutput(boolean). The receiver remains usable and must still be closed independently.- Throws:
RuntimeException- if the write fails.
-
writeParquet
Materialize this DataFrame as Parquet atpathwith the suppliedParquetWriteOptions. The receiver remains usable and must still be closed independently.- Throws:
RuntimeException- if the write fails (path inaccessible, invalid compression spec, etc.).
-
writeCsv
Materialize this DataFrame as CSV atpath. The path is treated as a directory unless overridden viaCsvWriteOptions.singleFileOutput(boolean). The receiver remains usable and must still be closed independently.- Throws:
RuntimeException- if the write fails.
-
writeCsv
Materialize this DataFrame as CSV atpathwith the suppliedCsvWriteOptions. The receiver remains usable and must still be closed independently.- Throws:
IllegalArgumentException- ifpathoroptionsisnull.RuntimeException- if the write fails (path inaccessible, invalid compression spec, etc.).
-
writeJson
Materialize this DataFrame as newline-delimited JSON atpath. The path is treated as a directory unless overridden viaJsonWriteOptions.singleFileOutput(boolean). The receiver remains usable and must still be closed independently.- Throws:
RuntimeException- if the write fails.
-
writeJson
Materialize this DataFrame as newline-delimited JSON atpathwith the suppliedJsonWriteOptions. The receiver remains usable and must still be closed independently.- Throws:
IllegalArgumentException- ifpathoroptionsisnull.RuntimeException- if the write fails (path inaccessible, invalid compression spec, etc.).
-
close
public void close()- Specified by:
closein interfaceAutoCloseable
-