org.apache.datafusion.DataFrame

All Implemented Interfaces:: AutoCloseable

public final class DataFrame extends Object implements AutoCloseable

A lazy representation of a query plan, mirroring the Rust DataFusion DataFrame. Created by SessionContext.sql(String) or other planning entry points and executed by either collect(org.apache.arrow.memory.BufferAllocator) (materializes every batch on the native heap before returning) or executeStream(org.apache.arrow.memory.BufferAllocator) (yields one batch at a time as Java drains the reader).

Instances are not thread-safe and must be closed. Both collect(org.apache.arrow.memory.BufferAllocator) and executeStream(org.apache.arrow.memory.BufferAllocator) consume the DataFrame: a successfully consumed DataFrame cannot be consumed again by either method (or by other executors such as count()), and close() on an already-consumed instance is a no-op.

Method Summary

Modifier and Type

Method

Description

DataFrame

cache()

Materialise this DataFrame into an in-memory table and return a new DataFrame that scans it.

void

close()

org.apache.arrow.vector.ipc.ArrowReader

collect(org.apache.arrow.memory.BufferAllocator allocator)

Execute the plan and return its record batches as an ArrowReader.

long

count()

Execute the plan and return the number of rows.

DataFrame

describe()

Compute summary statistics (count, null_count, mean, std, min, max, median) over this DataFrame's columns and return them as a new DataFrame.

DataFrame

distinct()

Deduplicate rows across all columns.

DataFrame

dropColumns(String... columnNames)

Drop the named columns.

DataFrame

except(DataFrame other)

Rows present in this DataFrame but not in other, keeping duplicates from the receiver (SQL EXCEPT ALL).

DataFrame

exceptDistinct(DataFrame other)

Rows present in this DataFrame but not in other, deduplicated (SQL EXCEPT).

org.apache.arrow.vector.ipc.ArrowReader

executeStream(org.apache.arrow.memory.BufferAllocator allocator)

Execute the plan and return its record batches as a streaming ArrowReader.

DataFrame

explain(boolean verbose, boolean analyze)

Return a new DataFrame whose rows describe the plan that would execute this DataFrame.

DataFrame

filter(String predicate)

Apply a SQL predicate to produce a filtered DataFrame.

DataFrame

intersect(DataFrame other)

Rows present in both this DataFrame and other, keeping duplicates from the receiver (SQL INTERSECT ALL).

DataFrame

intersectDistinct(DataFrame other)

Rows present in both this DataFrame and other, deduplicated (SQL INTERSECT).

DataFrame

join(DataFrame right, JoinType type, String[] leftCols, String[] rightCols)

Equi-join this DataFrame with right on the named columns, using the given JoinType.

DataFrame

join(DataFrame right, JoinType type, String[] leftCols, String[] rightCols, String filter)

Equi-join this DataFrame with right, restricting the result with a residual SQL filter parsed against the combined schema (left columns followed by right columns; columns may be qualified with the relation alias when ambiguous).

DataFrame

joinOn(DataFrame right, JoinType type, String... predicates)

Join this DataFrame with right using arbitrary SQL predicates parsed against the combined schema.

DataFrame

limit(int fetch)

Take the first fetch rows.

DataFrame

limit(int skip, int fetch)

Skip skip rows, then take the next fetch rows.

DataFrame

repartitionHash(int numPartitions, String... columns)

Repartition this DataFrame by hashing the named columns into numPartitions output partitions. v1 supports column-name keys only; expression keys are deferred until the Java binding gains an Expr builder.

DataFrame

repartitionRoundRobin(int numPartitions)

Repartition this DataFrame using a round-robin scheme across numPartitions output partitions.

org.apache.arrow.vector.types.pojo.Schema

schema()

Return the Arrow Schema of this DataFrame's output.

DataFrame

select(String... columnNames)

Project the listed columns into a new DataFrame.

void

show()

Execute the plan and print formatted batches to native stdout.

void

show(int limit)

Execute the plan and print the first limit rows to native stdout.

DataFrame

sort(SortExpr... exprs)

Order the rows by the supplied sort keys.

DataFrame

union(DataFrame other)

Concatenate this DataFrame with other by column position, keeping all duplicates (SQL UNION ALL).

DataFrame

unionByName(DataFrame other)

Concatenate this DataFrame with other by column name, keeping all duplicates.

DataFrame

unionByNameDistinct(DataFrame other)

Concatenate this DataFrame with other by column name, removing duplicates.

DataFrame

unionDistinct(DataFrame other)

Concatenate this DataFrame with other by column position, removing duplicates (SQL UNION DISTINCT -- equivalent to plain UNION in standard SQL).

DataFrame

unnestColumns(String... columns)

Expand list or struct columns into rows or fields, with default UnnestOptions (i.e.

DataFrame

unnestColumns(UnnestOptions options, String... columns)

Expand list or struct columns into rows or fields with the supplied UnnestOptions.

DataFrame

withColumn(String name, String expr)

Add a column to this DataFrame computed from a SQL expression.

DataFrame

withColumnRenamed(String oldName, String newName)

Rename a column.

void

writeCsv(String path)

Materialize this DataFrame as CSV at path.

void

writeCsv(String path, CsvWriteOptions options)

Materialize this DataFrame as CSV at path with the supplied CsvWriteOptions.

void

writeJson(String path)

Materialize this DataFrame as newline-delimited JSON at path.

void

writeJson(String path, JsonWriteOptions options)

Materialize this DataFrame as newline-delimited JSON at path with the supplied JsonWriteOptions.

void

writeParquet(String path)

Materialize this DataFrame as Parquet at path.

void

writeParquet(String path, ParquetWriteOptions options)

Materialize this DataFrame as Parquet at path with the supplied ParquetWriteOptions.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Details
- collect
  
  public org.apache.arrow.vector.ipc.ArrowReader collect(org.apache.arrow.memory.BufferAllocator allocator)
  
  Execute the plan and return its record batches as an ArrowReader.
  Consumes this DataFrame: the native plan is released as soon as the stream is established. The caller is responsible for closing the returned reader, and the supplied allocator must outlive it.
  This method materializes every batch on the native heap before the first batch crosses the FFI boundary, which can OOM the Rust side for unbounded or very large result sets. Prefer executeStream(BufferAllocator) for analytics-scale queries.
- executeStream
  
  public org.apache.arrow.vector.ipc.ArrowReader executeStream(org.apache.arrow.memory.BufferAllocator allocator)
  
  Execute the plan and return its record batches as a streaming ArrowReader. Each call to ArrowReader.loadNextBatch() drives one async stream.next() on the native side, so memory pressure stays bounded by the executor pipeline plus one in-flight batch instead of the full result set.
  Consumes this DataFrame with the same lifecycle rules as collect(BufferAllocator): the native plan is released as soon as the stream is established, the caller closes the returned reader, and the supplied allocator must outlive it.
  For result sets that fit comfortably in native memory and are read in their entirety, collect(BufferAllocator) remains a reasonable choice. For TB-scale or unbounded result sets, use this method.
- schema
  
  public org.apache.arrow.vector.types.pojo.Schema schema()
  
  Return the Arrow Schema of this DataFrame's output. Non-consuming: the receiver remains usable and must still be closed independently. Schema inspection does not execute the plan.
  The schema is transferred via Arrow IPC; no BufferAllocator is required because a schema carries no buffer data.
- explain
  
  public DataFrame explain(boolean verbose, boolean analyze)
  
  Return a new DataFrame whose rows describe the plan that would execute this DataFrame. Non-consuming: the receiver remains usable and must still be closed independently.
  With verbose=false and analyze=false (the cheap, lazy variant), the result contains the logical plan only. verbose=true adds optimised-plan and physical-plan rows; analyze=true runs the plan and attaches per-operator metrics. Render via show() or collect(BufferAllocator).
- cache
  
  public DataFrame cache()
  
  Materialise this DataFrame into an in-memory table and return a new DataFrame that scans it. Non-consuming: the receiver remains usable and must still be closed independently.
  Executes the plan eagerly: the entire result set is held in native memory until the returned DataFrame is closed. Suitable for intermediate results that will be reused across multiple downstream queries.
  
  Throws:
  
  RuntimeException - if execution fails.
- describe
  
  public DataFrame describe()
  
  Compute summary statistics (count, null_count, mean, std, min, max, median) over this DataFrame's columns and return them as a new DataFrame. Non-consuming: the receiver remains usable and must still be closed independently.
  Executes the plan: DataFusion runs seven aggregate sub-plans against this DataFrame to build the summary table. Numeric columns receive every statistic; non-numeric columns receive count / null_count / min / max where applicable.
  
  Throws:
  
  RuntimeException - if execution fails.
- count
  
  public long count()
  
  Execute the plan and return the number of rows.
- show
  
  public void show()
  
  Execute the plan and print formatted batches to native stdout.
- show
  
  public void show(int limit)
  
  Execute the plan and print the first limit rows to native stdout.
- select
  
  public DataFrame select(String... columnNames)
  
  Project the listed columns into a new DataFrame. The receiver remains usable and must still be closed independently.
- filter
  
  public DataFrame filter(String predicate)
  
  Apply a SQL predicate to produce a filtered DataFrame. The predicate is parsed against this DataFrame's own schema. The receiver remains usable and must still be closed independently.
- limit
  
  public DataFrame limit(int fetch)
  
  Take the first fetch rows. Equivalent to limit(int, int) with skip = 0. The receiver remains usable and must still be closed independently.
- limit
  
  public DataFrame limit(int skip, int fetch)
  
  Skip skip rows, then take the next fetch rows. Both arguments must be non-negative. The receiver remains usable and must still be closed independently.
- distinct
  
  public DataFrame distinct()
  
  Deduplicate rows across all columns. The receiver remains usable and must still be closed independently.
- dropColumns
  
  public DataFrame dropColumns(String... columnNames)
  
  Drop the named columns. The inverse of select(String...). The receiver remains usable and must still be closed independently.
- withColumnRenamed
  
  public DataFrame withColumnRenamed(String oldName, String newName)
  
  Rename a column. The receiver remains usable and must still be closed independently.
- withColumn
  
  public DataFrame withColumn(String name, String expr)
  
  Add a column to this DataFrame computed from a SQL expression. If a column with the given name already exists, it is replaced in place; otherwise the new column is appended. The expression is parsed against this DataFrame's own schema, matching the convention used by filter(String). The receiver remains usable and must still be closed independently.
  
  Throws:
  
  IllegalArgumentException - if name or expr is null.
- unnestColumns
  
  public DataFrame unnestColumns(String... columns)
  
  Expand list or struct columns into rows or fields, with default UnnestOptions (i.e. preserveNulls = true). The receiver remains usable and must still be closed independently.
- unnestColumns
  
  public DataFrame unnestColumns(UnnestOptions options, String... columns)
  
  Expand list or struct columns into rows or fields with the supplied UnnestOptions. The receiver remains usable and must still be closed independently.
  
  Throws:
  
  IllegalArgumentException - if options or columns is null.
- union
  
  public DataFrame union(DataFrame other)
  
  Concatenate this DataFrame with other by column position, keeping all duplicates (SQL UNION ALL). The two schemas must match positionally. Both this DataFrame and other remain usable after the call and must still be closed independently.
  
  Throws:
  
  IllegalArgumentException - if other is null.
  
  RuntimeException - if the schemas are incompatible.
- unionDistinct
  
  public DataFrame unionDistinct(DataFrame other)
  
  Concatenate this DataFrame with other by column position, removing duplicates (SQL UNION DISTINCT -- equivalent to plain UNION in standard SQL). Both DataFrames remain usable.
  
  Throws:
  
  IllegalArgumentException - if other is null.
  
  RuntimeException - if the schemas are incompatible.
- unionByName
  
  public DataFrame unionByName(DataFrame other)
  
  Concatenate this DataFrame with other by column name, keeping all duplicates. Columns present in only one side are filled with NULL on the other. Both DataFrames remain usable.
  
  Throws:
  
  IllegalArgumentException - if other is null.
  
  RuntimeException - if column types disagree on a shared name.
- unionByNameDistinct
  
  public DataFrame unionByNameDistinct(DataFrame other)
  
  Concatenate this DataFrame with other by column name, removing duplicates. Columns present in only one side are filled with NULL on the other. Both DataFrames remain usable.
  
  Throws:
  
  IllegalArgumentException - if other is null.
  
  RuntimeException - if column types disagree on a shared name.
- intersect
  
  public DataFrame intersect(DataFrame other)
  
  Rows present in both this DataFrame and other, keeping duplicates from the receiver (SQL INTERSECT ALL). Both schemas must match positionally. Both DataFrames remain usable.
  Implementation note: DataFusion implements INTERSECT ALL as a left-semi join on equality, not as standard SQL bag intersection. A left row is kept iff any matching row exists in other. With left = (1, 2, 2, 3) and right = (2, 3), the result is (2, 2, 3) -- both copies of 2 survive because each finds a match in right. PostgreSQL / Spark INTERSECT ALL would also yield (2, 2, 3) here, but the two engines diverge when other has fewer copies than this of a row that appears in both.
  
  Throws:
  
  IllegalArgumentException - if other is null.
  
  RuntimeException - if the schemas are incompatible.
- intersectDistinct
  
  public DataFrame intersectDistinct(DataFrame other)
  
  Rows present in both this DataFrame and other, deduplicated (SQL INTERSECT). Both schemas must match positionally. Both DataFrames remain usable.
  
  Throws:
  
  IllegalArgumentException - if other is null.
  
  RuntimeException - if the schemas are incompatible.
- except
  
  public DataFrame except(DataFrame other)
  
  Rows present in this DataFrame but not in other, keeping duplicates from the receiver (SQL EXCEPT ALL). Both schemas must match positionally. Both DataFrames remain usable.
  Implementation note: DataFusion implements EXCEPT ALL as a left-anti join on equality, not as standard SQL bag difference. A left row is kept iff no matching row exists in other -- the multiplicity of matches is irrelevant. With left = (1, 1, 2, 2, 3) and right = (1, 3), the result is (2, 2): both copies of 2 survive (no match in right); both copies of 1 and the 3 drop. PostgreSQL / Spark EXCEPT ALL would yield the same answer here, but the two engines diverge when right contains fewer copies than left of a row that appears in both.
  
  Throws:
  
  IllegalArgumentException - if other is null.
  
  RuntimeException - if the schemas are incompatible.
- exceptDistinct
  
  public DataFrame exceptDistinct(DataFrame other)
  
  Rows present in this DataFrame but not in other, deduplicated (SQL EXCEPT). Both schemas must match positionally. Both DataFrames remain usable.
  
  Throws:
  
  IllegalArgumentException - if other is null.
  
  RuntimeException - if the schemas are incompatible.
- sort
  
  public DataFrame sort(SortExpr... exprs)
  
  Order the rows by the supplied sort keys. Each SortExpr names a column and a direction (SortExpr.asc(String) / SortExpr.desc(String)); call SortExpr.nullsFirst(boolean) to override null placement.
  An empty exprs array is a no-op (matches DataFusion's sort(vec![])). The receiver remains usable and must still be closed independently.
  
  Throws:
  
  IllegalArgumentException - if exprs or any element is null.
  
  RuntimeException - if a sort column does not exist in this DataFrame's schema.
- repartitionRoundRobin
  
  public DataFrame repartitionRoundRobin(int numPartitions)
  
  Repartition this DataFrame using a round-robin scheme across numPartitions output partitions. The receiver remains usable and must still be closed independently.
  
  Throws:
  
  IllegalArgumentException - if numPartitions <= 0.
  
  RuntimeException - if the underlying repartition plan rejects the request.
- repartitionHash
  
  public DataFrame repartitionHash(int numPartitions, String... columns)
  
  Repartition this DataFrame by hashing the named columns into numPartitions output partitions. v1 supports column-name keys only; expression keys are deferred until the Java binding gains an Expr builder. The receiver remains usable and must still be closed independently.
  
  Throws:
  
  IllegalArgumentException - if numPartitions <= 0, columns is null or empty, or any element of columns is null.
  
  RuntimeException - if a partition column does not exist in this DataFrame's schema.
- join
  
  public DataFrame join(DataFrame right, JoinType type, String[] leftCols, String[] rightCols)
  
  Equi-join this DataFrame with right on the named columns, using the given JoinType. The receiver and right both remain usable and must still be closed independently.
  Equivalent to SQL left <type> JOIN right ON l.leftCols[0] = r.rightCols[0] AND .... leftCols and rightCols must have the same length.
  
  Throws:
  
  IllegalArgumentException - if any argument is null or leftCols.length != rightCols.length.
  
  IllegalStateException - if either DataFrame is closed or already collected.
  
  RuntimeException - if join planning fails (column collision in the combined schema, unknown column names, etc.).
- join
  
  public DataFrame join(DataFrame right, JoinType type, String[] leftCols, String[] rightCols, String filter)
  
  Equi-join this DataFrame with right, restricting the result with a residual SQL filter parsed against the combined schema (left columns followed by right columns; columns may be qualified with the relation alias when ambiguous). The receiver and right both remain usable and must still be closed independently.
  For outer joins, the filter is applied only to matched rows; unmatched rows are passed through with nulls on the unmatched side, matching DataFusion's semantics.
  
  Throws:
  
  IllegalArgumentException - if any argument is null or leftCols.length != rightCols.length.
  
  IllegalStateException - if either DataFrame is closed or already collected.
  
  RuntimeException - if join planning or filter parsing fails.
- joinOn
  
  public DataFrame joinOn(DataFrame right, JoinType type, String... predicates)
  
  Join this DataFrame with right using arbitrary SQL predicates parsed against the combined schema. Each predicate is parsed independently and the join evaluates their conjunction. Predicates may reference columns from either side and may be qualified with the relation alias when ambiguous (e.g. "left.x = right.x"). The receiver and right both remain usable and must still be closed independently.
  DataFusion's optimiser identifies and rewrites equality predicates into hash-join keys automatically, so joinOn(right, INNER, "l.id = r.id") plans equivalently to join(DataFrame, JoinType, String[], String[]) with a single key. Use joinOn when the predicate is not a simple equality, e.g. inequality joins or range conditions.
  
  Throws:
  
  IllegalArgumentException - if right or type is null, or predicates is null or empty, or any predicate is null.
  
  IllegalStateException - if either DataFrame is closed or already collected.
  
  RuntimeException - if predicate parsing or join planning fails.
- writeParquet
  
  public void writeParquet(String path)
  
  Materialize this DataFrame as Parquet at path. The path is treated as a directory unless overridden via ParquetWriteOptions.singleFileOutput(boolean). The receiver remains usable and must still be closed independently.
  
  Throws:
  
  RuntimeException - if the write fails.
- writeParquet
  
  public void writeParquet(String path, ParquetWriteOptions options)
  
  Materialize this DataFrame as Parquet at path with the supplied ParquetWriteOptions. The receiver remains usable and must still be closed independently.
  
  Throws:
  
  RuntimeException - if the write fails (path inaccessible, invalid compression spec, etc.).
- writeCsv
  
  public void writeCsv(String path)
  
  Materialize this DataFrame as CSV at path. The path is treated as a directory unless overridden via CsvWriteOptions.singleFileOutput(boolean). The receiver remains usable and must still be closed independently.
  
  Throws:
  
  RuntimeException - if the write fails.
- writeCsv
  
  public void writeCsv(String path, CsvWriteOptions options)
  
  Materialize this DataFrame as CSV at path with the supplied CsvWriteOptions. The receiver remains usable and must still be closed independently.
  
  Throws:
  
  IllegalArgumentException - if path or options is null.
  
  RuntimeException - if the write fails (path inaccessible, invalid compression spec, etc.).
- writeJson
  
  public void writeJson(String path)
  
  Materialize this DataFrame as newline-delimited JSON at path. The path is treated as a directory unless overridden via JsonWriteOptions.singleFileOutput(boolean). The receiver remains usable and must still be closed independently.
  
  Throws:
  
  RuntimeException - if the write fails.
- writeJson
  
  public void writeJson(String path, JsonWriteOptions options)
  
  Materialize this DataFrame as newline-delimited JSON at path with the supplied JsonWriteOptions. The receiver remains usable and must still be closed independently.
  
  Throws:
  
  IllegalArgumentException - if path or options is null.
  
  RuntimeException - if the write fails (path inaccessible, invalid compression spec, etc.).
- close
  
  public void close()
  
  Specified by:
  
  close in interface AutoCloseable

Class DataFrame

Method Summary

Methods inherited from class java.lang.Object

Method Details

collect

executeStream

schema

explain

cache

describe

count

show

show

select

filter

limit

limit

distinct

dropColumns

withColumnRenamed

withColumn

unnestColumns

unnestColumns

union

unionDistinct

unionByName

unionByNameDistinct

intersect

intersectDistinct

except

exceptDistinct

sort

repartitionRoundRobin

repartitionHash

join

join

joinOn

writeParquet

writeParquet

writeCsv

writeCsv

writeJson

writeJson

close