Class DataFrame

java.lang.Object
org.apache.datafusion.DataFrame
All Implemented Interfaces:
AutoCloseable

public final class DataFrame extends Object implements AutoCloseable
A lazy representation of a query plan, mirroring the Rust DataFusion DataFrame. Created by SessionContext.sql(String) or other planning entry points and executed by either collect(org.apache.arrow.memory.BufferAllocator) (materializes every batch on the native heap before returning) or executeStream(org.apache.arrow.memory.BufferAllocator) (yields one batch at a time as Java drains the reader).

Instances are not thread-safe and must be closed. Both collect(org.apache.arrow.memory.BufferAllocator) and executeStream(org.apache.arrow.memory.BufferAllocator) consume the DataFrame: a successfully consumed DataFrame cannot be consumed again by either method (or by other executors such as count()), and close() on an already-consumed instance is a no-op.

  • Method Details

    • collect

      public org.apache.arrow.vector.ipc.ArrowReader collect(org.apache.arrow.memory.BufferAllocator allocator)
      Execute the plan and return its record batches as an ArrowReader.

      Consumes this DataFrame: the native plan is released as soon as the stream is established. The caller is responsible for closing the returned reader, and the supplied allocator must outlive it.

      This method materializes every batch on the native heap before the first batch crosses the FFI boundary, which can OOM the Rust side for unbounded or very large result sets. Prefer executeStream(BufferAllocator) for analytics-scale queries.

    • executeStream

      public org.apache.arrow.vector.ipc.ArrowReader executeStream(org.apache.arrow.memory.BufferAllocator allocator)
      Execute the plan and return its record batches as a streaming ArrowReader. Each call to ArrowReader.loadNextBatch() drives one async stream.next() on the native side, so memory pressure stays bounded by the executor pipeline plus one in-flight batch instead of the full result set.

      Consumes this DataFrame with the same lifecycle rules as collect(BufferAllocator): the native plan is released as soon as the stream is established, the caller closes the returned reader, and the supplied allocator must outlive it.

      For result sets that fit comfortably in native memory and are read in their entirety, collect(BufferAllocator) remains a reasonable choice. For TB-scale or unbounded result sets, use this method.

    • schema

      public org.apache.arrow.vector.types.pojo.Schema schema()
      Return the Arrow Schema of this DataFrame's output. Non-consuming: the receiver remains usable and must still be closed independently. Schema inspection does not execute the plan.

      The schema is transferred via Arrow IPC; no BufferAllocator is required because a schema carries no buffer data.

    • explain

      public DataFrame explain(boolean verbose, boolean analyze)
      Return a new DataFrame whose rows describe the plan that would execute this DataFrame. Non-consuming: the receiver remains usable and must still be closed independently.

      With verbose=false and analyze=false (the cheap, lazy variant), the result contains the logical plan only. verbose=true adds optimised-plan and physical-plan rows; analyze=true runs the plan and attaches per-operator metrics. Render via show() or collect(BufferAllocator).

    • cache

      public DataFrame cache()
      Materialise this DataFrame into an in-memory table and return a new DataFrame that scans it. Non-consuming: the receiver remains usable and must still be closed independently.

      Executes the plan eagerly: the entire result set is held in native memory until the returned DataFrame is closed. Suitable for intermediate results that will be reused across multiple downstream queries.

      Throws:
      RuntimeException - if execution fails.
    • describe

      public DataFrame describe()
      Compute summary statistics (count, null_count, mean, std, min, max, median) over this DataFrame's columns and return them as a new DataFrame. Non-consuming: the receiver remains usable and must still be closed independently.

      Executes the plan: DataFusion runs seven aggregate sub-plans against this DataFrame to build the summary table. Numeric columns receive every statistic; non-numeric columns receive count / null_count / min / max where applicable.

      Throws:
      RuntimeException - if execution fails.
    • count

      public long count()
      Execute the plan and return the number of rows.
    • show

      public void show()
      Execute the plan and print formatted batches to native stdout.
    • show

      public void show(int limit)
      Execute the plan and print the first limit rows to native stdout.
    • select

      public DataFrame select(String... columnNames)
      Project the listed columns into a new DataFrame. The receiver remains usable and must still be closed independently.
    • filter

      public DataFrame filter(String predicate)
      Apply a SQL predicate to produce a filtered DataFrame. The predicate is parsed against this DataFrame's own schema. The receiver remains usable and must still be closed independently.
    • limit

      public DataFrame limit(int fetch)
      Take the first fetch rows. Equivalent to limit(int, int) with skip = 0. The receiver remains usable and must still be closed independently.
    • limit

      public DataFrame limit(int skip, int fetch)
      Skip skip rows, then take the next fetch rows. Both arguments must be non-negative. The receiver remains usable and must still be closed independently.
    • distinct

      public DataFrame distinct()
      Deduplicate rows across all columns. The receiver remains usable and must still be closed independently.
    • dropColumns

      public DataFrame dropColumns(String... columnNames)
      Drop the named columns. The inverse of select(String...). The receiver remains usable and must still be closed independently.
    • withColumnRenamed

      public DataFrame withColumnRenamed(String oldName, String newName)
      Rename a column. The receiver remains usable and must still be closed independently.
    • withColumn

      public DataFrame withColumn(String name, String expr)
      Add a column to this DataFrame computed from a SQL expression. If a column with the given name already exists, it is replaced in place; otherwise the new column is appended. The expression is parsed against this DataFrame's own schema, matching the convention used by filter(String). The receiver remains usable and must still be closed independently.
      Throws:
      IllegalArgumentException - if name or expr is null.
    • unnestColumns

      public DataFrame unnestColumns(String... columns)
      Expand list or struct columns into rows or fields, with default UnnestOptions (i.e. preserveNulls = true). The receiver remains usable and must still be closed independently.
    • unnestColumns

      public DataFrame unnestColumns(UnnestOptions options, String... columns)
      Expand list or struct columns into rows or fields with the supplied UnnestOptions. The receiver remains usable and must still be closed independently.
      Throws:
      IllegalArgumentException - if options or columns is null.
    • join

      public DataFrame join(DataFrame right, JoinType type, String[] leftCols, String[] rightCols)
      Equi-join this DataFrame with right on the named columns, using the given JoinType. The receiver and right both remain usable and must still be closed independently.

      Equivalent to SQL left <type> JOIN right ON l.leftCols[0] = r.rightCols[0] AND .... leftCols and rightCols must have the same length.

      Throws:
      IllegalArgumentException - if any argument is null or leftCols.length != rightCols.length.
      IllegalStateException - if either DataFrame is closed or already collected.
      RuntimeException - if join planning fails (column collision in the combined schema, unknown column names, etc.).
    • join

      public DataFrame join(DataFrame right, JoinType type, String[] leftCols, String[] rightCols, String filter)
      Equi-join this DataFrame with right, restricting the result with a residual SQL filter parsed against the combined schema (left columns followed by right columns; columns may be qualified with the relation alias when ambiguous). The receiver and right both remain usable and must still be closed independently.

      For outer joins, the filter is applied only to matched rows; unmatched rows are passed through with nulls on the unmatched side, matching DataFusion's semantics.

      Throws:
      IllegalArgumentException - if any argument is null or leftCols.length != rightCols.length.
      IllegalStateException - if either DataFrame is closed or already collected.
      RuntimeException - if join planning or filter parsing fails.
    • joinOn

      public DataFrame joinOn(DataFrame right, JoinType type, String... predicates)
      Join this DataFrame with right using arbitrary SQL predicates parsed against the combined schema. Each predicate is parsed independently and the join evaluates their conjunction. Predicates may reference columns from either side and may be qualified with the relation alias when ambiguous (e.g. "left.x = right.x"). The receiver and right both remain usable and must still be closed independently.

      DataFusion's optimiser identifies and rewrites equality predicates into hash-join keys automatically, so joinOn(right, INNER, "l.id = r.id") plans equivalently to join(DataFrame, JoinType, String[], String[]) with a single key. Use joinOn when the predicate is not a simple equality, e.g. inequality joins or range conditions.

      Throws:
      IllegalArgumentException - if right or type is null, or predicates is null or empty, or any predicate is null.
      IllegalStateException - if either DataFrame is closed or already collected.
      RuntimeException - if predicate parsing or join planning fails.
    • writeParquet

      public void writeParquet(String path)
      Materialize this DataFrame as Parquet at path. The path is treated as a directory unless overridden via ParquetWriteOptions.singleFileOutput(boolean). The receiver remains usable and must still be closed independently.
      Throws:
      RuntimeException - if the write fails.
    • writeParquet

      public void writeParquet(String path, ParquetWriteOptions options)
      Materialize this DataFrame as Parquet at path with the supplied ParquetWriteOptions. The receiver remains usable and must still be closed independently.
      Throws:
      RuntimeException - if the write fails (path inaccessible, invalid compression spec, etc.).
    • writeCsv

      public void writeCsv(String path)
      Materialize this DataFrame as CSV at path. The path is treated as a directory unless overridden via CsvWriteOptions.singleFileOutput(boolean). The receiver remains usable and must still be closed independently.
      Throws:
      RuntimeException - if the write fails.
    • writeCsv

      public void writeCsv(String path, CsvWriteOptions options)
      Materialize this DataFrame as CSV at path with the supplied CsvWriteOptions. The receiver remains usable and must still be closed independently.
      Throws:
      IllegalArgumentException - if path or options is null.
      RuntimeException - if the write fails (path inaccessible, invalid compression spec, etc.).
    • writeJson

      public void writeJson(String path)
      Materialize this DataFrame as newline-delimited JSON at path. The path is treated as a directory unless overridden via JsonWriteOptions.singleFileOutput(boolean). The receiver remains usable and must still be closed independently.
      Throws:
      RuntimeException - if the write fails.
    • writeJson

      public void writeJson(String path, JsonWriteOptions options)
      Materialize this DataFrame as newline-delimited JSON at path with the supplied JsonWriteOptions. The receiver remains usable and must still be closed independently.
      Throws:
      IllegalArgumentException - if path or options is null.
      RuntimeException - if the write fails (path inaccessible, invalid compression spec, etc.).
    • close

      public void close()
      Specified by:
      close in interface AutoCloseable