DataFrame and SQL#

DataFusion Java supports two query interfaces: SQL strings via SessionContext.sql(String), and a programmatic DataFrame API.

SQL#

try (DataFrame df = ctx.sql("SELECT a, b FROM t WHERE a > 10")) {
    df.show();
}

sql(String) plans the query and returns a DataFrame. Execution does not start until you pull results.

DataFrame transformations#

The DataFrame API exposes select, filter, limit, distinct, dropColumns, and withColumnRenamed.

try (DataFrame df = ctx.readParquet("/path/to/orders.parquet")) {
    try (DataFrame filtered = df.filter("o_orderpriority = '1-URGENT'")) {
        filtered.show();
    }
}

Each transformation returns a new DataFrame that must be closed.

Pulling results#

Three patterns are available:

Stream as Arrow batches. Use collect(allocator) to pull the result set as Arrow record batches via the Arrow C Data Interface:

try (DataFrame df = ctx.sql("SELECT ...");
     ArrowReader reader = df.collect(allocator)) {
    while (reader.loadNextBatch()) {
        var batch = reader.getVectorSchemaRoot();
        // process batch...
    }
}

Count rows. df.count() returns the row count without materializing the rows in the JVM.

Print for inspection. df.show() and df.show(int n) print results to standard output. Useful for exploration; not appropriate for production code paths.

Schema introspection#

To get the schema of a registered table without running a query:

org.apache.arrow.vector.types.pojo.Schema schema = ctx.tableSchema("orders");

Plan input#

A DataFusion logical plan can be deserialized from datafusion-proto bytes via SessionContext.fromProto(byte[]). The datafusion-proto Java classes are generated by the Maven build. This is useful for accepting plans produced by other DataFusion-aware tooling.