DataFrame and SQL#
DataFusion Java supports two query interfaces: SQL strings via
SessionContext.sql(String), and a programmatic DataFrame API.
SQL#
try (DataFrame df = ctx.sql("SELECT a, b FROM t WHERE a > 10")) {
df.show();
}
sql(String) plans the query and returns a DataFrame. Execution does
not start until you pull results.
DataFrame transformations#
The DataFrame API exposes select, filter, limit, distinct,
dropColumns, and withColumnRenamed.
try (DataFrame df = ctx.readParquet("/path/to/orders.parquet")) {
try (DataFrame filtered = df.filter("o_orderpriority = '1-URGENT'")) {
filtered.show();
}
}
Each transformation returns a new DataFrame that must be closed.
Pulling results#
Three patterns are available:
Stream as Arrow batches. Use collect(allocator) to pull the result
set as Arrow record batches via the Arrow C Data Interface:
try (DataFrame df = ctx.sql("SELECT ...");
ArrowReader reader = df.collect(allocator)) {
while (reader.loadNextBatch()) {
var batch = reader.getVectorSchemaRoot();
// process batch...
}
}
Count rows. df.count() returns the row count without materializing
the rows in the JVM.
Print for inspection. df.show() and df.show(int n) print results
to standard output. Useful for exploration; not appropriate for
production code paths.
Schema introspection#
To get the schema of a registered table without running a query:
org.apache.arrow.vector.types.pojo.Schema schema = ctx.tableSchema("orders");
Plan input#
A DataFusion logical plan can be deserialized from datafusion-proto
bytes via SessionContext.fromProto(byte[]). The datafusion-proto Java
classes are generated by the Maven build. This is useful for accepting
plans produced by other DataFusion-aware tooling.