Quickstart#
This page walks through a complete query end-to-end.
The full example#
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.datafusion.DataFrame;
import org.apache.datafusion.SessionContext;
try (var allocator = new RootAllocator();
var ctx = new SessionContext()) {
ctx.registerParquet("orders", "/path/to/orders.parquet");
try (DataFrame df = ctx.sql(
"SELECT o_orderpriority, COUNT(*) AS n " +
"FROM orders GROUP BY o_orderpriority");
ArrowReader reader = df.collect(allocator)) {
while (reader.loadNextBatch()) {
var batch = reader.getVectorSchemaRoot();
// ...
}
}
}
Walkthrough#
Allocator. RootAllocator is the Arrow off-heap memory allocator. Every
JVM-side Arrow buffer is tracked under an allocator; when the allocator is
closed, leaked buffers are reported. Use one allocator per query (or one
per application) and close it in a try-with-resources.
Session context. SessionContext is the entry point into DataFusion. It
holds the catalog of registered tables and the query planner. It is
AutoCloseable and not thread-safe — use one per thread, or guard
access externally.
Registering data. registerParquet(name, path) reads the file’s footer
on call and exposes it under the given table name. See
Parquet for the options form.
SQL. ctx.sql("...") plans the query and returns a DataFrame. The
query is not executed until results are pulled.
Collecting results. df.collect(allocator) starts native execution and
returns an ArrowReader. Each loadNextBatch() call pulls the next
VectorSchemaRoot; iterate until it returns false.
Cleanup. Both SessionContext and DataFrame are AutoCloseable. Use
try-with-resources so native resources and Arrow buffers are released
even on exception.