Quickstart#

This page walks through a complete query end-to-end.

The full example#

import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.datafusion.DataFrame;
import org.apache.datafusion.SessionContext;

try (var allocator = new RootAllocator();
     var ctx = new SessionContext()) {

    ctx.registerParquet("orders", "/path/to/orders.parquet");

    try (DataFrame df = ctx.sql(
            "SELECT o_orderpriority, COUNT(*) AS n " +
            "FROM orders GROUP BY o_orderpriority");
         ArrowReader reader = df.collect(allocator)) {
        while (reader.loadNextBatch()) {
            var batch = reader.getVectorSchemaRoot();
            // ...
        }
    }
}

Walkthrough#

Allocator. RootAllocator is the Arrow off-heap memory allocator. Every JVM-side Arrow buffer is tracked under an allocator; when the allocator is closed, leaked buffers are reported. Use one allocator per query (or one per application) and close it in a try-with-resources.

Session context. SessionContext is the entry point into DataFusion. It holds the catalog of registered tables and the query planner. It is AutoCloseable and not thread-safe — use one per thread, or guard access externally.

Registering data. registerParquet(name, path) reads the file’s footer on call and exposes it under the given table name. See Parquet for the options form.

SQL. ctx.sql("...") plans the query and returns a DataFrame. The query is not executed until results are pulled.

Collecting results. df.collect(allocator) starts native execution and returns an ArrowReader. Each loadNextBatch() call pulls the next VectorSchemaRoot; iterate until it returns false.

Cleanup. Both SessionContext and DataFrame are AutoCloseable. Use try-with-resources so native resources and Arrow buffers are released even on exception.