Java table providers#

SessionContext.registerTable(name, provider) registers a Java-implemented table. SQL queries that reference name call back into your TableProvider to fetch batches. Data flows from Java to native code via the Arrow C Data Interface, so there are no extra copies in the hot path. This is the Java counterpart to DataFusion’s Rust SessionContext::register_table.

Implement#

import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.arrow.vector.types.pojo.Schema;
import org.apache.datafusion.TableProvider;

public final class MyTable implements TableProvider {
    private final Schema schema;

    public MyTable(Schema schema) {
        this.schema = schema;
    }

    @Override public Schema schema() { return schema; }

    @Override
    public ArrowReader scan(BufferAllocator allocator) {
        // Return a fresh ArrowReader. The reader must allocate its buffers
        // from `allocator` (or a child of it) — the framework needs the
        // allocator hierarchy to share a root.
        return openMyReader(allocator);
    }
}

For the common case of “I have a schema and a function that returns an ArrowReader,” SimpleTableProvider packages those two into a ready-made TableProvider without having to subclass:

TableProvider t = new SimpleTableProvider(mySchema(), allocator -> openMyReader(allocator));
ctx.registerTable("t", t);

Register and query#

try (SessionContext ctx = new SessionContext();
     BufferAllocator allocator = new RootAllocator()) {
    ctx.registerTable("t", new MyTable(mySchema()));

    try (DataFrame df = ctx.sql("SELECT * FROM t WHERE x > 10");
         ArrowReader r = df.collect(allocator)) {
        while (r.loadNextBatch()) {
            // ...
        }
    }
}

Contract#

schema() is called exactly once, on the caller’s thread, at registration time. Throwing from it aborts registration with the original exception.
scan(allocator) is called once per SQL query that touches the table, on a worker thread. It must return a fresh, independent ArrowReader on every call — this is what makes self-joins and UNION ALL over the same table work.
The reader returned by scan must allocate its buffers from the supplied allocator (or a child of it). Arrow Java’s Data.exportArrayStream requires the reader’s allocator and the export allocator to share a root.
The returned reader’s schema must equal the schema returned by schema(). A mismatch fails the query.
You do not need to close the returned reader yourself. The framework installs a release callback that closes it when the underlying FFI stream is dropped.

Errors#

Exceptions thrown from scan() or from the returned reader surface in the RuntimeException raised by collect(). The error message includes the Java exception class and getMessage(), in the same format used for scalar UDF errors.

Threading#

SessionContext is single-threaded, but scan(allocator) may be invoked from any DataFusion worker thread. If your implementation maintains mutable state across scans, synchronise it.

Limitations (v1)#

Single-partition scans only. DataFusion sees the table as one partition; multi-partition parallelism is a follow-up.
No projection or filter pushdown. DataFusion applies projection and filters on top of the batches you return; the Java side always sees the full schema. The interface is intentionally minimal so it can grow these capabilities (as default methods) without breaking existing implementations.
No deregisterTable. Tables live until the SessionContext is closed.