Java table providers#
SessionContext.registerTable(name, provider) registers a Java-implemented
table. SQL queries that reference name call back into your TableProvider
to fetch batches. Data flows from Java to native code via the Arrow C Data
Interface, so there are no extra copies in the hot path. This is the Java
counterpart to DataFusion’s Rust SessionContext::register_table.
Implement#
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.arrow.vector.types.pojo.Schema;
import org.apache.datafusion.TableProvider;
public final class MyTable implements TableProvider {
private final Schema schema;
public MyTable(Schema schema) {
this.schema = schema;
}
@Override public Schema schema() { return schema; }
@Override
public ArrowReader scan(BufferAllocator allocator) {
// Return a fresh ArrowReader. The reader must allocate its buffers
// from `allocator` (or a child of it) — the framework needs the
// allocator hierarchy to share a root.
return openMyReader(allocator);
}
}
For the common case of “I have a schema and a function that returns an
ArrowReader,” SimpleTableProvider packages those two into a ready-made
TableProvider without having to subclass:
TableProvider t = new SimpleTableProvider(mySchema(), allocator -> openMyReader(allocator));
ctx.registerTable("t", t);
Register and query#
try (SessionContext ctx = new SessionContext();
BufferAllocator allocator = new RootAllocator()) {
ctx.registerTable("t", new MyTable(mySchema()));
try (DataFrame df = ctx.sql("SELECT * FROM t WHERE x > 10");
ArrowReader r = df.collect(allocator)) {
while (r.loadNextBatch()) {
// ...
}
}
}
Contract#
schema()is called exactly once, on the caller’s thread, at registration time. Throwing from it aborts registration with the original exception.scan(allocator)is called once per SQL query that touches the table, on a worker thread. It must return a fresh, independentArrowReaderon every call — this is what makes self-joins andUNION ALLover the same table work.The reader returned by
scanmust allocate its buffers from the suppliedallocator(or a child of it). Arrow Java’sData.exportArrayStreamrequires the reader’s allocator and the export allocator to share a root.The returned reader’s schema must equal the schema returned by
schema(). A mismatch fails the query.You do not need to close the returned reader yourself. The framework installs a release callback that closes it when the underlying FFI stream is dropped.
Errors#
Exceptions thrown from scan() or from the returned reader surface in the
RuntimeException raised by collect(). The error message includes the Java
exception class and getMessage(), in the same format used for scalar UDF
errors.
Threading#
SessionContext is single-threaded, but scan(allocator) may be invoked from
any DataFusion worker thread. If your implementation maintains mutable state
across scans, synchronise it.
Limitations (v1)#
Single-partition scans only. DataFusion sees the table as one partition; multi-partition parallelism is a follow-up.
No projection or filter pushdown. DataFusion applies projection and filters on top of the batches you return; the Java side always sees the full schema. The interface is intentionally minimal so it can grow these capabilities (as default methods) without breaking existing implementations.
No
deregisterTable. Tables live until theSessionContextis closed.