Introducing Apache DataFusion Java 0.1.0
Posted on: Tue 26 May 2026 by pmc
We are pleased to announce the 0.1.0 release of Apache DataFusion Java, the first release of the project. DataFusion Java is a thin Java binding over the Apache DataFusion query engine: SQL and DataFrame queries are planned and executed in native Rust, and results return to the JVM as Apache Arrow record batches through the Arrow C Data Interface.
This post focuses on what the project is and why it exists. The changelog has the full list of features that landed in 0.1.0.
Why DataFusion Java¶
DataFusion is already accessible from Rust, Python, and (via Comet) Spark. JVM applications that
want a fast, embeddable columnar query engine have so far had to either pull in Spark — which is a
heavy dependency for an embedded use case — or write their own JNI layer around DataFusion. The
goal of DataFusion Java is to make DataFusion a first-class JVM library: add it to your pom.xml,
construct a SessionContext, and run queries.
Typical use cases:
- Embedded analytics inside JVM services that need to run SQL or DataFrame queries over local files or object storage without standing up a separate query system.
- Tools and frontends that compile a higher-level surface language to DataFusion logical plans and need a JVM-side executor.
- Spark and Hadoop ecosystem integrations that already speak Arrow and want native columnar execution from Java or Scala code without going through Spark.
Because results cross the JNI boundary as Arrow C Data Interface streams, there is no row-by-row marshalling or extra copy on the hot path — the JVM reads the same Arrow buffers DataFusion wrote.
A first look¶
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.datafusion.DataFrame;
import org.apache.datafusion.SessionContext;
try (var allocator = new RootAllocator();
var ctx = new SessionContext()) {
ctx.registerParquet("orders", "/path/to/orders.parquet");
try (DataFrame df = ctx.sql(
"SELECT o_orderpriority, COUNT(*) AS n " +
"FROM orders GROUP BY o_orderpriority");
ArrowReader reader = df.collect(allocator)) {
while (reader.loadNextBatch()) {
var batch = reader.getVectorSchemaRoot();
// process batch...
}
}
}
SessionContext is the entry point — it owns the catalog of registered tables and the query
planner. SessionContext and DataFrame are AutoCloseable, so try-with-resources releases
native resources and Arrow buffers automatically.
What 0.1.0 includes¶
The 0.1.0 surface is intentionally small but covers what most embedded analytics use cases need.
Reading data¶
- Parquet, CSV, JSON, Avro, and Arrow IPC via
registerParquet/readParquet(and the equivalent entry points for the other formats), with format-specific read options. - Object store backends registered on
SessionContextBuilderso queries can read directly from S3, GCS, or other object stores. - Java table providers.
SessionContext.registerTable(name, provider)exposes a Java-implementedTableProvider. The framework calls back into the Java side per query to fetch Arrow batches, with data flowing to native code over the Arrow C Data Interface.
Querying¶
- SQL via
ctx.sql(String)and a DataFrame API withselect,filter,limit,distinct,sort,repartition,join/joinOn,dropColumns, andwithColumnRenamed. - Introspection methods on
DataFrame:schema,explain,cache,describe. - Streaming execution via
df.executeStream(allocator)for pulling batches incrementally without materializing the full result set in memory. - Logical plans from
datafusion-protoviaSessionContext.fromProto(byte[])— useful for cross-tool interop and for tools that compile their own surface language to DataFusion plans. The protobuf Java classes are generated from pinned.protofiles at the matching upstream DataFusion tag.
Writing data¶
DataFrame.writeCsvandDataFrame.writeJsonwith format-specific options.
Extending DataFusion from Java¶
- Scalar UDFs. Implement the
ScalarFunctioninterface to add a vectorised, Arrow-native SQL function written in Java. The implementation declares its SQL name, argument fields, return field, and volatility, and supplies a per-batchevaluatebody that reads input Arrow vectors and returns a result vector or a broadcast scalar.
public final class AddOne implements ScalarFunction {
private static final ArrowType INT32 = new ArrowType.Int(32, true);
@Override public String name() { return "add_one"; }
@Override public List<Field> argFields() { return List.of(Field.nullable("x", INT32)); }
@Override public Field returnField() { return Field.nullable("y", INT32); }
@Override public Volatility volatility() { return Volatility.IMMUTABLE; }
@Override
public ColumnarValue evaluate(BufferAllocator allocator, ScalarFunctionArgs args) {
IntVector in = (IntVector) args.args().get(0).vector();
IntVector out = new IntVector("add_one", allocator);
out.allocateNew(in.getValueCount());
for (int i = 0; i < in.getValueCount(); i++) {
if (in.isNull(i)) out.setNull(i);
else out.set(i, in.get(i) + 1);
}
out.setValueCount(in.getValueCount());
return ColumnarValue.array(out);
}
}
ctx.registerUdf(new ScalarUdf(new AddOne()));
Operational details¶
- JDK 17 or newer. Set
JAVA_HOMEto point at it. - Bundled native library. The published JAR contains pre-built native libraries for the
supported platforms;
NativeLibraryLoaderselects and loads the right one at startup. You do not need a Rust toolchain to use the library — only to build it from source. - Typed exception hierarchy. DataFusion errors crossing the JNI boundary surface as a typed
DataFusionExceptionhierarchy on the Java side rather than opaque runtime exceptions. - Configurable session.
SessionContextBuilderexposes batch size, target partitions, statistics collection, information schema, memory pool size, spill directory, and the built-in DataFusion cache manager. - Runtime visibility.
SessionContext.memoryUsageandruntimeStatsexpose DataFusion's internal allocator and runtime statistics for monitoring and capacity planning.
What this release is not¶
DataFusion Java is at 0.1.0 and the API will change. A few specific limits are worth calling out because they shape what you can build today:
- Scalar UDFs only. Aggregate, window, and table function UDFs are not yet exposed.
- Single-partition Java
TableProviders. DataFusion sees a Java-implemented table as one partition, and there is no projection or filter pushdown — DataFusion applies both on top of the batches the Java side returns. The interface is intentionally minimal so it can grow these capabilities as default methods without breaking implementations. SessionContextis not thread-safe. Use one per thread, or guard access externally.
Looking ahead¶
The rough direction for upcoming releases:
- Round out the extension points. Aggregate and window UDFs, multi-partition
TableProviders, and projection / filter pushdown into Java table providers. - Close the gap with DataFusion's Rust API. Anything that is reasonable to expose to a Java
caller should be reachable from
SessionContextandDataFramewith idiomatic Java signatures. - Track DataFusion releases. The aim is to follow DataFusion's release cadence so the Java binding stays current with new features and fixes in the engine.
How to get involved¶
- Source: https://github.com/apache/datafusion-java
- Issues: https://github.com/apache/datafusion-java/issues
- Documentation: see
docs/source/in the repository for the user and contributor guides.
Bug reports, design discussions, and pull requests are all welcome. Issues labelled good first issue are a good place to start.
Thank you¶
Thanks to everyone who contributed to making the first release possible, and to the broader DataFusion and Arrow communities whose work this project builds on directly.