Java bindings for Apache DataFusion, the Rust-based query engine. SQL queries run in native code and results are returned to the JVM as Apache Arrow record batches via the Arrow C Data Interface — no per-row JNI calls, no row-by-row copies.
Project status: early development. This is a brand-new project. The API is small, will change without notice, and there is no published release. Do not depend on it from production code yet. Bug reports, design feedback, and contributions are very welcome.
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.datafusion.DataFrame;
import org.apache.datafusion.SessionContext;
try (var allocator = new RootAllocator();
var ctx = new SessionContext()) {
ctx.registerParquet("orders", "/path/to/orders.parquet");
try (DataFrame df = ctx.sql(
"SELECT o_orderpriority, COUNT(*) AS n " +
"FROM orders GROUP BY o_orderpriority");
ArrowReader reader = df.collect(allocator)) {
while (reader.loadNextBatch()) {
var batch = reader.getVectorSchemaRoot();
// ... consume batch ...
}
}
}The current public surface mirrors a small slice of the Rust DataFusion API:
SessionContext.sql(String query)— parse and plan a SQL query, returning a lazyDataFrame. No execution happens yet.DataFrame.collect(BufferAllocator allocator)— execute the plan and return the result batches as anArrowReader. Consumes the DataFrame; the caller closes the reader, and the allocator must outlive it.SessionContext.registerParquet(String name, String path)— register a local Parquet file as a SQL table.
Both SessionContext and DataFrame are AutoCloseable and not
thread-safe.
- JDK 17 or newer
- Rust toolchain (stable, installed via rustup)
tpchgen-cli— only needed to generate test data for the Parquet integration test (cargo install tpchgen-cli)
Maven is bundled via the ./mvnw wrapper; no separate Maven install required.
make test
This builds the native Rust crate and runs the JUnit tests. The steps can be run individually:
cd native && cargo build
./mvnw test
The native library must be built before running JVM tests.
The Parquet integration test reads TPC-H SF1 data (~345 MB across 8 tables in Snappy-compressed Parquet). Generate it once with:
make tpch-data
Tests that need this data skip cleanly if it is missing. make clean does
not remove tpch-data/ — delete it manually to reclaim the disk space.
src/— Java sources and testsnative/— Rust crate that exposes DataFusion over JNI and the Arrow C Data Interface
Near-term priorities, roughly in order:
- Session configuration. Expose
SessionConfigandRuntimeEnvsettings (target partitions, batch size, memory pool, default catalog, …) so callers can tune execution from the JVM. - Full
SessionContextandDataFrameAPIs. Expand beyondsqlandregisterParquetto mirror the Rust API: table registration variants,read_*/write_*entry points, and DataFrame transformations such asselect,filter,join,aggregate,sort, plus result-materialization variants (show,count, streaming collection). - JVM-side plan construction via Protobuf. Build DataFusion logical and physical plans on the JVM using the existing DataFusion Protobuf representation, then ship them to the native side for execution. This lets plans assembled by other JVM tools (Spark, Flink, Beam, custom planners) run on DataFusion without having to go through SQL.
- Java-defined expressions. Allow user expressions / UDFs to be implemented in Java and invoked from the native plan, operating on Apache Arrow vectors (Java side) so evaluation stays vectorized end-to-end and avoids row-by-row JNI crossings.
These are intentionally large items — design discussion via GitHub issues before implementation is welcome.
This project follows the Apache DataFusion contribution model. Issues and pull requests are welcome — please open a GitHub issue to discuss any significant change before sending a PR.
Licensed under the Apache License, Version 2.0. See NOTICE.txt for required attributions.