Apache DataFusion Java

Java bindings for Apache DataFusion, the Rust-based query engine. SQL queries run in native code and results are returned to the JVM as Apache Arrow record batches via the Arrow C Data Interface — no per-row JNI calls, no row-by-row copies.

Project status: early development. This is a brand-new project. The API is small, will change without notice, and there is no published release. Do not depend on it from production code yet. Bug reports, design feedback, and contributions are very welcome.

Quickstart

import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.datafusion.DataFrame;
import org.apache.datafusion.SessionContext;

try (var allocator = new RootAllocator();
     var ctx = new SessionContext()) {

    ctx.registerParquet("orders", "/path/to/orders.parquet");

    try (DataFrame df = ctx.sql(
            "SELECT o_orderpriority, COUNT(*) AS n " +
            "FROM orders GROUP BY o_orderpriority");
         ArrowReader reader = df.collect(allocator)) {
        while (reader.loadNextBatch()) {
            var batch = reader.getVectorSchemaRoot();
            // ... consume batch ...
        }
    }
}

The current public surface mirrors a small slice of the Rust DataFusion API:

SessionContext.sql(String query) — parse and plan a SQL query, returning a lazy DataFrame. No execution happens yet.
DataFrame.collect(BufferAllocator allocator) — execute the plan and return the result batches as an ArrowReader. Consumes the DataFrame; the caller closes the reader, and the allocator must outlive it.
SessionContext.registerParquet(String name, String path) — register a local Parquet file as a SQL table.

Both SessionContext and DataFrame are AutoCloseable and not thread-safe.

Prerequisites

JDK 17 or newer
Rust toolchain (stable, installed via rustup)
tpchgen-cli — only needed to generate test data for the Parquet integration test (cargo install tpchgen-cli)

Maven is bundled via the ./mvnw wrapper; no separate Maven install required.

Build & test

make test

This builds the native Rust crate and runs the JUnit tests. The steps can be run individually:

cd native && cargo build
./mvnw test

The native library must be built before running JVM tests.

Test data

The Parquet integration test reads TPC-H SF1 data (~345 MB across 8 tables in Snappy-compressed Parquet). Generate it once with:

make tpch-data

Tests that need this data skip cleanly if it is missing. make clean does not remove tpch-data/ — delete it manually to reclaim the disk space.

Repository layout

src/ — Java sources and tests
native/ — Rust crate that exposes DataFusion over JNI and the Arrow C Data Interface

Roadmap

Near-term priorities, roughly in order:

Session configuration. Expose SessionConfig and RuntimeEnv settings (target partitions, batch size, memory pool, default catalog, …) so callers can tune execution from the JVM.
Full SessionContext and DataFrame APIs. Expand beyond sql and registerParquet to mirror the Rust API: table registration variants, read_* / write_* entry points, and DataFrame transformations such as select, filter, join, aggregate, sort, plus result-materialization variants (show, count, streaming collection).
JVM-side plan construction via Protobuf. Build DataFusion logical and physical plans on the JVM using the existing DataFusion Protobuf representation, then ship them to the native side for execution. This lets plans assembled by other JVM tools (Spark, Flink, Beam, custom planners) run on DataFusion without having to go through SQL.
Java-defined expressions. Allow user expressions / UDFs to be implemented in Java and invoked from the native plan, operating on Apache Arrow vectors (Java side) so evaluation stays vectorized end-to-end and avoids row-by-row JNI crossings.

These are intentionally large items — design discussion via GitHub issues before implementation is welcome.

Contributing

This project follows the Apache DataFusion contribution model. Issues and pull requests are welcome — please open a GitHub issue to discuss any significant change before sending a PR.

License

Licensed under the Apache License, Version 2.0. See NOTICE.txt for required attributions.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
.mvn/wrapper		.mvn/wrapper
native		native
src		src
.asf.yaml		.asf.yaml
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
Makefile		Makefile
NOTICE.txt		NOTICE.txt
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache DataFusion Java

Quickstart

Prerequisites

Build & test

Test data

Repository layout

Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Apache DataFusion Java

Quickstart

Prerequisites

Build & test

Test data

Repository layout

Roadmap

Contributing

License

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages