refactor: pass csv and parquet read options via protobuf by andygrove · Pull Request #29 · apache/datafusion-java

andygrove · 2026-05-13T17:30:47Z

Which issue does this PR close?

No tracking issue — follow-up to #28.

Rationale for this change

PR #28 established protobuf-over-JNI as the transport for SessionContext configuration. This PR applies the same pattern to the CSV and Parquet read paths.

Before this change, registering or reading a CSV file passed 14–16 raw JNI arguments — booleans, byte values, nullable-encoded as xxx_set / xxx_value pairs, -1L sentinels for "unset" longs, and FileCompressionType shipped as its name() string. Parquet had the same shape with 7–9 args.

After: each call takes a single serialized CsvReadOptionsProto / ParquetReadOptionsProto byte array plus an optional Arrow-IPC schema byte array. Nullability, enums, and field evolution are now native to the wire format. The contributor guide documents the proto-over-JNI convention so future structured JNI calls follow the same pattern.

What changes are included in this PR?

New proto/csv_read_options.proto and proto/parquet_read_options.proto, mirroring the structure of session_options.proto. FileCompressionType is a proto3 enum with prefixed values and a _UNSPECIFIED = 0 sentinel.
CsvReadOptions.toBytes() and ParquetReadOptions.toBytes() serialize the Java options through the generated builders.
with_csv_options and with_parquet_options on the Rust side decode the proto via prost and fold the fields into DataFusion's option structs. The Unspecified compression arm returns an error rather than silently defaulting.
Four JNI methods collapse to 4 or 5 arguments each: (handle, [name,] path, byte[] optionsBytes, byte[] schemaIpcBytesOrNull).
New native/src/schema.rs::decode_optional_schema replaces two copies of identical Arrow-IPC schema-decode logic.
Renamed Rust module session_options → proto_gen since the single generated file now contains the types for all three protos (they share package datafusion_java;).
New contributor-guide section Passing structured options across the JNI boundary documents the convention, including proto3 enum-prefix and _UNSPECIFIED = 0 requirements.

The public Java API is unchanged: every public setter on CsvReadOptions / ParquetReadOptions and every register* / read* method on SessionContext keeps the same signature.

Are these changes tested?

Yes:

CsvReadOptionsTest (4 tests) and ParquetReadOptionsTest (3 tests) round-trip through toBytes() / Proto.parseFrom(...), verifying every field, default presence/absence, and all five FileCompressionType values.
The existing SessionContextCsvTest and SessionContextParquetOptionsTest continue to exercise the public API end-to-end through JNI without modification — strong evidence that the new wire format reaches the Rust side correctly.
Full ./mvnw test passes (49 run, 0 failed, 12 skipped — skips are pre-existing tpch-data integration tests).
cd native && cargo build && cargo clippy --all-targets -- -D warnings && cargo fmt --check clean.

Are there any user-facing changes?

No. Public Java API is unchanged.

andygrove added 10 commits May 13, 2026 11:01

feat: add csv_read_options.proto

26d22d6

feat: add parquet_read_options.proto

7fedfc6

build: include csv and parquet read-options protos

5db077c

refactor(proto): prefix FileCompressionType enum values

b068284

refactor(build): single source of truth for proto file list

4e182d7

refactor: pass csv and parquet read options via protobuf

111e70e

test: round-trip read options through protobuf

f13b51b

docs(contributor-guide): document proto-over-JNI convention

ab37f87

refactor(native): rename session_options module to proto_gen

9136495

docs(contributor-guide): document proto3 enum naming convention

5164e00

andygrove merged commit 9dacdc8 into apache:main May 13, 2026
2 checks passed

andygrove deleted the feat/proto-read-options branch May 13, 2026 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: pass csv and parquet read options via protobuf#29

refactor: pass csv and parquet read options via protobuf#29
andygrove merged 10 commits into
apache:mainfrom
andygrove:feat/proto-read-options

andygrove commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 13, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant