Skip to content

refactor: pass csv and parquet read options via protobuf#29

Merged
andygrove merged 10 commits into
apache:mainfrom
andygrove:feat/proto-read-options
May 13, 2026
Merged

refactor: pass csv and parquet read options via protobuf#29
andygrove merged 10 commits into
apache:mainfrom
andygrove:feat/proto-read-options

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

No tracking issue — follow-up to #28.

Rationale for this change

PR #28 established protobuf-over-JNI as the transport for SessionContext configuration. This PR applies the same pattern to the CSV and Parquet read paths.

Before this change, registering or reading a CSV file passed 14–16 raw JNI arguments — booleans, byte values, nullable-encoded as xxx_set / xxx_value pairs, -1L sentinels for "unset" longs, and FileCompressionType shipped as its name() string. Parquet had the same shape with 7–9 args.

After: each call takes a single serialized CsvReadOptionsProto / ParquetReadOptionsProto byte array plus an optional Arrow-IPC schema byte array. Nullability, enums, and field evolution are now native to the wire format. The contributor guide documents the proto-over-JNI convention so future structured JNI calls follow the same pattern.

What changes are included in this PR?

  • New proto/csv_read_options.proto and proto/parquet_read_options.proto, mirroring the structure of session_options.proto. FileCompressionType is a proto3 enum with prefixed values and a _UNSPECIFIED = 0 sentinel.
  • CsvReadOptions.toBytes() and ParquetReadOptions.toBytes() serialize the Java options through the generated builders.
  • with_csv_options and with_parquet_options on the Rust side decode the proto via prost and fold the fields into DataFusion's option structs. The Unspecified compression arm returns an error rather than silently defaulting.
  • Four JNI methods collapse to 4 or 5 arguments each: (handle, [name,] path, byte[] optionsBytes, byte[] schemaIpcBytesOrNull).
  • New native/src/schema.rs::decode_optional_schema replaces two copies of identical Arrow-IPC schema-decode logic.
  • Renamed Rust module session_optionsproto_gen since the single generated file now contains the types for all three protos (they share package datafusion_java;).
  • New contributor-guide section Passing structured options across the JNI boundary documents the convention, including proto3 enum-prefix and _UNSPECIFIED = 0 requirements.

The public Java API is unchanged: every public setter on CsvReadOptions / ParquetReadOptions and every register* / read* method on SessionContext keeps the same signature.

Are these changes tested?

Yes:

  • CsvReadOptionsTest (4 tests) and ParquetReadOptionsTest (3 tests) round-trip through toBytes() / Proto.parseFrom(...), verifying every field, default presence/absence, and all five FileCompressionType values.
  • The existing SessionContextCsvTest and SessionContextParquetOptionsTest continue to exercise the public API end-to-end through JNI without modification — strong evidence that the new wire format reaches the Rust side correctly.
  • Full ./mvnw test passes (49 run, 0 failed, 12 skipped — skips are pre-existing tpch-data integration tests).
  • cd native && cargo build && cargo clippy --all-targets -- -D warnings && cargo fmt --check clean.

Are there any user-facing changes?

No. Public Java API is unchanged.

@andygrove andygrove merged commit 9dacdc8 into apache:main May 13, 2026
2 checks passed
@andygrove andygrove deleted the feat/proto-read-options branch May 13, 2026 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant