refactor: pass csv and parquet read options via protobuf#29
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
No tracking issue — follow-up to #28.
Rationale for this change
PR #28 established protobuf-over-JNI as the transport for
SessionContextconfiguration. This PR applies the same pattern to the CSV and Parquet read paths.Before this change, registering or reading a CSV file passed 14–16 raw JNI arguments — booleans, byte values, nullable-encoded as
xxx_set/xxx_valuepairs,-1Lsentinels for "unset" longs, andFileCompressionTypeshipped as itsname()string. Parquet had the same shape with 7–9 args.After: each call takes a single serialized
CsvReadOptionsProto/ParquetReadOptionsProtobyte array plus an optional Arrow-IPC schema byte array. Nullability, enums, and field evolution are now native to the wire format. The contributor guide documents the proto-over-JNI convention so future structured JNI calls follow the same pattern.What changes are included in this PR?
proto/csv_read_options.protoandproto/parquet_read_options.proto, mirroring the structure ofsession_options.proto.FileCompressionTypeis a proto3 enum with prefixed values and a_UNSPECIFIED = 0sentinel.CsvReadOptions.toBytes()andParquetReadOptions.toBytes()serialize the Java options through the generated builders.with_csv_optionsandwith_parquet_optionson the Rust side decode the proto via prost and fold the fields into DataFusion's option structs. TheUnspecifiedcompression arm returns an error rather than silently defaulting.(handle, [name,] path, byte[] optionsBytes, byte[] schemaIpcBytesOrNull).native/src/schema.rs::decode_optional_schemareplaces two copies of identical Arrow-IPC schema-decode logic.session_options→proto_gensince the single generated file now contains the types for all three protos (they sharepackage datafusion_java;).Passing structured options across the JNI boundarydocuments the convention, including proto3 enum-prefix and_UNSPECIFIED = 0requirements.The public Java API is unchanged: every public setter on
CsvReadOptions/ParquetReadOptionsand everyregister*/read*method onSessionContextkeeps the same signature.Are these changes tested?
Yes:
CsvReadOptionsTest(4 tests) andParquetReadOptionsTest(3 tests) round-trip throughtoBytes()/Proto.parseFrom(...), verifying every field, default presence/absence, and all fiveFileCompressionTypevalues.SessionContextCsvTestandSessionContextParquetOptionsTestcontinue to exercise the public API end-to-end through JNI without modification — strong evidence that the new wire format reaches the Rust side correctly../mvnw testpasses (49 run, 0 failed, 12 skipped — skips are pre-existing tpch-data integration tests).cd native && cargo build && cargo clippy --all-targets -- -D warnings && cargo fmt --checkclean.Are there any user-facing changes?
No. Public Java API is unchanged.