Match CPython error type for non-ASCII struct format arguments#7681
Conversation
Struct() raised the wrong exception type when the format argument
contained non-ASCII characters:
- str input with non-ASCII char: RustPython raised UnicodeDecodeError
with an empty message; CPython raises UnicodeEncodeError as if
format.encode('ascii') had been called directly.
- bytes input with non-ASCII byte: same wrong UnicodeDecodeError;
CPython passes the bytes through to the format parser, which then
errors with struct.error("bad char in struct format").
Restructure IntoStructFormatBytes::try_from_object to:
- raise UnicodeEncodeError("ascii", s, start, start+1, "ordinal not
in range(128)") for non-ASCII str, with start computed as the
first non-ASCII code point position (matching CPython's natural
encoding-error format);
- raise struct.error("bad char in struct format") for non-ASCII bytes,
produced via the existing new_struct_error helper.
Probed byte-identical with CPython 3.14.4 for both cases. Full
test.test_struct (43 tests) passes with no regressions. Sanity-tested
all standard format/pack/unpack/calcsize call shapes remain unchanged.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/stdlib/src/pystruct.rs (1)
38-50: Optional: tighten the non‑ASCII detection.
unwrap_or(0)is logically unreachable here since we’ve already verified!s.isascii(), so a position must exist. Using.expect(...)(or restructuring to find the position first and only branch onSome) makes the invariant explicit and avoids silently returning a misleadingstart = 0if the two ASCII checks ever drift. Also, the double-negation closure can be expressed a touch more directly.♻️ Suggested simplification
- if !s.isascii() { - let start = s - .as_wtf8() - .code_points() - .position(|cp| !cp.to_char().is_some_and(|c| c.is_ascii())) - .unwrap_or(0); + if let Some(start) = s + .as_wtf8() + .code_points() + .position(|cp| !cp.to_char().is_some_and(char::is_ascii)) + { return Err(vm.new_unicode_encode_error_real( vm.ctx.new_str("ascii"), s, start, start + 1, vm.ctx.new_str("ordinal not in range(128)"), )); } s🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/pystruct.rs` around lines 38 - 50, The non-ASCII detection uses unwrap_or(0) which is unreachable after the prior !s.isascii() check; compute the failing byte position with s.as_wtf8().code_points().position(...) and replace unwrap_or(0) with expect("non-ASCII position must exist when s.isascii() is false") to make the invariant explicit, and simplify the closure from the double-negation form to a direct test such as using to_char().map_or(true, |c| !c.is_ascii()) so the position search is clearer before calling vm.new_unicode_encode_error_real with start and start+1.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@crates/stdlib/src/pystruct.rs`:
- Around line 38-50: The non-ASCII detection uses unwrap_or(0) which is
unreachable after the prior !s.isascii() check; compute the failing byte
position with s.as_wtf8().code_points().position(...) and replace unwrap_or(0)
with expect("non-ASCII position must exist when s.isascii() is false") to make
the invariant explicit, and simplify the closure from the double-negation form
to a direct test such as using to_char().map_or(true, |c| !c.is_ascii()) so the
position search is clearer before calling vm.new_unicode_encode_error_real with
start and start+1.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 7f90d8c3-9d00-436e-bef7-fb6f47f44ac8
📒 Files selected for processing (1)
crates/stdlib/src/pystruct.rs
ShaharNaveh
left a comment
There was a problem hiding this comment.
Thanks!
I'm surprised this change isn't make some tests to pass, maybe we need to update some of our tests files for this?
youknowone
left a comment
There was a problem hiding this comment.
The changes looks good. Was there any test catch this bug in Lib/test/test_struct.py? Otherwise adding a test case under extra_tests/snippets/stdlib_struct.py will be great.
Thank you so much for many bug fixes!
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@extra_tests/snippets/stdlib_struct.py`:
- Around line 83-88: The else branch currently uses `assert False`, which is
removed under -O and triggers ruff B011; replace the `assert False, "expected
UnicodeEncodeError"` with an explicit raise, e.g. raise AssertionError("expected
UnicodeEncodeError"), so the failure is always reported; keep the try calling
struct.Struct("\udc00") and the except block (except UnicodeEncodeError as e)
and the subsequent e.encoding check unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 9aeccf40-88a6-4b1d-872d-87a63236050d
📒 Files selected for processing (1)
extra_tests/snippets/stdlib_struct.py
|
thanks for catching ! i just added the snippet at extra_tests/snippets/stdlib_struct.py covering three cases (str surrogate, str non-ASCII, bytes non-ASCII) with explicit error-type assertions, verified against CPython 3.14.4 for context on Lib/test/test_struct.py: "test_Struct_reinitialization" does exercise this (with self.assertRaises(UnicodeEncodeError): s.init('\udc00') at line 619), but it's currently |
Summary
Struct()(and the module-levelstruct.pack/unpack/calcsize) raised the wrong exception type when the format argument contained non-ASCII characters:struct.pack('é', 42)UnicodeDecodeError:(empty message)UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)struct.pack(b'\xe9', 42)UnicodeDecodeError:(empty message)struct.error: bad char in struct formatBoth error type and message diverge from CPython.
Fix
Restructure
IntoStructFormatBytes::try_from_objectincrates/stdlib/src/pystruct.rs:UnicodeEncodeError("ascii", s, start, start+1, "ordinal not in range(128)"). The position is the first non-ASCII code point, matching whatformat.encode('ascii')would naturally produce in CPython.struct.error("bad char in struct format")via the existingnew_struct_errorhelper. CPython gets here because it passes the bytes through to the format parser, which then errors on the bad char; we emit the equivalent error directly without the indirection.Other paths (str ASCII, bytes ASCII, non-str/bytes argument) are unchanged.
Verification
Targeted probes (CPython 3.14.4 byte-identical)
Identical to CPython 3.14.4 output for both.
Test suite
No regressions; expected-failures unchanged (none of the 8 expected failures cover this code path).
Sanity check on valid usage
Verified that standard
struct.pack/unpack/calcsizewith valid str or bytes format strings, and the's'/'p'byte-string formats, all behave as before. No behaviour change on success paths.Pre-push
cargo fmt --all --checkcleancargo clippy -p rustpython-stdlib --all-targets -- -D warningscleanprek run --all-files— all 8 hooks passScope
1 file, +30/−8 lines. Single function. The only behaviour change is the exception type and message for non-ASCII format input, which previously produced an incorrect
UnicodeDecodeErrorwith an empty message.Out of scope (separate PRs)
While probing this fix, two further
structdivergences were observed:struct.pack('5s', 'hello')raisesTypeErrorin RustPython butstruct.errorin CPython. This is a separate code path in the format-spec packer.struct.pack('Q@', 42)is silently accepted in RustPython but raisesstruct.error: bad char in struct formatin CPython. The sharedFormatSpec::parse(also used bymemoryview) silently skips endianness markers mid-format; addressing this requires deciding a parser-mode strategy.Both deferred to follow-up PRs.
Summary by CodeRabbit
Bug Fixes
Tests