Skip to content

Match CPython error type for non-ASCII struct format arguments#7681

Merged
youknowone merged 3 commits into
RustPython:mainfrom
changjoon-park:fix-struct-format-error-types
Apr 26, 2026
Merged

Match CPython error type for non-ASCII struct format arguments#7681
youknowone merged 3 commits into
RustPython:mainfrom
changjoon-park:fix-struct-format-error-types

Conversation

@changjoon-park

@changjoon-park changjoon-park commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Struct() (and the module-level struct.pack / unpack / calcsize) raised the wrong exception type when the format argument contained non-ASCII characters:

Input Before (RustPython) CPython 3.14.4
struct.pack('é', 42) UnicodeDecodeError: (empty message) UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)
struct.pack(b'\xe9', 42) UnicodeDecodeError: (empty message) struct.error: bad char in struct format

Both error type and message diverge from CPython.

Fix

Restructure IntoStructFormatBytes::try_from_object in crates/stdlib/src/pystruct.rs:

  • str input with non-ASCII char → raise UnicodeEncodeError("ascii", s, start, start+1, "ordinal not in range(128)"). The position is the first non-ASCII code point, matching what format.encode('ascii') would naturally produce in CPython.
  • bytes input with non-ASCII byte → raise struct.error("bad char in struct format") via the existing new_struct_error helper. CPython gets here because it passes the bytes through to the format parser, which then errors on the bad char; we emit the equivalent error directly without the indirection.

Other paths (str ASCII, bytes ASCII, non-str/bytes argument) are unchanged.

Verification

Targeted probes (CPython 3.14.4 byte-identical)

$ rustpython -c 'import struct; struct.pack("é", 42)'
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

$ rustpython -c 'import struct; struct.pack(b"\xe9", 42)'
struct.error: bad char in struct format

Identical to CPython 3.14.4 output for both.

Test suite

$ rustpython -m unittest test.test_struct
Ran 43 tests in 0.6s
OK (skipped=6, expected failures=8)

No regressions; expected-failures unchanged (none of the 8 expected failures cover this code path).

Sanity check on valid usage

Verified that standard struct.pack / unpack / calcsize with valid str or bytes format strings, and the 's' / 'p' byte-string formats, all behave as before. No behaviour change on success paths.

Pre-push

  • cargo fmt --all --check clean
  • cargo clippy -p rustpython-stdlib --all-targets -- -D warnings clean
  • prek run --all-files — all 8 hooks pass

Scope

1 file, +30/−8 lines. Single function. The only behaviour change is the exception type and message for non-ASCII format input, which previously produced an incorrect UnicodeDecodeError with an empty message.

Out of scope (separate PRs)

While probing this fix, two further struct divergences were observed:

  • struct.pack('5s', 'hello') raises TypeError in RustPython but struct.error in CPython. This is a separate code path in the format-spec packer.
  • struct.pack('Q@', 42) is silently accepted in RustPython but raises struct.error: bad char in struct format in CPython. The shared FormatSpec::parse (also used by memoryview) silently skips endianness markers mid-format; addressing this requires deciding a parser-mode strategy.

Both deferred to follow-up PRs.

Summary by CodeRabbit

  • Bug Fixes

    • struct format parsing now yields more accurate errors: string format inputs raise a UnicodeEncodeError with CPython-like message/position; bytes format inputs raise a clearer struct.error for invalid characters.
  • Tests

    • Added tests verifying non-ASCII behavior for both string and bytes struct format inputs.

Struct() raised the wrong exception type when the format argument
contained non-ASCII characters:
- str input with non-ASCII char: RustPython raised UnicodeDecodeError
  with an empty message; CPython raises UnicodeEncodeError as if
  format.encode('ascii') had been called directly.
- bytes input with non-ASCII byte: same wrong UnicodeDecodeError;
  CPython passes the bytes through to the format parser, which then
  errors with struct.error("bad char in struct format").

Restructure IntoStructFormatBytes::try_from_object to:
- raise UnicodeEncodeError("ascii", s, start, start+1, "ordinal not
  in range(128)") for non-ASCII str, with start computed as the
  first non-ASCII code point position (matching CPython's natural
  encoding-error format);
- raise struct.error("bad char in struct format") for non-ASCII bytes,
  produced via the existing new_struct_error helper.

Probed byte-identical with CPython 3.14.4 for both cases. Full
test.test_struct (43 tests) passes with no regressions. Sanity-tested
all standard format/pack/unpack/calcsize call shapes remain unchanged.
@coderabbitai

coderabbitai Bot commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 0bb1a36d-fe6a-47c4-819f-e6de35d1cd90

📥 Commits

Reviewing files that changed from the base of the PR and between d0ed04a and b0faf80.

📒 Files selected for processing (1)
  • extra_tests/snippets/stdlib_struct.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • extra_tests/snippets/stdlib_struct.py

📝 Walkthrough

Walkthrough

The struct module's format argument parsing now distinguishes ASCII validation behavior by input type: str inputs explicitly check for non-ASCII and raise UnicodeEncodeError matching CPython semantics, while bytes inputs convert via AsciiStr::from_ascii and map failures to struct.error. A previous generic Option-based error path was removed.

Changes

Cohort / File(s) Summary
Struct Format Validation
crates/stdlib/src/pystruct.rs
Refined IntoStructFormatBytes::try_from_object to perform per-type ASCII validation: emit UnicodeEncodeError for non-ASCII str formats (with CPython-like position/message) and map invalid bytes formats to struct.error("bad char in struct format"). Removed prior generic error branch.
Tests / Snippets
extra_tests/snippets/stdlib_struct.py
Added tests verifying struct.Struct error behavior: str non-ASCII inputs raise UnicodeEncodeError (checks encoding value in one case) and non-ASCII bytes inputs raise struct.error. Also retains earlier assertions about struct.error's __module__ and __name__.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • youknowone

Poem

🐰
I nibbled bytes and sniffed a string,
Found ASCII rules for everything,
Errors hop to their right place,
Positions noted, messages trace,
Hooray — formats sing! 🎵

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and accurately summarizes the main change: matching CPython's error behavior for non-ASCII struct format arguments, which is the core objective of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/stdlib/src/pystruct.rs (1)

38-50: Optional: tighten the non‑ASCII detection.

unwrap_or(0) is logically unreachable here since we’ve already verified !s.isascii(), so a position must exist. Using .expect(...) (or restructuring to find the position first and only branch on Some) makes the invariant explicit and avoids silently returning a misleading start = 0 if the two ASCII checks ever drift. Also, the double-negation closure can be expressed a touch more directly.

♻️ Suggested simplification
-                    if !s.isascii() {
-                        let start = s
-                            .as_wtf8()
-                            .code_points()
-                            .position(|cp| !cp.to_char().is_some_and(|c| c.is_ascii()))
-                            .unwrap_or(0);
+                    if let Some(start) = s
+                        .as_wtf8()
+                        .code_points()
+                        .position(|cp| !cp.to_char().is_some_and(char::is_ascii))
+                    {
                         return Err(vm.new_unicode_encode_error_real(
                             vm.ctx.new_str("ascii"),
                             s,
                             start,
                             start + 1,
                             vm.ctx.new_str("ordinal not in range(128)"),
                         ));
                     }
                     s
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/pystruct.rs` around lines 38 - 50, The non-ASCII detection
uses unwrap_or(0) which is unreachable after the prior !s.isascii() check;
compute the failing byte position with s.as_wtf8().code_points().position(...)
and replace unwrap_or(0) with expect("non-ASCII position must exist when
s.isascii() is false") to make the invariant explicit, and simplify the closure
from the double-negation form to a direct test such as using
to_char().map_or(true, |c| !c.is_ascii()) so the position search is clearer
before calling vm.new_unicode_encode_error_real with start and start+1.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/stdlib/src/pystruct.rs`:
- Around line 38-50: The non-ASCII detection uses unwrap_or(0) which is
unreachable after the prior !s.isascii() check; compute the failing byte
position with s.as_wtf8().code_points().position(...) and replace unwrap_or(0)
with expect("non-ASCII position must exist when s.isascii() is false") to make
the invariant explicit, and simplify the closure from the double-negation form
to a direct test such as using to_char().map_or(true, |c| !c.is_ascii()) so the
position search is clearer before calling vm.new_unicode_encode_error_real with
start and start+1.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 7f90d8c3-9d00-436e-bef7-fb6f47f44ac8

📥 Commits

Reviewing files that changed from the base of the PR and between adafaf2 and 3209a59.

📒 Files selected for processing (1)
  • crates/stdlib/src/pystruct.rs

@ShaharNaveh ShaharNaveh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I'm surprised this change isn't make some tests to pass, maybe we need to update some of our tests files for this?

@youknowone youknowone left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks good. Was there any test catch this bug in Lib/test/test_struct.py? Otherwise adding a test case under extra_tests/snippets/stdlib_struct.py will be great.

Thank you so much for many bug fixes!

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@extra_tests/snippets/stdlib_struct.py`:
- Around line 83-88: The else branch currently uses `assert False`, which is
removed under -O and triggers ruff B011; replace the `assert False, "expected
UnicodeEncodeError"` with an explicit raise, e.g. raise AssertionError("expected
UnicodeEncodeError"), so the failure is always reported; keep the try calling
struct.Struct("\udc00") and the except block (except UnicodeEncodeError as e)
and the subsequent e.encoding check unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 9aeccf40-88a6-4b1d-872d-87a63236050d

📥 Commits

Reviewing files that changed from the base of the PR and between 3209a59 and d0ed04a.

📒 Files selected for processing (1)
  • extra_tests/snippets/stdlib_struct.py

Comment thread extra_tests/snippets/stdlib_struct.py Outdated
@changjoon-park

Copy link
Copy Markdown
Contributor Author

thanks for catching ! i just added the snippet at extra_tests/snippets/stdlib_struct.py covering three cases (str surrogate, str non-ASCII, bytes non-ASCII) with explicit error-type assertions, verified against CPython 3.14.4

for context on Lib/test/test_struct.py: "test_Struct_reinitialization" does exercise this (with self.assertRaises(UnicodeEncodeError): s.init('\udc00') at line 619), but it's currently @unittest.expectedFailure for an unrelated reason ('>h' != '>hh' on reinit), so unmasking it would require also fixing that bug. Hence the standalone snippet.

@youknowone youknowone left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@youknowone youknowone merged commit be43bb6 into RustPython:main Apr 26, 2026
20 checks passed
@changjoon-park changjoon-park deleted the fix-struct-format-error-types branch April 27, 2026 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants