Match CPython error type for non-ASCII struct format arguments by changjoon-park · Pull Request #7681 · RustPython/RustPython

changjoon-park · 2026-04-25T07:19:05Z

Summary

Struct() (and the module-level struct.pack / unpack / calcsize) raised the wrong exception type when the format argument contained non-ASCII characters:

Input	Before (RustPython)	CPython 3.14.4
`struct.pack('é', 42)`	`UnicodeDecodeError:` (empty message)	`UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)`
`struct.pack(b'\xe9', 42)`	`UnicodeDecodeError:` (empty message)	`struct.error: bad char in struct format`

Both error type and message diverge from CPython.

Fix

Restructure IntoStructFormatBytes::try_from_object in crates/stdlib/src/pystruct.rs:

str input with non-ASCII char → raise UnicodeEncodeError("ascii", s, start, start+1, "ordinal not in range(128)"). The position is the first non-ASCII code point, matching what format.encode('ascii') would naturally produce in CPython.
bytes input with non-ASCII byte → raise struct.error("bad char in struct format") via the existing new_struct_error helper. CPython gets here because it passes the bytes through to the format parser, which then errors on the bad char; we emit the equivalent error directly without the indirection.

Other paths (str ASCII, bytes ASCII, non-str/bytes argument) are unchanged.

Verification

Targeted probes (CPython 3.14.4 byte-identical)

$ rustpython -c 'import struct; struct.pack("é", 42)'
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

$ rustpython -c 'import struct; struct.pack(b"\xe9", 42)'
struct.error: bad char in struct format

Identical to CPython 3.14.4 output for both.

Test suite

$ rustpython -m unittest test.test_struct
Ran 43 tests in 0.6s
OK (skipped=6, expected failures=8)

No regressions; expected-failures unchanged (none of the 8 expected failures cover this code path).

Sanity check on valid usage

Verified that standard struct.pack / unpack / calcsize with valid str or bytes format strings, and the 's' / 'p' byte-string formats, all behave as before. No behaviour change on success paths.

Pre-push

cargo fmt --all --check clean
cargo clippy -p rustpython-stdlib --all-targets -- -D warnings clean
prek run --all-files — all 8 hooks pass

Scope

1 file, +30/−8 lines. Single function. The only behaviour change is the exception type and message for non-ASCII format input, which previously produced an incorrect UnicodeDecodeError with an empty message.

Out of scope (separate PRs)

While probing this fix, two further struct divergences were observed:

struct.pack('5s', 'hello') raises TypeError in RustPython but struct.error in CPython. This is a separate code path in the format-spec packer.
struct.pack('Q@', 42) is silently accepted in RustPython but raises struct.error: bad char in struct format in CPython. The shared FormatSpec::parse (also used by memoryview) silently skips endianness markers mid-format; addressing this requires deciding a parser-mode strategy.

Both deferred to follow-up PRs.

Summary by CodeRabbit

Bug Fixes
- struct format parsing now yields more accurate errors: string format inputs raise a UnicodeEncodeError with CPython-like message/position; bytes format inputs raise a clearer struct.error for invalid characters.
Tests
- Added tests verifying non-ASCII behavior for both string and bytes struct format inputs.

Struct() raised the wrong exception type when the format argument contained non-ASCII characters: - str input with non-ASCII char: RustPython raised UnicodeDecodeError with an empty message; CPython raises UnicodeEncodeError as if format.encode('ascii') had been called directly. - bytes input with non-ASCII byte: same wrong UnicodeDecodeError; CPython passes the bytes through to the format parser, which then errors with struct.error("bad char in struct format"). Restructure IntoStructFormatBytes::try_from_object to: - raise UnicodeEncodeError("ascii", s, start, start+1, "ordinal not in range(128)") for non-ASCII str, with start computed as the first non-ASCII code point position (matching CPython's natural encoding-error format); - raise struct.error("bad char in struct format") for non-ASCII bytes, produced via the existing new_struct_error helper. Probed byte-identical with CPython 3.14.4 for both cases. Full test.test_struct (43 tests) passes with no regressions. Sanity-tested all standard format/pack/unpack/calcsize call shapes remain unchanged.

coderabbitai · 2026-04-25T07:19:19Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 0bb1a36d-fe6a-47c4-819f-e6de35d1cd90

📥 Commits

Reviewing files that changed from the base of the PR and between d0ed04a and b0faf80.

📒 Files selected for processing (1)

extra_tests/snippets/stdlib_struct.py

🚧 Files skipped from review as they are similar to previous changes (1)

extra_tests/snippets/stdlib_struct.py

📝 Walkthrough

Walkthrough

The struct module's format argument parsing now distinguishes ASCII validation behavior by input type: str inputs explicitly check for non-ASCII and raise UnicodeEncodeError matching CPython semantics, while bytes inputs convert via AsciiStr::from_ascii and map failures to struct.error. A previous generic Option-based error path was removed.

Changes

Cohort / File(s)	Summary
Struct Format Validation `crates/stdlib/src/pystruct.rs`	Refined `IntoStructFormatBytes::try_from_object` to perform per-type ASCII validation: emit `UnicodeEncodeError` for non-ASCII `str` formats (with CPython-like position/message) and map invalid `bytes` formats to `struct.error("bad char in struct format")`. Removed prior generic error branch.
Tests / Snippets `extra_tests/snippets/stdlib_struct.py`	Added tests verifying `struct.Struct` error behavior: `str` non-ASCII inputs raise `UnicodeEncodeError` (checks `encoding` value in one case) and non-ASCII `bytes` inputs raise `struct.error`. Also retains earlier assertions about `struct.error`'s `__module__` and `__name__`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

youknowone

Poem

🐰
I nibbled bytes and sniffed a string,
Found ASCII rules for everything,
Errors hop to their right place,
Positions noted, messages trace,
Hooray — formats sing! 🎵

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title clearly and accurately summarizes the main change: matching CPython's error behavior for non-ASCII struct format arguments, which is the core objective of the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

crates/stdlib/src/pystruct.rs (1)

38-50: Optional: tighten the non‑ASCII detection.

unwrap_or(0) is logically unreachable here since we’ve already verified !s.isascii(), so a position must exist. Using .expect(...) (or restructuring to find the position first and only branch on Some) makes the invariant explicit and avoids silently returning a misleading start = 0 if the two ASCII checks ever drift. Also, the double-negation closure can be expressed a touch more directly.

♻️ Suggested simplification

-                    if !s.isascii() {
-                        let start = s
-                            .as_wtf8()
-                            .code_points()
-                            .position(|cp| !cp.to_char().is_some_and(|c| c.is_ascii()))
-                            .unwrap_or(0);
+                    if let Some(start) = s
+                        .as_wtf8()
+                        .code_points()
+                        .position(|cp| !cp.to_char().is_some_and(char::is_ascii))
+                    {
                         return Err(vm.new_unicode_encode_error_real(
                             vm.ctx.new_str("ascii"),
                             s,
                             start,
                             start + 1,
                             vm.ctx.new_str("ordinal not in range(128)"),
                         ));
                     }
                     s

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/pystruct.rs` around lines 38 - 50, The non-ASCII detection
uses unwrap_or(0) which is unreachable after the prior !s.isascii() check;
compute the failing byte position with s.as_wtf8().code_points().position(...)
and replace unwrap_or(0) with expect("non-ASCII position must exist when
s.isascii() is false") to make the invariant explicit, and simplify the closure
from the double-negation form to a direct test such as using
to_char().map_or(true, |c| !c.is_ascii()) so the position search is clearer
before calling vm.new_unicode_encode_error_real with start and start+1.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/stdlib/src/pystruct.rs`:
- Around line 38-50: The non-ASCII detection uses unwrap_or(0) which is
unreachable after the prior !s.isascii() check; compute the failing byte
position with s.as_wtf8().code_points().position(...) and replace unwrap_or(0)
with expect("non-ASCII position must exist when s.isascii() is false") to make
the invariant explicit, and simplify the closure from the double-negation form
to a direct test such as using to_char().map_or(true, |c| !c.is_ascii()) so the
position search is clearer before calling vm.new_unicode_encode_error_real with
start and start+1.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 7f90d8c3-9d00-436e-bef7-fb6f47f44ac8

📥 Commits

Reviewing files that changed from the base of the PR and between adafaf2 and 3209a59.

📒 Files selected for processing (1)

crates/stdlib/src/pystruct.rs

ShaharNaveh

Thanks!

I'm surprised this change isn't make some tests to pass, maybe we need to update some of our tests files for this?

youknowone

The changes looks good. Was there any test catch this bug in Lib/test/test_struct.py? Otherwise adding a test case under extra_tests/snippets/stdlib_struct.py will be great.

Thank you so much for many bug fixes!

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@extra_tests/snippets/stdlib_struct.py`:
- Around line 83-88: The else branch currently uses `assert False`, which is
removed under -O and triggers ruff B011; replace the `assert False, "expected
UnicodeEncodeError"` with an explicit raise, e.g. raise AssertionError("expected
UnicodeEncodeError"), so the failure is always reported; keep the try calling
struct.Struct("\udc00") and the except block (except UnicodeEncodeError as e)
and the subsequent e.encoding check unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 9aeccf40-88a6-4b1d-872d-87a63236050d

📥 Commits

Reviewing files that changed from the base of the PR and between 3209a59 and d0ed04a.

📒 Files selected for processing (1)

extra_tests/snippets/stdlib_struct.py

changjoon-park · 2026-04-25T15:23:40Z

thanks for catching ! i just added the snippet at extra_tests/snippets/stdlib_struct.py covering three cases (str surrogate, str non-ASCII, bytes non-ASCII) with explicit error-type assertions, verified against CPython 3.14.4

for context on Lib/test/test_struct.py: "test_Struct_reinitialization" does exercise this (with self.assertRaises(UnicodeEncodeError): s.init('\udc00') at line 619), but it's currently @unittest.expectedFailure for an unrelated reason ('>h' != '>hh' on reinit), so unmasking it would require also fixing that bug. Hence the standalone snippet.

youknowone

Thank you!

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

ShaharNaveh approved these changes Apr 25, 2026

View reviewed changes

youknowone requested changes Apr 25, 2026

View reviewed changes

Add regression test for non-ASCII format string error types

d0ed04a

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread extra_tests/snippets/stdlib_struct.py Outdated

Use raise AssertionError instead of assert False (B011)

b0faf80

youknowone approved these changes Apr 26, 2026

View reviewed changes

youknowone merged commit be43bb6 into RustPython:main Apr 26, 2026
20 checks passed

changjoon-park deleted the fix-struct-format-error-types branch April 27, 2026 13:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Match CPython error type for non-ASCII struct format arguments#7681

Match CPython error type for non-ASCII struct format arguments#7681
youknowone merged 3 commits into
RustPython:mainfrom
changjoon-park:fix-struct-format-error-types

changjoon-park commented Apr 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 25, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ShaharNaveh left a comment

Uh oh!

youknowone left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

changjoon-park commented Apr 25, 2026

Uh oh!

youknowone left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

changjoon-park commented Apr 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Verification

Targeted probes (CPython 3.14.4 byte-identical)

Test suite

Sanity check on valid usage

Pre-push

Scope

Out of scope (separate PRs)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh left a comment

Choose a reason for hiding this comment

Uh oh!

youknowone left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

changjoon-park commented Apr 25, 2026

Uh oh!

youknowone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

changjoon-park commented Apr 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 25, 2026 •

edited

Loading