Update test_utf8source from v3.14.3 and implement it by youknowone · Pull Request #7318 · RustPython/RustPython

youknowone · 2026-03-02T12:13:41Z

Summary by CodeRabbit

Bug Fixes
- Improved source-file encoding detection (BOM + PEP 263) and decoding to better match CPython.
- Stricter UTF-8 validation with clearer, location-aware syntax errors for invalid byte sequences.
- Prevents ambiguous BOM/mixed-encoding handling and surfaces clearer messages when decoding fails.
- Compile/preprocessing now consistently uses the improved decoding path for source text.
- No changes to public APIs or exported declarations.

coderabbitai · 2026-03-02T12:13:50Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb805d3 and 53ddc7e.

⛔ Files ignored due to path filters (3)

Lib/test/test_importlib/source/test_source_encoding.py is excluded by !Lib/**
Lib/test/test_runpy.py is excluded by !Lib/**
Lib/test/test_utf8source.py is excluded by !Lib/**

📒 Files selected for processing (1)

crates/vm/src/stdlib/builtins.rs

🚧 Files skipped from review as they are similar to previous changes (1)

crates/vm/src/stdlib/builtins.rs

📝 Walkthrough

Walkthrough

Adds parser-gated helpers for PEP 263 and BOM-aware source encoding detection and decoding, and replaces direct UTF‑8 decoding in compile paths with decode_source_bytes, preserving CPython-like error semantics for encoding/decoding failures.

Changes

Cohort / File(s)	Summary
Source encoding & decoding `crates/vm/src/stdlib/builtins.rs`	Under `cfg(feature = "parser")` add `detect_source_encoding()`, `is_utf8_encoding()`, and `decode_source_bytes()` handling BOM, PEP 263 cookie detection, UTF‑8 normalization/validation, and codec-registry decoding. Compile paths updated to call `decode_source_bytes()` instead of direct UTF‑8 decode.

Sequence Diagram(s)

sequenceDiagram
    participant File as "Source bytes"
    participant VM as "VirtualMachine"
    participant Builtins as "builtins::decode_source_bytes"
    participant Codec as "Codec registry / decoder"
    participant Parser as "Parser/Compiler"

    File->>Builtins: provide `source: &[u8]` + `filename`
    Builtins->>Builtins: check BOM and first-two-line cookie
    alt encoding is UTF‑8 (or BOM)
        Builtins->>Builtins: validate / decode UTF‑8
        Builtins-->>Parser: return decoded String
    else encoding is non‑UTF‑8
        Builtins->>Codec: lookup codec by name
        Codec-->>Builtins: decoder (or error)
        Builtins->>Codec: decode bytes
        Codec-->>Builtins: decoded String (or error)
        Builtins-->>Parser: return decoded String
    end
    Builtins-->>VM: raise SyntaxError on invalid encoding/decoding with location

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Better codecs and fix lots of test_io and other expectedFailures #6997 — Related changes to codec lookup/decoding behavior used by the new decode_source_bytes() helpers.

Poem

🐰 I nibble bytes at break of day,

BOM and cookie show the way,
I hop through codecs, old and new,
I stitch your bytes to text for you,
A happy rabbit, quick and gay.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: implementing source encoding detection and UTF-8 handling helpers, updating behavior from v3.14.3 CPython.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-02T12:22:49Z

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/vm/src/stdlib/builtins.rs`:
- Around line 228-237: The map_err closure on the call to
vm.state.codec_registry.decode_text currently swallows all errors and replaces
them with a generic "unknown encoding" SyntaxError; change the closure to
inspect the caught exception (e.g., bind it as exc) and only convert it to the
vm.new_exception_msg "unknown encoding for '{filename}': {effective_encoding}"
when exc.isinstance(vm.ctx.exceptions.lookup_error) is true, otherwise return
the original exception (propagate exc) so decode/type errors are preserved;
update the closure referenced around decode_text and vm.new_exception_msg
accordingly.
- Around line 119-126: find_encoding_in_line currently accepts any '#' anywhere
on the line which can match '#' inside strings; update find_encoding_in_line to
validate that all bytes before the found hash_pos are only ASCII space, tab or
formfeed (b' ', b'\t', b'\x0c') and return None otherwise, then proceed to find
"coding[:=]" after that hash; also ensure the second-line check that calls
find_encoding_in_line (the caller at the second-line encoding check) relies on
this stricter behavior so it won't accept non-comment '#' occurrences.
- Around line 185-193: The BOM/encoding check rejects valid aliases like
"utf_8"; normalize the encoding string by lowercasing and stripping
hyphens/underscores before comparing to "utf8". In the block that uses has_bom
and encoding (the encoding variable and the BOM check in builtins.rs, related to
detect_source_encoding()), transform the encoding (e.g., map |e|
e.to_ascii_lowercase().replace(['-','_'], "")) or otherwise remove '-' and '_'
then compare against "utf8" so "utf-8", "utf_8", and "utf8" are treated equally.

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a6a6f8 and 55ca583.

⛔ Files ignored due to path filters (3)

Lib/test/test_importlib/source/test_source_encoding.py is excluded by !Lib/**
Lib/test/test_runpy.py is excluded by !Lib/**
Lib/test/test_utf8source.py is excluded by !Lib/**

📒 Files selected for processing (1)

crates/vm/src/stdlib/builtins.rs

crates/vm/src/stdlib/builtins.rs

coderabbitai

♻️ Duplicate comments (1)

crates/vm/src/stdlib/builtins.rs (1)

238-248: ⚠️ Potential issue | 🟠 Major

Check exception type before overriding decode errors.

The map_err(|_| ...) closure masks all decode failures as "unknown encoding", including legitimate decode errors on known encodings (e.g., invalid bytes for the declared encoding). Only LookupError should map to "unknown encoding"; other errors (decode failures) should be preserved or converted to an appropriate SyntaxError with decode details.

🐛 Proposed fix to distinguish error types

         let decoded = vm
             .state
             .codec_registry
             .decode_text(bytes_obj.into(), enc, None, vm)
-            .map_err(|_| {
-                vm.new_exception_msg(
-                    vm.ctx.exceptions.syntax_error.to_owned(),
-                    format!("unknown encoding for '{filename}': {enc}").into(),
-                )
+            .map_err(|exc| {
+                if exc.fast_isinstance(&vm.ctx.exceptions.lookup_error) {
+                    vm.new_exception_msg(
+                        vm.ctx.exceptions.syntax_error.to_owned(),
+                        format!("unknown encoding for '{filename}': {enc}").into(),
+                    )
+                } else {
+                    vm.new_exception_msg(
+                        vm.ctx.exceptions.syntax_error.to_owned(),
+                        format!("'{filename}': {}", exc.to_string()).into(),
+                    )
+                }
             })?;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/stdlib/builtins.rs` around lines 238 - 248, The current map_err
on vm.state.codec_registry.decode_text hides all decode failures as "unknown
encoding"; change the error handling in the call to decode_text (the map_err
closure around decode_text) to inspect the underlying error: if it is a
LookupError (i.e., unknown encoding) convert to
vm.new_exception_msg(vm.ctx.exceptions.syntax_error, format!("unknown encoding
for '{filename}': {enc}")), otherwise preserve or convert the real decode error
into a SyntaxError (include the original error message/details) or re-raise it
so real decoding failures (invalid byte sequences) are not masked; update the
closure that wraps decode_text accordingly instead of unconditionally mapping
all errors.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/vm/src/stdlib/builtins.rs`:
- Around line 238-248: The current map_err on
vm.state.codec_registry.decode_text hides all decode failures as "unknown
encoding"; change the error handling in the call to decode_text (the map_err
closure around decode_text) to inspect the underlying error: if it is a
LookupError (i.e., unknown encoding) convert to
vm.new_exception_msg(vm.ctx.exceptions.syntax_error, format!("unknown encoding
for '{filename}': {enc}")), otherwise preserve or convert the real decode error
into a SyntaxError (include the original error message/details) or re-raise it
so real decoding failures (invalid byte sequences) are not masked; update the
closure that wraps decode_text accordingly instead of unconditionally mapping
all errors.

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 55ca583 and 87b5930.

📒 Files selected for processing (1)

crates/vm/src/stdlib/builtins.rs

github-actions · 2026-03-03T01:34:24Z

Code has been automatically formatted

The code in this PR has been formatted using:

cargo fmt --all
Please pull the latest changes before pushing again:

git pull origin utf8source

coderabbitai

♻️ Duplicate comments (1)

crates/vm/src/stdlib/builtins.rs (1)

239-244: ⚠️ Potential issue | 🟠 Major

Preserve non-lookup decode failures instead of rewriting all errors.

This closure currently rewrites all decode failures as “unknown encoding”, which masks real decode/type errors for known codecs. Only LookupError should be translated; other exceptions should pass through unchanged.

Proposed fix

             let decoded = vm
                 .state
                 .codec_registry
                 .decode_text(bytes_obj.into(), enc, None, vm)
-                .map_err(|_| {
-                    vm.new_exception_msg(
-                        vm.ctx.exceptions.syntax_error.to_owned(),
-                        format!("unknown encoding for '{filename}': {enc}").into(),
-                    )
-                })?;
+                .map_err(|exc| {
+                    if exc.isinstance(vm.ctx.exceptions.lookup_error) {
+                        vm.new_exception_msg(
+                            vm.ctx.exceptions.syntax_error.to_owned(),
+                            format!("unknown encoding for '{filename}': {enc}").into(),
+                        )
+                    } else {
+                        exc
+                    }
+                })?;

#!/bin/bash
# Verify current error remapping site and exception-type check patterns.
rg -nP --type=rust -C4 '\.decode_text\(' crates/vm/src/stdlib/builtins.rs
rg -nP --type=rust -C2 '\.isinstance\(' crates/vm/src | head -n 40

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/stdlib/builtins.rs` around lines 239 - 244, The current map_err
closure unconditionally rewrites any decode failure into a SyntaxError "unknown
encoding", which masks real decode/type errors; change the closure used around
the decode call so it only converts the error when it is a LookupError from the
encoder lookup and otherwise returns the original error unchanged — e.g. inspect
the error value inside the map_err (or use match/if let) and only call
vm.new_exception_msg(..., format!("unknown encoding for '{filename}':
{enc}").into()) when the err is a LookupError; for any other error, propagate
the original error (do not rewrap) so real decode/type errors are preserved
(referencing the existing vm.new_exception_msg, vm.ctx.exceptions.syntax_error
and the decode call surrounding .map_err).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/vm/src/stdlib/builtins.rs`:
- Around line 239-244: The current map_err closure unconditionally rewrites any
decode failure into a SyntaxError "unknown encoding", which masks real
decode/type errors; change the closure used around the decode call so it only
converts the error when it is a LookupError from the encoder lookup and
otherwise returns the original error unchanged — e.g. inspect the error value
inside the map_err (or use match/if let) and only call vm.new_exception_msg(...,
format!("unknown encoding for '{filename}': {enc}").into()) when the err is a
LookupError; for any other error, propagate the original error (do not rewrap)
so real decode/type errors are preserved (referencing the existing
vm.new_exception_msg, vm.ctx.exceptions.syntax_error and the decode call
surrounding .map_err).

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87b5930 and 7a0e27f.

📒 Files selected for processing (1)

crates/vm/src/stdlib/builtins.rs

coderabbitai

♻️ Duplicate comments (1)

crates/vm/src/stdlib/builtins.rs (1)

235-244: ⚠️ Potential issue | 🟠 Major

Preserve non-lookup codec errors instead of rewriting all failures.

At Line 239, map_err(|_| ...) converts every decode failure into unknown encoding, which hides real decode/type errors from codec execution. Only LookupError should be rewritten; everything else should propagate unchanged.

Proposed fix

             let decoded = vm
                 .state
                 .codec_registry
                 .decode_text(bytes_obj.into(), enc, None, vm)
-                .map_err(|_| {
-                    vm.new_exception_msg(
-                        vm.ctx.exceptions.syntax_error.to_owned(),
-                        format!("unknown encoding for '{filename}': {enc}").into(),
-                    )
-                })?;
+                .map_err(|exc| {
+                    if exc.isinstance(vm.ctx.exceptions.lookup_error) {
+                        vm.new_exception_msg(
+                            vm.ctx.exceptions.syntax_error.to_owned(),
+                            format!("unknown encoding for '{filename}': {enc}").into(),
+                        )
+                    } else {
+                        exc
+                    }
+                })?;

#!/bin/bash
# Verify how decode_text errors are produced and ensure only LookupError is remapped.
# Expected: decode_text can return non-LookupError exceptions, which should not be masked.
rg -n --type=rust -C4 '\bdecode_text\s*\(' crates/vm/src
rg -n --type=rust -C6 'pub fn decode_text|fn decode_text|lookup_error' crates/vm/src/codecs.rs crates/vm/src/stdlib/builtins.rs

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/stdlib/builtins.rs` around lines 235 - 244, The current map_err
on vm.state.codec_registry.decode_text(...) swallows all decoder errors and
rewraps them as a syntax_error with "unknown encoding", but only LookupError
should be remapped; preserve and propagate any other errors from decode_text.
Change the map_err closure to match on the error variant returned by decode_text
(check for LookupError) and convert only that case into
vm.new_exception_msg(vm.ctx.exceptions.syntax_error, format!("unknown encoding
for '{filename}': {enc}")), otherwise return the original error unchanged so
non-lookup decode/type errors propagate; locate this change around the call to
decode_text in builtins.rs and update the map_err handling accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/vm/src/stdlib/builtins.rs`:
- Around line 235-244: The current map_err on
vm.state.codec_registry.decode_text(...) swallows all decoder errors and rewraps
them as a syntax_error with "unknown encoding", but only LookupError should be
remapped; preserve and propagate any other errors from decode_text. Change the
map_err closure to match on the error variant returned by decode_text (check for
LookupError) and convert only that case into
vm.new_exception_msg(vm.ctx.exceptions.syntax_error, format!("unknown encoding
for '{filename}': {enc}")), otherwise return the original error unchanged so
non-lookup decode/type errors propagate; locate this change around the call to
decode_text in builtins.rs and update the map_err handling accordingly.

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a0e27f and bb805d3.

⛔ Files ignored due to path filters (3)

Lib/test/test_importlib/source/test_source_encoding.py is excluded by !Lib/**
Lib/test/test_runpy.py is excluded by !Lib/**
Lib/test/test_utf8source.py is excluded by !Lib/**

📒 Files selected for processing (1)

crates/vm/src/stdlib/builtins.rs

- Validate '#' is preceded only by whitespace/formfeed (PEP 263) - Normalize UTF-8 encoding aliases (utf-8, utf_8, utf8, etc.)

Remove @unittest.expectedFailure decorators from: - test_source_encoding: test_encoding_on_first_line, test_encoding_on_second_line, test_bom_conflict - test_runpy: test_encoding

youknowone marked this pull request as ready for review March 3, 2026 00:38

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

crates/vm/src/stdlib/builtins.rs Show resolved Hide resolved

crates/vm/src/stdlib/builtins.rs Outdated Show resolved Hide resolved

crates/vm/src/stdlib/builtins.rs Show resolved Hide resolved

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

youknowone force-pushed the utf8source branch from 7a0e27f to bb805d3 Compare March 3, 2026 04:12

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

CPython Developers and others added 3 commits March 3, 2026 21:27

Update test_utf8source from v3.14.3

33fca0d

Fix PEP 263 encoding detection per review

77eade3

- Validate '#' is preceded only by whitespace/formfeed (PEP 263) - Normalize UTF-8 encoding aliases (utf-8, utf_8, utf8, etc.)

Remove expectedFailure for encoding tests now passing

53ddc7e

Remove @unittest.expectedFailure decorators from: - test_source_encoding: test_encoding_on_first_line, test_encoding_on_second_line, test_bom_conflict - test_runpy: test_encoding

youknowone force-pushed the utf8source branch from bb805d3 to 53ddc7e Compare March 3, 2026 12:27

youknowone merged commit 62ca60a into RustPython:main Mar 3, 2026
14 checks passed

youknowone deleted the utf8source branch March 3, 2026 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update test_utf8source from v3.14.3 and implement it#7318

Update test_utf8source from v3.14.3 and implement it#7318
youknowone merged 3 commits intoRustPython:mainfrom
youknowone:utf8source

youknowone commented Mar 2, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 2, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

github-actions bot commented Mar 2, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

youknowone commented Mar 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

github-actions bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Library Dependencies

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

youknowone commented Mar 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 2, 2026 •

edited

Loading

github-actions bot commented Mar 2, 2026 •

edited

Loading