Skip to content

Update test_utf8source from v3.14.3 and implement it#7318

Merged
youknowone merged 3 commits intoRustPython:mainfrom
youknowone:utf8source
Mar 3, 2026
Merged

Update test_utf8source from v3.14.3 and implement it#7318
youknowone merged 3 commits intoRustPython:mainfrom
youknowone:utf8source

Conversation

@youknowone
Copy link
Member

@youknowone youknowone commented Mar 2, 2026

Summary by CodeRabbit

  • Bug Fixes
    • Improved source-file encoding detection (BOM + PEP 263) and decoding to better match CPython.
    • Stricter UTF-8 validation with clearer, location-aware syntax errors for invalid byte sequences.
    • Prevents ambiguous BOM/mixed-encoding handling and surfaces clearer messages when decoding fails.
    • Compile/preprocessing now consistently uses the improved decoding path for source text.
    • No changes to public APIs or exported declarations.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb805d3 and 53ddc7e.

⛔ Files ignored due to path filters (3)
  • Lib/test/test_importlib/source/test_source_encoding.py is excluded by !Lib/**
  • Lib/test/test_runpy.py is excluded by !Lib/**
  • Lib/test/test_utf8source.py is excluded by !Lib/**
📒 Files selected for processing (1)
  • crates/vm/src/stdlib/builtins.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/vm/src/stdlib/builtins.rs

📝 Walkthrough

Walkthrough

Adds parser-gated helpers for PEP 263 and BOM-aware source encoding detection and decoding, and replaces direct UTF‑8 decoding in compile paths with decode_source_bytes, preserving CPython-like error semantics for encoding/decoding failures.

Changes

Cohort / File(s) Summary
Source encoding & decoding
crates/vm/src/stdlib/builtins.rs
Under cfg(feature = "parser") add detect_source_encoding(), is_utf8_encoding(), and decode_source_bytes() handling BOM, PEP 263 cookie detection, UTF‑8 normalization/validation, and codec-registry decoding. Compile paths updated to call decode_source_bytes() instead of direct UTF‑8 decode.

Sequence Diagram(s)

sequenceDiagram
    participant File as "Source bytes"
    participant VM as "VirtualMachine"
    participant Builtins as "builtins::decode_source_bytes"
    participant Codec as "Codec registry / decoder"
    participant Parser as "Parser/Compiler"

    File->>Builtins: provide `source: &[u8]` + `filename`
    Builtins->>Builtins: check BOM and first-two-line cookie
    alt encoding is UTF‑8 (or BOM)
        Builtins->>Builtins: validate / decode UTF‑8
        Builtins-->>Parser: return decoded String
    else encoding is non‑UTF‑8
        Builtins->>Codec: lookup codec by name
        Codec-->>Builtins: decoder (or error)
        Builtins->>Codec: decode bytes
        Codec-->>Builtins: decoded String (or error)
        Builtins-->>Parser: return decoded String
    end
    Builtins-->>VM: raise SyntaxError on invalid encoding/decoding with location
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 I nibble bytes at break of day,

BOM and cookie show the way,
I hop through codecs, old and new,
I stitch your bytes to text for you,
A happy rabbit, quick and gay.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: implementing source encoding detection and UTF-8 handling helpers, updating behavior from v3.14.3 CPython.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] lib: cpython/Lib/importlib
[ ] test: cpython/Lib/test/test_importlib (TODO: 12)

dependencies:

  • importlib

dependent tests: (111 tests)

  • importlib: test_bdb test_cmd_line_script test_codecs test_compileall test_ctypes test_doctest test_frozen test_hashlib test_importlib test_inspect test_linecache test_multiprocessing_main_handling test_pkgutil test_py_compile test_pyclbr test_pydoc test_reprlib test_runpy test_sundry test_support test_tomllib test_unittest test_zipfile test_zipimport test_zoneinfo
    • ctypes.util: test_ctypes
    • ensurepip: test_ensurepip test_venv
    • inspect: test_abc test_argparse test_asyncgen test_buffer test_builtin test_code test_collections test_coroutines test_decimal test_enum test_functools test_generators test_grammar test_monitoring test_ntpath test_operator test_patma test_posixpath test_signal test_sqlite3 test_traceback test_type_annotations test_types test_typing test_unittest test_yield_from
      • ast: test_ast test_compile test_compiler_codegen test_dis test_fstring test_future_stmt test_site test_ssl test_type_comments test_ucn test_unparse
      • asyncio: test_asyncio test_contextlib_async test_logging test_os test_sys_settrace test_unittest
      • cmd: test_cmd
      • dataclasses: test__colorize test_copy test_ctypes test_genericalias test_pprint test_regrtest
      • rlcompleter: test_rlcompleter
      • trace: test_trace
      • xmlrpc.server: test_docxmlrpc test_xmlrpc
    • py_compile: test_importlib
      • zipfile: test_shutil test_zipapp test_zipfile test_zipfile64
    • sysconfig: test_c_locale_coercion test_dtrace test_gc test_launcher test_osx_env test_posix test_pyexpat test_subprocess test_sys test_sysconfig test_time test_tools test_urllib2net
    • zipfile:
      • shutil: test_bz2 test_filecmp test_glob test_httpservers test_largefile test_sax test_string_literals test_tarfile test_tempfile test_unicode_file
    • zipimport: test_importlib

[x] lib: cpython/Lib/runpy.py
[x] test: cpython/Lib/test/test_runpy.py (TODO: 1)

dependencies:

  • runpy

dependent tests: (2 tests)

  • runpy: test_runpy
    • trace: test_trace

[x] test: cpython/Lib/test/test_utf8source.py

dependencies:

dependent tests: (no tests depend on utf8source)

Legend:

  • [+] path exists in CPython
  • [x] up-to-date, [ ] outdated

@youknowone youknowone marked this pull request as ready for review March 3, 2026 00:38
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/vm/src/stdlib/builtins.rs`:
- Around line 228-237: The map_err closure on the call to
vm.state.codec_registry.decode_text currently swallows all errors and replaces
them with a generic "unknown encoding" SyntaxError; change the closure to
inspect the caught exception (e.g., bind it as exc) and only convert it to the
vm.new_exception_msg "unknown encoding for '{filename}': {effective_encoding}"
when exc.isinstance(vm.ctx.exceptions.lookup_error) is true, otherwise return
the original exception (propagate exc) so decode/type errors are preserved;
update the closure referenced around decode_text and vm.new_exception_msg
accordingly.
- Around line 119-126: find_encoding_in_line currently accepts any '#' anywhere
on the line which can match '#' inside strings; update find_encoding_in_line to
validate that all bytes before the found hash_pos are only ASCII space, tab or
formfeed (b' ', b'\t', b'\x0c') and return None otherwise, then proceed to find
"coding[:=]" after that hash; also ensure the second-line check that calls
find_encoding_in_line (the caller at the second-line encoding check) relies on
this stricter behavior so it won't accept non-comment '#' occurrences.
- Around line 185-193: The BOM/encoding check rejects valid aliases like
"utf_8"; normalize the encoding string by lowercasing and stripping
hyphens/underscores before comparing to "utf8". In the block that uses has_bom
and encoding (the encoding variable and the BOM check in builtins.rs, related to
detect_source_encoding()), transform the encoding (e.g., map |e|
e.to_ascii_lowercase().replace(['-','_'], "")) or otherwise remove '-' and '_'
then compare against "utf8" so "utf-8", "utf_8", and "utf8" are treated equally.

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a6a6f8 and 55ca583.

⛔ Files ignored due to path filters (3)
  • Lib/test/test_importlib/source/test_source_encoding.py is excluded by !Lib/**
  • Lib/test/test_runpy.py is excluded by !Lib/**
  • Lib/test/test_utf8source.py is excluded by !Lib/**
📒 Files selected for processing (1)
  • crates/vm/src/stdlib/builtins.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
crates/vm/src/stdlib/builtins.rs (1)

238-248: ⚠️ Potential issue | 🟠 Major

Check exception type before overriding decode errors.

The map_err(|_| ...) closure masks all decode failures as "unknown encoding", including legitimate decode errors on known encodings (e.g., invalid bytes for the declared encoding). Only LookupError should map to "unknown encoding"; other errors (decode failures) should be preserved or converted to an appropriate SyntaxError with decode details.

🐛 Proposed fix to distinguish error types
         let decoded = vm
             .state
             .codec_registry
             .decode_text(bytes_obj.into(), enc, None, vm)
-            .map_err(|_| {
-                vm.new_exception_msg(
-                    vm.ctx.exceptions.syntax_error.to_owned(),
-                    format!("unknown encoding for '{filename}': {enc}").into(),
-                )
+            .map_err(|exc| {
+                if exc.fast_isinstance(&vm.ctx.exceptions.lookup_error) {
+                    vm.new_exception_msg(
+                        vm.ctx.exceptions.syntax_error.to_owned(),
+                        format!("unknown encoding for '{filename}': {enc}").into(),
+                    )
+                } else {
+                    vm.new_exception_msg(
+                        vm.ctx.exceptions.syntax_error.to_owned(),
+                        format!("'{filename}': {}", exc.to_string()).into(),
+                    )
+                }
             })?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/stdlib/builtins.rs` around lines 238 - 248, The current map_err
on vm.state.codec_registry.decode_text hides all decode failures as "unknown
encoding"; change the error handling in the call to decode_text (the map_err
closure around decode_text) to inspect the underlying error: if it is a
LookupError (i.e., unknown encoding) convert to
vm.new_exception_msg(vm.ctx.exceptions.syntax_error, format!("unknown encoding
for '{filename}': {enc}")), otherwise preserve or convert the real decode error
into a SyntaxError (include the original error message/details) or re-raise it
so real decoding failures (invalid byte sequences) are not masked; update the
closure that wraps decode_text accordingly instead of unconditionally mapping
all errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/vm/src/stdlib/builtins.rs`:
- Around line 238-248: The current map_err on
vm.state.codec_registry.decode_text hides all decode failures as "unknown
encoding"; change the error handling in the call to decode_text (the map_err
closure around decode_text) to inspect the underlying error: if it is a
LookupError (i.e., unknown encoding) convert to
vm.new_exception_msg(vm.ctx.exceptions.syntax_error, format!("unknown encoding
for '{filename}': {enc}")), otherwise preserve or convert the real decode error
into a SyntaxError (include the original error message/details) or re-raise it
so real decoding failures (invalid byte sequences) are not masked; update the
closure that wraps decode_text accordingly instead of unconditionally mapping
all errors.

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 55ca583 and 87b5930.

📒 Files selected for processing (1)
  • crates/vm/src/stdlib/builtins.rs

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

Code has been automatically formatted

The code in this PR has been formatted using:

  • cargo fmt --all
    Please pull the latest changes before pushing again:
git pull origin utf8source

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
crates/vm/src/stdlib/builtins.rs (1)

239-244: ⚠️ Potential issue | 🟠 Major

Preserve non-lookup decode failures instead of rewriting all errors.

This closure currently rewrites all decode failures as “unknown encoding”, which masks real decode/type errors for known codecs. Only LookupError should be translated; other exceptions should pass through unchanged.

Proposed fix
             let decoded = vm
                 .state
                 .codec_registry
                 .decode_text(bytes_obj.into(), enc, None, vm)
-                .map_err(|_| {
-                    vm.new_exception_msg(
-                        vm.ctx.exceptions.syntax_error.to_owned(),
-                        format!("unknown encoding for '{filename}': {enc}").into(),
-                    )
-                })?;
+                .map_err(|exc| {
+                    if exc.isinstance(vm.ctx.exceptions.lookup_error) {
+                        vm.new_exception_msg(
+                            vm.ctx.exceptions.syntax_error.to_owned(),
+                            format!("unknown encoding for '{filename}': {enc}").into(),
+                        )
+                    } else {
+                        exc
+                    }
+                })?;
#!/bin/bash
# Verify current error remapping site and exception-type check patterns.
rg -nP --type=rust -C4 '\.decode_text\(' crates/vm/src/stdlib/builtins.rs
rg -nP --type=rust -C2 '\.isinstance\(' crates/vm/src | head -n 40
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/stdlib/builtins.rs` around lines 239 - 244, The current map_err
closure unconditionally rewrites any decode failure into a SyntaxError "unknown
encoding", which masks real decode/type errors; change the closure used around
the decode call so it only converts the error when it is a LookupError from the
encoder lookup and otherwise returns the original error unchanged — e.g. inspect
the error value inside the map_err (or use match/if let) and only call
vm.new_exception_msg(..., format!("unknown encoding for '{filename}':
{enc}").into()) when the err is a LookupError; for any other error, propagate
the original error (do not rewrap) so real decode/type errors are preserved
(referencing the existing vm.new_exception_msg, vm.ctx.exceptions.syntax_error
and the decode call surrounding .map_err).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/vm/src/stdlib/builtins.rs`:
- Around line 239-244: The current map_err closure unconditionally rewrites any
decode failure into a SyntaxError "unknown encoding", which masks real
decode/type errors; change the closure used around the decode call so it only
converts the error when it is a LookupError from the encoder lookup and
otherwise returns the original error unchanged — e.g. inspect the error value
inside the map_err (or use match/if let) and only call vm.new_exception_msg(...,
format!("unknown encoding for '{filename}': {enc}").into()) when the err is a
LookupError; for any other error, propagate the original error (do not rewrap)
so real decode/type errors are preserved (referencing the existing
vm.new_exception_msg, vm.ctx.exceptions.syntax_error and the decode call
surrounding .map_err).

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87b5930 and 7a0e27f.

📒 Files selected for processing (1)
  • crates/vm/src/stdlib/builtins.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
crates/vm/src/stdlib/builtins.rs (1)

235-244: ⚠️ Potential issue | 🟠 Major

Preserve non-lookup codec errors instead of rewriting all failures.

At Line 239, map_err(|_| ...) converts every decode failure into unknown encoding, which hides real decode/type errors from codec execution. Only LookupError should be rewritten; everything else should propagate unchanged.

Proposed fix
             let decoded = vm
                 .state
                 .codec_registry
                 .decode_text(bytes_obj.into(), enc, None, vm)
-                .map_err(|_| {
-                    vm.new_exception_msg(
-                        vm.ctx.exceptions.syntax_error.to_owned(),
-                        format!("unknown encoding for '{filename}': {enc}").into(),
-                    )
-                })?;
+                .map_err(|exc| {
+                    if exc.isinstance(vm.ctx.exceptions.lookup_error) {
+                        vm.new_exception_msg(
+                            vm.ctx.exceptions.syntax_error.to_owned(),
+                            format!("unknown encoding for '{filename}': {enc}").into(),
+                        )
+                    } else {
+                        exc
+                    }
+                })?;
#!/bin/bash
# Verify how decode_text errors are produced and ensure only LookupError is remapped.
# Expected: decode_text can return non-LookupError exceptions, which should not be masked.
rg -n --type=rust -C4 '\bdecode_text\s*\(' crates/vm/src
rg -n --type=rust -C6 'pub fn decode_text|fn decode_text|lookup_error' crates/vm/src/codecs.rs crates/vm/src/stdlib/builtins.rs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/stdlib/builtins.rs` around lines 235 - 244, The current map_err
on vm.state.codec_registry.decode_text(...) swallows all decoder errors and
rewraps them as a syntax_error with "unknown encoding", but only LookupError
should be remapped; preserve and propagate any other errors from decode_text.
Change the map_err closure to match on the error variant returned by decode_text
(check for LookupError) and convert only that case into
vm.new_exception_msg(vm.ctx.exceptions.syntax_error, format!("unknown encoding
for '{filename}': {enc}")), otherwise return the original error unchanged so
non-lookup decode/type errors propagate; locate this change around the call to
decode_text in builtins.rs and update the map_err handling accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/vm/src/stdlib/builtins.rs`:
- Around line 235-244: The current map_err on
vm.state.codec_registry.decode_text(...) swallows all decoder errors and rewraps
them as a syntax_error with "unknown encoding", but only LookupError should be
remapped; preserve and propagate any other errors from decode_text. Change the
map_err closure to match on the error variant returned by decode_text (check for
LookupError) and convert only that case into
vm.new_exception_msg(vm.ctx.exceptions.syntax_error, format!("unknown encoding
for '{filename}': {enc}")), otherwise return the original error unchanged so
non-lookup decode/type errors propagate; locate this change around the call to
decode_text in builtins.rs and update the map_err handling accordingly.

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a0e27f and bb805d3.

⛔ Files ignored due to path filters (3)
  • Lib/test/test_importlib/source/test_source_encoding.py is excluded by !Lib/**
  • Lib/test/test_runpy.py is excluded by !Lib/**
  • Lib/test/test_utf8source.py is excluded by !Lib/**
📒 Files selected for processing (1)
  • crates/vm/src/stdlib/builtins.rs

CPython Developers and others added 3 commits March 3, 2026 21:27
- Validate '#' is preceded only by whitespace/formfeed (PEP 263)
- Normalize UTF-8 encoding aliases (utf-8, utf_8, utf8, etc.)
Remove @unittest.expectedFailure decorators from:
- test_source_encoding: test_encoding_on_first_line, test_encoding_on_second_line, test_bom_conflict
- test_runpy: test_encoding
@youknowone youknowone merged commit 62ca60a into RustPython:main Mar 3, 2026
14 checks passed
@youknowone youknowone deleted the utf8source branch March 3, 2026 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant