Update test_utf8source from v3.14.3 and implement it#7318
Update test_utf8source from v3.14.3 and implement it#7318youknowone merged 3 commits intoRustPython:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro ⛔ Files ignored due to path filters (3)
📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds parser-gated helpers for PEP 263 and BOM-aware source encoding detection and decoding, and replaces direct UTF‑8 decoding in compile paths with Changes
Sequence Diagram(s)sequenceDiagram
participant File as "Source bytes"
participant VM as "VirtualMachine"
participant Builtins as "builtins::decode_source_bytes"
participant Codec as "Codec registry / decoder"
participant Parser as "Parser/Compiler"
File->>Builtins: provide `source: &[u8]` + `filename`
Builtins->>Builtins: check BOM and first-two-line cookie
alt encoding is UTF‑8 (or BOM)
Builtins->>Builtins: validate / decode UTF‑8
Builtins-->>Parser: return decoded String
else encoding is non‑UTF‑8
Builtins->>Codec: lookup codec by name
Codec-->>Builtins: decoder (or error)
Builtins->>Codec: decode bytes
Codec-->>Builtins: decoded String (or error)
Builtins-->>Parser: return decoded String
end
Builtins-->>VM: raise SyntaxError on invalid encoding/decoding with location
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [x] lib: cpython/Lib/importlib dependencies:
dependent tests: (111 tests)
[x] lib: cpython/Lib/runpy.py dependencies:
dependent tests: (2 tests)
[x] test: cpython/Lib/test/test_utf8source.py dependencies: dependent tests: (no tests depend on utf8source) Legend:
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/vm/src/stdlib/builtins.rs`:
- Around line 228-237: The map_err closure on the call to
vm.state.codec_registry.decode_text currently swallows all errors and replaces
them with a generic "unknown encoding" SyntaxError; change the closure to
inspect the caught exception (e.g., bind it as exc) and only convert it to the
vm.new_exception_msg "unknown encoding for '{filename}': {effective_encoding}"
when exc.isinstance(vm.ctx.exceptions.lookup_error) is true, otherwise return
the original exception (propagate exc) so decode/type errors are preserved;
update the closure referenced around decode_text and vm.new_exception_msg
accordingly.
- Around line 119-126: find_encoding_in_line currently accepts any '#' anywhere
on the line which can match '#' inside strings; update find_encoding_in_line to
validate that all bytes before the found hash_pos are only ASCII space, tab or
formfeed (b' ', b'\t', b'\x0c') and return None otherwise, then proceed to find
"coding[:=]" after that hash; also ensure the second-line check that calls
find_encoding_in_line (the caller at the second-line encoding check) relies on
this stricter behavior so it won't accept non-comment '#' occurrences.
- Around line 185-193: The BOM/encoding check rejects valid aliases like
"utf_8"; normalize the encoding string by lowercasing and stripping
hyphens/underscores before comparing to "utf8". In the block that uses has_bom
and encoding (the encoding variable and the BOM check in builtins.rs, related to
detect_source_encoding()), transform the encoding (e.g., map |e|
e.to_ascii_lowercase().replace(['-','_'], "")) or otherwise remove '-' and '_'
then compare against "utf8" so "utf-8", "utf_8", and "utf8" are treated equally.
ℹ️ Review info
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
Lib/test/test_importlib/source/test_source_encoding.pyis excluded by!Lib/**Lib/test/test_runpy.pyis excluded by!Lib/**Lib/test/test_utf8source.pyis excluded by!Lib/**
📒 Files selected for processing (1)
crates/vm/src/stdlib/builtins.rs
There was a problem hiding this comment.
♻️ Duplicate comments (1)
crates/vm/src/stdlib/builtins.rs (1)
238-248:⚠️ Potential issue | 🟠 MajorCheck exception type before overriding decode errors.
The
map_err(|_| ...)closure masks all decode failures as "unknown encoding", including legitimate decode errors on known encodings (e.g., invalid bytes for the declared encoding). OnlyLookupErrorshould map to "unknown encoding"; other errors (decode failures) should be preserved or converted to an appropriateSyntaxErrorwith decode details.🐛 Proposed fix to distinguish error types
let decoded = vm .state .codec_registry .decode_text(bytes_obj.into(), enc, None, vm) - .map_err(|_| { - vm.new_exception_msg( - vm.ctx.exceptions.syntax_error.to_owned(), - format!("unknown encoding for '{filename}': {enc}").into(), - ) + .map_err(|exc| { + if exc.fast_isinstance(&vm.ctx.exceptions.lookup_error) { + vm.new_exception_msg( + vm.ctx.exceptions.syntax_error.to_owned(), + format!("unknown encoding for '{filename}': {enc}").into(), + ) + } else { + vm.new_exception_msg( + vm.ctx.exceptions.syntax_error.to_owned(), + format!("'{filename}': {}", exc.to_string()).into(), + ) + } })?;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/stdlib/builtins.rs` around lines 238 - 248, The current map_err on vm.state.codec_registry.decode_text hides all decode failures as "unknown encoding"; change the error handling in the call to decode_text (the map_err closure around decode_text) to inspect the underlying error: if it is a LookupError (i.e., unknown encoding) convert to vm.new_exception_msg(vm.ctx.exceptions.syntax_error, format!("unknown encoding for '{filename}': {enc}")), otherwise preserve or convert the real decode error into a SyntaxError (include the original error message/details) or re-raise it so real decoding failures (invalid byte sequences) are not masked; update the closure that wraps decode_text accordingly instead of unconditionally mapping all errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@crates/vm/src/stdlib/builtins.rs`:
- Around line 238-248: The current map_err on
vm.state.codec_registry.decode_text hides all decode failures as "unknown
encoding"; change the error handling in the call to decode_text (the map_err
closure around decode_text) to inspect the underlying error: if it is a
LookupError (i.e., unknown encoding) convert to
vm.new_exception_msg(vm.ctx.exceptions.syntax_error, format!("unknown encoding
for '{filename}': {enc}")), otherwise preserve or convert the real decode error
into a SyntaxError (include the original error message/details) or re-raise it
so real decoding failures (invalid byte sequences) are not masked; update the
closure that wraps decode_text accordingly instead of unconditionally mapping
all errors.
|
Code has been automatically formatted The code in this PR has been formatted using:
git pull origin utf8source |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
crates/vm/src/stdlib/builtins.rs (1)
239-244:⚠️ Potential issue | 🟠 MajorPreserve non-lookup decode failures instead of rewriting all errors.
This closure currently rewrites all decode failures as “unknown encoding”, which masks real decode/type errors for known codecs. Only
LookupErrorshould be translated; other exceptions should pass through unchanged.Proposed fix
let decoded = vm .state .codec_registry .decode_text(bytes_obj.into(), enc, None, vm) - .map_err(|_| { - vm.new_exception_msg( - vm.ctx.exceptions.syntax_error.to_owned(), - format!("unknown encoding for '{filename}': {enc}").into(), - ) - })?; + .map_err(|exc| { + if exc.isinstance(vm.ctx.exceptions.lookup_error) { + vm.new_exception_msg( + vm.ctx.exceptions.syntax_error.to_owned(), + format!("unknown encoding for '{filename}': {enc}").into(), + ) + } else { + exc + } + })?;#!/bin/bash # Verify current error remapping site and exception-type check patterns. rg -nP --type=rust -C4 '\.decode_text\(' crates/vm/src/stdlib/builtins.rs rg -nP --type=rust -C2 '\.isinstance\(' crates/vm/src | head -n 40🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/stdlib/builtins.rs` around lines 239 - 244, The current map_err closure unconditionally rewrites any decode failure into a SyntaxError "unknown encoding", which masks real decode/type errors; change the closure used around the decode call so it only converts the error when it is a LookupError from the encoder lookup and otherwise returns the original error unchanged — e.g. inspect the error value inside the map_err (or use match/if let) and only call vm.new_exception_msg(..., format!("unknown encoding for '{filename}': {enc}").into()) when the err is a LookupError; for any other error, propagate the original error (do not rewrap) so real decode/type errors are preserved (referencing the existing vm.new_exception_msg, vm.ctx.exceptions.syntax_error and the decode call surrounding .map_err).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@crates/vm/src/stdlib/builtins.rs`:
- Around line 239-244: The current map_err closure unconditionally rewrites any
decode failure into a SyntaxError "unknown encoding", which masks real
decode/type errors; change the closure used around the decode call so it only
converts the error when it is a LookupError from the encoder lookup and
otherwise returns the original error unchanged — e.g. inspect the error value
inside the map_err (or use match/if let) and only call vm.new_exception_msg(...,
format!("unknown encoding for '{filename}': {enc}").into()) when the err is a
LookupError; for any other error, propagate the original error (do not rewrap)
so real decode/type errors are preserved (referencing the existing
vm.new_exception_msg, vm.ctx.exceptions.syntax_error and the decode call
surrounding .map_err).
There was a problem hiding this comment.
♻️ Duplicate comments (1)
crates/vm/src/stdlib/builtins.rs (1)
235-244:⚠️ Potential issue | 🟠 MajorPreserve non-lookup codec errors instead of rewriting all failures.
At Line 239,
map_err(|_| ...)converts every decode failure intounknown encoding, which hides real decode/type errors from codec execution. OnlyLookupErrorshould be rewritten; everything else should propagate unchanged.Proposed fix
let decoded = vm .state .codec_registry .decode_text(bytes_obj.into(), enc, None, vm) - .map_err(|_| { - vm.new_exception_msg( - vm.ctx.exceptions.syntax_error.to_owned(), - format!("unknown encoding for '{filename}': {enc}").into(), - ) - })?; + .map_err(|exc| { + if exc.isinstance(vm.ctx.exceptions.lookup_error) { + vm.new_exception_msg( + vm.ctx.exceptions.syntax_error.to_owned(), + format!("unknown encoding for '{filename}': {enc}").into(), + ) + } else { + exc + } + })?;#!/bin/bash # Verify how decode_text errors are produced and ensure only LookupError is remapped. # Expected: decode_text can return non-LookupError exceptions, which should not be masked. rg -n --type=rust -C4 '\bdecode_text\s*\(' crates/vm/src rg -n --type=rust -C6 'pub fn decode_text|fn decode_text|lookup_error' crates/vm/src/codecs.rs crates/vm/src/stdlib/builtins.rs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/stdlib/builtins.rs` around lines 235 - 244, The current map_err on vm.state.codec_registry.decode_text(...) swallows all decoder errors and rewraps them as a syntax_error with "unknown encoding", but only LookupError should be remapped; preserve and propagate any other errors from decode_text. Change the map_err closure to match on the error variant returned by decode_text (check for LookupError) and convert only that case into vm.new_exception_msg(vm.ctx.exceptions.syntax_error, format!("unknown encoding for '{filename}': {enc}")), otherwise return the original error unchanged so non-lookup decode/type errors propagate; locate this change around the call to decode_text in builtins.rs and update the map_err handling accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@crates/vm/src/stdlib/builtins.rs`:
- Around line 235-244: The current map_err on
vm.state.codec_registry.decode_text(...) swallows all decoder errors and rewraps
them as a syntax_error with "unknown encoding", but only LookupError should be
remapped; preserve and propagate any other errors from decode_text. Change the
map_err closure to match on the error variant returned by decode_text (check for
LookupError) and convert only that case into
vm.new_exception_msg(vm.ctx.exceptions.syntax_error, format!("unknown encoding
for '{filename}': {enc}")), otherwise return the original error unchanged so
non-lookup decode/type errors propagate; locate this change around the call to
decode_text in builtins.rs and update the map_err handling accordingly.
ℹ️ Review info
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
Lib/test/test_importlib/source/test_source_encoding.pyis excluded by!Lib/**Lib/test/test_runpy.pyis excluded by!Lib/**Lib/test/test_utf8source.pyis excluded by!Lib/**
📒 Files selected for processing (1)
crates/vm/src/stdlib/builtins.rs
- Validate '#' is preceded only by whitespace/formfeed (PEP 263) - Normalize UTF-8 encoding aliases (utf-8, utf_8, utf8, etc.)
Remove @unittest.expectedFailure decorators from: - test_source_encoding: test_encoding_on_first_line, test_encoding_on_second_line, test_bom_conflict - test_runpy: test_encoding
Summary by CodeRabbit