Skip to content

Match CPython parity for compile() str source, mode wording, and func_type#7767

Open
changjoon-park wants to merge 2 commits into
RustPython:mainfrom
changjoon-park:fix-compile-str-source-encoding
Open

Match CPython parity for compile() str source, mode wording, and func_type#7767
changjoon-park wants to merge 2 commits into
RustPython:mainfrom
changjoon-park:fix-compile-str-source-encoding

Conversation

@changjoon-park
Copy link
Copy Markdown
Contributor

@changjoon-park changjoon-park commented May 2, 2026

Summary

Three CPython parity gaps in compile() text-source handling, all in crates/vm/src/stdlib/builtins.rs.

1. str source path applies bytes-only encoding declaration validation

>>> compile('# -*- coding: badencoding -*-\n"x"\n', 'tmp', 'exec')
# CPython 3.14.4: <code object>
# RustPython main: SyntaxError: unknown encoding for 'tmp': badencoding  ❌

Previously, compile() collapsed both str and bytes-like inputs to bytes via ArgStrOrBytesLike::borrow_bytes() and ran the same decode_source_bytes() over them, including the "unknown encoding" error path. CPython only applies # -*- coding: ... -*- validation when source is bytes — for str the source is already decoded and the declaration must be ignored.

Branch on args.source type early:

  • pure str (no surrogates) → skip encoding-declaration validation
  • surrogate-bearing str → fall back to bytes path (preserves existing SyntaxError)
  • bytes-like → unchanged

2. compile() mode error wording

>>> compile('x=1', '<x>', 'bogus')
# CPython 3.14.4: ValueError: compile() mode must be 'exec', 'eval' or 'single'
# RustPython main: ValueError: mode must be "exec", "eval", or "single"  ❌

The parser's ModeParseError Display produces a generic message used outside compile() too. Replace the err.to_string() wrapping at the four mode_str.parse::<Mode>() call sites in builtins::compile with the exact CPython wording.

3. Missing func_type validation on text-source path

>>> compile('def f(x): pass', 'f', 'func_type')
# CPython 3.14.4: ValueError: compile() mode 'func_type' requires flag PyCF_ONLY_AST
# RustPython main: ValueError: mode must be "exec", "eval", or "single"  ❌

The AST-input path already had this check (lines 287-291); mirror it for text-source input so the user sees the actual cause (missing flag) rather than "invalid mode."

Verification (CPython 3.14.4)

Probe Cases Match
Source-input combinations: {str, bytes, bytearray, memoryview} × {no decl, valid utf-8, bad decl} 9 9/9
Mode strings (exec, eval, single, func_type, bogus, '', BAD) 7 7/7

cargo run --release -- -m test test_compile test_builtin test_compileall test_codeop test_ast — 703 tests pass, 0 regressions. All 188 extra_tests/snippets/*.py pass under the CI feature set.

Out of scope

test_compile.test_encoding's expectedFailure marker is left in place: the test exercises non-UTF-8 encoding decoding (# -*- coding: latin1 -*-, iso8859-15) which is a separate pre-existing limitation in decode_source_bytes unaffected by this PR.

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling of source encodings for text vs. binary inputs when compiling code.
    • Enhanced compatibility with CPython compile behavior and error responses.
    • Stricter validation of compile "mode" values, with clearer error handling for invalid modes.
  • Tests
    • Added tests to verify encoding rules, mode error messages, and special-mode flag requirements.

…on mode wording

Three CPython parity gaps in `compile()` text-source handling, all in
`crates/vm/src/stdlib/builtins.rs`:

1. **str source path applies bytes-only encoding declaration validation.**
   Previously, `compile()` collapsed both `str` and bytes-like inputs to
   bytes via `ArgStrOrBytesLike::borrow_bytes()` and ran the same
   `decode_source_bytes()` over them, including the "unknown encoding"
   error path. CPython only applies `# -*- coding: ... -*-` validation
   when source is bytes — for `str` the source is already decoded and
   the declaration must be ignored.

   ```python
   >>> compile('# -*- coding: badencoding -*-\n"x"\n', 'tmp', 'exec')
   # CPython 3.14.4: <code object>
   # RustPython main: SyntaxError: unknown encoding for 'tmp': badencoding  ❌
   ```

   Branch on `args.source` type early: pure `str` (no surrogates) skips
   the encoding-declaration check; surrogate-bearing `str` falls back to
   the bytes path; bytes-like input is unchanged.

2. **`compile()` mode error wording.** The parser's `ModeParseError`
   Display produces `mode must be "exec", "eval", or "single"` (no
   `compile()` prefix, double quotes, Oxford comma). CPython:
   `compile() mode must be 'exec', 'eval' or 'single'`. Replace
   `err.to_string()` wrapping at the four `mode_str.parse::<Mode>()`
   call sites with the exact CPython wording.

3. **Missing `func_type` validation on text-source path.** Previously,
   `compile(text, ..., 'func_type')` without `PyCF_ONLY_AST` fell
   through the `mode_str.parse::<Mode>()` failure path, which now
   reports `'exec', 'eval' or 'single'` and hides that the real issue
   is the missing flag. The AST-input path already had the correct
   check; mirror it for text input. CPython:
   `compile() mode 'func_type' requires flag PyCF_ONLY_AST`.

Verified byte-identical with CPython 3.14.4 across:
- 9 source-input combinations ({str, bytes, bytearray, memoryview} × {no decl, valid utf-8, bad decl})
- 7 mode strings (exec, eval, single, func_type, bogus, '', BAD)

`cargo run --release -- -m test test_compile test_builtin test_compileall
test_codeop test_ast` — 703 tests pass, 0 regressions. All 188
`extra_tests/snippets/*.py` pass under the CI feature set.

`test_compile.test_encoding`'s expectedFailure marker is left in place:
the test exercises non-UTF-8 encoding decoding (`# -*- coding: latin1 -*-`,
`iso8859-15`) which is a separate pre-existing limitation in
`decode_source_bytes` unaffected by this PR.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

📝 Walkthrough

Walkthrough

The compile() function now distinguishes str vs bytes-like inputs for source decoding: str inputs bypass PEP 263 encoding-cookie handling (falling back only for surrogate-bearing str), while bytes inputs keep encoding/BOM detection. Mode parsing is unified to use Mode::parse() with fixed error messages, and mode == "func_type" is rejected unless PyCF_ONLY_AST is set.

Changes

Source Decoding and Mode Parsing Improvements

Layer / File(s) Summary
Source Decoding
crates/vm/src/stdlib/builtins.rs (lines 344–368)
compile() first treats args.source as PyStr; if to_str() succeeds, the UTF‑8 string is used directly (ignoring encoding cookies). If to_str() fails (surrogates) or args.source is bytes-like, decode via decode_source_bytes to preserve BOM/encoding declarations.
Mode Parsing (AST+codegen)
crates/vm/src/stdlib/builtins.rs (lines 318–322)
AST+codegen path now imports crate::compiler::Mode and parses mode_str.parse::<Mode>(), mapping parse failures to a fixed ValueError message.
CPython Parity & Compiler-Feature Path
crates/vm/src/stdlib/builtins.rs (lines 388–403, 415–417)
Adds explicit check rejecting mode_str == "func_type" unless PyCF_ONLY_AST is set; other mode parse errors produce the fixed ValueError text.
Non-compiler Parser Path
crates/vm/src/stdlib/builtins.rs (lines 440–442)
Parses mode_str with mode_str.parse::<parser::Mode>() and uses the same fixed ValueError message on failure.
Tests
extra_tests/snippets/builtin_compile.py
New tests assert that encoding declarations are ignored for str input (including unknown encodings) but enforced for bytes input (raising SyntaxError), and check exact ValueError messages for invalid mode values and for mode="func_type" without PyCF_ONLY_AST.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • youknowone
  • ShaharNaveh

Poem

🐰 I nibbled bits and parsed the rest,

Strings sing UTF‑8, bytes wear a vest.
Modes marched tidy, errors concise,
A rabbit's nod to CPython's spice. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly matches the three main behavioral fixes: str source encoding handling, mode error wording, and func_type validation for CPython parity.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: Turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get your free trial and get 200 agent minutes per Slack user (a $50 value).


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/vm/src/stdlib/builtins.rs (1)

318-442: 💤 Low value

Consider extracting repeated error message strings to constants.

The error messages are duplicated across multiple locations:

  • "compile() mode must be 'exec', 'eval' or 'single'" appears 4 times (lines 321, 401, 416, 441)
  • "compile() mode 'func_type' requires flag PyCF_ONLY_AST" appears 2 times (lines 289, 394)

Extracting these to constants would reduce duplication and ensure consistency.

♻️ Proposed refactor
+    const COMPILE_MODE_ERROR: &str = "compile() mode must be 'exec', 'eval' or 'single'";
+    const FUNC_TYPE_REQUIRES_AST_FLAG: &str = "compile() mode 'func_type' requires flag PyCF_ONLY_AST";
+
     #[cfg(any(feature = "parser", feature = "compiler"))]
     #[pyfunction]
     fn compile(args: CompileArgs, vm: &VirtualMachine) -> PyResult {

Then replace the inline string literals with these constants.

As per coding guidelines: "When branches differ only in a value but share common logic, extract the differing value first."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/stdlib/builtins.rs` around lines 318 - 442, Create shared
constant error strings (e.g. const COMPILE_MODE_ERR: &str = "compile() mode must
be 'exec', 'eval' or 'single'"; const FUNC_TYPE_FLAG_ERR: &str = "compile() mode
'func_type' requires flag PyCF_ONLY_AST";) and replace all inline uses in the
compile path: the mode_str.parse::<Mode> error maps (currently returning
vm.new_value_error(...)) in the branches that call mode_str.parse (both
parser::Mode and crate::compiler::Mode), and the checks that return the
func_type flag error (the if mode_str == "func_type" branches used before
codegen and in the PyCF_ONLY_AST handling), so all four occurrences of the
compile-mode message and both occurrences of the func_type message use the new
constants to avoid duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/vm/src/stdlib/builtins.rs`:
- Around line 318-442: Create shared constant error strings (e.g. const
COMPILE_MODE_ERR: &str = "compile() mode must be 'exec', 'eval' or 'single'";
const FUNC_TYPE_FLAG_ERR: &str = "compile() mode 'func_type' requires flag
PyCF_ONLY_AST";) and replace all inline uses in the compile path: the
mode_str.parse::<Mode> error maps (currently returning vm.new_value_error(...))
in the branches that call mode_str.parse (both parser::Mode and
crate::compiler::Mode), and the checks that return the func_type flag error (the
if mode_str == "func_type" branches used before codegen and in the PyCF_ONLY_AST
handling), so all four occurrences of the compile-mode message and both
occurrences of the func_type message use the new constants to avoid duplication.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 837dc311-f5ba-4724-8dc6-43d035393069

📥 Commits

Reviewing files that changed from the base of the PR and between c2910c0 and 736be4e.

📒 Files selected for processing (1)
  • crates/vm/src/stdlib/builtins.rs

Copy link
Copy Markdown
Contributor

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch!

Does this fix something in the CPython test suite? if not, can you add a custom test to prevent regressions?

Per ShaharNaveh's review on PR RustPython#7767: the fix doesn't unmask any
existing CPython test, so add a custom regression snippet covering all
three changes:

- str source ignores encoding declarations (the "unknown encoding"
  error path remains for bytes input)
- compile() mode error wording matches CPython's exact text
- func_type without PyCF_ONLY_AST raises the specific "requires flag"
  error, not the generic "invalid mode"
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
extra_tests/snippets/builtin_compile.py (1)

7-15: ⚡ Quick win

Surrogate-bearing str fallback path is not tested.

The Rust implementation explicitly documents that a surrogate-bearing str (where pystr.to_str() returns None) falls back to the bytes path and raises SyntaxError on the WTF-8 encoding. This is new branching logic introduced by this PR, but no assertion covers it. A minimal addition would be:

🧪 Suggested coverage addition
 compile("# -*- coding: utf-8 -*-\nx = 1\n", "tmp", "exec")
+
+# Surrogate-bearing str falls back to the bytes path → SyntaxError.
+assert_raises(SyntaxError, compile, chr(0xD800) + "\nx = 1\n", "tmp", "exec")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@extra_tests/snippets/builtin_compile.py` around lines 7 - 15, Add a test that
exercises the surrogate-bearing str fallback by calling compile with a str that
contains a lone surrogate (so pystr.to_str() returns None) and asserting it
behaves like the bytes path (i.e., raises SyntaxError for the bogus WTF-8
declaration); use the existing compile call pattern and assert_raises to verify
the SyntaxError from the surrogate-bearing string input, referencing the compile
and assert_raises usages already in the snippet.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@extra_tests/snippets/builtin_compile.py`:
- Around line 7-15: Add a test that exercises the surrogate-bearing str fallback
by calling compile with a str that contains a lone surrogate (so pystr.to_str()
returns None) and asserting it behaves like the bytes path (i.e., raises
SyntaxError for the bogus WTF-8 declaration); use the existing compile call
pattern and assert_raises to verify the SyntaxError from the surrogate-bearing
string input, referencing the compile and assert_raises usages already in the
snippet.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 7c6a6d5c-add4-415e-b244-99a5e2699d21

📥 Commits

Reviewing files that changed from the base of the PR and between 736be4e and 9d96664.

📒 Files selected for processing (1)
  • extra_tests/snippets/builtin_compile.py

@changjoon-park
Copy link
Copy Markdown
Contributor Author

Thanks for the patch!

Does this fix something in the CPython test suite? if not, can you add a custom test to prevent regressions?

thanks for catching ! it doesnt unmask an existing test, i added a regression snippet. please check it and thanks for the review :)

Copy link
Copy Markdown
Contributor

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants