Match CPython parity for compile() str source, mode wording, and func_type by changjoon-park · Pull Request #7767 · RustPython/RustPython

changjoon-park · 2026-05-02T22:09:50Z

Summary

Three CPython parity gaps in compile() text-source handling, all in crates/vm/src/stdlib/builtins.rs.

1. str source path applies bytes-only encoding declaration validation

>>> compile('# -*- coding: badencoding -*-\n"x"\n', 'tmp', 'exec')
# CPython 3.14.4: <code object>
# RustPython main: SyntaxError: unknown encoding for 'tmp': badencoding  ❌

Previously, compile() collapsed both str and bytes-like inputs to bytes via ArgStrOrBytesLike::borrow_bytes() and ran the same decode_source_bytes() over them, including the "unknown encoding" error path. CPython only applies # -*- coding: ... -*- validation when source is bytes — for str the source is already decoded and the declaration must be ignored.

Branch on args.source type early:

pure str (no surrogates) → skip encoding-declaration validation
surrogate-bearing str → fall back to bytes path (preserves existing SyntaxError)
bytes-like → unchanged

2. `compile()` mode error wording

>>> compile('x=1', '<x>', 'bogus')
# CPython 3.14.4: ValueError: compile() mode must be 'exec', 'eval' or 'single'
# RustPython main: ValueError: mode must be "exec", "eval", or "single"  ❌

The parser's ModeParseError Display produces a generic message used outside compile() too. Replace the err.to_string() wrapping at the four mode_str.parse::<Mode>() call sites in builtins::compile with the exact CPython wording.

3. Missing `func_type` validation on text-source path

>>> compile('def f(x): pass', 'f', 'func_type')
# CPython 3.14.4: ValueError: compile() mode 'func_type' requires flag PyCF_ONLY_AST
# RustPython main: ValueError: mode must be "exec", "eval", or "single"  ❌

The AST-input path already had this check (lines 287-291); mirror it for text-source input so the user sees the actual cause (missing flag) rather than "invalid mode."

Verification (CPython 3.14.4)

Probe	Cases	Match
Source-input combinations: {str, bytes, bytearray, memoryview} × {no decl, valid utf-8, bad decl}	9	9/9
Mode strings (exec, eval, single, func_type, bogus, '', BAD)	7	7/7

cargo run --release -- -m test test_compile test_builtin test_compileall test_codeop test_ast — 703 tests pass, 0 regressions. All 188 extra_tests/snippets/*.py pass under the CI feature set.

Out of scope

test_compile.test_encoding's expectedFailure marker is left in place: the test exercises non-UTF-8 encoding decoding (# -*- coding: latin1 -*-, iso8859-15) which is a separate pre-existing limitation in decode_source_bytes unaffected by this PR.

Summary by CodeRabbit

Bug Fixes
- Improved handling of source encodings for text vs. binary inputs when compiling code.
- Enhanced compatibility with CPython compile behavior and error responses.
- Stricter validation of compile "mode" values, with clearer error handling for invalid modes.
Tests
- Added tests to verify encoding rules, mode error messages, and special-mode flag requirements.

…on mode wording Three CPython parity gaps in `compile()` text-source handling, all in `crates/vm/src/stdlib/builtins.rs`: 1. **str source path applies bytes-only encoding declaration validation.** Previously, `compile()` collapsed both `str` and bytes-like inputs to bytes via `ArgStrOrBytesLike::borrow_bytes()` and ran the same `decode_source_bytes()` over them, including the "unknown encoding" error path. CPython only applies `# -*- coding: ... -*-` validation when source is bytes — for `str` the source is already decoded and the declaration must be ignored. ```python >>> compile('# -*- coding: badencoding -*-\n"x"\n', 'tmp', 'exec') # CPython 3.14.4: <code object> # RustPython main: SyntaxError: unknown encoding for 'tmp': badencoding ❌ ``` Branch on `args.source` type early: pure `str` (no surrogates) skips the encoding-declaration check; surrogate-bearing `str` falls back to the bytes path; bytes-like input is unchanged. 2. **`compile()` mode error wording.** The parser's `ModeParseError` Display produces `mode must be "exec", "eval", or "single"` (no `compile()` prefix, double quotes, Oxford comma). CPython: `compile() mode must be 'exec', 'eval' or 'single'`. Replace `err.to_string()` wrapping at the four `mode_str.parse::<Mode>()` call sites with the exact CPython wording. 3. **Missing `func_type` validation on text-source path.** Previously, `compile(text, ..., 'func_type')` without `PyCF_ONLY_AST` fell through the `mode_str.parse::<Mode>()` failure path, which now reports `'exec', 'eval' or 'single'` and hides that the real issue is the missing flag. The AST-input path already had the correct check; mirror it for text input. CPython: `compile() mode 'func_type' requires flag PyCF_ONLY_AST`. Verified byte-identical with CPython 3.14.4 across: - 9 source-input combinations ({str, bytes, bytearray, memoryview} × {no decl, valid utf-8, bad decl}) - 7 mode strings (exec, eval, single, func_type, bogus, '', BAD) `cargo run --release -- -m test test_compile test_builtin test_compileall test_codeop test_ast` — 703 tests pass, 0 regressions. All 188 `extra_tests/snippets/*.py` pass under the CI feature set. `test_compile.test_encoding`'s expectedFailure marker is left in place: the test exercises non-UTF-8 encoding decoding (`# -*- coding: latin1 -*-`, `iso8859-15`) which is a separate pre-existing limitation in `decode_source_bytes` unaffected by this PR.

coderabbitai · 2026-05-02T22:10:01Z

📝 Walkthrough

Walkthrough

The compile() function now distinguishes str vs bytes-like inputs for source decoding: str inputs bypass PEP 263 encoding-cookie handling (falling back only for surrogate-bearing str), while bytes inputs keep encoding/BOM detection. Mode parsing is unified to use Mode::parse() with fixed error messages, and mode == "func_type" is rejected unless PyCF_ONLY_AST is set.

Changes

Source Decoding and Mode Parsing Improvements

Layer / File(s)	Summary
Source Decoding `crates/vm/src/stdlib/builtins.rs` (lines 344–368)	`compile()` first treats `args.source` as `PyStr`; if `to_str()` succeeds, the UTF‑8 string is used directly (ignoring encoding cookies). If `to_str()` fails (surrogates) or `args.source` is bytes-like, decode via `decode_source_bytes` to preserve BOM/encoding declarations.
Mode Parsing (AST+codegen) `crates/vm/src/stdlib/builtins.rs` (lines 318–322)	AST+codegen path now imports `crate::compiler::Mode` and parses `mode_str.parse::<Mode>()`, mapping parse failures to a fixed `ValueError` message.
CPython Parity & Compiler-Feature Path `crates/vm/src/stdlib/builtins.rs` (lines 388–403, 415–417)	Adds explicit check rejecting `mode_str == "func_type"` unless `PyCF_ONLY_AST` is set; other mode parse errors produce the fixed `ValueError` text.
Non-compiler Parser Path `crates/vm/src/stdlib/builtins.rs` (lines 440–442)	Parses `mode_str` with `mode_str.parse::<parser::Mode>()` and uses the same fixed `ValueError` message on failure.
Tests `extra_tests/snippets/builtin_compile.py`	New tests assert that encoding declarations are ignored for `str` input (including unknown encodings) but enforced for `bytes` input (raising `SyntaxError`), and check exact `ValueError` messages for invalid `mode` values and for `mode="func_type"` without `PyCF_ONLY_AST`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Accept __index__-conforming objects for compile() flags / optimize #7728 — touches the same compile() area and related flag/optimize parsing.

Suggested reviewers

youknowone
ShaharNaveh

Poem

🐰 I nibbled bits and parsed the rest,

Strings sing UTF‑8, bytes wear a vest.
Modes marched tidy, errors concise,
A rabbit's nod to CPython's spice. ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title directly matches the three main behavioral fixes: str source encoding handling, mode error wording, and func_type validation for CPython parity.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: Turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get your free trial and get 200 agent minutes per Slack user (a $50 value).

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

crates/vm/src/stdlib/builtins.rs (1)
318-442: 💤 Low value

Consider extracting repeated error message strings to constants.

The error messages are duplicated across multiple locations:

"compile() mode must be 'exec', 'eval' or 'single'" appears 4 times (lines 321, 401, 416, 441)

"compile() mode 'func_type' requires flag PyCF_ONLY_AST" appears 2 times (lines 289, 394)

Extracting these to constants would reduce duplication and ensure consistency.
♻️ Proposed refactor
+    const COMPILE_MODE_ERROR: &str = "compile() mode must be 'exec', 'eval' or 'single'";
+    const FUNC_TYPE_REQUIRES_AST_FLAG: &str = "compile() mode 'func_type' requires flag PyCF_ONLY_AST";
+
     #[cfg(any(feature = "parser", feature = "compiler"))]
     #[pyfunction]
     fn compile(args: CompileArgs, vm: &VirtualMachine) -> PyResult {
Then replace the inline string literals with these constants.
As per coding guidelines: "When branches differ only in a value but share common logic, extract the differing value first."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/stdlib/builtins.rs` around lines 318 - 442, Create shared
constant error strings (e.g. const COMPILE_MODE_ERR: &str = "compile() mode must
be 'exec', 'eval' or 'single'"; const FUNC_TYPE_FLAG_ERR: &str = "compile() mode
'func_type' requires flag PyCF_ONLY_AST";) and replace all inline uses in the
compile path: the mode_str.parse::<Mode> error maps (currently returning
vm.new_value_error(...)) in the branches that call mode_str.parse (both
parser::Mode and crate::compiler::Mode), and the checks that return the
func_type flag error (the if mode_str == "func_type" branches used before
codegen and in the PyCF_ONLY_AST handling), so all four occurrences of the
compile-mode message and both occurrences of the func_type message use the new
constants to avoid duplication.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/vm/src/stdlib/builtins.rs`:
- Around line 318-442: Create shared constant error strings (e.g. const
COMPILE_MODE_ERR: &str = "compile() mode must be 'exec', 'eval' or 'single'";
const FUNC_TYPE_FLAG_ERR: &str = "compile() mode 'func_type' requires flag
PyCF_ONLY_AST";) and replace all inline uses in the compile path: the
mode_str.parse::<Mode> error maps (currently returning vm.new_value_error(...))
in the branches that call mode_str.parse (both parser::Mode and
crate::compiler::Mode), and the checks that return the func_type flag error (the
if mode_str == "func_type" branches used before codegen and in the PyCF_ONLY_AST
handling), so all four occurrences of the compile-mode message and both
occurrences of the func_type message use the new constants to avoid duplication.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 837dc311-f5ba-4724-8dc6-43d035393069

📥 Commits

Reviewing files that changed from the base of the PR and between c2910c0 and 736be4e.

📒 Files selected for processing (1)

crates/vm/src/stdlib/builtins.rs

ShaharNaveh

Thanks for the patch!

Does this fix something in the CPython test suite? if not, can you add a custom test to prevent regressions?

Per ShaharNaveh's review on PR RustPython#7767: the fix doesn't unmask any existing CPython test, so add a custom regression snippet covering all three changes: - str source ignores encoding declarations (the "unknown encoding" error path remains for bytes input) - compile() mode error wording matches CPython's exact text - func_type without PyCF_ONLY_AST raises the specific "requires flag" error, not the generic "invalid mode"

coderabbitai

🧹 Nitpick comments (1)

extra_tests/snippets/builtin_compile.py (1)
7-15: ⚡ Quick win

Surrogate-bearing str fallback path is not tested.

The Rust implementation explicitly documents that a surrogate-bearing str (where pystr.to_str() returns None) falls back to the bytes path and raises SyntaxError on the WTF-8 encoding. This is new branching logic introduced by this PR, but no assertion covers it. A minimal addition would be:
🧪 Suggested coverage addition
 compile("# -*- coding: utf-8 -*-\nx = 1\n", "tmp", "exec")
+
+# Surrogate-bearing str falls back to the bytes path → SyntaxError.
+assert_raises(SyntaxError, compile, chr(0xD800) + "\nx = 1\n", "tmp", "exec")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@extra_tests/snippets/builtin_compile.py` around lines 7 - 15, Add a test that
exercises the surrogate-bearing str fallback by calling compile with a str that
contains a lone surrogate (so pystr.to_str() returns None) and asserting it
behaves like the bytes path (i.e., raises SyntaxError for the bogus WTF-8
declaration); use the existing compile call pattern and assert_raises to verify
the SyntaxError from the surrogate-bearing string input, referencing the compile
and assert_raises usages already in the snippet.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@extra_tests/snippets/builtin_compile.py`:
- Around line 7-15: Add a test that exercises the surrogate-bearing str fallback
by calling compile with a str that contains a lone surrogate (so pystr.to_str()
returns None) and asserting it behaves like the bytes path (i.e., raises
SyntaxError for the bogus WTF-8 declaration); use the existing compile call
pattern and assert_raises to verify the SyntaxError from the surrogate-bearing
string input, referencing the compile and assert_raises usages already in the
snippet.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 7c6a6d5c-add4-415e-b244-99a5e2699d21

📥 Commits

Reviewing files that changed from the base of the PR and between 736be4e and 9d96664.

📒 Files selected for processing (1)

extra_tests/snippets/builtin_compile.py

changjoon-park · 2026-05-02T23:41:56Z

Thanks for the patch!

Does this fix something in the CPython test suite? if not, can you add a custom test to prevent regressions?

thanks for catching ! it doesnt unmask an existing test, i added a regression snippet. please check it and thanks for the review :)

ShaharNaveh

looks great:)

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

ShaharNaveh approved these changes May 2, 2026

View reviewed changes

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

ShaharNaveh approved these changes May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match CPython parity for compile() str source, mode wording, and func_type#7767

Match CPython parity for compile() str source, mode wording, and func_type#7767
changjoon-park wants to merge 2 commits into
RustPython:mainfrom
changjoon-park:fix-compile-str-source-encoding

changjoon-park commented May 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ShaharNaveh left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

changjoon-park commented May 2, 2026

Uh oh!

ShaharNaveh left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

changjoon-park commented May 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. str source path applies bytes-only encoding declaration validation

2. compile() mode error wording

3. Missing func_type validation on text-source path

Verification (CPython 3.14.4)

Out of scope

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

changjoon-park commented May 2, 2026

Uh oh!

ShaharNaveh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

changjoon-park commented May 2, 2026 •

edited by coderabbitai Bot

Loading

2. `compile()` mode error wording

3. Missing `func_type` validation on text-source path

coderabbitai Bot commented May 2, 2026 •

edited

Loading