Parse surrogates in string literals properly #5629

coolreader18 · 2025-03-27T01:40:52Z

Because ruff represents string literals as str, we have to reparse every string that might have a surrogate \uDXXX escape as wtf8. Then there were a bunch of crashes to fix, because surrogate strings in test/ actually get parsed properly now instead of being subbed with REPLACEMENT_CHARACTER.

youknowone · 2025-03-27T04:30:25Z

interesting, copilot reviewer is added. I tried a test run on this pr

Copilot

Pull Request Overview

This PR aims to improve the handling of surrogate escapes in string literals by reparsing them into WTF‑8 format so that previously crashing surrogate strings now produce valid output. Key changes include:

Adding surrogate-specific parsing logic in the string parser and updating usage of string constants to use WTF‑8.
Adjusting code in the WTF‑8 module to introduce new types for lead and trail surrogates and improve decoding.
Updating various tests and Cargo.toml dependencies to align with these changes.

Reviewed Changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated no comments.

File	Description
compiler/codegen/src/string_parser.rs	Implements surrogate-aware parsing for string literals.
common/src/wtf8/mod.rs	Introduces new surrogate types and adds decoding enhancements.
Lib/test/*.py and other test files	Updates expected failure markers and related test expectations.
compiler/core/Cargo.toml & compiler/codegen/Cargo.toml	Adds and adjusts dependencies required for the changes.

Comments suppressed due to low confidence (3)

compiler/codegen/src/string_parser.rs:280

Using unwrap_or_else with an empty match in parse_string_literal may mask unexpected errors during surrogate parsing. Consider handling the error explicitly with a descriptive panic message or returning a proper Result.

        .unwrap_or_else(|x| match x {})

common/src/wtf8/mod.rs:767

In the from_bytes function, always advancing by 3 bytes after decoding a surrogate sequence may be fragile if the assumptions about surrogate lengths change. Consider adding a comment to document this behavior or validating the boundary before slicing.

            rest = &rest[3..];

compiler/codegen/src/compile.rs:2682

[nitpick] The code branch that reparses string literals containing the replacement character introduces additional processing. Verify that this extra reparsing is required for proper surrogate handling and that the performance impact is acceptable.

if value.contains(char::REPLACEMENT_CHARACTER) {

coolreader18 · 2025-03-27T04:55:52Z

I guess I'm reviewing the review, lol

This PR aims to improve the handling of surrogate escapes in string literals by reparsing them into WTF‑8 format so that previously crashing surrogate strings now produce valid output

Close, but reparsing the strings to WTF-8 actually began causing crashes. The second commit then fixes those crashes.

The following are "low confidence" but I guess that's for good reason

Using unwrap_or_else with an empty match in parse_string_literal may mask unexpected errors during surrogate parsing. Consider handling the error explicitly with a descriptive panic message or returning a proper Result.
        .unwrap_or_else(|x| match x {})

Not how uninhabited types work lol

In the from_bytes function, always advancing by 3 bytes after decoding a surrogate sequence may be fragile if the assumptions about surrogate lengths change. Consider adding a comment to document this behavior or validating the boundary before slicing.

If UTF-8 changes, we have more issues than just this.

[nitpick] The code branch that reparses string literals containing the replacement character introduces additional processing. Verify that this extra reparsing is required for proper surrogate handling and that the performance impact is acceptable.

Well, yes, that's why it's behind a .contains(REPLACEMENT_CHAR) branch in the first place.

common/src/wtf8/mod.rs

youknowone · 2025-03-27T08:49:40Z

vm/src/stdlib/codecs.rs

@@ -26,7 +26,7 @@ mod _codecs {
    fn lookup(encoding: PyStrRef, vm: &VirtualMachine) -> PyResult {


probably we need PyUtf8StrRef to avoid the repeating try_to_str(vm) too much

youknowone · 2025-03-27T09:13:20Z

compiler/codegen/src/string_parser.rs

@@ -0,0 +1,287 @@
+//! A stripped-down version of ruff's string literal parser, modified to


Would ruff parser also need to be patched?

Yeah, though I'm not sure whether they'd necessarily feel a need to. I think what might strike a good balance is if there's a flag that can be set like contains_surrogates, so we can know whether or not we need to reparse.

That's a question for @MichaReiser.

Yeah, probably. There's an open issue but it isn't something that has come up often. astral-sh/ruff#13666

Co-authored-by: Jeong, YunWon <69878+youknowone@users.noreply.github.com>

Parse surrogates in string literals properly

c6cab4c

coolreader18 force-pushed the surrogate-literals branch from 960e86c to b4e3558 Compare March 27, 2025 03:48

Fix more surrogate crashes

0a07cd9

coolreader18 force-pushed the surrogate-literals branch from b4e3558 to 0a07cd9 Compare March 27, 2025 04:12

youknowone requested a review from Copilot March 27, 2025 04:29

Copilot AI reviewed Mar 27, 2025

View reviewed changes

youknowone approved these changes Mar 27, 2025

View reviewed changes

Update common/src/wtf8/mod.rs

dd467f6

Co-authored-by: Jeong, YunWon <69878+youknowone@users.noreply.github.com>

coolreader18 merged commit b6aacbf into RustPython:main Mar 27, 2025
11 checks passed

coolreader18 deleted the surrogate-literals branch March 27, 2025 15:15

MichaReiser mentioned this pull request Mar 27, 2025

Surrogate code points are internally represented as U+FFFD REPLACEMENT CHARACTER astral-sh/ruff#13666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse surrogates in string literals properly #5629

Parse surrogates in string literals properly #5629

coolreader18 commented Mar 27, 2025

youknowone commented Mar 27, 2025

Copilot AI left a comment

coolreader18 commented Mar 27, 2025

youknowone Mar 27, 2025

youknowone Mar 27, 2025

coolreader18 Mar 27, 2025

fanninpm Mar 27, 2025

MichaReiser Mar 27, 2025

		@@ -26,7 +26,7 @@ mod _codecs {
		fn lookup(encoding: PyStrRef, vm: &VirtualMachine) -> PyResult {

		@@ -0,0 +1,287 @@
		//! A stripped-down version of ruff's string literal parser, modified to

Parse surrogates in string literals properly #5629

Parse surrogates in string literals properly #5629

Conversation

coolreader18 commented Mar 27, 2025

youknowone commented Mar 27, 2025

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

coolreader18 commented Mar 27, 2025

youknowone Mar 27, 2025

Choose a reason for hiding this comment

youknowone Mar 27, 2025

Choose a reason for hiding this comment

coolreader18 Mar 27, 2025

Choose a reason for hiding this comment

fanninpm Mar 27, 2025

Choose a reason for hiding this comment

MichaReiser Mar 27, 2025

Choose a reason for hiding this comment