Skip to content

Parse surrogates in string literals properly #5629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 27, 2025

Conversation

coolreader18
Copy link
Member

Because ruff represents string literals as str, we have to reparse every string that might have a surrogate \uDXXX escape as wtf8. Then there were a bunch of crashes to fix, because surrogate strings in test/ actually get parsed properly now instead of being subbed with REPLACEMENT_CHARACTER.

@youknowone youknowone requested a review from Copilot March 27, 2025 04:29
@youknowone
Copy link
Member

interesting, copilot reviewer is added. I tried a test run on this pr

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR aims to improve the handling of surrogate escapes in string literals by reparsing them into WTF‑8 format so that previously crashing surrogate strings now produce valid output. Key changes include:

  • Adding surrogate-specific parsing logic in the string parser and updating usage of string constants to use WTF‑8.
  • Adjusting code in the WTF‑8 module to introduce new types for lead and trail surrogates and improve decoding.
  • Updating various tests and Cargo.toml dependencies to align with these changes.

Reviewed Changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated no comments.

File Description
compiler/codegen/src/string_parser.rs Implements surrogate-aware parsing for string literals.
common/src/wtf8/mod.rs Introduces new surrogate types and adds decoding enhancements.
Lib/test/*.py and other test files Updates expected failure markers and related test expectations.
compiler/core/Cargo.toml & compiler/codegen/Cargo.toml Adds and adjusts dependencies required for the changes.
Comments suppressed due to low confidence (3)

compiler/codegen/src/string_parser.rs:280

  • Using unwrap_or_else with an empty match in parse_string_literal may mask unexpected errors during surrogate parsing. Consider handling the error explicitly with a descriptive panic message or returning a proper Result.
        .unwrap_or_else(|x| match x {})

common/src/wtf8/mod.rs:767

  • In the from_bytes function, always advancing by 3 bytes after decoding a surrogate sequence may be fragile if the assumptions about surrogate lengths change. Consider adding a comment to document this behavior or validating the boundary before slicing.
            rest = &rest[3..];

compiler/codegen/src/compile.rs:2682

  • [nitpick] The code branch that reparses string literals containing the replacement character introduces additional processing. Verify that this extra reparsing is required for proper surrogate handling and that the performance impact is acceptable.
if value.contains(char::REPLACEMENT_CHARACTER) {

@coolreader18
Copy link
Member Author

I guess I'm reviewing the review, lol

This PR aims to improve the handling of surrogate escapes in string literals by reparsing them into WTF‑8 format so that previously crashing surrogate strings now produce valid output

Close, but reparsing the strings to WTF-8 actually began causing crashes. The second commit then fixes those crashes.

The following are "low confidence" but I guess that's for good reason

Using unwrap_or_else with an empty match in parse_string_literal may mask unexpected errors during surrogate parsing. Consider handling the error explicitly with a descriptive panic message or returning a proper Result.

        .unwrap_or_else(|x| match x {})

Not how uninhabited types work lol

In the from_bytes function, always advancing by 3 bytes after decoding a surrogate sequence may be fragile if the assumptions about surrogate lengths change. Consider adding a comment to document this behavior or validating the boundary before slicing.

If UTF-8 changes, we have more issues than just this.

[nitpick] The code branch that reparses string literals containing the replacement character introduces additional processing. Verify that this extra reparsing is required for proper surrogate handling and that the performance impact is acceptable.

Well, yes, that's why it's behind a .contains(REPLACEMENT_CHAR) branch in the first place.

@@ -26,7 +26,7 @@ mod _codecs {
fn lookup(encoding: PyStrRef, vm: &VirtualMachine) -> PyResult {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we need PyUtf8StrRef to avoid the repeating try_to_str(vm) too much

@@ -0,0 +1,287 @@
//! A stripped-down version of ruff's string literal parser, modified to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would ruff parser also need to be patched?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, though I'm not sure whether they'd necessarily feel a need to. I think what might strike a good balance is if there's a flag that can be set like contains_surrogates, so we can know whether or not we need to reparse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a question for @MichaReiser.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably. There's an open issue but it isn't something that has come up often. astral-sh/ruff#13666

Co-authored-by: Jeong, YunWon <69878+youknowone@users.noreply.github.com>
@coolreader18 coolreader18 merged commit b6aacbf into RustPython:main Mar 27, 2025
11 checks passed
@coolreader18 coolreader18 deleted the surrogate-literals branch March 27, 2025 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants