-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Parse surrogates in string literals properly #5629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse surrogates in string literals properly #5629
Conversation
960e86c
to
b4e3558
Compare
b4e3558
to
0a07cd9
Compare
interesting, copilot reviewer is added. I tried a test run on this pr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR aims to improve the handling of surrogate escapes in string literals by reparsing them into WTF‑8 format so that previously crashing surrogate strings now produce valid output. Key changes include:
- Adding surrogate-specific parsing logic in the string parser and updating usage of string constants to use WTF‑8.
- Adjusting code in the WTF‑8 module to introduce new types for lead and trail surrogates and improve decoding.
- Updating various tests and Cargo.toml dependencies to align with these changes.
Reviewed Changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated no comments.
File | Description |
---|---|
compiler/codegen/src/string_parser.rs | Implements surrogate-aware parsing for string literals. |
common/src/wtf8/mod.rs | Introduces new surrogate types and adds decoding enhancements. |
Lib/test/*.py and other test files | Updates expected failure markers and related test expectations. |
compiler/core/Cargo.toml & compiler/codegen/Cargo.toml | Adds and adjusts dependencies required for the changes. |
Comments suppressed due to low confidence (3)
compiler/codegen/src/string_parser.rs:280
- Using unwrap_or_else with an empty match in parse_string_literal may mask unexpected errors during surrogate parsing. Consider handling the error explicitly with a descriptive panic message or returning a proper Result.
.unwrap_or_else(|x| match x {})
common/src/wtf8/mod.rs:767
- In the from_bytes function, always advancing by 3 bytes after decoding a surrogate sequence may be fragile if the assumptions about surrogate lengths change. Consider adding a comment to document this behavior or validating the boundary before slicing.
rest = &rest[3..];
compiler/codegen/src/compile.rs:2682
- [nitpick] The code branch that reparses string literals containing the replacement character introduces additional processing. Verify that this extra reparsing is required for proper surrogate handling and that the performance impact is acceptable.
if value.contains(char::REPLACEMENT_CHARACTER) {
I guess I'm reviewing the review, lol
Close, but reparsing the strings to WTF-8 actually began causing crashes. The second commit then fixes those crashes. The following are "low confidence" but I guess that's for good reason
Not how uninhabited types work lol
If UTF-8 changes, we have more issues than just this.
Well, yes, that's why it's behind a |
@@ -26,7 +26,7 @@ mod _codecs { | |||
fn lookup(encoding: PyStrRef, vm: &VirtualMachine) -> PyResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably we need PyUtf8StrRef
to avoid the repeating try_to_str(vm)
too much
@@ -0,0 +1,287 @@ | |||
//! A stripped-down version of ruff's string literal parser, modified to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would ruff parser also need to be patched?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, though I'm not sure whether they'd necessarily feel a need to. I think what might strike a good balance is if there's a flag that can be set like contains_surrogates
, so we can know whether or not we need to reparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a question for @MichaReiser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably. There's an open issue but it isn't something that has come up often. astral-sh/ruff#13666
Co-authored-by: Jeong, YunWon <69878+youknowone@users.noreply.github.com>
Because ruff represents string literals as
str
, we have to reparse every string that might have a surrogate\uDXXX
escape as wtf8. Then there were a bunch of crashes to fix, because surrogate strings intest/
actually get parsed properly now instead of being subbed withREPLACEMENT_CHARACTER
.