Skip to content

gh-69456: Add method to detect if a string contains surrogates #135265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

StanFromIreland
Copy link
Contributor

@StanFromIreland StanFromIreland commented Jun 8, 2025

Request @bitdancer (and maybe @ezio-melotti and @malemburg (looking at devguide))

contains_surrogates is misleading since it does not require more than one, and to me, singular suffers less, though it can be misleading too.

And, should this also be implemented for bytes?


📚 Documentation preview 📚: https://cpython-previews--135265.org.readthedocs.build/

@bitdancer
Copy link
Member

Thanks for working on this!

_has_surrogates is an internal email library function name, and as such is not a suitable model for the method name for more than one reason. Importantly, it is a bit misleading to say we are detecting 'surrogates', since what we actually want the function to do is detect that a string contains surrogateescape encoded bytes. I'm a bit rusty on the python C code, but if I understand correctly surrogates only appear in the python internal unicode representation if there are in fact escaped bytes, so the function itself is probably correct. I'd like someone with fresher knowledge of the relevant C code to review, though.

But I can give my opinion on the name :) We probably want something like issurrogateescaped to harmonize with the analogous existing string function names. Specifically, this is analogous to istitle, which returns True if and only if the string has at least one capital letter and conforms to the title casing rules. In this case, we return True if and only if the string contains at least one escaped byte.

And no, there is no corresponding method for bytes, since by definition this is something that only has meaning in a unicode string, since surrogateescape is a decode error handler.

@ZeroIntensity
Copy link
Member

For what it's worth, changes to builtins (especially ones as prominent as str) need PEPs these days, because they'll have to be reflected across implementations. Perhaps @bitdancer would be willing to author/sponsor one?

@malemburg
Copy link
Member

I'm not sure what this method is supposed to signal. We first need a clear definition to be able to tell whether it's worth adding a str method for it.

  • Surrogates are valid code points, so they can appear in Python (Unicode) strings. They do trigger exceptions with UTF codecs, though, since those encodings cannot encode surrogates.
  • Escaped surrogates are a Python-only feature (see https://peps.python.org/pep-0383/). The surrogateescape error handler maps bytes in the range D8-DF to \uDCD8-\uDCDF when decoding them to make them roundtrip.

Detecting such escape sequences can easily be done using the re module, so I'm not sure whether we need a separate method for this special case.

Note that all this is different from e.g. .istitle(), since those methods rely on code point properties, which tend to be extended and sometimes updated with newer Unicode versions.

@malemburg
Copy link
Member

Looking at the implementation, the method does indeed only detect surrogate code points, so this methods should be called .issurrogate() to be in line with the other similar methods.

This may actually be useful, since it allows checking for possible problem cases in Python strings before encoding them using one of the UTF codecs.

@bitdancer
Copy link
Member

That would exactly match the email package use case (checking that the string can be encoded as is or needs special handling). I don't (yet) know what is involved in sponsoring a PEP, but I'm willing ;)

Currently the email package tries an encode and catches the exception. It would be nice to be able to do "look before you leap" instead, and probably more efficient than invoking the codec machinery.

@StanFromIreland
Copy link
Contributor Author

I think it's best if you write it, rather than sponsor it;-) You know a lot more about this than anyone else whom you could sponsor.

I have some familiarity with the PEPs system, I'd be happy to help and co-author. I can polish this example implementation to whatever we want then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants