-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
gh-69456: Add method to detect if a string contains surrogates #135265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for working on this!
But I can give my opinion on the name :) We probably want something like And no, there is no corresponding method for bytes, since by definition this is something that only has meaning in a unicode string, since surrogateescape is a decode error handler. |
For what it's worth, changes to builtins (especially ones as prominent as |
I'm not sure what this method is supposed to signal. We first need a clear definition to be able to tell whether it's worth adding a str method for it.
Detecting such escape sequences can easily be done using the Note that all this is different from e.g. |
Looking at the implementation, the method does indeed only detect surrogate code points, so this methods should be called This may actually be useful, since it allows checking for possible problem cases in Python strings before encoding them using one of the UTF codecs. |
That would exactly match the email package use case (checking that the string can be encoded as is or needs special handling). I don't (yet) know what is involved in sponsoring a PEP, but I'm willing ;) Currently the email package tries an encode and catches the exception. It would be nice to be able to do "look before you leap" instead, and probably more efficient than invoking the codec machinery. |
I think it's best if you write it, rather than sponsor it;-) You know a lot more about this than anyone else whom you could sponsor. I have some familiarity with the PEPs system, I'd be happy to help and co-author. I can polish this example implementation to whatever we want then. |
Request @bitdancer (and maybe @ezio-melotti and @malemburg (looking at devguide))
contains_surrogates
is misleading since it does not require more than one, and to me, singular suffers less, though it can be misleading too.And, should this also be implemented for bytes?
📚 Documentation preview 📚: https://cpython-previews--135265.org.readthedocs.build/