Skip to content

add \z as a synonym for \Z in Python REs for standardization #133306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mstevenbrown opened this issue May 2, 2025 · 6 comments
Open

add \z as a synonym for \Z in Python REs for standardization #133306

mstevenbrown opened this issue May 2, 2025 · 6 comments
Assignees
Labels
topic-regex type-feature A feature request or enhancement

Comments

@mstevenbrown
Copy link

mstevenbrown commented May 2, 2025

Feature or enhancement

Proposal:

Hello - I’m with the Austin Common Standards Revision Group - the joint technical working group established to develop and maintain the core open systems interfaces that are the POSIX™ 1003.1 (and former 1003.2) standards, ISO/IEC 9945, and the core of the Single UNIX Specification.

We have had a request to unify/rationalize the regex behaviors for “anchor at string beginning” (^ is the closest in POSIX) and “anchor at string end” ($ is the closest in POSIX). A description of this problem in depth can be found here and a table that scopes the varied solutions across varying languages can be found here.

Our working group has come to the conclusion that \A and \z are widely implemented across many ecosystems and are the most “standard” solution to the issue. We are asking if the Python community would consider adding “\z” as a synonym for “\Z” in their regex lexicon.

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

https://discuss.python.org/t/proposal-add-z-as-a-synonym-for-z-in-python-res-for-standardization/90378/1

Linked PRs

@mstevenbrown mstevenbrown added the type-feature A feature request or enhancement label May 2, 2025
@david-a-wheeler
Copy link

+1, I'm very supportive!

Currently it's easy to make a mistake using regexes when checking inputs for security; see Correctly Using Regular Expressions. Many platforms interpret \Z as "optional newline followed by end-of-string" including Java, .NET/C#, Perl, PCRE, PHP (using PCRE), and Ruby. Many developers aren't aware that regex syntax varies between programming languages, and LLMs "borrow" regexes from other languages without necessarily fixing them. If everyone supported \A for beginning and \z for the end, we could at least have unified regular expression syntax for this common case (beginning of string .. end of string).

Thanks!

@serhiy-storchaka serhiy-storchaka self-assigned this May 2, 2025
@collinfunk
Copy link
Contributor

collinfunk commented May 2, 2025

Hi @serhiy-storchaka, before you spend too much time implementing this I think you should be aware that there is some discussion on the glibc mailing list [1]. Because it is a proposed standardization, it is still subject to change. Paul Eggert, a glibc maintainer, wants to standardize \` and \' which is already in glibc and other implementations.

[1] https://inbox.sourceware.org/libc-alpha/d357ebe7-9aef-41f7-98da-5fa891f8064a@cs.ucla.edu/T/#m6dcdfc0466b896647fce7be5569d8d39af797125

@sethmlarson
Copy link
Contributor

As the author of a blog post about this footgun, I am in complete support of this proposal! Thank you for opening this issue.

@serhiy-storchaka
Copy link
Member

\` and \' is a terrible option.

  • Since \' usually happens at the end of the regular expression, see what it looks like: r'\`"[^"]+"\''. It is difficult to see where the end of the string. Even if you use double quotes, adjacent ' and " are hard to red: r"\`'[^']+'\'".
  • On GitHub and other programmer communication sites which use backquotes to mark a code, it is difficult to use for code containing a backquote.
  • There is a historical ban of using backquotes in Python syntax.

Even if \` and \' be standardized, \A is already supported in Python, and \z is supported in many other engines. \A and \z will left a preferable syntax for these anchors.

@mstevenbrown
Copy link
Author

Hi @serhiy-storchaka, before you spend too much time implementing this I think you should be aware that there is some discussion on the glibc mailing list [1]. Because it is a proposed standardization, it is still subject to change. Paul Eggert, a glibc maintainer, wants to standardize \` and \' which is already in glibc and other implementations.

[1] https://inbox.sourceware.org/libc-alpha/d357ebe7-9aef-41f7-98da-5fa891f8064a@cs.ucla.edu/T/#m6dcdfc0466b896647fce7be5569d8d39af797125

The glibc discussion was also started by our committee, as a part of the rationalization effort.

@collinfunk
Copy link
Contributor

* On GitHub and other programmer communication sites which use backquotes to mark a code, it is difficult to use for code containing a backquote.

Yes, I experienced that in typing my original message. 😄

Anyways, I don't have super strong opinions. I just wanted to make sure the disagreement was known and considered. Thanks!

serhiy-storchaka added a commit that referenced this issue May 3, 2025
…133314)

\Z was an error inherited from PCRE 0.95. It was fixed in PCRE 2.0.
In other engines, \Z means not “anchor at string end”, but
“anchor before optional newline at string end”.

\z means “anchor at string end” in most RE engines.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-regex type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants