Skip to content

gh-86519: Add prefixmatch APIs to the re module #31137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 105 additions & 45 deletions Doc/library/re.rst
Original file line number Diff line number Diff line change
Expand Up @@ -831,7 +831,7 @@ Flags
value::

def myfunc(text, flag=re.NOFLAG):
return re.match(text, flag)
return re.search(text, flag)

.. versionadded:: 3.11

Expand Down Expand Up @@ -887,8 +887,8 @@ Functions

Compile a regular expression pattern into a :ref:`regular expression object
<re-objects>`, which can be used for matching using its
:func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
below.
:func:`~Pattern.prefixmatch` (:func:`~Pattern.match`),
:func:`~Pattern.search`, and other methods, described below.

The expression's behaviour can be modified by specifying a *flags* value.
Values can be any of the `flags`_ variables, combined using bitwise OR
Expand All @@ -897,11 +897,11 @@ Functions
The sequence ::

prog = re.compile(pattern)
result = prog.match(string)
result = prog.search(string)

is equivalent to ::

result = re.match(pattern, string)
result = re.search(pattern, string)

but using :func:`re.compile` and saving the resulting regular expression
object for reuse is more efficient when the expression will be used several
Expand All @@ -928,14 +928,15 @@ Functions


.. function:: match(pattern, string, flags=0)
.. function:: prefixmatch(pattern, string, flags=0)

If zero or more characters at the beginning of *string* match the regular
expression *pattern*, return a corresponding :class:`~re.Match`. Return
``None`` if the string does not match the pattern; note that this is
different from a zero-length match.

Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
at the beginning of the string and not at the beginning of each line.
Note that even in :const:`MULTILINE` mode, this will only match at the
beginning of the string and not at the beginning of each line.

If you want to locate a match anywhere in *string*, use :func:`search`
instead (see also :ref:`search-vs-match`).
Expand All @@ -944,6 +945,18 @@ Functions
Values can be any of the `flags`_ variables, combined using bitwise OR
(the ``|`` operator).

This function now has two names and has long been known as
:func:`~re.match`. Use that name when you need to retain compatibility with
older Python versions.

.. versionchanged:: next
An alternate :func:`~re.prefixmatch` name with this API was added as a
more descriptive explicit name for the behavior of :func:`~re.match`. Use
it to more clearly express intent. The norm in other languages and
regular expression implementations is to use the term *match* to refer to
the behavior of what Python has always called :func:`~re.search`. See
:ref:`prefixmatch-vs-match`.


.. function:: fullmatch(pattern, string, flags=0)

Expand Down Expand Up @@ -1264,23 +1277,42 @@ Regular Expression Objects


.. method:: Pattern.match(string[, pos[, endpos]])
.. method:: Pattern.prefixmatch(string[, pos[, endpos]])

If zero or more characters at the *beginning* of *string* match this regular
expression, return a corresponding :class:`~re.Match`. Return ``None`` if the
string does not match the pattern; note that this is different from a
zero-length match.

Note that even in :const:`MULTILINE` mode, this will only match at the
beginning of the string and not at the beginning of each line.

The optional *pos* and *endpos* parameters have the same meaning as for the
:meth:`~Pattern.search` method. ::

>>> pattern = re.compile("o")
>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
>>> pattern.prefixmatch("dog") # No match as "o" is not at the start of "dog".
>>> pattern.prefixmatch("dog", 1) # Match as "o" is the 2nd character of "dog".
<re.Match object; span=(1, 2), match='o'>
>>> pattern.match("dog") # Same as above.
>>> pattern.match("dog", 1) # Same as above.
<re.Match object; span=(1, 2), match='o'>

If you want to locate a match anywhere in *string*, use
:meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).

This method now has two names and has long been known as
:meth:`~Pattern.match`. Use that name when you need to retain compatibility
with older Python versions.

.. versionchanged:: next
An alternate :meth:`~Pattern.prefixmatch` name with this API was added as
a more descriptive explicit name for the behavior of
:meth:`~Pattern.match`. Use it to more clearly express intent. The norm
in other languages and regular expression implementations is to use the
term *match* to refer to the behavior of what Python has always called
:meth:`~Pattern.search`. See :ref:`prefixmatch-vs-match`.


.. method:: Pattern.fullmatch(string[, pos[, endpos]])

Expand Down Expand Up @@ -1368,8 +1400,7 @@ Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
when there is no match, you can test whether there was a match with a simple
``if`` statement::

match = re.search(pattern, string)
if match:
if match := re.search(pattern, string):
process(match)

.. class:: Match
Expand Down Expand Up @@ -1407,7 +1438,7 @@ when there is no match, you can test whether there was a match with a simple
If a group is contained in a part of the pattern that matched multiple times,
the last match is returned. ::

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m = re.search(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0) # The entire match
'Isaac Newton'
>>> m.group(1) # The first parenthesized subgroup.
Expand All @@ -1424,7 +1455,7 @@ when there is no match, you can test whether there was a match with a simple

A moderately complicated example::

>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m = re.search(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.group('first_name')
'Malcolm'
>>> m.group('last_name')
Expand All @@ -1439,8 +1470,8 @@ when there is no match, you can test whether there was a match with a simple

If a group matches multiple times, only the last match is accessible::

>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
>>> m.group(1) # Returns only the last match.
>>> m = re.search(r"(..)+", "a1b2c3") # Matches 3 times.
>>> m.group(1) # Returns only the last match.
'c3'


Expand All @@ -1449,7 +1480,7 @@ when there is no match, you can test whether there was a match with a simple
This is identical to ``m.group(g)``. This allows easier access to
an individual group from a match::

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m = re.search(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m[0] # The entire match
'Isaac Newton'
>>> m[1] # The first parenthesized subgroup.
Expand All @@ -1459,7 +1490,7 @@ when there is no match, you can test whether there was a match with a simple

Named groups are supported as well::

>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Isaac Newton")
>>> m = re.search(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Isaac Newton")
>>> m['first_name']
'Isaac'
>>> m['last_name']
Expand All @@ -1476,15 +1507,15 @@ when there is no match, you can test whether there was a match with a simple

For example::

>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
>>> m = re.search(r"(\d+)\.(\d+)", "24.1632")
>>> m.groups()
('24', '1632')

If we make the decimal place and everything after it optional, not all groups
might participate in the match. These groups will default to ``None`` unless
the *default* argument is given::

>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
>>> m = re.search(r"(\d+)\.?(\d+)?", "24")
>>> m.groups() # Second group defaults to None.
('24', None)
>>> m.groups('0') # Now, the second group defaults to '0'.
Expand All @@ -1497,7 +1528,7 @@ when there is no match, you can test whether there was a match with a simple
the subgroup name. The *default* argument is used for groups that did not
participate in the match; it defaults to ``None``. For example::

>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m = re.search(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.groupdict()
{'first_name': 'Malcolm', 'last_name': 'Reynolds'}

Expand Down Expand Up @@ -1603,38 +1634,38 @@ representing the card with that value.
To see if a given string is a valid hand, one could do the following::

>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
>>> displaymatch(valid.match("akt5q")) # Valid.
>>> displaymatch(valid.search("akt5q")) # Valid.
"<Match: 'akt5q', groups=()>"
>>> displaymatch(valid.match("akt5e")) # Invalid.
>>> displaymatch(valid.match("akt")) # Invalid.
>>> displaymatch(valid.match("727ak")) # Valid.
>>> displaymatch(valid.search("akt5e")) # Invalid.
>>> displaymatch(valid.search("akt")) # Invalid.
>>> displaymatch(valid.search("727ak")) # Valid.
"<Match: '727ak', groups=()>"

That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
To match this with a regular expression, one could use backreferences as such::

>>> pair = re.compile(r".*(.).*\1")
>>> displaymatch(pair.match("717ak")) # Pair of 7s.
>>> pair = re.compile(r"^.*(.).*\1")
>>> displaymatch(pair.search("717ak")) # Pair of 7s.
"<Match: '717', groups=('7',)>"
>>> displaymatch(pair.match("718ak")) # No pairs.
>>> displaymatch(pair.match("354aa")) # Pair of aces.
>>> displaymatch(pair.search("718ak")) # No pairs.
>>> displaymatch(pair.search("354aa")) # Pair of aces.
"<Match: '354aa', groups=('a',)>"

To find out what card the pair consists of, one could use the
:meth:`~Match.group` method of the match object in the following manner::

>>> pair = re.compile(r".*(.).*\1")
>>> pair.match("717ak").group(1)
>>> pair = re.compile(r"^.*(.).*\1")
>>> pair.search("717ak").group(1)
'7'

# Error because re.match() returns None, which doesn't have a group() method:
>>> pair.match("718ak").group(1)
# Error because re.search() returns None, which doesn't have a group() method:
>>> pair.search("718ak").group(1)
Traceback (most recent call last):
File "<pyshell#23>", line 1, in <module>
re.match(r".*(.).*\1", "718ak").group(1)
re.search(r".*(.).*\1", "718ak").group(1)
AttributeError: 'NoneType' object has no attribute 'group'

>>> pair.match("354aa").group(1)
>>> pair.search("354aa").group(1)
'a'


Expand Down Expand Up @@ -1693,16 +1724,17 @@ search() vs. match()

Python offers different primitive operations based on regular expressions:

+ :func:`re.match` checks for a match only at the beginning of the string
+ :func:`re.prefixmatch`, also known under the less explicit name
:func:`re.match`, checks for a match only at the beginning of the string
+ :func:`re.search` checks for a match anywhere in the string
(this is what Perl does by default)
+ :func:`re.fullmatch` checks for entire string to be a match


For example::

>>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
>>> re.match("c", "abcdef") # No match
>>> re.prefixmatch("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
<re.Match object; span=(2, 3), match='c'>
>>> re.fullmatch("p.*n", "python") # Match
<re.Match object; span=(0, 6), match='python'>
Expand All @@ -1711,19 +1743,47 @@ For example::
Regular expressions beginning with ``'^'`` can be used with :func:`search` to
restrict the match at the beginning of the string::

>>> re.match("c", "abcdef") # No match
>>> re.search("^c", "abcdef") # No match
>>> re.search("^a", "abcdef") # Match
>>> re.match("c", "abcdef") # No match
>>> re.prefixmatch("c", "abcdef") # No match
>>> re.search("^c", "abcdef") # No match
>>> re.search("^a", "abcdef") # Match
<re.Match object; span=(0, 1), match='a'>

Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
beginning of the string, whereas using :func:`search` with a regular expression
beginning with ``'^'`` will match at the beginning of each line. ::

>>> re.prefixmatch("X", "A\nB\nX", re.MULTILINE) # No match
>>> re.match("X", "A\nB\nX", re.MULTILINE) # No match
>>> re.search("^X", "A\nB\nX", re.MULTILINE) # Match
<re.Match object; span=(4, 5), match='X'>

.. _prefixmatch-vs-match:

prefixmatch() vs. match()
^^^^^^^^^^^^^^^^^^^^^^^^^

Why is the :func:`~re.match` function and method name being discouraged in
favor of the longer :func:`~re.prefixmatch` spelling in very recent Python?

Many other languages have gained regex support libraries since regular
expressions were added to Python. However in the most popular of those, they
use the term *match* in their APIs to mean the unanchored behavior provided in
Python by :func:`~re.search`. Thus use of the plain term *match* can be
unclear to those used to other languages when reading or writing code and
not familiar with the Python API's divergence from what otherwise become the
industry norm.

Quoting from the Zen Of Python (``python3 -m this``): *"Explicit is better than
implicit"*. Anyone reading the name :func:`~re.prefixmatch` is likely to
understand the intended semantics. When reading :func:`~re.match` there remains
a seed of doubt about the intended behavior to anyone not already familiar with
this old Python gotcha.

We **do not** plan to deprecate and remove the older *match* name in this
decade, if ever, as it has been used in code for over 25 years.

.. versionadded:: next

Making a Phonebook
^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -1843,19 +1903,19 @@ every backslash (``'\'``) in a regular expression would have to be prefixed with
another one to escape it. For example, the two following lines of code are
functionally identical::

>>> re.match(r"\W(.)\1\W", " ff ")
>>> re.search(r"\W(.)\1\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>
>>> re.match("\\W(.)\\1\\W", " ff ")
>>> re.search("\\W(.)\\1\\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>

When one wants to match a literal backslash, it must be escaped in the regular
expression. With raw string notation, this means ``r"\\"``. Without raw string
notation, one must use ``"\\\\"``, making the following lines of code
functionally identical::

>>> re.match(r"\\", r"\\")
>>> re.search(r"\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>
>>> re.match("\\\\", r"\\")
>>> re.search("\\\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>


Expand Down
13 changes: 13 additions & 0 deletions Doc/whatsnew/3.14.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1056,6 +1056,19 @@ pydoc
(Contributed by Jelle Zijlstra in :gh:`101552`.)


re
--

* :func:`re.prefixmatch` and a corresponding :meth:`~re.Pattern.prefixmatch`
have been added as alternate more explicit names for the existing
:func:`re.match` and :meth:`~re.Pattern.match` APIs. These are intended
to be used to alleviate confusion around what *match* means by following the
Zen of Python's *"Explicit is better than implicit"* mantra. Most other
language regular expression libraries use an API named *match* to mean what
Python has always called *search*.
(Contributed by Gregory P. Smith in :gh:`86519`.)


ssl
---

Expand Down
Loading
Loading