From 211f83d4a090db288cb3ef20bd98d8db3d670738 Mon Sep 17 00:00:00 2001
From: Petr Viktorin <encukou@gmail.com>
Date: Wed, 21 May 2025 17:34:02 +0200
Subject: [PATCH 1/8] gh-127833: Reword and expand the Notation section

Prepare the docs for using the notation used in the `python.gram`
file. If we want to sync the two, the meta-syntax should be the same.

Also, remove the distinction between lexical and syntactic rules.
With f- and t-strings, the line between the two is blurry.
---
 Doc/reference/introduction.rst | 130 +++++++++++++++++++++++----------
 1 file changed, 93 insertions(+), 37 deletions(-)

diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst
index b7b70e6be5a5b7..61dc2937007efb 100644
--- a/Doc/reference/introduction.rst
+++ b/Doc/reference/introduction.rst
@@ -94,40 +94,96 @@ The descriptions of lexical analysis and syntax use a modified
 `Backus–Naur form (BNF) <https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form>`_ grammar
 notation.  This uses the following style of definition:
 
-.. productionlist:: notation
-   name: `lc_letter` (`lc_letter` | "_")*
-   lc_letter: "a"..."z"
-
-The first line says that a ``name`` is an ``lc_letter`` followed by a sequence
-of zero or more ``lc_letter``\ s and underscores.  An ``lc_letter`` in turn is
-any of the single characters ``'a'`` through ``'z'``.  (This rule is actually
-adhered to for the names defined in lexical and grammar rules in this document.)
-
-Each rule begins with a name (which is the name defined by the rule) and
-``::=``.  A vertical bar (``|``) is used to separate alternatives; it is the
-least binding operator in this notation.  A star (``*``) means zero or more
-repetitions of the preceding item; likewise, a plus (``+``) means one or more
-repetitions, and a phrase enclosed in square brackets (``[ ]``) means zero or
-one occurrences (in other words, the enclosed phrase is optional).  The ``*``
-and ``+`` operators bind as tightly as possible; parentheses are used for
-grouping.  Literal strings are enclosed in quotes.  White space is only
-meaningful to separate tokens. Rules are normally contained on a single line;
-rules with many alternatives may be formatted alternatively with each line after
-the first beginning with a vertical bar.
-
-.. index:: lexical definitions, ASCII
-
-In lexical definitions (as the example above), two more conventions are used:
-Two literal characters separated by three dots mean a choice of any single
-character in the given (inclusive) range of ASCII characters.  A phrase between
-angular brackets (``<...>``) gives an informal description of the symbol
-defined; e.g., this could be used to describe the notion of 'control character'
-if needed.
-
-Even though the notation used is almost the same, there is a big difference
-between the meaning of lexical and syntactic definitions: a lexical definition
-operates on the individual characters of the input source, while a syntax
-definition operates on the stream of tokens generated by the lexical analysis.
-All uses of BNF in the next chapter ("Lexical Analysis") are lexical
-definitions; uses in subsequent chapters are syntactic definitions.
-
+.. grammar-snippet::
+   :group: notation
+
+   name:   `letter` (`letter` | `digit` | "_")*
+   letter: "a"..."z" | "A"..."Z"
+   digit:  "0"..."9"
+
+In this example, the first line says that a ``name`` is a ``letter`` followed
+by a sequence of zero or more ``letter``\ s, ``digit``\ s, and underscores.
+A ``letter`` in turn is any of the single characters ``'a'`` through
+``'z'`` and ``A`` through ``Z``; a ``digit`` is a single character from ``0``
+to ``9``.
+
+Each rule begins with a name (which identifies the rule that's being defined)
+followed by a colon, ``:``.
+The definition to the right of the colon uses the following syntax elements:
+
+* ``name``: A name refers to another rule.
+  Where possible, it is a link to the rule's definition.
+
+  * ``TOKEN``: An uppercase name refers to a :term:`token`.
+    For the purposes of grammar definitions, tokens are the same as rules.
+
+* ``"text"``, ``'text'``: Text in single or double quotes must match literally
+  (without the quotes). The type of quote is chosen according to the meaning
+  of ``text``:
+
+  * ``'if'``: A name in single quotes denotes a :ref:`keyword <keywords>`.
+  * ``"case"``: A name in double quotes denotes a
+    :ref:`soft-keyword <soft-keywords>`.
+  * ``'@'``: A non-letter symbol in single quotes denotes an
+    :py:data:`~token.OP` token, that is, a :ref:`delimiter <delimiters>` or
+    :ref:`operator <operators>`.
+
+* ``"a"..."z"``: Two literal characters separated by three dots mean a choice
+  of any single character in the given (inclusive) range of ASCII characters.
+* ``<...>``: A phrase between angular brackets gives an informal description
+  of the matched symbol (for example, ``<any ASCII character except "\">``),
+  or an abbreviation that is defined in nearby text (for example, ``<Lu>``).
+* ``e1 e2``: Items separated only by whitespace denote a sequence.
+  Here, ``e1`` must be followed by ``e2``.
+* ``e1 | e2``: A vertical bar is used to separate alternatives.
+  It is the least tightly binding operator in this notation.
+* ``e*``: A star means zero or more repetitions of the preceding item.
+* ``e+``: Likewise, a plus means one or more repetitions.
+* ``[e]``: A phrase enclosed in square brackets means zero or
+  one occurrences. In other words, the enclosed phrase is optional.
+* ``e?``: A question mark has exactly the same meaning as square brackets:
+  the preceding item is optional.
+* ``(e)``: Parentheses are used for grouping.
+
+The unary operators (``*``, ``+``, ``?``) bind as tightly as possible.
+
+White space is only meaningful to separate tokens.
+
+Rules are normally contained on a single line, but rules that are too long
+may be wrapped:
+
+.. grammar-snippet::
+   :group: notation
+
+   literal: `stringliteral` | `bytesliteral`
+            | `integer` | `floatnumber` | `imagnumber`
+
+Alternatively, rules may be formatted with the first line ending at the colon,
+and each alternative beginning with a vertical bar on a new line.
+For example:
+
+
+.. grammar-snippet::
+   :group: notation-alt
+
+   literal:
+      | `stringliteral`
+      | `bytesliteral`
+      | `integer`
+      | `floatnumber`
+      | `imagnumber`
+
+This does *not* mean that there is an empty first alternative.
+
+.. index:: lexical definitions
+
+.. note::
+
+   There is some difference between *lexical* and *syntactic* analysis:
+   the :term:`lexical analyzer` operates on the individual characters of the
+   input source, while the *parser* (syntactic analyzer) operates on the stream
+   of :term:`tokens <token>` generated by the lexical analysis.
+   However, in some cases the exact boundary between the two phases is a
+   CPython implementation detail.
+
+   This documentation uses the same BNF grammar for both.

From ec90d4066987534c7dbeed7e91aadbac1ff8670b Mon Sep 17 00:00:00 2001
From: Petr Viktorin <encukou@gmail.com>
Date: Wed, 21 May 2025 18:04:23 +0200
Subject: [PATCH 2/8] Consolidate with the Full Grammar intro

Co-authored-by: Blaise Pabon <blaise@gmail.com>
---
 Doc/reference/grammar.rst      | 16 +++++++---------
 Doc/reference/introduction.rst | 16 +++++++++++-----
 2 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/Doc/reference/grammar.rst b/Doc/reference/grammar.rst
index b9cca4444c9141..028a847ded556b 100644
--- a/Doc/reference/grammar.rst
+++ b/Doc/reference/grammar.rst
@@ -8,15 +8,13 @@ used to generate the CPython parser (see :source:`Grammar/python.gram`).
 The version here omits details related to code generation and
 error recovery.
 
-The notation is a mixture of `EBNF
-<https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_
-and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_.
-In particular, ``&`` followed by a symbol, token or parenthesized
-group indicates a positive lookahead (i.e., is required to match but
-not consumed), while ``!`` indicates a negative lookahead (i.e., is
-required *not* to match).  We use the ``|`` separator to mean PEG's
-"ordered choice" (written as ``/`` in traditional PEG grammars). See
-:pep:`617` for more details on the grammar's syntax.
+The notation used here is the same as in the preceding docs,
+and is described in the :ref:`notation <notation>` section,
+except for a few extra complications:
+
+* ``&e``: a positive lookahead (that is, ``e`` is required to match but
+  not consumed)
+* ``!e``: a negative lookahead (that is, ``e`` is required *not* to match)
 
 .. literalinclude:: ../../Grammar/python.gram
   :language: peg
diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst
index 61dc2937007efb..bde86c0b6654d8 100644
--- a/Doc/reference/introduction.rst
+++ b/Doc/reference/introduction.rst
@@ -90,9 +90,10 @@ Notation
 
 .. index:: BNF, grammar, syntax, notation
 
-The descriptions of lexical analysis and syntax use a modified
-`Backus–Naur form (BNF) <https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form>`_ grammar
-notation.  This uses the following style of definition:
+The descriptions of lexical analysis use a grammar notation that is a mixture
+of `EBNF <https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_
+and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_.
+For example:
 
 .. grammar-snippet::
    :group: notation
@@ -136,7 +137,11 @@ The definition to the right of the colon uses the following syntax elements:
 * ``e1 e2``: Items separated only by whitespace denote a sequence.
   Here, ``e1`` must be followed by ``e2``.
 * ``e1 | e2``: A vertical bar is used to separate alternatives.
-  It is the least tightly binding operator in this notation.
+  It denotes PEG's "ordered choice": if ``e1`` matches, ``e2`` is
+  not considered.
+  In traditional PEG grammars, this is written as a slash, ``/``, rather than
+  a vertical bar.
+  See :pep:`617` for more background and details.
 * ``e*``: A star means zero or more repetitions of the preceding item.
 * ``e+``: Likewise, a plus means one or more repetitions.
 * ``[e]``: A phrase enclosed in square brackets means zero or
@@ -145,7 +150,8 @@ The definition to the right of the colon uses the following syntax elements:
   the preceding item is optional.
 * ``(e)``: Parentheses are used for grouping.
 
-The unary operators (``*``, ``+``, ``?``) bind as tightly as possible.
+The unary operators (``*``, ``+``, ``?``) bind as tightly as possible;
+the vertical bar (``|``) binds most loosely.
 
 White space is only meaningful to separate tokens.
 

From 3f3a0dbd33ca1c14c0cc754f4b75c880a0be290b Mon Sep 17 00:00:00 2001
From: Petr Viktorin <encukou@gmail.com>
Date: Wed, 21 May 2025 18:10:49 +0200
Subject: [PATCH 3/8] Don't link the examples

---
 Doc/reference/introduction.rst | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst
index bde86c0b6654d8..16d70aa9d2649c 100644
--- a/Doc/reference/introduction.rst
+++ b/Doc/reference/introduction.rst
@@ -161,8 +161,8 @@ may be wrapped:
 .. grammar-snippet::
    :group: notation
 
-   literal: `stringliteral` | `bytesliteral`
-            | `integer` | `floatnumber` | `imagnumber`
+   literal: stringliteral | bytesliteral
+            | integer | floatnumber | imagnumber
 
 Alternatively, rules may be formatted with the first line ending at the colon,
 and each alternative beginning with a vertical bar on a new line.
@@ -173,11 +173,11 @@ For example:
    :group: notation-alt
 
    literal:
-      | `stringliteral`
-      | `bytesliteral`
-      | `integer`
-      | `floatnumber`
-      | `imagnumber`
+      | stringliteral
+      | bytesliteral
+      | integer
+      | floatnumber
+      | imagnumber
 
 This does *not* mean that there is an empty first alternative.
 

From 160ff4208066b8a2f619e0c848539c90f4057b7b Mon Sep 17 00:00:00 2001
From: Petr Viktorin <encukou@gmail.com>
Date: Wed, 21 May 2025 18:17:42 +0200
Subject: [PATCH 4/8] Add Cut

---
 Doc/reference/grammar.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Doc/reference/grammar.rst b/Doc/reference/grammar.rst
index 028a847ded556b..534bc742cf57c2 100644
--- a/Doc/reference/grammar.rst
+++ b/Doc/reference/grammar.rst
@@ -15,6 +15,7 @@ except for a few extra complications:
 * ``&e``: a positive lookahead (that is, ``e`` is required to match but
   not consumed)
 * ``!e``: a negative lookahead (that is, ``e`` is required *not* to match)
+* ``~`` ("cut"): commit to the current alternative, even if it fails to parse
 
 .. literalinclude:: ../../Grammar/python.gram
   :language: peg

From 231e2bae7e1de58f96029c015e865326b66c8f81 Mon Sep 17 00:00:00 2001
From: Petr Viktorin <encukou@gmail.com>
Date: Wed, 28 May 2025 18:07:33 +0200
Subject: [PATCH 5/8] Note practical difference between syntactic & lexical
 definitions

Co-authored-by: Blaise Pabon <blaise@gmail.com>
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
---
 Doc/reference/introduction.rst | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst
index 16d70aa9d2649c..0699fff0b1a016 100644
--- a/Doc/reference/introduction.rst
+++ b/Doc/reference/introduction.rst
@@ -183,13 +183,19 @@ This does *not* mean that there is an empty first alternative.
 
 .. index:: lexical definitions
 
-.. note::
-
-   There is some difference between *lexical* and *syntactic* analysis:
-   the :term:`lexical analyzer` operates on the individual characters of the
-   input source, while the *parser* (syntactic analyzer) operates on the stream
-   of :term:`tokens <token>` generated by the lexical analysis.
-   However, in some cases the exact boundary between the two phases is a
-   CPython implementation detail.
-
-   This documentation uses the same BNF grammar for both.
+There is some difference between *lexical* and *syntactic* analysis:
+the :term:`lexical analyzer` operates on the individual characters of the
+input source, while the *parser* (syntactic analyzer) operates on the stream
+of :term:`tokens <token>` generated by the lexical analysis.
+However, in some cases the exact boundary between the two phases is a
+CPython implementation detail.
+
+The practical difference between the two is that in *lexical* definitions,
+all whitespace is significant.
+The lexical analyzer :ref:`discards <whitespace>` all whitespace that is not
+converted to tokens like :data:`token.INDENT` or :data:`~token.NEWLINE`.
+*Syntactic* definitions then use these tokens, rather than source characters.
+
+This documentation uses the same BNF grammar for both styles of definitions.
+All uses of BNF in the next chapter (“Lexical Analysis”) are lexical definitions;
+uses in subsequent chapters are syntactic definitions.

From fa17c00cd75cfa9e1f5b61082eafcb4eac67d8e1 Mon Sep 17 00:00:00 2001
From: Petr Viktorin <encukou@gmail.com>
Date: Tue, 3 Jun 2025 08:43:04 +0200
Subject: [PATCH 6/8] Apply suggestions from code review

Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com>
Co-authored-by: Colin Marquardt <cmarqu42@gmail.com>
---
 Doc/reference/grammar.rst      | 3 ++-
 Doc/reference/introduction.rst | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/Doc/reference/grammar.rst b/Doc/reference/grammar.rst
index 534bc742cf57c2..55c148801d8559 100644
--- a/Doc/reference/grammar.rst
+++ b/Doc/reference/grammar.rst
@@ -15,7 +15,8 @@ except for a few extra complications:
 * ``&e``: a positive lookahead (that is, ``e`` is required to match but
   not consumed)
 * ``!e``: a negative lookahead (that is, ``e`` is required *not* to match)
-* ``~`` ("cut"): commit to the current alternative, even if it fails to parse
+* ``~`` ("cut"): commit to the current alternative and fail the rule
+  even if this fails to parse
 
 .. literalinclude:: ../../Grammar/python.gram
   :language: peg
diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst
index 0699fff0b1a016..0d54237b9c0072 100644
--- a/Doc/reference/introduction.rst
+++ b/Doc/reference/introduction.rst
@@ -197,5 +197,5 @@ converted to tokens like :data:`token.INDENT` or :data:`~token.NEWLINE`.
 *Syntactic* definitions then use these tokens, rather than source characters.
 
 This documentation uses the same BNF grammar for both styles of definitions.
-All uses of BNF in the next chapter (“Lexical Analysis”) are lexical definitions;
+All uses of BNF in the next chapter (:ref:`lexical`) are lexical definitions;
 uses in subsequent chapters are syntactic definitions.

From 327b90de33b32b0e0a2939875c2136c1eeeeedca Mon Sep 17 00:00:00 2001
From: Petr Viktorin <encukou@gmail.com>
Date: Wed, 4 Jun 2025 15:49:20 +0200
Subject: [PATCH 7/8] Constrain `"a"..."z"` and `<...>` to lexical definitions

---
 Doc/reference/introduction.rst | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst
index 0d54237b9c0072..6bf785d31c9c1e 100644
--- a/Doc/reference/introduction.rst
+++ b/Doc/reference/introduction.rst
@@ -129,11 +129,6 @@ The definition to the right of the colon uses the following syntax elements:
     :py:data:`~token.OP` token, that is, a :ref:`delimiter <delimiters>` or
     :ref:`operator <operators>`.
 
-* ``"a"..."z"``: Two literal characters separated by three dots mean a choice
-  of any single character in the given (inclusive) range of ASCII characters.
-* ``<...>``: A phrase between angular brackets gives an informal description
-  of the matched symbol (for example, ``<any ASCII character except "\">``),
-  or an abbreviation that is defined in nearby text (for example, ``<Lu>``).
 * ``e1 e2``: Items separated only by whitespace denote a sequence.
   Here, ``e1`` must be followed by ``e2``.
 * ``e1 | e2``: A vertical bar is used to separate alternatives.
@@ -149,6 +144,15 @@ The definition to the right of the colon uses the following syntax elements:
 * ``e?``: A question mark has exactly the same meaning as square brackets:
   the preceding item is optional.
 * ``(e)``: Parentheses are used for grouping.
+* ``"a"..."z"``: Two literal characters separated by three dots mean a choice
+  of any single character in the given (inclusive) range of ASCII characters.
+  This notation is only used in
+  :ref:`lexical definitions <notation-lexical-vs-syntactic>`.
+* ``<...>``: A phrase between angular brackets gives an informal description
+  of the matched symbol (for example, ``<any ASCII character except "\">``),
+  or an abbreviation that is defined in nearby text (for example, ``<Lu>``).
+  This notation is only used in
+  :ref:`lexical definitions <notation-lexical-vs-syntactic>`.
 
 The unary operators (``*``, ``+``, ``?``) bind as tightly as possible;
 the vertical bar (``|``) binds most loosely.
@@ -183,6 +187,11 @@ This does *not* mean that there is an empty first alternative.
 
 .. index:: lexical definitions
 
+.. _notation-lexical-vs-syntactic:
+
+Lexical and Syntactic definitions
+---------------------------------
+
 There is some difference between *lexical* and *syntactic* analysis:
 the :term:`lexical analyzer` operates on the individual characters of the
 input source, while the *parser* (syntactic analyzer) operates on the stream

From 3c5caa6b13a45afc0d5c6240ac846bb2cadb62b5 Mon Sep 17 00:00:00 2001
From: Petr Viktorin <encukou@gmail.com>
Date: Sat, 7 Jun 2025 12:30:58 +0200
Subject: [PATCH 8/8] Update Doc/reference/introduction.rst

---
 Doc/reference/introduction.rst | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst
index 6bf785d31c9c1e..444acac374a690 100644
--- a/Doc/reference/introduction.rst
+++ b/Doc/reference/introduction.rst
@@ -90,8 +90,9 @@ Notation
 
 .. index:: BNF, grammar, syntax, notation
 
-The descriptions of lexical analysis use a grammar notation that is a mixture
-of `EBNF <https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_
+The descriptions of lexical analysis and syntax use a grammar notation that
+is a mixture of
+`EBNF <https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_
 and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_.
 For example: