From 211f83d4a090db288cb3ef20bd98d8db3d670738 Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 21 May 2025 17:34:02 +0200 Subject: [PATCH 1/8] gh-127833: Reword and expand the Notation section Prepare the docs for using the notation used in the `python.gram` file. If we want to sync the two, the meta-syntax should be the same. Also, remove the distinction between lexical and syntactic rules. With f- and t-strings, the line between the two is blurry. --- Doc/reference/introduction.rst | 130 +++++++++++++++++++++++---------- 1 file changed, 93 insertions(+), 37 deletions(-) diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst index b7b70e6be5a5b7..61dc2937007efb 100644 --- a/Doc/reference/introduction.rst +++ b/Doc/reference/introduction.rst @@ -94,40 +94,96 @@ The descriptions of lexical analysis and syntax use a modified `Backus–Naur form (BNF) `_ grammar notation. This uses the following style of definition: -.. productionlist:: notation - name: `lc_letter` (`lc_letter` | "_")* - lc_letter: "a"..."z" - -The first line says that a ``name`` is an ``lc_letter`` followed by a sequence -of zero or more ``lc_letter``\ s and underscores. An ``lc_letter`` in turn is -any of the single characters ``'a'`` through ``'z'``. (This rule is actually -adhered to for the names defined in lexical and grammar rules in this document.) - -Each rule begins with a name (which is the name defined by the rule) and -``::=``. A vertical bar (``|``) is used to separate alternatives; it is the -least binding operator in this notation. A star (``*``) means zero or more -repetitions of the preceding item; likewise, a plus (``+``) means one or more -repetitions, and a phrase enclosed in square brackets (``[ ]``) means zero or -one occurrences (in other words, the enclosed phrase is optional). The ``*`` -and ``+`` operators bind as tightly as possible; parentheses are used for -grouping. Literal strings are enclosed in quotes. White space is only -meaningful to separate tokens. Rules are normally contained on a single line; -rules with many alternatives may be formatted alternatively with each line after -the first beginning with a vertical bar. - -.. index:: lexical definitions, ASCII - -In lexical definitions (as the example above), two more conventions are used: -Two literal characters separated by three dots mean a choice of any single -character in the given (inclusive) range of ASCII characters. A phrase between -angular brackets (``<...>``) gives an informal description of the symbol -defined; e.g., this could be used to describe the notion of 'control character' -if needed. - -Even though the notation used is almost the same, there is a big difference -between the meaning of lexical and syntactic definitions: a lexical definition -operates on the individual characters of the input source, while a syntax -definition operates on the stream of tokens generated by the lexical analysis. -All uses of BNF in the next chapter ("Lexical Analysis") are lexical -definitions; uses in subsequent chapters are syntactic definitions. - +.. grammar-snippet:: + :group: notation + + name: `letter` (`letter` | `digit` | "_")* + letter: "a"..."z" | "A"..."Z" + digit: "0"..."9" + +In this example, the first line says that a ``name`` is a ``letter`` followed +by a sequence of zero or more ``letter``\ s, ``digit``\ s, and underscores. +A ``letter`` in turn is any of the single characters ``'a'`` through +``'z'`` and ``A`` through ``Z``; a ``digit`` is a single character from ``0`` +to ``9``. + +Each rule begins with a name (which identifies the rule that's being defined) +followed by a colon, ``:``. +The definition to the right of the colon uses the following syntax elements: + +* ``name``: A name refers to another rule. + Where possible, it is a link to the rule's definition. + + * ``TOKEN``: An uppercase name refers to a :term:`token`. + For the purposes of grammar definitions, tokens are the same as rules. + +* ``"text"``, ``'text'``: Text in single or double quotes must match literally + (without the quotes). The type of quote is chosen according to the meaning + of ``text``: + + * ``'if'``: A name in single quotes denotes a :ref:`keyword `. + * ``"case"``: A name in double quotes denotes a + :ref:`soft-keyword `. + * ``'@'``: A non-letter symbol in single quotes denotes an + :py:data:`~token.OP` token, that is, a :ref:`delimiter ` or + :ref:`operator `. + +* ``"a"..."z"``: Two literal characters separated by three dots mean a choice + of any single character in the given (inclusive) range of ASCII characters. +* ``<...>``: A phrase between angular brackets gives an informal description + of the matched symbol (for example, ````), + or an abbreviation that is defined in nearby text (for example, ````). +* ``e1 e2``: Items separated only by whitespace denote a sequence. + Here, ``e1`` must be followed by ``e2``. +* ``e1 | e2``: A vertical bar is used to separate alternatives. + It is the least tightly binding operator in this notation. +* ``e*``: A star means zero or more repetitions of the preceding item. +* ``e+``: Likewise, a plus means one or more repetitions. +* ``[e]``: A phrase enclosed in square brackets means zero or + one occurrences. In other words, the enclosed phrase is optional. +* ``e?``: A question mark has exactly the same meaning as square brackets: + the preceding item is optional. +* ``(e)``: Parentheses are used for grouping. + +The unary operators (``*``, ``+``, ``?``) bind as tightly as possible. + +White space is only meaningful to separate tokens. + +Rules are normally contained on a single line, but rules that are too long +may be wrapped: + +.. grammar-snippet:: + :group: notation + + literal: `stringliteral` | `bytesliteral` + | `integer` | `floatnumber` | `imagnumber` + +Alternatively, rules may be formatted with the first line ending at the colon, +and each alternative beginning with a vertical bar on a new line. +For example: + + +.. grammar-snippet:: + :group: notation-alt + + literal: + | `stringliteral` + | `bytesliteral` + | `integer` + | `floatnumber` + | `imagnumber` + +This does *not* mean that there is an empty first alternative. + +.. index:: lexical definitions + +.. note:: + + There is some difference between *lexical* and *syntactic* analysis: + the :term:`lexical analyzer` operates on the individual characters of the + input source, while the *parser* (syntactic analyzer) operates on the stream + of :term:`tokens ` generated by the lexical analysis. + However, in some cases the exact boundary between the two phases is a + CPython implementation detail. + + This documentation uses the same BNF grammar for both. From ec90d4066987534c7dbeed7e91aadbac1ff8670b Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 21 May 2025 18:04:23 +0200 Subject: [PATCH 2/8] Consolidate with the Full Grammar intro Co-authored-by: Blaise Pabon --- Doc/reference/grammar.rst | 16 +++++++--------- Doc/reference/introduction.rst | 16 +++++++++++----- 2 files changed, 18 insertions(+), 14 deletions(-) diff --git a/Doc/reference/grammar.rst b/Doc/reference/grammar.rst index b9cca4444c9141..028a847ded556b 100644 --- a/Doc/reference/grammar.rst +++ b/Doc/reference/grammar.rst @@ -8,15 +8,13 @@ used to generate the CPython parser (see :source:`Grammar/python.gram`). The version here omits details related to code generation and error recovery. -The notation is a mixture of `EBNF -`_ -and `PEG `_. -In particular, ``&`` followed by a symbol, token or parenthesized -group indicates a positive lookahead (i.e., is required to match but -not consumed), while ``!`` indicates a negative lookahead (i.e., is -required *not* to match). We use the ``|`` separator to mean PEG's -"ordered choice" (written as ``/`` in traditional PEG grammars). See -:pep:`617` for more details on the grammar's syntax. +The notation used here is the same as in the preceding docs, +and is described in the :ref:`notation ` section, +except for a few extra complications: + +* ``&e``: a positive lookahead (that is, ``e`` is required to match but + not consumed) +* ``!e``: a negative lookahead (that is, ``e`` is required *not* to match) .. literalinclude:: ../../Grammar/python.gram :language: peg diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst index 61dc2937007efb..bde86c0b6654d8 100644 --- a/Doc/reference/introduction.rst +++ b/Doc/reference/introduction.rst @@ -90,9 +90,10 @@ Notation .. index:: BNF, grammar, syntax, notation -The descriptions of lexical analysis and syntax use a modified -`Backus–Naur form (BNF) `_ grammar -notation. This uses the following style of definition: +The descriptions of lexical analysis use a grammar notation that is a mixture +of `EBNF `_ +and `PEG `_. +For example: .. grammar-snippet:: :group: notation @@ -136,7 +137,11 @@ The definition to the right of the colon uses the following syntax elements: * ``e1 e2``: Items separated only by whitespace denote a sequence. Here, ``e1`` must be followed by ``e2``. * ``e1 | e2``: A vertical bar is used to separate alternatives. - It is the least tightly binding operator in this notation. + It denotes PEG's "ordered choice": if ``e1`` matches, ``e2`` is + not considered. + In traditional PEG grammars, this is written as a slash, ``/``, rather than + a vertical bar. + See :pep:`617` for more background and details. * ``e*``: A star means zero or more repetitions of the preceding item. * ``e+``: Likewise, a plus means one or more repetitions. * ``[e]``: A phrase enclosed in square brackets means zero or @@ -145,7 +150,8 @@ The definition to the right of the colon uses the following syntax elements: the preceding item is optional. * ``(e)``: Parentheses are used for grouping. -The unary operators (``*``, ``+``, ``?``) bind as tightly as possible. +The unary operators (``*``, ``+``, ``?``) bind as tightly as possible; +the vertical bar (``|``) binds most loosely. White space is only meaningful to separate tokens. From 3f3a0dbd33ca1c14c0cc754f4b75c880a0be290b Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 21 May 2025 18:10:49 +0200 Subject: [PATCH 3/8] Don't link the examples --- Doc/reference/introduction.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst index bde86c0b6654d8..16d70aa9d2649c 100644 --- a/Doc/reference/introduction.rst +++ b/Doc/reference/introduction.rst @@ -161,8 +161,8 @@ may be wrapped: .. grammar-snippet:: :group: notation - literal: `stringliteral` | `bytesliteral` - | `integer` | `floatnumber` | `imagnumber` + literal: stringliteral | bytesliteral + | integer | floatnumber | imagnumber Alternatively, rules may be formatted with the first line ending at the colon, and each alternative beginning with a vertical bar on a new line. @@ -173,11 +173,11 @@ For example: :group: notation-alt literal: - | `stringliteral` - | `bytesliteral` - | `integer` - | `floatnumber` - | `imagnumber` + | stringliteral + | bytesliteral + | integer + | floatnumber + | imagnumber This does *not* mean that there is an empty first alternative. From 160ff4208066b8a2f619e0c848539c90f4057b7b Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 21 May 2025 18:17:42 +0200 Subject: [PATCH 4/8] Add Cut --- Doc/reference/grammar.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/Doc/reference/grammar.rst b/Doc/reference/grammar.rst index 028a847ded556b..534bc742cf57c2 100644 --- a/Doc/reference/grammar.rst +++ b/Doc/reference/grammar.rst @@ -15,6 +15,7 @@ except for a few extra complications: * ``&e``: a positive lookahead (that is, ``e`` is required to match but not consumed) * ``!e``: a negative lookahead (that is, ``e`` is required *not* to match) +* ``~`` ("cut"): commit to the current alternative, even if it fails to parse .. literalinclude:: ../../Grammar/python.gram :language: peg From 231e2bae7e1de58f96029c015e865326b66c8f81 Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 28 May 2025 18:07:33 +0200 Subject: [PATCH 5/8] Note practical difference between syntactic & lexical definitions Co-authored-by: Blaise Pabon Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com> --- Doc/reference/introduction.rst | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst index 16d70aa9d2649c..0699fff0b1a016 100644 --- a/Doc/reference/introduction.rst +++ b/Doc/reference/introduction.rst @@ -183,13 +183,19 @@ This does *not* mean that there is an empty first alternative. .. index:: lexical definitions -.. note:: - - There is some difference between *lexical* and *syntactic* analysis: - the :term:`lexical analyzer` operates on the individual characters of the - input source, while the *parser* (syntactic analyzer) operates on the stream - of :term:`tokens ` generated by the lexical analysis. - However, in some cases the exact boundary between the two phases is a - CPython implementation detail. - - This documentation uses the same BNF grammar for both. +There is some difference between *lexical* and *syntactic* analysis: +the :term:`lexical analyzer` operates on the individual characters of the +input source, while the *parser* (syntactic analyzer) operates on the stream +of :term:`tokens ` generated by the lexical analysis. +However, in some cases the exact boundary between the two phases is a +CPython implementation detail. + +The practical difference between the two is that in *lexical* definitions, +all whitespace is significant. +The lexical analyzer :ref:`discards ` all whitespace that is not +converted to tokens like :data:`token.INDENT` or :data:`~token.NEWLINE`. +*Syntactic* definitions then use these tokens, rather than source characters. + +This documentation uses the same BNF grammar for both styles of definitions. +All uses of BNF in the next chapter (“Lexical Analysis”) are lexical definitions; +uses in subsequent chapters are syntactic definitions. From fa17c00cd75cfa9e1f5b61082eafcb4eac67d8e1 Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Tue, 3 Jun 2025 08:43:04 +0200 Subject: [PATCH 6/8] Apply suggestions from code review Co-authored-by: Lysandros Nikolaou Co-authored-by: Colin Marquardt --- Doc/reference/grammar.rst | 3 ++- Doc/reference/introduction.rst | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/Doc/reference/grammar.rst b/Doc/reference/grammar.rst index 534bc742cf57c2..55c148801d8559 100644 --- a/Doc/reference/grammar.rst +++ b/Doc/reference/grammar.rst @@ -15,7 +15,8 @@ except for a few extra complications: * ``&e``: a positive lookahead (that is, ``e`` is required to match but not consumed) * ``!e``: a negative lookahead (that is, ``e`` is required *not* to match) -* ``~`` ("cut"): commit to the current alternative, even if it fails to parse +* ``~`` ("cut"): commit to the current alternative and fail the rule + even if this fails to parse .. literalinclude:: ../../Grammar/python.gram :language: peg diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst index 0699fff0b1a016..0d54237b9c0072 100644 --- a/Doc/reference/introduction.rst +++ b/Doc/reference/introduction.rst @@ -197,5 +197,5 @@ converted to tokens like :data:`token.INDENT` or :data:`~token.NEWLINE`. *Syntactic* definitions then use these tokens, rather than source characters. This documentation uses the same BNF grammar for both styles of definitions. -All uses of BNF in the next chapter (“Lexical Analysis”) are lexical definitions; +All uses of BNF in the next chapter (:ref:`lexical`) are lexical definitions; uses in subsequent chapters are syntactic definitions. From 327b90de33b32b0e0a2939875c2136c1eeeeedca Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 4 Jun 2025 15:49:20 +0200 Subject: [PATCH 7/8] Constrain `"a"..."z"` and `<...>` to lexical definitions --- Doc/reference/introduction.rst | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst index 0d54237b9c0072..6bf785d31c9c1e 100644 --- a/Doc/reference/introduction.rst +++ b/Doc/reference/introduction.rst @@ -129,11 +129,6 @@ The definition to the right of the colon uses the following syntax elements: :py:data:`~token.OP` token, that is, a :ref:`delimiter ` or :ref:`operator `. -* ``"a"..."z"``: Two literal characters separated by three dots mean a choice - of any single character in the given (inclusive) range of ASCII characters. -* ``<...>``: A phrase between angular brackets gives an informal description - of the matched symbol (for example, ````), - or an abbreviation that is defined in nearby text (for example, ````). * ``e1 e2``: Items separated only by whitespace denote a sequence. Here, ``e1`` must be followed by ``e2``. * ``e1 | e2``: A vertical bar is used to separate alternatives. @@ -149,6 +144,15 @@ The definition to the right of the colon uses the following syntax elements: * ``e?``: A question mark has exactly the same meaning as square brackets: the preceding item is optional. * ``(e)``: Parentheses are used for grouping. +* ``"a"..."z"``: Two literal characters separated by three dots mean a choice + of any single character in the given (inclusive) range of ASCII characters. + This notation is only used in + :ref:`lexical definitions `. +* ``<...>``: A phrase between angular brackets gives an informal description + of the matched symbol (for example, ````), + or an abbreviation that is defined in nearby text (for example, ````). + This notation is only used in + :ref:`lexical definitions `. The unary operators (``*``, ``+``, ``?``) bind as tightly as possible; the vertical bar (``|``) binds most loosely. @@ -183,6 +187,11 @@ This does *not* mean that there is an empty first alternative. .. index:: lexical definitions +.. _notation-lexical-vs-syntactic: + +Lexical and Syntactic definitions +--------------------------------- + There is some difference between *lexical* and *syntactic* analysis: the :term:`lexical analyzer` operates on the individual characters of the input source, while the *parser* (syntactic analyzer) operates on the stream From 3c5caa6b13a45afc0d5c6240ac846bb2cadb62b5 Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Sat, 7 Jun 2025 12:30:58 +0200 Subject: [PATCH 8/8] Update Doc/reference/introduction.rst --- Doc/reference/introduction.rst | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst index 6bf785d31c9c1e..444acac374a690 100644 --- a/Doc/reference/introduction.rst +++ b/Doc/reference/introduction.rst @@ -90,8 +90,9 @@ Notation .. index:: BNF, grammar, syntax, notation -The descriptions of lexical analysis use a grammar notation that is a mixture -of `EBNF `_ +The descriptions of lexical analysis and syntax use a grammar notation that +is a mixture of +`EBNF `_ and `PEG `_. For example: