Skip to content

gh-127833: Reword and expand the Notation section #134443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 9, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions Doc/reference/grammar.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@ used to generate the CPython parser (see :source:`Grammar/python.gram`).
The version here omits details related to code generation and
error recovery.

The notation is a mixture of `EBNF
<https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_
and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_.
In particular, ``&`` followed by a symbol, token or parenthesized
group indicates a positive lookahead (i.e., is required to match but
not consumed), while ``!`` indicates a negative lookahead (i.e., is
required *not* to match). We use the ``|`` separator to mean PEG's
"ordered choice" (written as ``/`` in traditional PEG grammars). See
:pep:`617` for more details on the grammar's syntax.
The notation used here is the same as in the preceding docs,
and is described in the :ref:`notation <notation>` section,
except for a few extra complications:

* ``&e``: a positive lookahead (that is, ``e`` is required to match but
not consumed)
* ``!e``: a negative lookahead (that is, ``e`` is required *not* to match)
* ``~`` ("cut"): commit to the current alternative and fail the rule
even if this fails to parse

.. literalinclude:: ../../Grammar/python.gram
:language: peg
160 changes: 119 additions & 41 deletions Doc/reference/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,44 +90,122 @@ Notation

.. index:: BNF, grammar, syntax, notation

The descriptions of lexical analysis and syntax use a modified
`Backus–Naur form (BNF) <https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form>`_ grammar
notation. This uses the following style of definition:

.. productionlist:: notation
name: `lc_letter` (`lc_letter` | "_")*
lc_letter: "a"..."z"

The first line says that a ``name`` is an ``lc_letter`` followed by a sequence
of zero or more ``lc_letter``\ s and underscores. An ``lc_letter`` in turn is
any of the single characters ``'a'`` through ``'z'``. (This rule is actually
adhered to for the names defined in lexical and grammar rules in this document.)

Each rule begins with a name (which is the name defined by the rule) and
``::=``. A vertical bar (``|``) is used to separate alternatives; it is the
least binding operator in this notation. A star (``*``) means zero or more
repetitions of the preceding item; likewise, a plus (``+``) means one or more
repetitions, and a phrase enclosed in square brackets (``[ ]``) means zero or
one occurrences (in other words, the enclosed phrase is optional). The ``*``
and ``+`` operators bind as tightly as possible; parentheses are used for
grouping. Literal strings are enclosed in quotes. White space is only
meaningful to separate tokens. Rules are normally contained on a single line;
rules with many alternatives may be formatted alternatively with each line after
the first beginning with a vertical bar.

.. index:: lexical definitions, ASCII

In lexical definitions (as the example above), two more conventions are used:
Two literal characters separated by three dots mean a choice of any single
character in the given (inclusive) range of ASCII characters. A phrase between
angular brackets (``<...>``) gives an informal description of the symbol
defined; e.g., this could be used to describe the notion of 'control character'
if needed.

Even though the notation used is almost the same, there is a big difference
between the meaning of lexical and syntactic definitions: a lexical definition
operates on the individual characters of the input source, while a syntax
definition operates on the stream of tokens generated by the lexical analysis.
All uses of BNF in the next chapter ("Lexical Analysis") are lexical
definitions; uses in subsequent chapters are syntactic definitions.

The descriptions of lexical analysis and syntax use a grammar notation that
is a mixture of
`EBNF <https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_
and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_.
For example:

.. grammar-snippet::
:group: notation

name: `letter` (`letter` | `digit` | "_")*
letter: "a"..."z" | "A"..."Z"
digit: "0"..."9"

In this example, the first line says that a ``name`` is a ``letter`` followed
by a sequence of zero or more ``letter``\ s, ``digit``\ s, and underscores.
A ``letter`` in turn is any of the single characters ``'a'`` through
``'z'`` and ``A`` through ``Z``; a ``digit`` is a single character from ``0``
to ``9``.

Each rule begins with a name (which identifies the rule that's being defined)
followed by a colon, ``:``.
The definition to the right of the colon uses the following syntax elements:

* ``name``: A name refers to another rule.
Where possible, it is a link to the rule's definition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What possibilities exist, other than referencing another rule?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rules that aren't defined by formal grammar in the docs, for example strings in https://docs.python.org/3/reference/compound_stmts.html#literal-patterns
I consider these to be issues that #127833 should eventually fix, but until then I'd keep the weasel word in.

Also: tokens are currently unlinked, and many don't have a (lexical) grammar rule so they should eventually link to prose.


* ``TOKEN``: An uppercase name refers to a :term:`token`.
For the purposes of grammar definitions, tokens are the same as rules.

* ``"text"``, ``'text'``: Text in single or double quotes must match literally
(without the quotes). The type of quote is chosen according to the meaning
of ``text``:

* ``'if'``: A name in single quotes denotes a :ref:`keyword <keywords>`.
* ``"case"``: A name in double quotes denotes a
:ref:`soft-keyword <soft-keywords>`.
* ``'@'``: A non-letter symbol in single quotes denotes an
:py:data:`~token.OP` token, that is, a :ref:`delimiter <delimiters>` or
:ref:`operator <operators>`.

* ``e1 e2``: Items separated only by whitespace denote a sequence.
Here, ``e1`` must be followed by ``e2``.
* ``e1 | e2``: A vertical bar is used to separate alternatives.
It denotes PEG's "ordered choice": if ``e1`` matches, ``e2`` is
not considered.
In traditional PEG grammars, this is written as a slash, ``/``, rather than
a vertical bar.
See :pep:`617` for more background and details.
* ``e*``: A star means zero or more repetitions of the preceding item.
* ``e+``: Likewise, a plus means one or more repetitions.
* ``[e]``: A phrase enclosed in square brackets means zero or
one occurrences. In other words, the enclosed phrase is optional.
* ``e?``: A question mark has exactly the same meaning as square brackets:
the preceding item is optional.
* ``(e)``: Parentheses are used for grouping.
* ``"a"..."z"``: Two literal characters separated by three dots mean a choice
of any single character in the given (inclusive) range of ASCII characters.
This notation is only used in
:ref:`lexical definitions <notation-lexical-vs-syntactic>`.
* ``<...>``: A phrase between angular brackets gives an informal description
of the matched symbol (for example, ``<any ASCII character except "\">``),
or an abbreviation that is defined in nearby text (for example, ``<Lu>``).
This notation is only used in
:ref:`lexical definitions <notation-lexical-vs-syntactic>`.

The unary operators (``*``, ``+``, ``?``) bind as tightly as possible;
the vertical bar (``|``) binds most loosely.

White space is only meaningful to separate tokens.

Rules are normally contained on a single line, but rules that are too long
may be wrapped:

.. grammar-snippet::
:group: notation

literal: stringliteral | bytesliteral
| integer | floatnumber | imagnumber

Alternatively, rules may be formatted with the first line ending at the colon,
and each alternative beginning with a vertical bar on a new line.
For example:


.. grammar-snippet::
:group: notation-alt

literal:
| stringliteral
| bytesliteral
| integer
| floatnumber
| imagnumber

This does *not* mean that there is an empty first alternative.

.. index:: lexical definitions

.. _notation-lexical-vs-syntactic:

Lexical and Syntactic definitions
---------------------------------

There is some difference between *lexical* and *syntactic* analysis:
the :term:`lexical analyzer` operates on the individual characters of the
input source, while the *parser* (syntactic analyzer) operates on the stream
of :term:`tokens <token>` generated by the lexical analysis.
However, in some cases the exact boundary between the two phases is a
CPython implementation detail.

The practical difference between the two is that in *lexical* definitions,
all whitespace is significant.
The lexical analyzer :ref:`discards <whitespace>` all whitespace that is not
converted to tokens like :data:`token.INDENT` or :data:`~token.NEWLINE`.
*Syntactic* definitions then use these tokens, rather than source characters.

This documentation uses the same BNF grammar for both styles of definitions.
All uses of BNF in the next chapter (:ref:`lexical`) are lexical definitions;
uses in subsequent chapters are syntactic definitions.
Loading