Update grammar.rst and compiler.rst to describe the PEG parser

gvanrossum · gvanrossum · commit 01b60642b5c4 · 2020-07-27T12:25:12.000-07:00
diff --git a/compiler.rst b/compiler.rst
@@ -10,8 +10,8 @@ Abstract
 
 In CPython, the compilation from source code to bytecode involves several steps:
 
-1. Parse source code into a parse tree (:file:`Parser/pgen.c`)
-2. Transform parse tree into an Abstract Syntax Tree (:file:`Python/ast.c`)
+1. Tokenize the source code (:file:`Parser/tokenizer.c`)
+2. Parse the stream of tokens into an Abstract Syntax Tree (:file:`Parser/parser.c`)
 3. Transform AST into a Control Flow Graph (:file:`Python/compile.c`)
 4. Emit bytecode based on the Control Flow Graph (:file:`Python/compile.c`)
 
@@ -23,49 +23,18 @@ in terms of the how the entire system works.  You will most likely need
 to read some source to have an exact understanding of all details.
 
 
-Parse Trees
------------
+Parsing
+-------
 
-Python's parser is an LL(1) parser mostly based off of the
-implementation laid out in the Dragon Book [Aho86]_.
+As of Python 3.9, Python's parser is a PEG parser of a somewhat
+unusual design (since its input is a stream of tokens rather than a
+stream of characters as is more common with PEG parsers).
 
-The grammar file for Python can be found in :file:`Grammar/Grammar` with the
-numeric value of grammar rules stored in :file:`Include/graminit.h`.  The
-list of types of tokens (literal tokens, such as ``:``, numbers, etc.) can
-be found in :file:`Grammar/Tokens` with the numeric value stored in
-:file:`Include/token.h`.  The parse tree is made up
-of ``node *`` structs (as defined in :file:`Include/node.h`).
-
-Querying data from the node structs can be done with the following
-macros (which are all defined in :file:`Include/node.h`):
-
-``CHILD(node *, int)``
-        Returns the nth child of the node using zero-offset indexing
-``RCHILD(node *, int)``
-        Returns the nth child of the node from the right side; use
-        negative numbers!
-``NCH(node *)``
-        Number of children the node has
-``STR(node *)``
-        String representation of the node; e.g., will return ``:`` for a
-        ``COLON`` token
-``TYPE(node *)``
-        The type of node as specified in :file:`Include/graminit.h`
-``REQ(node *, TYPE)``
-        Assert that the node is the type that is expected
-``LINENO(node *)``
-        Retrieve the line number of the source code that led to the
-        creation of the parse rule; defined in :file:`Python/ast.c`
-
-For example, consider the rule for 'while':
-
-.. productionlist::
-   while_stmt: "while" `expression` ":" `suite` : ["else" ":" `suite`]
-
-The node representing this will have ``TYPE(node) == while_stmt`` and
-the number of children can be 4 or 7 depending on whether there is an
-'else' statement.  ``REQ(CHILD(node, 2), COLON)`` can be used to access
-what should be the first ``:`` and require it be an actual ``:`` token.
+The grammar file for Python can be found in
+:file:`Grammar/python.gram`.  The numeric values for literal tokens
+(such as ``:``, numbers, etc.) can be found in :file:`Grammar/Tokens`.
+Various C files, including :file:`Parser/parser.c` are generated from
+these (see :doc:`grammar`).
 
 
 Abstract Syntax Trees (AST)
@@ -569,10 +538,6 @@ thanks to having to support both classic and new-style classes.
 References
 ----------
 
-.. [Aho86] Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman.
-   `Compilers: Principles, Techniques, and Tools`,
-   https://www.amazon.com/exec/obidos/tg/detail/-/0201100886/104-0162389-6419108
-
 .. [Wang97]  Daniel C. Wang, Andrew W. Appel, Jeff L. Korn, and Chris
    S. Serra.  `The Zephyr Abstract Syntax Description Language.`_
    In Proceedings of the Conference on Domain-Specific Languages, pp.
diff --git a/grammar.rst b/grammar.rst
@@ -7,52 +7,44 @@ Abstract
 --------
 
 There's more to changing Python's grammar than editing
-:file:`Grammar/Grammar`.  This document aims to be a
-checklist of places that must also be fixed.
+:file:`Grammar/python.gram`.  Here's a checklist.
 
-It is probably incomplete.  If you see omissions,  submit a bug or patch.
-
-This document is not intended to be an instruction manual on Python
-grammar hacking, for several reasons.
-
-
-Rationale
----------
-
-People are getting this wrong all the time; it took well over a
-year before someone `noticed <https://bugs.python.org/issue676521>`_
-that adding the floor division
-operator (``//``) broke the :mod:`parser` module.
+NOTE: These instructions are for Python 3.9 and beyond.  Earlier
+versions use a different parser technology.  You probably shouldn't
+try to change the grammar of earlier Python versions, but if you
+really want to, use GitHub to track down the earlier version of this
+file in the devguide.  (Python 3.9 itself actually supports both
+parsers; the old parser can be invoked by passing ``-X oldparser``.)
 
 
 Checklist
 ---------
 
 Note: sometimes things mysteriously don't work.  Before giving up, try ``make clean``.
 
-* :file:`Grammar/Grammar`: OK, you'd probably worked this one out. :-)  After changing
-  it, run ``make regen-grammar``, to regenerate :file:`Include/graminit.h` and
-  :file:`Python/graminit.c`.  (This runs Python's parser generator, ``Python/pgen``).
+* :file:`Grammar/python.gram`: The grammar, with actions that build AST nodes.  After changing
+  it, run ``make regen-pegen``, to regenerate :file:`Parser/parser.c`.
+  (This runs Python's parser generator, ``Tools/peg_generator``).
 
 * :file:`Grammar/Tokens` is a place for adding new token types.  After
   changing it, run ``make regen-token`` to regenerate :file:`Include/token.h`,
   :file:`Parser/token.c`, :file:`Lib/token.py` and
-  :file:`Doc/library/token-list.inc`.  If you change both ``Grammar`` and ``Tokens``,
-  run ``make regen-tokens`` before ``make regen-grammar``.
+  :file:`Doc/library/token-list.inc`.  If you change both ``python.gram`` and ``Tokens``,
+  run ``make regen-token`` before ``make regen-pegen``.
 
-* :file:`Parser/Python.asdl` may need changes to match the Grammar.  Then run ``make
+* :file:`Parser/Python.asdl` may need changes to match the grammar.  Then run ``make
   regen-ast`` to regenerate :file:`Include/Python-ast.h` and :file:`Python/Python-ast.c`.
 
 * :file:`Parser/tokenizer.c` contains the tokenization code.  This is where you would
   add a new type of comment or string literal, for example.
 
-* :file:`Python/ast.c` will need changes to create the AST objects involved with the
-  Grammar change.
+* :file:`Python/ast.c` will need changes to validate AST objects involved with the
+  grammar change.
 
-* The :doc:`compiler` has its own page.
+* :file:`Python/ast_unparse.c` will need changes to unparse AST objects involved with the
+  grammar change ("unparsing" is used to turn annotations into strings per :pep:`563`).
 
-* The :mod:`parser` module.  Add some of your new syntax to ``test_parser``,
-  bang on :file:`Modules/parsermodule.c` until it passes.
+* The :doc:`compiler` has its own page.
 
 * Add some usage of your new syntax to ``test_grammar.py``.