gh-127833: Reword and expand the Notation section (GH-134443)

Prepare the docs for using the notation used in the `python.gram`
file. If we want to sync the two, the meta-syntax should be the same.

Link the Full Grammar docs here; keep only a few extras.

Also, remove the distinction between lexical and syntactic rules,
except for whitespace handling.
With f- and t-strings, the line between the two is blurry.

Co-authored-by: Blaise Pabon <blaise@gmail.com>
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com>
Co-authored-by: Colin Marquardt <cmarqu42@gmail.com>
This commit is contained in:
Petr Viktorin 2025-06-09 15:50:11 +02:00 committed by GitHub
parent f90483e13a
commit 28d91d06f1
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 121 additions and 43 deletions

View file

@ -8,15 +8,15 @@ used to generate the CPython parser (see :source:`Grammar/python.gram`).
The version here omits details related to code generation and
error recovery.
The notation is a mixture of `EBNF
<https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_
and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_.
In particular, ``&`` followed by a symbol, token or parenthesized
group indicates a positive lookahead (i.e., is required to match but
not consumed), while ``!`` indicates a negative lookahead (i.e., is
required *not* to match). We use the ``|`` separator to mean PEG's
"ordered choice" (written as ``/`` in traditional PEG grammars). See
:pep:`617` for more details on the grammar's syntax.
The notation used here is the same as in the preceding docs,
and is described in the :ref:`notation <notation>` section,
except for a few extra complications:
* ``&e``: a positive lookahead (that is, ``e`` is required to match but
not consumed)
* ``!e``: a negative lookahead (that is, ``e`` is required *not* to match)
* ``~`` ("cut"): commit to the current alternative and fail the rule
even if this fails to parse
.. literalinclude:: ../../Grammar/python.gram
:language: peg

View file

@ -90,44 +90,122 @@ Notation
.. index:: BNF, grammar, syntax, notation
The descriptions of lexical analysis and syntax use a modified
`BackusNaur form (BNF) <https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form>`_ grammar
notation. This uses the following style of definition:
The descriptions of lexical analysis and syntax use a grammar notation that
is a mixture of
`EBNF <https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_
and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_.
For example:
.. productionlist:: notation
name: `lc_letter` (`lc_letter` | "_")*
lc_letter: "a"..."z"
.. grammar-snippet::
:group: notation
The first line says that a ``name`` is an ``lc_letter`` followed by a sequence
of zero or more ``lc_letter``\ s and underscores. An ``lc_letter`` in turn is
any of the single characters ``'a'`` through ``'z'``. (This rule is actually
adhered to for the names defined in lexical and grammar rules in this document.)
name: `letter` (`letter` | `digit` | "_")*
letter: "a"..."z" | "A"..."Z"
digit: "0"..."9"
Each rule begins with a name (which is the name defined by the rule) and
``::=``. A vertical bar (``|``) is used to separate alternatives; it is the
least binding operator in this notation. A star (``*``) means zero or more
repetitions of the preceding item; likewise, a plus (``+``) means one or more
repetitions, and a phrase enclosed in square brackets (``[ ]``) means zero or
one occurrences (in other words, the enclosed phrase is optional). The ``*``
and ``+`` operators bind as tightly as possible; parentheses are used for
grouping. Literal strings are enclosed in quotes. White space is only
meaningful to separate tokens. Rules are normally contained on a single line;
rules with many alternatives may be formatted alternatively with each line after
the first beginning with a vertical bar.
In this example, the first line says that a ``name`` is a ``letter`` followed
by a sequence of zero or more ``letter``\ s, ``digit``\ s, and underscores.
A ``letter`` in turn is any of the single characters ``'a'`` through
``'z'`` and ``A`` through ``Z``; a ``digit`` is a single character from ``0``
to ``9``.
.. index:: lexical definitions, ASCII
Each rule begins with a name (which identifies the rule that's being defined)
followed by a colon, ``:``.
The definition to the right of the colon uses the following syntax elements:
In lexical definitions (as the example above), two more conventions are used:
Two literal characters separated by three dots mean a choice of any single
character in the given (inclusive) range of ASCII characters. A phrase between
angular brackets (``<...>``) gives an informal description of the symbol
defined; e.g., this could be used to describe the notion of 'control character'
if needed.
* ``name``: A name refers to another rule.
Where possible, it is a link to the rule's definition.
Even though the notation used is almost the same, there is a big difference
between the meaning of lexical and syntactic definitions: a lexical definition
operates on the individual characters of the input source, while a syntax
definition operates on the stream of tokens generated by the lexical analysis.
All uses of BNF in the next chapter ("Lexical Analysis") are lexical
definitions; uses in subsequent chapters are syntactic definitions.
* ``TOKEN``: An uppercase name refers to a :term:`token`.
For the purposes of grammar definitions, tokens are the same as rules.
* ``"text"``, ``'text'``: Text in single or double quotes must match literally
(without the quotes). The type of quote is chosen according to the meaning
of ``text``:
* ``'if'``: A name in single quotes denotes a :ref:`keyword <keywords>`.
* ``"case"``: A name in double quotes denotes a
:ref:`soft-keyword <soft-keywords>`.
* ``'@'``: A non-letter symbol in single quotes denotes an
:py:data:`~token.OP` token, that is, a :ref:`delimiter <delimiters>` or
:ref:`operator <operators>`.
* ``e1 e2``: Items separated only by whitespace denote a sequence.
Here, ``e1`` must be followed by ``e2``.
* ``e1 | e2``: A vertical bar is used to separate alternatives.
It denotes PEG's "ordered choice": if ``e1`` matches, ``e2`` is
not considered.
In traditional PEG grammars, this is written as a slash, ``/``, rather than
a vertical bar.
See :pep:`617` for more background and details.
* ``e*``: A star means zero or more repetitions of the preceding item.
* ``e+``: Likewise, a plus means one or more repetitions.
* ``[e]``: A phrase enclosed in square brackets means zero or
one occurrences. In other words, the enclosed phrase is optional.
* ``e?``: A question mark has exactly the same meaning as square brackets:
the preceding item is optional.
* ``(e)``: Parentheses are used for grouping.
* ``"a"..."z"``: Two literal characters separated by three dots mean a choice
of any single character in the given (inclusive) range of ASCII characters.
This notation is only used in
:ref:`lexical definitions <notation-lexical-vs-syntactic>`.
* ``<...>``: A phrase between angular brackets gives an informal description
of the matched symbol (for example, ``<any ASCII character except "\">``),
or an abbreviation that is defined in nearby text (for example, ``<Lu>``).
This notation is only used in
:ref:`lexical definitions <notation-lexical-vs-syntactic>`.
The unary operators (``*``, ``+``, ``?``) bind as tightly as possible;
the vertical bar (``|``) binds most loosely.
White space is only meaningful to separate tokens.
Rules are normally contained on a single line, but rules that are too long
may be wrapped:
.. grammar-snippet::
:group: notation
literal: stringliteral | bytesliteral
| integer | floatnumber | imagnumber
Alternatively, rules may be formatted with the first line ending at the colon,
and each alternative beginning with a vertical bar on a new line.
For example:
.. grammar-snippet::
:group: notation-alt
literal:
| stringliteral
| bytesliteral
| integer
| floatnumber
| imagnumber
This does *not* mean that there is an empty first alternative.
.. index:: lexical definitions
.. _notation-lexical-vs-syntactic:
Lexical and Syntactic definitions
---------------------------------
There is some difference between *lexical* and *syntactic* analysis:
the :term:`lexical analyzer` operates on the individual characters of the
input source, while the *parser* (syntactic analyzer) operates on the stream
of :term:`tokens <token>` generated by the lexical analysis.
However, in some cases the exact boundary between the two phases is a
CPython implementation detail.
The practical difference between the two is that in *lexical* definitions,
all whitespace is significant.
The lexical analyzer :ref:`discards <whitespace>` all whitespace that is not
converted to tokens like :data:`token.INDENT` or :data:`~token.NEWLINE`.
*Syntactic* definitions then use these tokens, rather than source characters.
This documentation uses the same BNF grammar for both styles of definitions.
All uses of BNF in the next chapter (:ref:`lexical`) are lexical definitions;
uses in subsequent chapters are syntactic definitions.