mirror of
https://github.com/python/cpython.git
synced 2025-07-07 19:35:27 +00:00
gh-127833: Reword and expand the Notation section (GH-134443)
Prepare the docs for using the notation used in the `python.gram` file. If we want to sync the two, the meta-syntax should be the same. Link the Full Grammar docs here; keep only a few extras. Also, remove the distinction between lexical and syntactic rules, except for whitespace handling. With f- and t-strings, the line between the two is blurry. Co-authored-by: Blaise Pabon <blaise@gmail.com> Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com> Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Colin Marquardt <cmarqu42@gmail.com>
This commit is contained in:
parent
f90483e13a
commit
28d91d06f1
2 changed files with 121 additions and 43 deletions
|
@ -8,15 +8,15 @@ used to generate the CPython parser (see :source:`Grammar/python.gram`).
|
|||
The version here omits details related to code generation and
|
||||
error recovery.
|
||||
|
||||
The notation is a mixture of `EBNF
|
||||
<https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_
|
||||
and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_.
|
||||
In particular, ``&`` followed by a symbol, token or parenthesized
|
||||
group indicates a positive lookahead (i.e., is required to match but
|
||||
not consumed), while ``!`` indicates a negative lookahead (i.e., is
|
||||
required *not* to match). We use the ``|`` separator to mean PEG's
|
||||
"ordered choice" (written as ``/`` in traditional PEG grammars). See
|
||||
:pep:`617` for more details on the grammar's syntax.
|
||||
The notation used here is the same as in the preceding docs,
|
||||
and is described in the :ref:`notation <notation>` section,
|
||||
except for a few extra complications:
|
||||
|
||||
* ``&e``: a positive lookahead (that is, ``e`` is required to match but
|
||||
not consumed)
|
||||
* ``!e``: a negative lookahead (that is, ``e`` is required *not* to match)
|
||||
* ``~`` ("cut"): commit to the current alternative and fail the rule
|
||||
even if this fails to parse
|
||||
|
||||
.. literalinclude:: ../../Grammar/python.gram
|
||||
:language: peg
|
||||
|
|
|
@ -90,44 +90,122 @@ Notation
|
|||
|
||||
.. index:: BNF, grammar, syntax, notation
|
||||
|
||||
The descriptions of lexical analysis and syntax use a modified
|
||||
`Backus–Naur form (BNF) <https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form>`_ grammar
|
||||
notation. This uses the following style of definition:
|
||||
The descriptions of lexical analysis and syntax use a grammar notation that
|
||||
is a mixture of
|
||||
`EBNF <https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_
|
||||
and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_.
|
||||
For example:
|
||||
|
||||
.. productionlist:: notation
|
||||
name: `lc_letter` (`lc_letter` | "_")*
|
||||
lc_letter: "a"..."z"
|
||||
.. grammar-snippet::
|
||||
:group: notation
|
||||
|
||||
The first line says that a ``name`` is an ``lc_letter`` followed by a sequence
|
||||
of zero or more ``lc_letter``\ s and underscores. An ``lc_letter`` in turn is
|
||||
any of the single characters ``'a'`` through ``'z'``. (This rule is actually
|
||||
adhered to for the names defined in lexical and grammar rules in this document.)
|
||||
name: `letter` (`letter` | `digit` | "_")*
|
||||
letter: "a"..."z" | "A"..."Z"
|
||||
digit: "0"..."9"
|
||||
|
||||
Each rule begins with a name (which is the name defined by the rule) and
|
||||
``::=``. A vertical bar (``|``) is used to separate alternatives; it is the
|
||||
least binding operator in this notation. A star (``*``) means zero or more
|
||||
repetitions of the preceding item; likewise, a plus (``+``) means one or more
|
||||
repetitions, and a phrase enclosed in square brackets (``[ ]``) means zero or
|
||||
one occurrences (in other words, the enclosed phrase is optional). The ``*``
|
||||
and ``+`` operators bind as tightly as possible; parentheses are used for
|
||||
grouping. Literal strings are enclosed in quotes. White space is only
|
||||
meaningful to separate tokens. Rules are normally contained on a single line;
|
||||
rules with many alternatives may be formatted alternatively with each line after
|
||||
the first beginning with a vertical bar.
|
||||
In this example, the first line says that a ``name`` is a ``letter`` followed
|
||||
by a sequence of zero or more ``letter``\ s, ``digit``\ s, and underscores.
|
||||
A ``letter`` in turn is any of the single characters ``'a'`` through
|
||||
``'z'`` and ``A`` through ``Z``; a ``digit`` is a single character from ``0``
|
||||
to ``9``.
|
||||
|
||||
.. index:: lexical definitions, ASCII
|
||||
Each rule begins with a name (which identifies the rule that's being defined)
|
||||
followed by a colon, ``:``.
|
||||
The definition to the right of the colon uses the following syntax elements:
|
||||
|
||||
In lexical definitions (as the example above), two more conventions are used:
|
||||
Two literal characters separated by three dots mean a choice of any single
|
||||
character in the given (inclusive) range of ASCII characters. A phrase between
|
||||
angular brackets (``<...>``) gives an informal description of the symbol
|
||||
defined; e.g., this could be used to describe the notion of 'control character'
|
||||
if needed.
|
||||
* ``name``: A name refers to another rule.
|
||||
Where possible, it is a link to the rule's definition.
|
||||
|
||||
Even though the notation used is almost the same, there is a big difference
|
||||
between the meaning of lexical and syntactic definitions: a lexical definition
|
||||
operates on the individual characters of the input source, while a syntax
|
||||
definition operates on the stream of tokens generated by the lexical analysis.
|
||||
All uses of BNF in the next chapter ("Lexical Analysis") are lexical
|
||||
definitions; uses in subsequent chapters are syntactic definitions.
|
||||
* ``TOKEN``: An uppercase name refers to a :term:`token`.
|
||||
For the purposes of grammar definitions, tokens are the same as rules.
|
||||
|
||||
* ``"text"``, ``'text'``: Text in single or double quotes must match literally
|
||||
(without the quotes). The type of quote is chosen according to the meaning
|
||||
of ``text``:
|
||||
|
||||
* ``'if'``: A name in single quotes denotes a :ref:`keyword <keywords>`.
|
||||
* ``"case"``: A name in double quotes denotes a
|
||||
:ref:`soft-keyword <soft-keywords>`.
|
||||
* ``'@'``: A non-letter symbol in single quotes denotes an
|
||||
:py:data:`~token.OP` token, that is, a :ref:`delimiter <delimiters>` or
|
||||
:ref:`operator <operators>`.
|
||||
|
||||
* ``e1 e2``: Items separated only by whitespace denote a sequence.
|
||||
Here, ``e1`` must be followed by ``e2``.
|
||||
* ``e1 | e2``: A vertical bar is used to separate alternatives.
|
||||
It denotes PEG's "ordered choice": if ``e1`` matches, ``e2`` is
|
||||
not considered.
|
||||
In traditional PEG grammars, this is written as a slash, ``/``, rather than
|
||||
a vertical bar.
|
||||
See :pep:`617` for more background and details.
|
||||
* ``e*``: A star means zero or more repetitions of the preceding item.
|
||||
* ``e+``: Likewise, a plus means one or more repetitions.
|
||||
* ``[e]``: A phrase enclosed in square brackets means zero or
|
||||
one occurrences. In other words, the enclosed phrase is optional.
|
||||
* ``e?``: A question mark has exactly the same meaning as square brackets:
|
||||
the preceding item is optional.
|
||||
* ``(e)``: Parentheses are used for grouping.
|
||||
* ``"a"..."z"``: Two literal characters separated by three dots mean a choice
|
||||
of any single character in the given (inclusive) range of ASCII characters.
|
||||
This notation is only used in
|
||||
:ref:`lexical definitions <notation-lexical-vs-syntactic>`.
|
||||
* ``<...>``: A phrase between angular brackets gives an informal description
|
||||
of the matched symbol (for example, ``<any ASCII character except "\">``),
|
||||
or an abbreviation that is defined in nearby text (for example, ``<Lu>``).
|
||||
This notation is only used in
|
||||
:ref:`lexical definitions <notation-lexical-vs-syntactic>`.
|
||||
|
||||
The unary operators (``*``, ``+``, ``?``) bind as tightly as possible;
|
||||
the vertical bar (``|``) binds most loosely.
|
||||
|
||||
White space is only meaningful to separate tokens.
|
||||
|
||||
Rules are normally contained on a single line, but rules that are too long
|
||||
may be wrapped:
|
||||
|
||||
.. grammar-snippet::
|
||||
:group: notation
|
||||
|
||||
literal: stringliteral | bytesliteral
|
||||
| integer | floatnumber | imagnumber
|
||||
|
||||
Alternatively, rules may be formatted with the first line ending at the colon,
|
||||
and each alternative beginning with a vertical bar on a new line.
|
||||
For example:
|
||||
|
||||
|
||||
.. grammar-snippet::
|
||||
:group: notation-alt
|
||||
|
||||
literal:
|
||||
| stringliteral
|
||||
| bytesliteral
|
||||
| integer
|
||||
| floatnumber
|
||||
| imagnumber
|
||||
|
||||
This does *not* mean that there is an empty first alternative.
|
||||
|
||||
.. index:: lexical definitions
|
||||
|
||||
.. _notation-lexical-vs-syntactic:
|
||||
|
||||
Lexical and Syntactic definitions
|
||||
---------------------------------
|
||||
|
||||
There is some difference between *lexical* and *syntactic* analysis:
|
||||
the :term:`lexical analyzer` operates on the individual characters of the
|
||||
input source, while the *parser* (syntactic analyzer) operates on the stream
|
||||
of :term:`tokens <token>` generated by the lexical analysis.
|
||||
However, in some cases the exact boundary between the two phases is a
|
||||
CPython implementation detail.
|
||||
|
||||
The practical difference between the two is that in *lexical* definitions,
|
||||
all whitespace is significant.
|
||||
The lexical analyzer :ref:`discards <whitespace>` all whitespace that is not
|
||||
converted to tokens like :data:`token.INDENT` or :data:`~token.NEWLINE`.
|
||||
*Syntactic* definitions then use these tokens, rather than source characters.
|
||||
|
||||
This documentation uses the same BNF grammar for both styles of definitions.
|
||||
All uses of BNF in the next chapter (:ref:`lexical`) are lexical definitions;
|
||||
uses in subsequent chapters are syntactic definitions.
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue