mirror of
https://github.com/python/cpython.git
synced 2025-08-31 14:07:50 +00:00
gh-119786: [doc] more consistent syntax in InternalDocs (#125815)
This commit is contained in:
parent
4848b0b92c
commit
d0bfff47fb
6 changed files with 371 additions and 420 deletions
|
@ -9,12 +9,12 @@ Python's Parser is currently a
|
|||
[`PEG` (Parser Expression Grammar)](https://en.wikipedia.org/wiki/Parsing_expression_grammar)
|
||||
parser. It was introduced in
|
||||
[PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/) to replace
|
||||
the original [``LL(1)``](https://en.wikipedia.org/wiki/LL_parser) parser.
|
||||
the original [`LL(1)`](https://en.wikipedia.org/wiki/LL_parser) parser.
|
||||
|
||||
The code implementing the parser is generated from a grammar definition by a
|
||||
[parser generator](https://en.wikipedia.org/wiki/Compiler-compiler).
|
||||
Therefore, changes to the Python language are made by modifying the
|
||||
[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram).
|
||||
[grammar file](../Grammar/python.gram).
|
||||
Developers rarely need to modify the generator itself.
|
||||
|
||||
See the devguide's [Changing CPython's grammar](https://devguide.python.org/developer-workflow/grammar/#grammar)
|
||||
|
@ -33,9 +33,9 @@ is ordered. This means that when writing:
|
|||
rule: A | B | C
|
||||
```
|
||||
|
||||
a parser that implements a context-free-grammar (such as an ``LL(1)`` parser) will
|
||||
a parser that implements a context-free-grammar (such as an `LL(1)` parser) will
|
||||
generate constructions that, given an input string, *deduce* which alternative
|
||||
(``A``, ``B`` or ``C``) must be expanded. On the other hand, a PEG parser will
|
||||
(`A`, `B` or `C`) must be expanded. On the other hand, a PEG parser will
|
||||
check each alternative, in the order in which they are specified, and select
|
||||
that first one that succeeds.
|
||||
|
||||
|
@ -67,21 +67,21 @@ time complexity with a technique called
|
|||
which not only loads the entire program in memory before parsing it but also
|
||||
allows the parser to backtrack arbitrarily. This is made efficient by memoizing
|
||||
the rules already matched for each position. The cost of the memoization cache
|
||||
is that the parser will naturally use more memory than a simple ``LL(1)`` parser,
|
||||
is that the parser will naturally use more memory than a simple `LL(1)` parser,
|
||||
which normally are table-based.
|
||||
|
||||
|
||||
Key ideas
|
||||
---------
|
||||
|
||||
- Alternatives are ordered ( ``A | B`` is not the same as ``B | A`` ).
|
||||
- Alternatives are ordered ( `A | B` is not the same as `B | A` ).
|
||||
- If a rule returns a failure, it doesn't mean that the parsing has failed,
|
||||
it just means "try something else".
|
||||
- By default PEG parsers run in exponential time, which can be optimized to linear by
|
||||
using memoization.
|
||||
- If parsing fails completely (no rule succeeds in parsing all the input text), the
|
||||
PEG parser doesn't have a concept of "where the
|
||||
[``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) is".
|
||||
[`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError) is".
|
||||
|
||||
|
||||
> [!IMPORTANT]
|
||||
|
@ -111,16 +111,16 @@ the following two rules (in these examples, a token is an individual character):
|
|||
second_rule: ('aa' | 'a' ) 'a'
|
||||
```
|
||||
|
||||
In a regular EBNF grammar, both rules specify the language ``{aa, aaa}`` but
|
||||
in PEG, one of these two rules accepts the string ``aaa`` but not the string
|
||||
``aa``. The other does the opposite -- it accepts the string ``aa``
|
||||
but not the string ``aaa``. The rule ``('a'|'aa')'a'`` does
|
||||
not accept ``aaa`` because ``'a'|'aa'`` consumes the first ``a``, letting the
|
||||
final ``a`` in the rule consume the second, and leaving out the third ``a``.
|
||||
In a regular EBNF grammar, both rules specify the language `{aa, aaa}` but
|
||||
in PEG, one of these two rules accepts the string `aaa` but not the string
|
||||
`aa`. The other does the opposite -- it accepts the string `aa`
|
||||
but not the string `aaa`. The rule `('a'|'aa')'a'` does
|
||||
not accept `aaa` because `'a'|'aa'` consumes the first `a`, letting the
|
||||
final `a` in the rule consume the second, and leaving out the third `a`.
|
||||
As the rule has succeeded, no attempt is ever made to go back and let
|
||||
``'a'|'aa'`` try the second alternative. The expression ``('aa'|'a')'a'`` does
|
||||
not accept ``aa`` because ``'aa'|'a'`` accepts all of ``aa``, leaving nothing
|
||||
for the final ``a``. Again, the second alternative of ``'aa'|'a'`` is not
|
||||
`'a'|'aa'` try the second alternative. The expression `('aa'|'a')'a'` does
|
||||
not accept `aa` because `'aa'|'a'` accepts all of `aa`, leaving nothing
|
||||
for the final `a`. Again, the second alternative of `'aa'|'a'` is not
|
||||
tried.
|
||||
|
||||
> [!CAUTION]
|
||||
|
@ -137,7 +137,7 @@ one is in almost all cases a mistake, for example:
|
|||
```
|
||||
|
||||
In this example, the second alternative will never be tried because the first one will
|
||||
succeed first (even if the input string has an ``'else' block`` that follows). To correctly
|
||||
succeed first (even if the input string has an `'else' block` that follows). To correctly
|
||||
write this rule you can simply alter the order:
|
||||
|
||||
```
|
||||
|
@ -146,7 +146,7 @@ write this rule you can simply alter the order:
|
|||
| 'if' expression 'then' block
|
||||
```
|
||||
|
||||
In this case, if the input string doesn't have an ``'else' block``, the first alternative
|
||||
In this case, if the input string doesn't have an `'else' block`, the first alternative
|
||||
will fail and the second will be attempted.
|
||||
|
||||
Grammar Syntax
|
||||
|
@ -166,8 +166,8 @@ the rule:
|
|||
rule_name[return_type]: expression
|
||||
```
|
||||
|
||||
If the return type is omitted, then a ``void *`` is returned in C and an
|
||||
``Any`` in Python.
|
||||
If the return type is omitted, then a `void *` is returned in C and an
|
||||
`Any` in Python.
|
||||
|
||||
Grammar expressions
|
||||
-------------------
|
||||
|
@ -214,7 +214,7 @@ Variables in the grammar
|
|||
------------------------
|
||||
|
||||
A sub-expression can be named by preceding it with an identifier and an
|
||||
``=`` sign. The name can then be used in the action (see below), like this:
|
||||
`=` sign. The name can then be used in the action (see below), like this:
|
||||
|
||||
```
|
||||
rule_name[return_type]: '(' a=some_other_rule ')' { a }
|
||||
|
@ -387,9 +387,9 @@ returns a valid C-based Python AST:
|
|||
| NUMBER
|
||||
```
|
||||
|
||||
Here ``EXTRA`` is a macro that expands to ``start_lineno, start_col_offset,
|
||||
end_lineno, end_col_offset, p->arena``, those being variables automatically
|
||||
injected by the parser; ``p`` points to an object that holds on to all state
|
||||
Here `EXTRA` is a macro that expands to `start_lineno, start_col_offset,
|
||||
end_lineno, end_col_offset, p->arena`, those being variables automatically
|
||||
injected by the parser; `p` points to an object that holds on to all state
|
||||
for the parser.
|
||||
|
||||
A similar grammar written to target Python AST objects:
|
||||
|
@ -422,50 +422,47 @@ Pegen
|
|||
|
||||
Pegen is the parser generator used in CPython to produce the final PEG parser
|
||||
used by the interpreter. It is the program that can be used to read the python
|
||||
grammar located in
|
||||
[`Grammar/python.gram`](https://github.com/python/cpython/blob/main/Grammar/python.gram)
|
||||
and produce the final C parser. It contains the following pieces:
|
||||
grammar located in [`Grammar/python.gram`](../Grammar/python.gram) and produce
|
||||
the final C parser. It contains the following pieces:
|
||||
|
||||
- A parser generator that can read a grammar file and produce a PEG parser
|
||||
written in Python or C that can parse said grammar. The generator is located at
|
||||
[`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen).
|
||||
[`Tools/peg_generator/pegen`](../Tools/peg_generator/pegen).
|
||||
- A PEG meta-grammar that automatically generates a Python parser which is used
|
||||
for the parser generator itself (this means that there are no manually-written
|
||||
parsers). The meta-grammar is located at
|
||||
[`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram).
|
||||
[`Tools/peg_generator/pegen/metagrammar.gram`](../Tools/peg_generator/pegen/metagrammar.gram).
|
||||
- A generated parser (using the parser generator) that can directly produce C and Python AST objects.
|
||||
|
||||
The source code for Pegen lives at
|
||||
[`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen)
|
||||
The source code for Pegen lives at [`Tools/peg_generator/pegen`](../Tools/peg_generator/pegen)
|
||||
but normally all typical commands to interact with the parser generator are executed from
|
||||
the main makefile.
|
||||
|
||||
How to regenerate the parser
|
||||
----------------------------
|
||||
|
||||
Once you have made the changes to the grammar files, to regenerate the ``C``
|
||||
Once you have made the changes to the grammar files, to regenerate the `C`
|
||||
parser (the one used by the interpreter) just execute:
|
||||
|
||||
```
|
||||
make regen-pegen
|
||||
```
|
||||
|
||||
using the ``Makefile`` in the main directory. If you are on Windows you can
|
||||
using the `Makefile` in the main directory. If you are on Windows you can
|
||||
use the Visual Studio project files to regenerate the parser or to execute:
|
||||
|
||||
```
|
||||
./PCbuild/build.bat --regen
|
||||
```
|
||||
|
||||
The generated parser file is located at
|
||||
[`Parser/parser.c`](https://github.com/python/cpython/blob/main/Parser/parser.c).
|
||||
The generated parser file is located at [`Parser/parser.c`](../Parser/parser.c).
|
||||
|
||||
How to regenerate the meta-parser
|
||||
---------------------------------
|
||||
|
||||
The meta-grammar (the grammar that describes the grammar for the grammar files
|
||||
themselves) is located at
|
||||
[`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram).
|
||||
[`Tools/peg_generator/pegen/metagrammar.gram`](../Tools/peg_generator/pegen/metagrammar.gram).
|
||||
Although it is very unlikely that you will ever need to modify it, if you make
|
||||
any modifications to this file (in order to implement new Pegen features) you will
|
||||
need to regenerate the meta-parser (the parser that parses the grammar files).
|
||||
|
@ -488,11 +485,11 @@ Grammatical elements and rules
|
|||
|
||||
Pegen has some special grammatical elements and rules:
|
||||
|
||||
- Strings with single quotes (') (for example, ``'class'``) denote KEYWORDS.
|
||||
- Strings with double quotes (") (for example, ``"match"``) denote SOFT KEYWORDS.
|
||||
- Uppercase names (for example, ``NAME``) denote tokens in the
|
||||
[`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens) file.
|
||||
- Rule names starting with ``invalid_`` are used for specialized syntax errors.
|
||||
- Strings with single quotes (') (for example, `'class'`) denote KEYWORDS.
|
||||
- Strings with double quotes (") (for example, `"match"`) denote SOFT KEYWORDS.
|
||||
- Uppercase names (for example, `NAME`) denote tokens in the
|
||||
[`Grammar/Tokens`](../Grammar/Tokens) file.
|
||||
- Rule names starting with `invalid_` are used for specialized syntax errors.
|
||||
|
||||
- These rules are NOT used in the first pass of the parser.
|
||||
- Only if the first pass fails to parse, a second pass including the invalid
|
||||
|
@ -509,14 +506,13 @@ Tokenization
|
|||
It is common among PEG parser frameworks that the parser does both the parsing
|
||||
and the tokenization, but this does not happen in Pegen. The reason is that the
|
||||
Python language needs a custom tokenizer to handle things like indentation
|
||||
boundaries, some special keywords like ``ASYNC`` and ``AWAIT`` (for
|
||||
boundaries, some special keywords like `ASYNC` and `AWAIT` (for
|
||||
compatibility purposes), backtracking errors (such as unclosed parenthesis),
|
||||
dealing with encoding, interactive mode and much more. Some of these reasons
|
||||
are also there for historical purposes, and some others are useful even today.
|
||||
|
||||
The list of tokens (all uppercase names in the grammar) that you can use can
|
||||
be found in thei
|
||||
[`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens)
|
||||
be found in the [`Grammar/Tokens`](../Grammar/Tokens)
|
||||
file. If you change this file to add new tokens, make sure to regenerate the
|
||||
files by executing:
|
||||
|
||||
|
@ -532,9 +528,7 @@ the tokens or to execute:
|
|||
```
|
||||
|
||||
How tokens are generated and the rules governing this are completely up to the tokenizer
|
||||
([`Parser/lexer`](https://github.com/python/cpython/blob/main/Parser/lexer)
|
||||
and
|
||||
[`Parser/tokenizer`](https://github.com/python/cpython/blob/main/Parser/tokenizer));
|
||||
([`Parser/lexer`](../Parser/lexer) and [`Parser/tokenizer`](../Parser/tokenizer));
|
||||
the parser just receives tokens from it.
|
||||
|
||||
Memoization
|
||||
|
@ -548,7 +542,7 @@ both in memory and time. Although the memory cost is obvious (the parser needs
|
|||
memory for storing previous results in the cache) the execution time cost comes
|
||||
for continuously checking if the given rule has a cache hit or not. In many
|
||||
situations, just parsing it again can be faster. Pegen **disables memoization
|
||||
by default** except for rules with the special marker ``memo`` after the rule
|
||||
by default** except for rules with the special marker `memo` after the rule
|
||||
name (and type, if present):
|
||||
|
||||
```
|
||||
|
@ -567,8 +561,7 @@ To determine whether a new rule needs memoization or not, benchmarking is requir
|
|||
(comparing execution times and memory usage of some considerably large files with
|
||||
and without memoization). There is a very simple instrumentation API available
|
||||
in the generated C parse code that allows to measure how much each rule uses
|
||||
memoization (check the
|
||||
[`Parser/pegen.c`](https://github.com/python/cpython/blob/main/Parser/pegen.c)
|
||||
memoization (check the [`Parser/pegen.c`](../Parser/pegen.c)
|
||||
file for more information) but it needs to be manually activated.
|
||||
|
||||
Automatic variables
|
||||
|
@ -578,9 +571,9 @@ To make writing actions easier, Pegen injects some automatic variables in the
|
|||
namespace available when writing actions. In the C parser, some of these
|
||||
automatic variable names are:
|
||||
|
||||
- ``p``: The parser structure.
|
||||
- ``EXTRA``: This is a macro that expands to
|
||||
``(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)``,
|
||||
- `p`: The parser structure.
|
||||
- `EXTRA`: This is a macro that expands to
|
||||
`(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)`,
|
||||
which is normally used to create AST nodes as almost all constructors need these
|
||||
attributes to be provided. All of the location variables are taken from the
|
||||
location information of the current token.
|
||||
|
@ -590,13 +583,13 @@ Hard and soft keywords
|
|||
|
||||
> [!NOTE]
|
||||
> In the grammar files, keywords are defined using **single quotes** (for example,
|
||||
> ``'class'``) while soft keywords are defined using **double quotes** (for example,
|
||||
> ``"match"``).
|
||||
> `'class'`) while soft keywords are defined using **double quotes** (for example,
|
||||
> `"match"`).
|
||||
|
||||
There are two kinds of keywords allowed in pegen grammars: *hard* and *soft*
|
||||
keywords. The difference between hard and soft keywords is that hard keywords
|
||||
are always reserved words, even in positions where they make no sense
|
||||
(for example, ``x = class + 1``), while soft keywords only get a special
|
||||
(for example, `x = class + 1`), while soft keywords only get a special
|
||||
meaning in context. Trying to use a hard keyword as a variable will always
|
||||
fail:
|
||||
|
||||
|
@ -621,7 +614,7 @@ one where they are defined as keywords:
|
|||
>>> foo(match="Yeah!")
|
||||
```
|
||||
|
||||
The ``match`` and ``case`` keywords are soft keywords, so that they are
|
||||
The `match` and `case` keywords are soft keywords, so that they are
|
||||
recognized as keywords at the beginning of a match statement or case block
|
||||
respectively, but are allowed to be used in other places as variable or
|
||||
argument names.
|
||||
|
@ -662,7 +655,7 @@ is, and it will unwind the stack and report the exception. This means that if a
|
|||
[rule action](#grammar-actions) raises an exception, all parsing will
|
||||
stop at that exact point. This is done to allow to correctly propagate any
|
||||
exception set by calling Python's C API functions. This also includes
|
||||
[``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError)
|
||||
[`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError)
|
||||
exceptions and it is the main mechanism the parser uses to report custom syntax
|
||||
error messages.
|
||||
|
||||
|
@ -684,10 +677,10 @@ grammar.
|
|||
To report generic syntax errors, pegen uses a common heuristic in PEG parsers:
|
||||
the location of *generic* syntax errors is reported to be the furthest token that
|
||||
was attempted to be matched but failed. This is only done if parsing has failed
|
||||
(the parser returns ``NULL`` in C or ``None`` in Python) but no exception has
|
||||
(the parser returns `NULL` in C or `None` in Python) but no exception has
|
||||
been raised.
|
||||
|
||||
As the Python grammar was primordially written as an ``LL(1)`` grammar, this heuristic
|
||||
As the Python grammar was primordially written as an `LL(1)` grammar, this heuristic
|
||||
has an extremely high success rate, but some PEG features, such as lookaheads,
|
||||
can impact this.
|
||||
|
||||
|
@ -699,19 +692,19 @@ can impact this.
|
|||
To generate more precise syntax errors, custom rules are used. This is a common
|
||||
practice also in context free grammars: the parser will try to accept some
|
||||
construct that is known to be incorrect just to report a specific syntax error
|
||||
for that construct. In pegen grammars, these rules start with the ``invalid_``
|
||||
for that construct. In pegen grammars, these rules start with the `invalid_`
|
||||
prefix. This is because trying to match these rules normally has a performance
|
||||
impact on parsing (and can also affect the 'correct' grammar itself in some
|
||||
tricky cases, depending on the ordering of the rules) so the generated parser
|
||||
acts in two phases:
|
||||
|
||||
1. The first phase will try to parse the input stream without taking into
|
||||
account rules that start with the ``invalid_`` prefix. If the parsing
|
||||
account rules that start with the `invalid_` prefix. If the parsing
|
||||
succeeds it will return the generated AST and the second phase will be
|
||||
skipped.
|
||||
|
||||
2. If the first phase failed, a second parsing attempt is done including the
|
||||
rules that start with an ``invalid_`` prefix. By design this attempt
|
||||
rules that start with an `invalid_` prefix. By design this attempt
|
||||
**cannot succeed** and is only executed to give to the invalid rules a
|
||||
chance to detect specific situations where custom, more precise, syntax
|
||||
errors can be raised. This also allows to trade a bit of performance for
|
||||
|
@ -723,15 +716,15 @@ acts in two phases:
|
|||
> When defining invalid rules:
|
||||
>
|
||||
> - Make sure all custom invalid rules raise
|
||||
> [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError)
|
||||
> [`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError)
|
||||
> exceptions (or a subclass of it).
|
||||
> - Make sure **all** invalid rules start with the ``invalid_`` prefix to not
|
||||
> - Make sure **all** invalid rules start with the `invalid_` prefix to not
|
||||
> impact performance of parsing correct Python code.
|
||||
> - Make sure the parser doesn't behave differently for regular rules when you introduce invalid rules
|
||||
> (see the [how PEG parsers work](#how-peg-parsers-work) section for more information).
|
||||
|
||||
You can find a collection of macros to raise specialized syntax errors in the
|
||||
[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h)
|
||||
[`Parser/pegen.h`](../Parser/pegen.h)
|
||||
header file. These macros allow also to report ranges for
|
||||
the custom errors, which will be highlighted in the tracebacks that will be
|
||||
displayed when the error is reported.
|
||||
|
@ -746,35 +739,33 @@ displayed when the error is reported.
|
|||
<valid python code> $ 42
|
||||
```
|
||||
|
||||
should trigger the syntax error in the ``$`` character. If your rule is not correctly defined this
|
||||
should trigger the syntax error in the `$` character. If your rule is not correctly defined this
|
||||
won't happen. As another example, suppose that you try to define a rule to match Python 2 style
|
||||
``print`` statements in order to create a better error message and you define it as:
|
||||
`print` statements in order to create a better error message and you define it as:
|
||||
|
||||
```
|
||||
invalid_print: "print" expression
|
||||
```
|
||||
|
||||
This will **seem** to work because the parser will correctly parse ``print(something)`` because it is valid
|
||||
code and the second phase will never execute but if you try to parse ``print(something) $ 3`` the first pass
|
||||
of the parser will fail (because of the ``$``) and in the second phase, the rule will match the
|
||||
``print(something)`` as ``print`` followed by the variable ``something`` between parentheses and the error
|
||||
will be reported there instead of the ``$`` character.
|
||||
This will **seem** to work because the parser will correctly parse `print(something)` because it is valid
|
||||
code and the second phase will never execute but if you try to parse `print(something) $ 3` the first pass
|
||||
of the parser will fail (because of the `$`) and in the second phase, the rule will match the
|
||||
`print(something)` as `print` followed by the variable `something` between parentheses and the error
|
||||
will be reported there instead of the `$` character.
|
||||
|
||||
Generating AST objects
|
||||
----------------------
|
||||
|
||||
The output of the C parser used by CPython, which is generated from the
|
||||
[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram),
|
||||
is a Python AST object (using C structures). This means that the actions in the
|
||||
grammar file generate AST objects when they succeed. Constructing these objects
|
||||
can be quite cumbersome (see the [AST compiler section](compiler.md#abstract-syntax-trees-ast)
|
||||
[grammar file](../Grammar/python.gram), is a Python AST object (using C
|
||||
structures). This means that the actions in the grammar file generate AST
|
||||
objects when they succeed. Constructing these objects can be quite cumbersome
|
||||
(see the [AST compiler section](compiler.md#abstract-syntax-trees-ast)
|
||||
for more information on how these objects are constructed and how they are used
|
||||
by the compiler), so special helper functions are used. These functions are
|
||||
declared in the
|
||||
[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h)
|
||||
header file and defined in the
|
||||
[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c)
|
||||
file. The helpers include functions that join AST sequences, get specific elements
|
||||
declared in the [`Parser/pegen.h`](../Parser/pegen.h) header file and defined
|
||||
in the [`Parser/action_helpers.c`](../Parser/action_helpers.c) file. The
|
||||
helpers include functions that join AST sequences, get specific elements
|
||||
from them or to perform extra processing on the generated tree.
|
||||
|
||||
|
||||
|
@ -788,11 +779,9 @@ from them or to perform extra processing on the generated tree.
|
|||
|
||||
As a general rule, if an action spawns multiple lines or requires something more
|
||||
complicated than a single expression of C code, is normally better to create a
|
||||
custom helper in
|
||||
[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c)
|
||||
and expose it in the
|
||||
[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h)
|
||||
header file so that it can be used from the grammar.
|
||||
custom helper in [`Parser/action_helpers.c`](../Parser/action_helpers.c)
|
||||
and expose it in the [`Parser/pegen.h`](../Parser/pegen.h) header file so that
|
||||
it can be used from the grammar.
|
||||
|
||||
When parsing succeeds, the parser **must** return a **valid** AST object.
|
||||
|
||||
|
@ -801,16 +790,15 @@ Testing
|
|||
|
||||
There are three files that contain tests for the grammar and the parser:
|
||||
|
||||
- [test_grammar.py](https://github.com/python/cpython/blob/main/Lib/test/test_grammar.py)
|
||||
- [test_syntax.py](https://github.com/python/cpython/blob/main/Lib/test/test_syntax.py)
|
||||
- [test_exceptions.py](https://github.com/python/cpython/blob/main/Lib/test/test_exceptions.py)
|
||||
- [test_grammar.py](../Lib/test/test_grammar.py)
|
||||
- [test_syntax.py](../Lib/test/test_syntax.py)
|
||||
- [test_exceptions.py](../Lib/test/test_exceptions.py)
|
||||
|
||||
Check the contents of these files to know which is the best place for new tests, depending
|
||||
on the nature of the new feature you are adding.
|
||||
Check the contents of these files to know which is the best place for new
|
||||
tests, depending on the nature of the new feature you are adding.
|
||||
|
||||
Tests for the parser generator itself can be found in the
|
||||
[test_peg_generator](https://github.com/python/cpython/blob/main/Lib/test_peg_generator)
|
||||
directory.
|
||||
[test_peg_generator](../Lib/test_peg_generator) directory.
|
||||
|
||||
|
||||
Debugging generated parsers
|
||||
|
@ -825,33 +813,32 @@ correctly compile and execute Python anymore. This makes it a bit challenging
|
|||
to debug when something goes wrong, especially when experimenting.
|
||||
|
||||
For this reason it is a good idea to experiment first by generating a Python
|
||||
parser. To do this, you can go to the
|
||||
[Tools/peg_generator](https://github.com/python/cpython/blob/main/Tools/peg_generator)
|
||||
parser. To do this, you can go to the [Tools/peg_generator](../Tools/peg_generator)
|
||||
directory on the CPython repository and manually call the parser generator by executing:
|
||||
|
||||
```
|
||||
$ python -m pegen python <PATH TO YOUR GRAMMAR FILE>
|
||||
```
|
||||
|
||||
This will generate a file called ``parse.py`` in the same directory that you
|
||||
This will generate a file called `parse.py` in the same directory that you
|
||||
can use to parse some input:
|
||||
|
||||
```
|
||||
$ python parse.py file_with_source_code_to_test.py
|
||||
```
|
||||
|
||||
As the generated ``parse.py`` file is just Python code, you can modify it
|
||||
As the generated `parse.py` file is just Python code, you can modify it
|
||||
and add breakpoints to debug or better understand some complex situations.
|
||||
|
||||
|
||||
Verbose mode
|
||||
------------
|
||||
|
||||
When Python is compiled in debug mode (by adding ``--with-pydebug`` when
|
||||
running the configure step in Linux or by adding ``-d`` when calling the
|
||||
[PCbuild/build.bat](https://github.com/python/cpython/blob/main/PCbuild/build.bat)),
|
||||
it is possible to activate a **very** verbose mode in the generated parser. This
|
||||
is very useful to debug the generated parser and to understand how it works, but it
|
||||
When Python is compiled in debug mode (by adding `--with-pydebug` when
|
||||
running the configure step in Linux or by adding `-d` when calling the
|
||||
[PCbuild/build.bat](../PCbuild/build.bat)), it is possible to activate a
|
||||
**very** verbose mode in the generated parser. This is very useful to
|
||||
debug the generated parser and to understand how it works, but it
|
||||
can be a bit hard to understand at first.
|
||||
|
||||
> [!NOTE]
|
||||
|
@ -859,13 +846,13 @@ can be a bit hard to understand at first.
|
|||
> interactive mode as it can be much harder to understand, because interactive
|
||||
> mode involves some special steps compared to regular parsing.
|
||||
|
||||
To activate verbose mode you can add the ``-d`` flag when executing Python:
|
||||
To activate verbose mode you can add the `-d` flag when executing Python:
|
||||
|
||||
```
|
||||
$ python -d file_to_test.py
|
||||
```
|
||||
|
||||
This will print **a lot** of output to ``stderr`` so it is probably better to dump
|
||||
This will print **a lot** of output to `stderr` so it is probably better to dump
|
||||
it to a file for further analysis. The output consists of trace lines with the
|
||||
following structure::
|
||||
|
||||
|
@ -873,17 +860,17 @@ following structure::
|
|||
<indentation> ('>'|'-'|'+'|'!') <rule_name>[<token_location>]: <alternative> ...
|
||||
```
|
||||
|
||||
Every line is indented by a different amount (``<indentation>``) depending on how
|
||||
Every line is indented by a different amount (`<indentation>`) depending on how
|
||||
deep the call stack is. The next character marks the type of the trace:
|
||||
|
||||
- ``>`` indicates that a rule is going to be attempted to be parsed.
|
||||
- ``-`` indicates that a rule has failed to be parsed.
|
||||
- ``+`` indicates that a rule has been parsed correctly.
|
||||
- ``!`` indicates that an exception or an error has been detected and the parser is unwinding.
|
||||
- `>` indicates that a rule is going to be attempted to be parsed.
|
||||
- `-` indicates that a rule has failed to be parsed.
|
||||
- `+` indicates that a rule has been parsed correctly.
|
||||
- `!` indicates that an exception or an error has been detected and the parser is unwinding.
|
||||
|
||||
The ``<token_location>`` part indicates the current index in the token array,
|
||||
the ``<rule_name>`` part indicates what rule is being parsed and
|
||||
the ``<alternative>`` part indicates what alternative within that rule
|
||||
The `<token_location>` part indicates the current index in the token array,
|
||||
the `<rule_name>` part indicates what rule is being parsed and
|
||||
the `<alternative>` part indicates what alternative within that rule
|
||||
is being attempted.
|
||||
|
||||
|
||||
|
@ -891,4 +878,5 @@ is being attempted.
|
|||
> **Document history**
|
||||
>
|
||||
> Pablo Galindo Salgado - Original author
|
||||
>
|
||||
> Irit Katriel and Jacob Coffee - Convert to Markdown
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue