gh-119786: cleanup internal docs and fix internal links (#127485)

This commit is contained in:
Bénédikt Tran 2024-12-01 18:12:22 +01:00 committed by GitHub
parent 1bc4f076d1
commit 04673d2f14
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
11 changed files with 152 additions and 148 deletions

View file

@ -1,4 +1,3 @@
# CPython Internals Documentation
The documentation in this folder is intended for CPython maintainers.

View file

@ -96,6 +96,7 @@ quality of specialization and keeping the overhead of specialization low.
Specialized instructions must be fast. In order to be fast,
specialized instructions should be tailored for a particular
set of values that allows them to:
1. Verify that incoming value is part of that set with low overhead.
2. Perform the operation quickly.
@ -107,9 +108,11 @@ For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()`
dictionaries that have a keys with the expected version.
This can be tested quickly:
* `globals->keys->dk_version == expected_version`
and the operation can be performed quickly:
* `value = entries[cache->index].me_value;`.
Because it is impossible to measure the performance of an instruction without
@ -122,6 +125,7 @@ base instruction.
### Implementation of specialized instructions
In general, specialized instructions should be implemented in two parts:
1. A sequence of guards, each of the form
`DEOPT_IF(guard-condition-is-false, BASE_NAME)`.
2. The operation, which should ideally have no branches and

View file

@ -32,7 +32,7 @@ Below is a checklist of things that may need to change.
[`Include/internal/pycore_ast.h`](../Include/internal/pycore_ast.h) and
[`Python/Python-ast.c`](../Python/Python-ast.c).
* [`Parser/lexer/`](../Parser/lexer/) contains the tokenization code.
* [`Parser/lexer/`](../Parser/lexer) contains the tokenization code.
This is where you would add a new type of comment or string literal, for example.
* [`Python/ast.c`](../Python/ast.c) will need changes to validate AST objects
@ -60,4 +60,4 @@ Below is a checklist of things that may need to change.
to the tokenizer.
* Documentation must be written! Specifically, one or more of the pages in
[`Doc/reference/`](../Doc/reference/) will need to be updated.
[`Doc/reference/`](../Doc/reference) will need to be updated.

View file

@ -1,4 +1,3 @@
Compiler design
===============
@ -7,8 +6,8 @@ Abstract
In CPython, the compilation from source code to bytecode involves several steps:
1. Tokenize the source code [Parser/lexer/](../Parser/lexer/)
and [Parser/tokenizer/](../Parser/tokenizer/).
1. Tokenize the source code [Parser/lexer/](../Parser/lexer)
and [Parser/tokenizer/](../Parser/tokenizer).
2. Parse the stream of tokens into an Abstract Syntax Tree
[Parser/parser.c](../Parser/parser.c).
3. Transform AST into an instruction sequence
@ -134,9 +133,8 @@ this case) a `stmt_ty` struct with the appropriate initialization. The
`FunctionDef()` constructor function sets 'kind' to `FunctionDef_kind` and
initializes the *name*, *args*, *body*, and *attributes* fields.
See also
[Green Tree Snakes - The missing Python AST docs](https://greentreesnakes.readthedocs.io/en/latest)
by Thomas Kluyver.
See also [Green Tree Snakes - The missing Python AST docs](
https://greentreesnakes.readthedocs.io/en/latest) by Thomas Kluyver.
Memory management
=================
@ -260,11 +258,11 @@ manually -- `generic`, `identifier` and `int`. These types are found in
[Include/internal/pycore_asdl.h](../Include/internal/pycore_asdl.h).
Functions and macros for creating `asdl_xx_seq *` types are as follows:
`_Py_asdl_generic_seq_new(Py_ssize_t, PyArena *)`
* `_Py_asdl_generic_seq_new(Py_ssize_t, PyArena *)`:
Allocate memory for an `asdl_generic_seq` of the specified length
`_Py_asdl_identifier_seq_new(Py_ssize_t, PyArena *)`
* `_Py_asdl_identifier_seq_new(Py_ssize_t, PyArena *)`:
Allocate memory for an `asdl_identifier_seq` of the specified length
`_Py_asdl_int_seq_new(Py_ssize_t, PyArena *)`
* `_Py_asdl_int_seq_new(Py_ssize_t, PyArena *)`:
Allocate memory for an `asdl_int_seq` of the specified length
In addition to the three types mentioned above, some ASDL sequence types are
@ -273,19 +271,19 @@ automatically generated by [Parser/asdl_c.py](../Parser/asdl_c.py) and found in
Macros for using both manually defined and automatically generated ASDL
sequence types are as follows:
`asdl_seq_GET(asdl_xx_seq *, int)`
* `asdl_seq_GET(asdl_xx_seq *, int)`:
Get item held at a specific position in an `asdl_xx_seq`
`asdl_seq_SET(asdl_xx_seq *, int, stmt_ty)`
* `asdl_seq_SET(asdl_xx_seq *, int, stmt_ty)`:
Set a specific index in an `asdl_xx_seq` to the specified value
Untyped counterparts exist for some of the typed macros. These are useful
when a function needs to manipulate a generic ASDL sequence:
`asdl_seq_GET_UNTYPED(asdl_seq *, int)`
* `asdl_seq_GET_UNTYPED(asdl_seq *, int)`:
Get item held at a specific position in an `asdl_seq`
`asdl_seq_SET_UNTYPED(asdl_seq *, int, stmt_ty)`
* `asdl_seq_SET_UNTYPED(asdl_seq *, int, stmt_ty)`:
Set a specific index in an `asdl_seq` to the specified value
`asdl_seq_LEN(asdl_seq *)`
* `asdl_seq_LEN(asdl_seq *)`:
Return the length of an `asdl_seq` or `asdl_xx_seq`
Note that typed macros and functions are recommended over their untyped
@ -379,14 +377,14 @@ arguments to a node that used the '*' modifier).
Emission of bytecode is handled by the following macros:
* `ADDOP(struct compiler *, location, int)`
* `ADDOP(struct compiler *, location, int)`:
add a specified opcode
* `ADDOP_IN_SCOPE(struct compiler *, location, int)`
* `ADDOP_IN_SCOPE(struct compiler *, location, int)`:
like `ADDOP`, but also exits current scope; used for adding return value
opcodes in lambdas and closures
* `ADDOP_I(struct compiler *, location, int, Py_ssize_t)`
* `ADDOP_I(struct compiler *, location, int, Py_ssize_t)`:
add an opcode that takes an integer argument
* `ADDOP_O(struct compiler *, location, int, PyObject *, TYPE)`
* `ADDOP_O(struct compiler *, location, int, PyObject *, TYPE)`:
add an opcode with the proper argument based on the position of the
specified PyObject in PyObject sequence object, but with no handling of
mangled names; used for when you
@ -394,17 +392,17 @@ Emission of bytecode is handled by the following macros:
parameters where name mangling is not possible and the scope of the
name is known; *TYPE* is the name of PyObject sequence
(`names` or `varnames`)
* `ADDOP_N(struct compiler *, location, int, PyObject *, TYPE)`
* `ADDOP_N(struct compiler *, location, int, PyObject *, TYPE)`:
just like `ADDOP_O`, but steals a reference to PyObject
* `ADDOP_NAME(struct compiler *, location, int, PyObject *, TYPE)`
* `ADDOP_NAME(struct compiler *, location, int, PyObject *, TYPE)`:
just like `ADDOP_O`, but name mangling is also handled; used for
attribute loading or importing based on name
* `ADDOP_LOAD_CONST(struct compiler *, location, PyObject *)`
* `ADDOP_LOAD_CONST(struct compiler *, location, PyObject *)`:
add the `LOAD_CONST` opcode with the proper argument based on the
position of the specified PyObject in the consts table.
* `ADDOP_LOAD_CONST_NEW(struct compiler *, location, PyObject *)`
* `ADDOP_LOAD_CONST_NEW(struct compiler *, location, PyObject *)`:
just like `ADDOP_LOAD_CONST_NEW`, but steals a reference to PyObject
* `ADDOP_JUMP(struct compiler *, location, int, basicblock *)`
* `ADDOP_JUMP(struct compiler *, location, int, basicblock *)`:
create a jump to a basic block
The `location` argument is a struct with the source location to be
@ -433,7 +431,7 @@ Finally, the sequence of pseudo-instructions is converted into actual
bytecode. This includes transforming pseudo instructions into actual instructions,
converting jump targets from logical labels to relative offsets, and
construction of the [exception table](exception_handling.md) and
[locations table](locations.md).
[locations table](code_objects.md#source-code-locations).
The bytecode and tables are then wrapped into a `PyCodeObject` along with additional
metadata, including the `consts` and `names` arrays, information about function
reference to the source code (filename, etc). All of this is implemented by
@ -453,7 +451,7 @@ in [Python/ceval.c](../Python/ceval.c).
Important files
===============
* [Parser/](../Parser/)
* [Parser/](../Parser)
* [Parser/Python.asdl](../Parser/Python.asdl):
ASDL syntax file.
@ -534,7 +532,7 @@ Important files
* [Python/instruction_sequence.c](../Python/instruction_sequence.c):
A data structure representing a sequence of bytecode-like pseudo-instructions.
* [Include/](../Include/)
* [Include/](../Include)
* [Include/cpython/code.h](../Include/cpython/code.h)
: Header file for [Objects/codeobject.c](../Objects/codeobject.c);
@ -570,7 +568,7 @@ Important files
by
[Tools/cases_generator/opcode_id_generator.py](../Tools/cases_generator/opcode_id_generator.py).
* [Objects/](../Objects/)
* [Objects/](../Objects)
* [Objects/codeobject.c](../Objects/codeobject.c)
: Contains PyCodeObject-related code.
@ -579,7 +577,7 @@ Important files
: Contains the `frame_setlineno()` function which should determine whether it is allowed
to make a jump between two points in a bytecode.
* [Lib/](../Lib/)
* [Lib/](../Lib)
* [Lib/opcode.py](../Lib/opcode.py)
: opcode utilities exposed to Python.
@ -591,7 +589,7 @@ Important files
Objects
=======
* [Locations](locations.md): Describes the location table
* [Locations](code_objects.md#source-code-locations): Describes the location table
* [Frames](frames.md): Describes frames and the frame stack
* [Objects/object_layout.md](../Objects/object_layout.md): Describes object layout for 3.11 and later
* [Exception Handling](exception_handling.md): Describes the exception table

View file

@ -107,13 +107,12 @@ Format of the exception table
-----------------------------
Conceptually, the exception table consists of a sequence of 5-tuples:
```
1. `start-offset` (inclusive)
2. `end-offset` (exclusive)
3. `target`
4. `stack-depth`
5. `push-lasti` (boolean)
```
All offsets and lengths are in code units, not bytes.
@ -129,12 +128,13 @@ Also, sizes are limited to 2**30 as the code length cannot exceed 2**31 and each
It also happens that depth is generally quite small.
So, we need to encode:
```
`start` (up to 30 bits)
`size` (up to 30 bits)
`target` (up to 30 bits)
`depth` (up to ~8 bits)
`lasti` (1 bit)
start (up to 30 bits)
size (up to 30 bits)
target (up to 30 bits)
depth (up to ~8 bits)
lasti (1 bit)
```
We need a marker for the start of the entry, so the first byte of entry will have the most significant bit set.
@ -145,23 +145,26 @@ The 8 bits of a byte are (msb left) SXdddddd where S is the start bit. X is the
In addition, we combine `depth` and `lasti` into a single value, `((depth<<1)+lasti)`, before encoding.
For example, the exception entry:
```
`start`: 20
`end`: 28
`target`: 100
`depth`: 3
`lasti`: False
start: 20
end: 28
target: 100
depth: 3
lasti: False
```
is encoded by first converting to the more compact four value form:
```
`start`: 20
`size`: 8
`target`: 100
`depth<<1+lasti`: 6
start: 20
size: 8
target: 100
depth<<1+lasti: 6
```
which is then encoded as:
```
148 (MSB + 20 for start)
8 (size)

View file

@ -27,6 +27,7 @@ objects, so are not allocated in the per-thread stack. See `PyGenObject` in
## Layout
Each activation record is laid out as:
* Specials
* Locals
* Stack

View file

@ -1,4 +1,3 @@
Garbage collector design
========================

View file

@ -1,4 +1,3 @@
Generators
==========

View file

@ -1,4 +1,3 @@
The bytecode interpreter
========================

View file

@ -1,4 +1,3 @@
Guide to the parser
===================
@ -444,15 +443,15 @@ How to regenerate the parser
Once you have made the changes to the grammar files, to regenerate the `C`
parser (the one used by the interpreter) just execute:
```
make regen-pegen
```shell
$ make regen-pegen
```
using the `Makefile` in the main directory. If you are on Windows you can
use the Visual Studio project files to regenerate the parser or to execute:
```
./PCbuild/build.bat --regen
```dos
PCbuild/build.bat --regen
```
The generated parser file is located at [`Parser/parser.c`](../Parser/parser.c).
@ -468,15 +467,15 @@ any modifications to this file (in order to implement new Pegen features) you wi
need to regenerate the meta-parser (the parser that parses the grammar files).
To do so just execute:
```
make regen-pegen-metaparser
```shell
$ make regen-pegen-metaparser
```
If you are on Windows you can use the Visual Studio project files
to regenerate the parser or to execute:
```
./PCbuild/build.bat --regen
```dos
PCbuild/build.bat --regen
```
@ -516,15 +515,15 @@ be found in the [`Grammar/Tokens`](../Grammar/Tokens)
file. If you change this file to add new tokens, make sure to regenerate the
files by executing:
```
make regen-token
```shell
$ make regen-token
```
If you are on Windows you can use the Visual Studio project files to regenerate
the tokens or to execute:
```
./PCbuild/build.bat --regen
```dos
PCbuild/build.bat --regen
```
How tokens are generated and the rules governing this are completely up to the tokenizer
@ -593,7 +592,7 @@ are always reserved words, even in positions where they make no sense
meaning in context. Trying to use a hard keyword as a variable will always
fail:
```
```pycon
>>> class = 3
File "<stdin>", line 1
class = 3
@ -609,7 +608,7 @@ fail:
While soft keywords don't have this limitation if used in a context other the
one where they are defined as keywords:
```
```pycon
>>> match = 45
>>> foo(match="Yeah!")
```
@ -621,7 +620,7 @@ argument names.
You can get a list of all keywords defined in the grammar from Python:
```
```pycon
>>> import keyword
>>> keyword.kwlist
['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break',
@ -632,7 +631,7 @@ You can get a list of all keywords defined in the grammar from Python:
as well as soft keywords:
```
```pycon
>>> import keyword
>>> keyword.softkwlist
['_', 'case', 'match']
@ -798,7 +797,7 @@ Check the contents of these files to know which is the best place for new
tests, depending on the nature of the new feature you are adding.
Tests for the parser generator itself can be found in the
[test_peg_generator](../Lib/test_peg_generator) directory.
[test_peg_generator](../Lib/test/test_peg_generator) directory.
Debugging generated parsers
@ -816,14 +815,14 @@ For this reason it is a good idea to experiment first by generating a Python
parser. To do this, you can go to the [Tools/peg_generator](../Tools/peg_generator)
directory on the CPython repository and manually call the parser generator by executing:
```
```shell
$ python -m pegen python <PATH TO YOUR GRAMMAR FILE>
```
This will generate a file called `parse.py` in the same directory that you
can use to parse some input:
```
```shell
$ python parse.py file_with_source_code_to_test.py
```
@ -848,7 +847,7 @@ can be a bit hard to understand at first.
To activate verbose mode you can add the `-d` flag when executing Python:
```
```shell
$ python -d file_to_test.py
```

View file

@ -2,6 +2,7 @@
*Interned* strings are conceptually part of an interpreter-global
*set* of interned strings, meaning that:
- no two interned strings have the same content (across an interpreter);
- two interned strings can be safely compared using pointer equality
(Python `is`).
@ -61,6 +62,7 @@ if it's interned and mortal it needs extra processing in
The converse is not true: interned strings can be mortal.
For mortal interned strings:
- the 2 references from the interned dict (key & value) are excluded from
their refcount
- the deallocator (`unicode_dealloc`) removes the string from the interned dict
@ -90,6 +92,7 @@ modify in place.
The functions take ownership of (“steal”) the reference to their argument,
and update the argument with a *new* reference.
This means:
- They're “reference neutral”.
- They must not be called with a borrowed reference.