mirror of
https://github.com/python/cpython.git
synced 2025-07-07 19:35:27 +00:00
Add internal docs about the free-threaded GC. (gh-132562)
This commit is contained in:
parent
e414a2d81c
commit
31a500a92b
1 changed files with 181 additions and 9 deletions
|
@ -477,17 +477,24 @@ specifically in a generation by calling `gc.collect(generation=NUM)`.
|
|||
```
|
||||
|
||||
|
||||
Optimization: visiting reachable objects
|
||||
========================================
|
||||
Optimization: excluding reachable objects
|
||||
=========================================
|
||||
|
||||
An object cannot be garbage if it can be reached.
|
||||
An object cannot be garbage if it can be reached. To avoid having to identify
|
||||
reference cycles across the whole heap, we can reduce the amount of work done
|
||||
considerably by first identifying objects reachable from objects known to be
|
||||
alive. These objects are excluded from the normal cyclic detection process.
|
||||
|
||||
To avoid having to identify reference cycles across the whole heap, we can
|
||||
reduce the amount of work done considerably by first moving most reachable objects
|
||||
to the `visited` space. Empirically, most reachable objects can be reached from a
|
||||
small set of global objects and local variables.
|
||||
This step does much less work per object, so reduces the time spent
|
||||
performing garbage collection by at least half.
|
||||
The default and free-threaded build both implement this optimization but in
|
||||
slightly different ways.
|
||||
|
||||
Finding reachable objects for the default build GC
|
||||
--------------------------------------------------
|
||||
|
||||
This works by first moving most reachable objects to the `visited` space.
|
||||
Empirically, most reachable objects can be reached from a small set of global
|
||||
objects and local variables. This step does much less work per object, so
|
||||
reduces the time spent performing garbage collection by at least half.
|
||||
|
||||
> [!NOTE]
|
||||
> Objects that are not determined to be reachable by this pass are not necessarily
|
||||
|
@ -515,6 +522,171 @@ increment. All objects directly referred to from those stack frames are
|
|||
added to the working set.
|
||||
Then the above algorithm is repeated, starting from step 2.
|
||||
|
||||
|
||||
Finding reachable objects for the free-threaded GC
|
||||
--------------------------------------------------
|
||||
|
||||
Within the `gc_free_threading.c` implementation, this is known as the "mark
|
||||
alive" pass or phase. It is similar in concept to what is done for the default
|
||||
build GC. Rather than moving objects between double-linked lists, the
|
||||
free-threaded GC uses a flag in `ob_gc_bits` to track if an object is
|
||||
found to be definitely alive (not garbage).
|
||||
|
||||
To find objects reachable from known alive objects, known as the "roots", the
|
||||
`gc_mark_alive_from_roots()` function is used. Root objects include
|
||||
`interp->sysdict` (the `sys` module dictionary), `interp->builtins`, and
|
||||
`interp->types`. Also included are all objects referred to by active Python
|
||||
frames. These objects and the transitive closure of objects reachable from
|
||||
them have `_PyGC_BITS_ALIVE` set. Any object with that bit set is excluded
|
||||
from the rest of the cyclic garbage detection process, since we know it cannot
|
||||
be unreachable.
|
||||
|
||||
> [!NOTE]
|
||||
> If the `gc.freeze()` function has been used, this phase of the collector is
|
||||
> skipped. That is done for two reasons. First, it is unlikely to be a
|
||||
> performance win if most of the objects have been marked as frozen. As such,
|
||||
> they would be excluded for the cyclic garbage detection, without this extra
|
||||
> work anyhow. Second, one of the purposes of using `gc.freeze()` is to avoid
|
||||
> dirtying the memory pages holding frozen objects. If this phase was executed,
|
||||
> the toggling of the `ob_gc_bits` flags would dirty pages and defeat that.
|
||||
|
||||
Software prefetch hinting
|
||||
-------------------------
|
||||
|
||||
To speed up the "mark alive" phase of the free-threaded GC, an additional
|
||||
optimization, known as software prefetching, is used. The GC will execute
|
||||
explicit prefetch CPU instructions in order to reduce the latency due to
|
||||
loading data from main memory. This added complexity can pay off since main
|
||||
memory is so much slower compared to accessing data in the CPU cache. This
|
||||
is enabled only if the number of long-lived objects exceeds a threshold. If
|
||||
the set of objects being garbage collected is small, the accessed memory is
|
||||
likely to fit entirely in the CPU cache and software prefetch is not helpful.
|
||||
|
||||
The details of this optimization are intricate, with the source code being the
|
||||
best reference. However, the rest of this section gives a high level
|
||||
description of how it works and explains some design decisions.
|
||||
|
||||
Software prefetching is only used during the "mark alive" phase of the GC.
|
||||
Specifically, when we are performing the transitive closure of the "alive"
|
||||
status of objects (i.e. objects reachable from known alive objects, known as
|
||||
roots). For each object we find, we need to traverse all objects directly
|
||||
reachable from that object. If that set of referred objects is in scattered
|
||||
locations of memory, the hardware prefetch is unlikely to predict the next
|
||||
accessed memory location.
|
||||
|
||||
Making software prefetch work well hinges on a key principle: allow enough time
|
||||
between issuing the prefetch instruction for a memory location and actually
|
||||
accessing that location's data. We can call that time difference the prefetch
|
||||
window. If the window is too large, we fill up the CPU caches with data that's
|
||||
not needed yet. Worse, the data might be discarded from the cache before we
|
||||
actually use it. If the window is too small then the memory system does not
|
||||
have enough time to finish loading the memory and the CPU will have to wait.
|
||||
The window is indirectly tuned by the prefetch buffer parameters. The buffer
|
||||
will be explained next.
|
||||
|
||||
The prefetch buffer is a FIFO queue of fixed size. When we enqueue an object
|
||||
reference into the buffer, we also issue a software prefetch instruction for
|
||||
that memory location. When we dequeue an object reference from the buffer, we
|
||||
assume or hope that enough time has elapsed so that the memory has been loaded
|
||||
into the cache. This is the mechanism that provides the window.
|
||||
|
||||
When performing the transitive closure of "alive" status, the set of objects
|
||||
yet to visit are stored in one of two places. First, they can be stored in the
|
||||
prefech buffer. Second, there is a LIFO stack, of unlimited size. When object
|
||||
references are found using `tp_traverse`, they are enqueued in the buffer if
|
||||
it is not full, otherwise they are pushed to the stack.
|
||||
|
||||
We must take special care not to access the memory referred to by an object
|
||||
pointer until we take that reference out of the prefetch buffer. That means we
|
||||
cannot check if an object is a "container" (if the `PyObject_IS_GC()` test is
|
||||
true) or if the object already has the "alive" flag set. Both of those tests
|
||||
would require that the object memory is accessed. There are some additional
|
||||
elaborations that try to keep the buffer filled to the optimal size but to keep
|
||||
things simple we will omit those details here. Not discussed in details are
|
||||
"spans", which help reduce the amount of stack used and make it easier to
|
||||
control the size of the buffer.
|
||||
|
||||
As mentioned, the prefetch window is the time delay between the issue of the
|
||||
prefetch instruction, on buffer enqueue, and the memory access, after buffer
|
||||
dequeue. It is tuned by adjusting some buffer parameters. If processing time
|
||||
were equal for every object then the buffer length would be proportional to the
|
||||
window. Since processing each object can actually take a variable amount of
|
||||
time, the relation between the buffer length and the prefetch window is only
|
||||
approximate. However, this proportional relationship is assumed to hold true
|
||||
and we avoid the overhead of actually measuring object processing times.
|
||||
|
||||
The relevant parameters are the maximum buffer size and the low and high
|
||||
thresholds for filling. The buffer parameters are set as follows: maximum
|
||||
length is 256, low threshold is 8, high threshold is 16. These parameters are
|
||||
used as follows. If the buffer has reached the maximum size, new object
|
||||
pointers found while following references are pushed to the stack, rather than
|
||||
put in the buffer. When dequeuing objects from the buffer, we will "prime" the
|
||||
buffer if the current length drops below the low threshold. Priming means
|
||||
popping objects from the stack and enqueing them into the buffer. While
|
||||
priming, we will fill it only until the high threshold is reached.
|
||||
|
||||
To measure the effectiveness of the buffer, some benchmark programs were run
|
||||
with a trace log of memory location prefetch and access instructions. The
|
||||
prefetch window for each object processed was computed from the trace log.
|
||||
Each enqueue and dequeue operation were treated as taking one unit of time.
|
||||
Time to actually process the object was assumed to be zero. A histogram of the
|
||||
windows is shown below. These traces suggest the buffer length is mostly being
|
||||
kept between the low and high thresholds, which is what we want. Variations of
|
||||
the buffer parameters were benchmarked and the best performing parameters were
|
||||
chosen. Obviously it is unlikely these parameters will be optimal for all
|
||||
hardware and programs.
|
||||
|
||||
```
|
||||
Prefetch window stats
|
||||
mean 52.1
|
||||
median 14.0
|
||||
max 256
|
||||
25.60 |65,304 | ******************************
|
||||
51.20 |5,590 | **
|
||||
76.80 |3,562 | *
|
||||
102.40 |2,683 | *
|
||||
128.00 |2,278 | *
|
||||
153.60 |2,285 | *
|
||||
179.20 |2,377 | *
|
||||
204.80 |2,238 | *
|
||||
230.40 |2,753 | *
|
||||
256.00 |5,930 | **
|
||||
-------- |------- | -------
|
||||
N= |95,000
|
||||
```
|
||||
|
||||
Using software prefetch instructions is only a win if the set of objects being
|
||||
examined by the GC does not fit into CPU caches. Otherwise, using the buffer
|
||||
and prefetch instructions is just overhead. Using the long lived object count
|
||||
seems a good estimate of if things will fit in the cache. On 64-bit platforms,
|
||||
the minimum object size is 32 bytes. A 4MB L2 cache would hold about 130,000
|
||||
objects.
|
||||
|
||||
The current threshold for enabling prefetch is that the number of long-lived
|
||||
objects must exceed 200,000. Based on benchmarking, this seems in the range
|
||||
where prefetch becomes a net gain. Obviously it depends on hardware details
|
||||
and also the "shape" of the object graph. For example, your object graph may
|
||||
be constructed by linearly allocating objects in memory. Then, traversing the
|
||||
object graph might naturally result in mostly ordered memory access. In that
|
||||
case, the hardware prefetch is likely to do a nearly perfect job without any
|
||||
software prefetch hints.
|
||||
|
||||
This optimization, as of March 2025, was tuned on the following hardware
|
||||
platforms:
|
||||
|
||||
- Apple M3 Pro, 32 GB RAM, 192+128 KB L1, 16 MB L2, compiled with Clang 19
|
||||
- AMD Ryzen 5 7600X, 64 GB RAM, 384 KB L1, 6 GB L2, 32 MB L3, compiled with GCC 12.2.0
|
||||
|
||||
Benchmarking the effectiveness of this optimization is particularly difficult.
|
||||
It depends both on hardware details, like CPU cache sizes and memory latencies,
|
||||
and the specifics of the program's memory access patterns, where objects are
|
||||
located in memory and in what order they are accessed during the mark alive
|
||||
phase. When the program's memory access patterns are favourable, working set
|
||||
of data larger than the CPU cache, objects allocated in such a way that access
|
||||
order is not linear, then the speedup from using software prefetching is in the
|
||||
range of 20% to 40% faster for the entire full GC collection.
|
||||
|
||||
|
||||
Optimization: reusing fields to save memory
|
||||
===========================================
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue