gh-138122: Allow tachyon to write and read binary output (#142730)

2025-12-23 09:19:18 +00:00 · 2025-12-22 23:57:20 +00:00 · 2025-12-22 23:57:20 +00:00 · 9e51301234
commit 9e51301234
parent 714037ba84
30 changed files with 6554 additions and 134 deletions
--- a/InternalDocs/profiling_binary_format.md
+++ b/InternalDocs/profiling_binary_format.md
@ -0,0 +1,489 @@
+# Profiling Binary Format
+
+The profiling module includes a binary file format for storing sampling
+profiler data. This document describes the format's structure and the
+design decisions behind it.
+
+The implementation is in
+[`Modules/_remote_debugging/binary_io_writer.c`](../Modules/_remote_debugging/binary_io_writer.c)
+and [`Modules/_remote_debugging/binary_io_reader.c`](../Modules/_remote_debugging/binary_io_reader.c),
+with declarations in
+[`Modules/_remote_debugging/binary_io.h`](../Modules/_remote_debugging/binary_io.h).
+
+## Overview
+
+The sampling profiler can generate enormous amounts of data. A typical
+profiling session sampling at 1000 Hz for 60 seconds produces 60,000 samples.
+Each sample contains a full call stack, often 20-50 frames deep, and each
+frame includes a filename, function name, and line number. In a text-based
+format like collapsed stacks, this would mean repeating the same long file
+paths and function names thousands of times.
+
+The binary format addresses this through two key strategies:
+
+1. **Deduplication**: Strings and frames are stored once in lookup tables,
+   then referenced by small integer indices. A 100-character file path that
+   appears in 50,000 samples is stored once, not 50,000 times.
+
+2. **Compact encoding**: Variable-length integers (varints) encode small
+   values in fewer bytes. Since most indices are small (under 128), they
+   typically need only one byte instead of four.
+
+Together with optional zstd compression, these techniques reduce file sizes
+by 10-50x compared to text formats while also enabling faster I/O.
+
+## File Layout
+
+The file consists of five sections:
+
+```
+------------------+  Offset 0
+|     Header       |  64 bytes (fixed)
+------------------+  Offset 64
+|                  |
+|   Sample Data    |  Variable size (optionally compressed)
+|                  |
+------------------+  string_table_offset
+|   String Table   |  Variable size
+------------------+  frame_table_offset
+|   Frame Table    |  Variable size
+------------------+  file_size - 32
+|     Footer       |  32 bytes (fixed)
+------------------+  file_size
+```
+
+The layout is designed for streaming writes during profiling. The profiler
+cannot know in advance how many unique strings or frames will be encountered,
+so these tables must be built incrementally and written at the end.
+
+The header comes first so readers can quickly validate the file and locate
+the metadata tables. The sample data follows immediately, allowing the writer
+to stream samples directly to disk (or through a compression stream) without
+buffering the entire dataset in memory.
+
+The string and frame tables are placed after sample data because they grow
+as new unique entries are discovered during profiling. By deferring their
+output until finalization, the writer avoids the complexity of reserving
+space or rewriting portions of the file.
+
+The footer at the end contains counts needed to allocate arrays before
+parsing the tables. Placing it at a fixed offset from the end (rather than
+at a variable offset recorded in the header) means readers can locate it
+with a single seek to `file_size - 32`, without first reading the header.
+
+## Header
+
+```
+ Offset   Size   Type      Description
+--------+------+---------+----------------------------------------+
+|    0   |  4   | uint32  | Magic number (0x54414348 = "TACH")     |
+|    4   |  4   | uint32  | Format version                         |
+|    8   |  4   | bytes   | Python version (major, minor, micro,   |
+|        |      |         | reserved)                              |
+|   12   |  8   | uint64  | Start timestamp (microseconds)         |
+|   20   |  8   | uint64  | Sample interval (microseconds)         |
+|   28   |  4   | uint32  | Total sample count                     |
+|   32   |  4   | uint32  | Thread count                           |
+|   36   |  8   | uint64  | String table offset                    |
+|   44   |  8   | uint64  | Frame table offset                     |
+|   52   |  4   | uint32  | Compression type (0=none, 1=zstd)      |
+|   56   |  8   | bytes   | Reserved (zero-filled)                 |
+--------+------+---------+----------------------------------------+
+```
+
+The magic number `0x54414348` ("TACH" for Tachyon) identifies the file format
+and also serves as an **endianness marker**. When read on a system with
+different byte order than the writer, it appears as `0x48434154`. The reader
+uses this to detect cross-endian files and automatically byte-swap all
+multi-byte integer fields.
+
+The Python version field records the major, minor, and micro version numbers
+of the Python interpreter that generated the file. This allows analysis tools
+to detect version mismatches when replaying data collected on a different
+Python version, which may have different internal structures or behaviors.
+
+The header is written as zeros initially, then overwritten with actual values
+during finalization. This requires the output stream to be seekable, which
+is acceptable since the format targets regular files rather than pipes or
+network streams.
+
+## Sample Data
+
+Sample data begins at offset 64 and extends to `string_table_offset`. Samples
+use delta compression to minimize redundancy when consecutive samples from the
+same thread have identical or similar call stacks.
+
+### Stack Encoding Types
+
+Each sample record begins with thread identification, then an encoding byte:
+
+| Code | Name | Description |
+|------|------|-------------|
+| 0x00 | REPEAT | RLE: identical stack repeated N times |
+| 0x01 | FULL | Complete stack (first sample or no match) |
+| 0x02 | SUFFIX | Shares N frames from bottom of previous stack |
+| 0x03 | POP_PUSH | Remove M frames from top, add N new frames |
+
+### Record Formats
+
+**REPEAT (0x00) - Run-Length Encoded Identical Stacks:**
+```
+-----------------+-----------+----------------------------------------+
+| thread_id       | 8 bytes   | Thread identifier (uint64, fixed)      |
+| interpreter_id  | 4 bytes   | Interpreter ID (uint32, fixed)         |
+| encoding        | 1 byte    | 0x00 (REPEAT)                          |
+| count           | varint    | Number of samples in this RLE group    |
+| samples         | varies    | Interleaved: [delta: varint, status: 1]|
+|                 |           | repeated count times                   |
+-----------------+-----------+----------------------------------------+
+```
+The stack is inherited from this thread's previous sample. Each sample in the
+group gets its own timestamp delta and status byte, stored as interleaved pairs
+(delta1, status1, delta2, status2, ...) rather than separate arrays.
+
+**FULL (0x01) - Complete Stack:**
+```
+-----------------+-----------+----------------------------------------+
+| thread_id       | 8 bytes   | Thread identifier (uint64, fixed)      |
+| interpreter_id  | 4 bytes   | Interpreter ID (uint32, fixed)         |
+| encoding        | 1 byte    | 0x01 (FULL)                            |
+| timestamp_delta | varint    | Microseconds since thread's last sample|
+| status          | 1 byte    | Thread state flags                     |
+| stack_depth     | varint    | Number of frames in call stack         |
+| frame_indices   | varint[]  | Array of frame table indices           |
+-----------------+-----------+----------------------------------------+
+```
+Used for the first sample from a thread, or when delta encoding would not
+provide savings.
+
+**SUFFIX (0x02) - Shared Suffix Match:**
+```
+-----------------+-----------+----------------------------------------+
+| thread_id       | 8 bytes   | Thread identifier (uint64, fixed)      |
+| interpreter_id  | 4 bytes   | Interpreter ID (uint32, fixed)         |
+| encoding        | 1 byte    | 0x02 (SUFFIX)                          |
+| timestamp_delta | varint    | Microseconds since thread's last sample|
+| status          | 1 byte    | Thread state flags                     |
+| shared_count    | varint    | Frames shared from bottom of prev stack|
+| new_count       | varint    | New frames at top of stack             |
+| new_frames      | varint[]  | Array of new_count frame indices       |
+-----------------+-----------+----------------------------------------+
+```
+Used when a function call added frames to the top of the stack. The shared
+frames from the previous stack are kept, and new frames are prepended.
+
+**POP_PUSH (0x03) - Pop and Push:**
+```
+-----------------+-----------+----------------------------------------+
+| thread_id       | 8 bytes   | Thread identifier (uint64, fixed)      |
+| interpreter_id  | 4 bytes   | Interpreter ID (uint32, fixed)         |
+| encoding        | 1 byte    | 0x03 (POP_PUSH)                        |
+| timestamp_delta | varint    | Microseconds since thread's last sample|
+| status          | 1 byte    | Thread state flags                     |
+| pop_count       | varint    | Frames to remove from top of prev stack|
+| push_count      | varint    | New frames to add at top               |
+| new_frames      | varint[]  | Array of push_count frame indices      |
+-----------------+-----------+----------------------------------------+
+```
+Used when the code path changed: some frames were popped (function returns)
+and new frames were pushed (different function calls).
+
+### Thread and Interpreter Identification
+
+Thread IDs are 64-bit values that can be large (memory addresses on some
+platforms) and vary unpredictably. Using a fixed 8-byte encoding avoids
+the overhead of varint encoding for large values and simplifies parsing
+since the reader knows exactly where each field begins.
+
+The interpreter ID identifies which Python sub-interpreter the thread
+belongs to, allowing analysis tools to separate activity across interpreters
+in processes using multiple sub-interpreters.
+
+### Status Byte
+
+The status byte is a bitfield encoding thread state at sample time:
+
+| Bit | Flag                  | Meaning                                    |
+|-----|-----------------------|--------------------------------------------|
+|  0  | THREAD_STATUS_HAS_GIL | Thread holds the GIL (Global Interpreter Lock) |
+|  1  | THREAD_STATUS_ON_CPU  | Thread is actively running on a CPU core   |
+|  2  | THREAD_STATUS_UNKNOWN | Thread state could not be determined       |
+|  3  | THREAD_STATUS_GIL_REQUESTED | Thread is waiting to acquire the GIL  |
+|  4  | THREAD_STATUS_HAS_EXCEPTION | Thread has a pending exception         |
+
+Multiple flags can be set simultaneously (e.g., a thread can hold the GIL
+while also running on CPU). Analysis tools use these to filter samples or
+visualize thread states over time.
+
+### Timestamp Delta Encoding
+
+Timestamps use delta encoding rather than absolute values. Absolute
+timestamps in microseconds require 8 bytes each, but consecutive samples
+from the same thread are typically separated by the sampling interval
+(e.g., 1000 microseconds), so the delta between them is small and fits
+in 1-2 varint bytes. The writer tracks the previous timestamp for each
+thread separately. The first sample from a thread encodes its delta from
+the profiling start time; subsequent samples encode the delta from that
+thread's previous sample. This per-thread tracking is necessary because
+samples are interleaved across threads in arrival order, not grouped by
+thread.
+
+For REPEAT (RLE) records, timestamp deltas and status bytes are stored as
+interleaved pairs (delta, status, delta, status, ...) - one pair per
+repeated sample - allowing efficient batching while preserving the exact
+timing and state of each sample.
+
+### Frame Indexing
+
+Each frame in a call stack is represented by an index into the frame table
+rather than inline data. This provides massive space savings because call
+stacks are highly repetitive: the same function appears in many samples
+(hot functions), call stacks often share common prefixes (main -> app ->
+handler -> ...), and recursive functions create repeated frame sequences.
+A frame index is typically 1-2 varint bytes. Inline frame data would be
+20-200+ bytes (two strings plus a line number). For a profile with 100,000
+samples averaging 30 frames each, this reduces frame data from potentially
+gigabytes to tens of megabytes.
+
+Frame indices are written innermost-first (the currently executing frame
+has index 0 in the array). This ordering works well with delta compression:
+function calls typically add frames at the top (index 0), while shared
+frames remain at the bottom.
+
+## String Table
+
+The string table stores deduplicated UTF-8 strings (filenames and function
+names). It begins at `string_table_offset` and contains entries in order of
+their assignment during writing:
+
+```
+----------------+
+| length: varint |
+| data: bytes    |
+----------------+  (repeated for each string)
+```
+
+Strings are stored in the order they were first encountered during writing.
+The first unique filename gets index 0, the second gets index 1, and so on.
+Length-prefixing (rather than null-termination) allows strings containing
+null bytes and enables readers to allocate exact-sized buffers. The varint
+length encoding means short strings (under 128 bytes) need only one length
+byte.
+
+## Frame Table
+
+The frame table stores deduplicated frame entries:
+
+```
+----------------------+
+| filename_idx: varint |
+| funcname_idx: varint |
+| lineno: svarint      |
+----------------------+  (repeated for each frame)
+```
+
+Each unique (filename, funcname, lineno) combination gets one entry. Two
+calls to the same function at different line numbers produce different
+frame entries; two calls at the same line number share one entry.
+
+Strings and frames are deduplicated separately because they have different
+cardinalities and reference patterns. A codebase might have hundreds of
+unique source files but thousands of unique functions. Many functions share
+the same filename, so storing the filename index in each frame entry (rather
+than the full string) provides an additional layer of deduplication. A frame
+entry is just three varints (typically 3-6 bytes) rather than two full
+strings plus a line number.
+
+Line numbers use signed varint (zigzag encoding) rather than unsigned to
+handle edge cases. Synthetic frames—generated frames that don't correspond
+directly to Python source code, such as C extension boundaries or internal
+interpreter frames—use line number 0 or -1 to indicate the absence of a
+source location. Zigzag encoding ensures these small negative values encode
+efficiently (−1 becomes 1, which is one byte) rather than requiring the
+maximum varint length.
+
+## Footer
+
+```
+ Offset   Size   Type      Description
+--------+------+---------+----------------------------------------+
+|    0   |  4   | uint32  | String count                           |
+|    4   |  4   | uint32  | Frame count                            |
+|    8   |  8   | uint64  | Total file size                        |
+|   16   | 16   | bytes   | Checksum (reserved, currently zeros)   |
+--------+------+---------+----------------------------------------+
+```
+
+The string and frame counts allow readers to pre-allocate arrays of the
+correct size before parsing the tables. Without these counts, readers would
+need to either scan the tables twice (once to count, once to parse) or use
+dynamically-growing arrays.
+
+The file size field provides a consistency check: if the actual file size
+does not match, the file may be truncated or corrupted.
+
+The checksum field is reserved for future use. A checksum would allow
+detection of corruption but adds complexity and computation cost. The
+current implementation leaves this as zeros.
+
+## Variable-Length Integer Encoding
+
+The format uses LEB128 (Little Endian Base 128) for unsigned integers and
+zigzag + LEB128 for signed integers. These encodings are widely used
+(Protocol Buffers, DWARF debug info, WebAssembly) and well-understood.
+
+### Unsigned Varint (LEB128)
+
+Each byte stores 7 bits of data. The high bit indicates whether more bytes
+follow:
+
+```
+Value        Encoded bytes
+0-127        [0xxxxxxx]                    (1 byte)
+128-16383    [1xxxxxxx] [0xxxxxxx]         (2 bytes)
+16384+       [1xxxxxxx] [1xxxxxxx] ...     (3+ bytes)
+```
+
+Most indices in profiling data are small. A profile with 1000 unique frames
+needs at most 2 bytes per frame index. The common case (indices under 128)
+needs only 1 byte.
+
+### Signed Varint (Zigzag)
+
+Standard LEB128 encodes −1 as a very large unsigned value, requiring many
+bytes. Zigzag encoding interleaves positive and negative values:
+
+```
+ 0 -> 0    -1 -> 1     1 -> 2    -2 -> 3     2 -> 4
+```
+
+This ensures small-magnitude values (whether positive or negative) encode
+in few bytes.
+
+## Compression
+
+When compression is enabled, the sample data region contains a zstd stream.
+The string table, frame table, and footer remain uncompressed so readers can
+access metadata without decompressing the entire file. A tool that only needs
+to report "this file contains 50,000 samples of 3 threads" can read the header
+and footer without touching the compressed sample data. This also simplifies
+the format: the header's offset fields point directly to the tables rather
+than to positions within a decompressed stream.
+
+Zstd provides an excellent balance of compression ratio and speed. Profiling
+data compresses very well (often 5-10x) due to repetitive patterns: the same
+small set of frame indices appears repeatedly, and delta-encoded timestamps
+cluster around the sampling interval. Zstd's streaming API allows compression
+without buffering the entire dataset. The writer feeds sample data through
+the compressor incrementally, flushing compressed chunks to disk as they
+become available.
+
+Level 5 compression is used as a default. Lower levels (1-3) are faster but
+compress less; higher levels (6+) compress more but slow down writing. Level
+5 provides good compression with minimal impact on profiling overhead.
+
+## Reading and Writing
+
+### Writing
+
+1. Open the output file and write 64 zero bytes as a placeholder header
+2. Initialize empty string and frame dictionaries for deduplication
+3. For each sample:
+   - Intern any new strings, assigning sequential indices
+   - Intern any new frames, assigning sequential indices
+   - Encode the sample record and write to the buffer
+   - Flush the buffer through compression (if enabled) when full
+4. Flush remaining buffered data and finalize compression
+5. Write the string table (length-prefixed strings in index order)
+6. Write the frame table (varint-encoded entries in index order)
+7. Write the footer with final counts
+8. Seek to offset 0 and write the header with actual values
+
+The writer maintains two dictionaries: one mapping strings to indices, one
+mapping (filename_idx, funcname_idx, lineno) tuples to frame indices. These
+enable O(1) lookup during interning.
+
+### Reading
+
+1. Read the header magic number to detect endianness (set `needs_swap` flag
+   if the magic appears byte-swapped)
+2. Validate version and read remaining header fields (byte-swapping if needed)
+3. Seek to end − 32 and read the footer (byte-swapping counts if needed)
+4. Allocate string array of `string_count` elements
+5. Parse the string table, populating the array
+6. Allocate frame array of `frame_count * 3` uint32 elements
+7. Parse the frame table, populating the array
+8. If compressed, decompress the sample data region
+9. Iterate through samples, resolving indices to strings/frames
+   (byte-swapping thread_id and interpreter_id if needed)
+
+The reader builds lookup arrays rather than dictionaries since it only needs
+index-to-value mapping, not value-to-index.
+
+## Platform Considerations
+
+### Byte Ordering and Cross-Platform Portability
+
+The binary format uses **native byte order** for all multi-byte integer
+fields when writing. However, the reader supports **cross-endian reading**:
+files written on a little-endian system (x86, ARM) can be read on a
+big-endian system (s390x, PowerPC), and vice versa.
+
+The magic number doubles as an endianness marker. When read on a system with
+different byte order, it appears byte-swapped (`0x48434154` instead of
+`0x54414348`). The reader detects this and automatically byte-swaps all
+fixed-width integer fields during parsing.
+
+Writers must use `memcpy()` from properly-sized integer types when writing
+fixed-width integer fields. When the source variable's type differs from the
+field width (e.g., `size_t` written as 4 bytes), explicit casting to the
+correct type (e.g., `uint32_t`) is required before `memcpy()`. On big-endian
+systems, copying from an oversized type would copy the wrong bytes—high-order
+zeros instead of the actual value.
+
+The reader tracks whether byte-swapping is needed via a `needs_swap` flag set
+during header parsing. All fixed-width fields in the header, footer, and
+sample data are conditionally byte-swapped using Python's internal byte-swap
+functions (`_Py_bswap32`, `_Py_bswap64` from `pycore_bitutils.h`).
+
+Variable-length integers (varints) are byte-order independent since they
+encode values one byte at a time using the LEB128 scheme, so they require
+no special handling for cross-endian reading.
+
+### Memory-Mapped I/O
+
+On Unix systems (Linux, macOS), the reader uses `mmap()` to map the file
+into the process address space. The kernel handles paging data in and out
+as needed, no explicit read() calls or buffer management are required,
+multiple readers can share the same physical pages, and sequential access
+patterns benefit from kernel read-ahead.
+
+The implementation uses `madvise()` to hint the access pattern to the kernel:
+`MADV_SEQUENTIAL` indicates the file will be read linearly, enabling
+aggressive read-ahead. `MADV_WILLNEED` requests pre-faulting of pages.
+On Linux, `MAP_POPULATE` pre-faults all pages at mmap time rather than on
+first access, moving page fault overhead from the parsing loop to the
+initial mapping for more predictable performance. For large files (over
+32 MB), `MADV_HUGEPAGE` requests transparent huge pages (2 MB instead of
+4 KB) to reduce TLB pressure when accessing large amounts of data.
+
+On Windows, the implementation falls back to standard file I/O with full
+file buffering. Profiling data files are typically small enough (tens to
+hundreds of megabytes) that this is acceptable.
+
+The writer uses a 512 KB buffer to batch small writes. Each sample record
+is typically tens of bytes; writing these individually would incur excessive
+syscall overhead. The buffer accumulates data until full, then flushes in
+one write() call (or feeds through the compression stream).
+
+## Future Considerations
+
+The format reserves space for future extensions. The 12 reserved bytes in
+the header could hold additional metadata. The 16-byte checksum field in
+the footer is currently unused. The version field allows incompatible
+changes with graceful rejection. New compression types could be added
+(compression_type > 1).
+
+Any changes that alter the meaning of existing fields or the parsing logic
+should increment the version number to prevent older readers from
+misinterpreting new files.