![]() ## Summary This PR updates the entire parser stack in multiple ways: ### Make the lexer lazy * https://github.com/astral-sh/ruff/pull/11244 * https://github.com/astral-sh/ruff/pull/11473 Previously, Ruff's lexer would act as an iterator. The parser would collect all the tokens in a vector first and then process the tokens to create the syntax tree. The first task in this project is to update the entire parsing flow to make the lexer lazy. This includes the `Lexer`, `TokenSource`, and `Parser`. For context, the `TokenSource` is a wrapper around the `Lexer` to filter out the trivia tokens[^1]. Now, the parser will ask the token source to get the next token and only then the lexer will continue and emit the token. This means that the lexer needs to be aware of the "current" token. When the `next_token` is called, the current token will be updated with the newly lexed token. The main motivation to make the lexer lazy is to allow re-lexing a token in a different context. This is going to be really useful to make the parser error resilience. For example, currently the emitted tokens remains the same even if the parser can recover from an unclosed parenthesis. This is important because the lexer emits a `NonLogicalNewline` in parenthesized context while a normal `Newline` in non-parenthesized context. This different kinds of newline is also used to emit the indentation tokens which is important for the parser as it's used to determine the start and end of a block. Additionally, this allows us to implement the following functionalities: 1. Checkpoint - rewind infrastructure: The idea here is to create a checkpoint and continue lexing. At a later point, this checkpoint can be used to rewind the lexer back to the provided checkpoint. 2. Remove the `SoftKeywordTransformer` and instead use lookahead or speculative parsing to determine whether a soft keyword is a keyword or an identifier 3. Remove the `Tok` enum. The `Tok` enum represents the tokens emitted by the lexer but it contains owned data which makes it expensive to clone. The new `TokenKind` enum just represents the type of token which is very cheap. This brings up a question as to how will the parser get the owned value which was stored on `Tok`. This will be solved by introducing a new `TokenValue` enum which only contains a subset of token kinds which has the owned value. This is stored on the lexer and is requested by the parser when it wants to process the data. For example: |
||
---|---|---|
.. | ||
corpus | ||
fuzz_targets | ||
.gitignore | ||
Cargo.toml | ||
init-fuzzer.sh | ||
README.md | ||
reinit-fuzzer.sh |
ruff-fuzz
Fuzzers and associated utilities for automatic testing of Ruff.
Usage
To use the fuzzers provided in this directory, start by invoking:
./fuzz/init-fuzzers.sh
This will install cargo-fuzz
and optionally download a
dataset which improves the efficacy of the testing.
This step is necessary for initialising the corpus directory, as all fuzzers share a common
corpus.
The dataset may take several hours to download and clean, so if you're just looking to try out the
fuzzers, skip the dataset download, though be warned that some features simply cannot be tested
without it (very unlikely for the fuzzer to generate valid python code from "thin air").
Once you have initialised the fuzzers, you can then execute any fuzzer with:
cargo fuzz run -s none name_of_fuzzer -- -timeout=1
Users using Apple M1 devices must use a nightly compiler and omit the -s none
portion of this
command, as this architecture does not support fuzzing without a sanitizer.
You can view the names of the available fuzzers with cargo fuzz list
.
For specific details about how each fuzzer works, please read this document in its entirety.
IMPORTANT: You should run ./reinit-fuzzer.sh
after adding more file-based testcases. This will
allow the testing of new features that you've added unit tests for.
Debugging a crash
Once you've found a crash, you'll need to debug it.
The easiest first step in this process is to minimise the input such that the crash is still
triggered with a smaller input.
cargo-fuzz
supports this out of the box with:
cargo fuzz tmin -s none name_of_fuzzer artifacts/name_of_fuzzer/crash-...
From here, you will need to analyse the input and potentially the behaviour of the program. The debugging process from here is unfortunately less well-defined, so you will need to apply some expertise here. Happy hunting!
A brief introduction to fuzzers
Fuzzing, or fuzz testing, is the process of providing generated data to a program under test. The most common variety of fuzzers are mutational fuzzers; given a set of existing inputs (a "corpus"), it will attempt to slightly change (or "mutate") these inputs into new inputs that cover parts of the code that haven't yet been observed. Using this strategy, we can quite efficiently generate testcases which cover significant portions of the program, both with expected and unexpected data. This is really quite effective for finding bugs.
The fuzzers here use cargo-fuzz
, a utility which allows
Rust to integrate with libFuzzer, the fuzzer library built
into LLVM.
Each source file present in fuzz_targets
is a harness, which is, in effect, a unit
test which can handle different inputs.
When an input is provided to a harness, the harness processes this data and libFuzzer observes the
code coverage and any special values used in comparisons over the course of the run.
Special values are preserved for future mutations and inputs which cover new regions of code are
added to the corpus.
Each fuzzer harness in detail
Each fuzzer harness in fuzz_targets
targets a different aspect of Ruff and tests
them in different ways. While there is implementation-specific documentation in the source code
itself, each harness is briefly described below.
ruff_parse_simple
This fuzz harness does not perform any "smart" testing of Ruff; it merely checks that the parsing and unparsing of a particular input (what would normally be a source code file) does not crash. It also attempts to verify that the locations of tokens and errors identified do not fall in the middle of a UTF-8 code point, which may cause downstream panics. While this is unlikely to find any issues on its own, it executes very quickly and covers a large and diverse code region that may speed up the generation of inputs and therefore make a more valuable corpus quickly. It is particularly useful if you skip the dataset generation.
ruff_parse_idempotency
This fuzz harness checks that Ruff's parser is idempotent in order to check that it is not
incorrectly parsing or unparsing an input.
It can be built in two modes: default (where it is only checked that the parser does not enter an
unstable state) or full idempotency (the parser is checked to ensure that it will always produce
the same output after the first unparsing).
Full idempotency mode can be used by enabling the full-idempotency
feature when running the
fuzzer, but this may be too strict of a restriction for initial testing.
ruff_fix_validity
This fuzz harness checks that fixes applied by Ruff do not introduce new errors using the existing
ruff_linter::test::test_snippet
testing utility.
It currently is only configured to use default settings, but may be extended in future versions to
test non-default linter settings.
ruff_formatter_idempotency
This fuzz harness ensures that the formatter is idempotent which detects possible unsteady states of Ruff's formatter.
ruff_formatter_validity
This fuzz harness checks that Ruff's formatter does not introduce new linter errors/warnings by linting once, counting the number of each error type, then formatting, then linting again and ensuring that the number of each error type does not increase across formats. This has the beneficial side effect of discovering cases where the linter does not discover a lint error when it should have due to a formatting inconsistency.