## Summary
This PR updates the entire parser stack in multiple ways:
### Make the lexer lazy
* https://github.com/astral-sh/ruff/pull/11244
* https://github.com/astral-sh/ruff/pull/11473
Previously, Ruff's lexer would act as an iterator. The parser would
collect all the tokens in a vector first and then process the tokens to
create the syntax tree.
The first task in this project is to update the entire parsing flow to
make the lexer lazy. This includes the `Lexer`, `TokenSource`, and
`Parser`. For context, the `TokenSource` is a wrapper around the `Lexer`
to filter out the trivia tokens[^1]. Now, the parser will ask the token
source to get the next token and only then the lexer will continue and
emit the token. This means that the lexer needs to be aware of the
"current" token. When the `next_token` is called, the current token will
be updated with the newly lexed token.
The main motivation to make the lexer lazy is to allow re-lexing a token
in a different context. This is going to be really useful to make the
parser error resilience. For example, currently the emitted tokens
remains the same even if the parser can recover from an unclosed
parenthesis. This is important because the lexer emits a
`NonLogicalNewline` in parenthesized context while a normal `Newline` in
non-parenthesized context. This different kinds of newline is also used
to emit the indentation tokens which is important for the parser as it's
used to determine the start and end of a block.
Additionally, this allows us to implement the following functionalities:
1. Checkpoint - rewind infrastructure: The idea here is to create a
checkpoint and continue lexing. At a later point, this checkpoint can be
used to rewind the lexer back to the provided checkpoint.
2. Remove the `SoftKeywordTransformer` and instead use lookahead or
speculative parsing to determine whether a soft keyword is a keyword or
an identifier
3. Remove the `Tok` enum. The `Tok` enum represents the tokens emitted
by the lexer but it contains owned data which makes it expensive to
clone. The new `TokenKind` enum just represents the type of token which
is very cheap.
This brings up a question as to how will the parser get the owned value
which was stored on `Tok`. This will be solved by introducing a new
`TokenValue` enum which only contains a subset of token kinds which has
the owned value. This is stored on the lexer and is requested by the
parser when it wants to process the data. For example:
8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L1260-L1262)
[^1]: Trivia tokens are `NonLogicalNewline` and `Comment`
### Remove `SoftKeywordTransformer`
* https://github.com/astral-sh/ruff/pull/11441
* https://github.com/astral-sh/ruff/pull/11459
* https://github.com/astral-sh/ruff/pull/11442
* https://github.com/astral-sh/ruff/pull/11443
* https://github.com/astral-sh/ruff/pull/11474
For context,
https://github.com/RustPython/RustPython/pull/4519/files#diff-5de40045e78e794aa5ab0b8aacf531aa477daf826d31ca129467703855408220
added support for soft keywords in the parser which uses infinite
lookahead to classify a soft keyword as a keyword or an identifier. This
is a brilliant idea as it basically wraps the existing Lexer and works
on top of it which means that the logic for lexing and re-lexing a soft
keyword remains separate. The change here is to remove
`SoftKeywordTransformer` and let the parser determine this based on
context, lookahead and speculative parsing.
* **Context:** The transformer needs to know the position of the lexer
between it being at a statement position or a simple statement position.
This is because a `match` token starts a compound statement while a
`type` token starts a simple statement. **The parser already knows
this.**
* **Lookahead:** Now that the parser knows the context it can perform
lookahead of up to two tokens to classify the soft keyword. The logic
for this is mentioned in the PR implementing it for `type` and `match
soft keyword.
* **Speculative parsing:** This is where the checkpoint - rewind
infrastructure helps. For `match` soft keyword, there are certain cases
for which we can't classify based on lookahead. The idea here is to
create a checkpoint and keep parsing. Based on whether the parsing was
successful and what tokens are ahead we can classify the remaining
cases. Refer to #11443 for more details.
If the soft keyword is being parsed in an identifier context, it'll be
converted to an identifier and the emitted token will be updated as
well. Refer
8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L487-L491).
The `case` soft keyword doesn't require any special handling because
it'll be a keyword only in the context of a match statement.
### Update the parser API
* https://github.com/astral-sh/ruff/pull/11494
* https://github.com/astral-sh/ruff/pull/11505
Now that the lexer is in sync with the parser, and the parser helps to
determine whether a soft keyword is a keyword or an identifier, the
lexer cannot be used on its own. The reason being that it's not
sensitive to the context (which is correct). This means that the parser
API needs to be updated to not allow any access to the lexer.
Previously, there were multiple ways to parse the source code:
1. Passing the source code itself
2. Or, passing the tokens
Now that the lexer and parser are working together, the API
corresponding to (2) cannot exists. The final API is mentioned in this
PR description: https://github.com/astral-sh/ruff/pull/11494.
### Refactor the downstream tools (linter and formatter)
* https://github.com/astral-sh/ruff/pull/11511
* https://github.com/astral-sh/ruff/pull/11515
* https://github.com/astral-sh/ruff/pull/11529
* https://github.com/astral-sh/ruff/pull/11562
* https://github.com/astral-sh/ruff/pull/11592
And, the final set of changes involves updating all references of the
lexer and `Tok` enum. This was done in two-parts:
1. Update all the references in a way that doesn't require any changes
from this PR i.e., it can be done independently
* https://github.com/astral-sh/ruff/pull/11402
* https://github.com/astral-sh/ruff/pull/11406
* https://github.com/astral-sh/ruff/pull/11418
* https://github.com/astral-sh/ruff/pull/11419
* https://github.com/astral-sh/ruff/pull/11420
* https://github.com/astral-sh/ruff/pull/11424
2. Update all the remaining references to use the changes made in this
PR
For (2), there were various strategies used:
1. Introduce a new `Tokens` struct which wraps the token vector and add
methods to query a certain subset of tokens. These includes:
1. `up_to_first_unknown` which replaces the `tokenize` function
2. `in_range` and `after` which replaces the `lex_starts_at` function
where the former returns the tokens within the given range while the
latter returns all the tokens after the given offset
2. Introduce a new `TokenFlags` which is a set of flags to query certain
information from a token. Currently, this information is only limited to
any string type token but can be expanded to include other information
in the future as needed. https://github.com/astral-sh/ruff/pull/11578
3. Move the `CommentRanges` to the parsed output because this
information is common to both the linter and the formatter. This removes
the need for `tokens_and_ranges` function.
## Test Plan
- [x] Update and verify the test snapshots
- [x] Make sure the entire test suite is passing
- [x] Make sure there are no changes in the ecosystem checks
- [x] Run the fuzzer on the parser
- [x] Run this change on dozens of open-source projects
### Running this change on dozens of open-source projects
Refer to the PR description to get the list of open source projects used
for testing.
Now, the following tests were done between `main` and this branch:
1. Compare the output of `--select=E999` (syntax errors)
2. Compare the output of default rule selection
3. Compare the output of `--select=ALL`
**Conclusion: all output were same**
## What's next?
The next step is to introduce re-lexing logic and update the parser to
feed the recovery information to the lexer so that it can emit the
correct token. This moves us one step closer to having error resilience
in the parser and provides Ruff the possibility to lint even if the
source code contains syntax errors.
## Summary
This PR fixes the bug where the formatter would format an f-string and
could potentially change the AST.
For a triple-quoted f-string, the element can't be formatted into
multiline if it has a format specifier because otherwise the newline
would be treated as part of the format specifier.
Given the following f-string:
```python
f"""aaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbb ccccccccccc {
variable:.3f} ddddddddddddddd eeeeeeee"""
```
The formatter sees that the f-string is already multiline so it assumes
that it can contain line breaks i.e., broken into multiple lines. But,
in this specific case we can't format it as:
```python
f"""aaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbb ccccccccccc {
variable:.3f
} ddddddddddddddd eeeeeeee"""
```
Because the format specifier string would become ".3f\n", which is not
the original string (`.3f`).
If the original source code already contained a newline, they'll be
preserved. For example:
```python
f"""aaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbb ccccccccccc {
variable:.3f
} ddddddddddddddd eeeeeeee"""
```
The above will be formatted as:
```py
f"""aaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbb ccccccccccc {variable:.3f
} ddddddddddddddd eeeeeeee"""
```
Note that the newline after `.3f` is part of the format specifier which
needs to be preserved.
The Python version is irrelevant in this case.
fixes: #10040
## Test Plan
Add some test cases to verify this behavior.
## Summary
_This is preview only feature and is available using the `--preview`
command-line flag._
With the implementation of [PEP 701] in Python 3.12, f-strings can now
be broken into multiple lines, can contain comments, and can re-use the
same quote character. Currently, no other Python formatter formats the
f-strings so there's some discussion which needs to happen in defining
the style used for f-string formatting. Relevant discussion:
https://github.com/astral-sh/ruff/discussions/9785
The goal for this PR is to add minimal support for f-string formatting.
This would be to format expression within the replacement field without
introducing any major style changes.
### Newlines
The heuristics for adding newline is similar to that of
[Prettier](https://prettier.io/docs/en/next/rationale.html#template-literals)
where the formatter would only split an expression in the replacement
field across multiple lines if there was already a line break within the
replacement field.
In other words, the formatter would not add any newlines unless they
were already present i.e., they were added by the user. This makes
breaking any expression inside an f-string optional and in control of
the user. For example,
```python
# We wouldn't break this
aaaaaaaaaaa = f"asaaaaaaaaaaaaaaaa { aaaaaaaaaaaa + bbbbbbbbbbbb + ccccccccccccccc } cccccccccc"
# But, we would break the following as there's already a newline
aaaaaaaaaaa = f"asaaaaaaaaaaaaaaaa {
aaaaaaaaaaaa + bbbbbbbbbbbb + ccccccccccccccc } cccccccccc"
```
If there are comments in any of the replacement field of the f-string,
then it will always be a multi-line f-string in which case the formatter
would prefer to break expressions i.e., introduce newlines. For example,
```python
x = f"{ # comment
a }"
```
### Quotes
The logic for formatting quotes remains unchanged. The existing logic is
used to determine the necessary quote char and is used accordingly.
Now, if the expression inside an f-string is itself a string like, then
we need to make sure to preserve the existing quote and not change it to
the preferred quote unless it's 3.12. For example,
```python
f"outer {'inner'} outer"
# For pre 3.12, preserve the single quote
f"outer {'inner'} outer"
# While for 3.12 and later, the quotes can be changed
f"outer {"inner"} outer"
```
But, for triple-quoted strings, we can re-use the same quote char unless
the inner string is itself a triple-quoted string.
```python
f"""outer {"inner"} outer""" # valid
f"""outer {'''inner'''} outer""" # preserve the single quote char for the inner string
```
### Debug expressions
If debug expressions are present in the replacement field of a f-string,
then the whitespace needs to be preserved as they will be rendered as it
is (for example, `f"{ x = }"`. If there are any nested f-strings, then
the whitespace in them needs to be preserved as well which means that
we'll stop formatting the f-string as soon as we encounter a debug
expression.
```python
f"outer { x = !s :.3f}"
# ^^
# We can remove these whitespaces
```
Now, the whitespace doesn't need to be preserved around conversion spec
and format specifiers, so we'll format them as usual but we won't be
formatting any nested f-string within the format specifier.
### Miscellaneous
- The
[`hug_parens_with_braces_and_square_brackets`](https://github.com/astral-sh/ruff/issues/8279)
preview style isn't implemented w.r.t. the f-string curly braces.
- The
[indentation](https://github.com/astral-sh/ruff/discussions/9785#discussioncomment-8470590)
is always relative to the f-string containing statement
## Test Plan
* Add new test cases
* Review existing snapshot changes
* Review the ecosystem changes
[PEP 701]: https://peps.python.org/pep-0701/
This PR splits the string formatting code in the formatter to be handled
by the respective nodes.
Previously, the string formatting was done through a single
`FormatString` interface. Now, the nodes themselves are responsible for
formatting.
The following changes were made:
1. Remove `StringLayout::ImplicitStringConcatenationInBinaryLike` and
inline the call to `FormatStringContinuation`. After the refactor, the
binary like formatting would delegate to `FormatString` which would then
delegate to `FormatStringContinuation`. This removes the intermediary
steps.
2. Add formatter implementation for `FStringPart` which delegates it to
the respective string literal or f-string node.
3. Add `ExprStringLiteralKind` which is either `String` or `Docstring`.
If it's a docstring variant, then the string expression would not be
implicitly concatenated. This is guaranteed by the
`DocstringStmt::try_from_expression` constructor.
4. Add `StringLiteralKind` which is either a `String`, `Docstring` or
`InImplicitlyConcatenatedFString`. The last variant is for when the
string literal is implicitly concatenated with an f-string (`"foo" f"bar
{x}"`).
5. Remove `FormatString`.
6. Extract the f-string quote detection as a standalone function which
is public to the crate. This is used to detect the quote to be used for
an f-string at the expression level (`ExprFString` or
`FormatStringContinuation`).
### Formatter ecosystem result
**This PR**
| project | similarity index | total files | changed files |
|----------------|------------------:|------------------:|------------------:|
| cpython | 0.75804 | 1799 | 1648 |
| django | 0.99984 | 2772 | 34 |
| home-assistant | 0.99955 | 10596 | 214 |
| poetry | 0.99905 | 321 | 15 |
| transformers | 0.99967 | 2657 | 324 |
| twine | 1.00000 | 33 | 0 |
| typeshed | 0.99980 | 3669 | 18 |
| warehouse | 0.99976 | 654 | 14 |
| zulip | 0.99958 | 1459 | 36 |
**main**
| project | similarity index | total files | changed files |
|----------------|------------------:|------------------:|------------------:|
| cpython | 0.75804 | 1799 | 1648 |
| django | 0.99984 | 2772 | 34 |
| home-assistant | 0.99955 | 10596 | 214 |
| poetry | 0.99905 | 321 | 15 |
| transformers | 0.99967 | 2657 | 324 |
| twine | 1.00000 | 33 | 0 |
| typeshed | 0.99980 | 3669 | 18 |
| warehouse | 0.99976 | 654 | 14 |
| zulip | 0.99958 | 1459 | 36 |
## Summary
This PR changes the internal `docstring-code-line-width` setting to
additionally accept a string value `dynamic`. When `dynamic` is set, the
line width is dynamically adjusted when reformatting code snippets in
docstrings based on the indent level of the docstring. The result is
that the reformatted lines from the code snippet should not exceed the
"global" line width configuration for the surrounding source.
This PR does not change the default behavior, although I suspect the
default should probably be `dynamic`.
## Test Plan
I added a new configuration to the existing docstring code tests and
also added a new set of tests dedicated to the new `dynamic` mode.
## Summary
This PR adds opt-in support for formatting doctests in docstrings. This
reflects initial support and it is intended to add support for Markdown
and reStructuredText Python code blocks in the future. But I believe
this PR lays the groundwork, and future additions for Markdown and reST
should be less costly to add.
It's strongly recommended to review this PR commit-by-commit. The last
few commits in particular implement the bulk of the work here and
represent the denser portions.
Some things worth mentioning:
* The formatter is itself not perfect, and it is possible for it to
produce invalid Python code. Because of this, reformatted code snippets
are checked for Python validity. If they aren't valid, then we
(unfortunately silently) bail on formatting that code snippet.
* There are a couple places where it would be nice to at least warn the
user that doctest formatting failed, but it wasn't clear to me what the
best way to do that is.
* I haven't yet run this in anger on a real world code base. I think
that should happen before merging.
Closes#7146
## Test Plan
* [x] Pass the local test suite.
* [x] Scrutinize ecosystem changes.
* [x] Run this formatter on extant code and scrutinize the results.
(e.g., CPython, numpy.)
## Summary
This PR updates the formatter to preserve trailing semicolon for Jupyter
Notebooks.
The motivation behind the change is that semicolons in notebooks are
typically used to hide the output, for example when plotting. This is
highlighted in the linked issue.
The conditions required as to when the trailing semicolon should be
preserved are:
1. It should be a top-level statement which is last in the module.
2. For statement, it can be either assignment, annotated assignment, or
augmented assignment. Here, the target should only be a single
identifier i.e., multiple assignments or tuple unpacking isn't
considered.
3. For expression, it can be any.
## Test Plan
Add a new integration test in `ruff_cli`. The test notebook basically
acts as a document as to which trailing semicolons are to be preserved.
fixes: #8254
**Summary** Implement `DerefMut` for `WithNodeLevel` so it can be used
in the same way as `PyFormatter`. I want this for my WIP upstack branch
to enable `.fmt(f)` on `WithNodeLevel` context. We could extend this to
remove the other two method from `WithNodeLevel`.
<!--
Thank you for contributing to Ruff! To help us out with reviewing, please consider the following:
- Does this pull request include a summary of the change? (See below.)
- Does this pull request include a descriptive title?
- Does this pull request include references to any relevant issues?
-->
## Summary
I started working on this because I assumed that I would need access to options inside of `NeedsParantheses` but it then turned out that I won't.
Anyway, it kind of felt nice to pass fewer arguments. So I'm gonna put this out here to get your feedback if you prefer this over passing individual fiels.
Oh, I sneeked in another change. I renamed `context.contents` to `source`. `contents` is too generic and doesn't tell you anything.
<!-- What's the purpose of the change? What does it do, and why? -->
## Test Plan
It compiles
<!--
Thank you for contributing to Ruff! To help us out with reviewing, please consider the following:
- Does this pull request include a summary of the change? (See below.)
- Does this pull request include a descriptive title?
- Does this pull request include references to any relevant issues?
-->
## Summary
This PR implements Black's behavior where it first splits off parenthesized expressions before splitting before operands to avoid unnecessary parentheses:
```python
# We want
if a + [
b,
c
]:
pass
# Rather than
if (
a
+ [b, c]
):
pass
```
This is implemented by using the new IR elements introduced in #5596.
* We give the group wrapping the optional parentheses an ID (`parentheses_id`)
* We use `conditional_group` for the lower priority groups (all non-parenthesized expressions) with the condition that the `parentheses_id` group breaks (we want to split before operands only if the parentheses are necessary)
* We use `fits_expanded` to wrap all other parenthesized expressions (lists, dicts, sets), to prevent that expanding e.g. a list expands the `parentheses_id` group. We gate the `fits_expand` to only apply if the `parentheses_id` group fits (because we prefer `a\n+[b, c]` over expanding `[b, c]` if the whole expression gets parenthesized).
We limit using `fits_expanded` and `conditional_group` only to expressions that themselves are not in parentheses (checking the conditions isn't free)
## Test Plan
It increases the Jaccard index for Django from 0.915 to 0.917
## Incompatibilites
There are two incompatibilities left that I'm aware of (there may be more, I didn't go through all snapshot differences).
### Long string literals
I commented on the regression. The issue is that a very long string (or any content without a split point) may not fit when only breaking the right side. The formatter than inserts the optional parentheses. But this is kind of useless because the overlong string will still not fit, because there are no new split points.
I think we should ignore this incompatibility for now
### Expressions on statement level
I don't fully understand the logic behind this yet, but black doesn't break before the operators for the following example even though the expression exceeds the configured line width
```python
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa < bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb > ccccccccccccccccccccccccccccc == ddddddddddddddddddddd
```
But it would if the expression is used inside of a condition.
What I understand so far is that Black doesn't insert optional parentheses on the expression statement level (and a few other places) and, therefore, only breaks after opening parentheses. I propose to keep this deviation for now to avoid overlong-lines and use the compatibility report to make a decision if we should implement the same behavior.
<!--
Thank you for contributing to Ruff! To help us out with reviewing, please consider the following:
- Does this pull request include a summary of the change? (See below.)
- Does this pull request include a descriptive title?
- Does this pull request include references to any relevant issues?
-->
## Summary
This PR adds a new `PyFormatOptions` struct that stores the python formatter options.
The new options aren't used yet, with the exception of magical trailing commas and the options passed to the printer.
I'll follow up with more PRs that use the new options (e.g. `QuoteStyle`).
<!-- What's the purpose of the change? What does it do, and why? -->
## Test Plan
`cargo test` I'll follow up with a new PR that adds support for overriding the options in our fixture tests.
# Summary
This PR contains the code for the autoformatter proof-of-concept.
## Crate structure
The primary formatting hook is the `fmt` function in `crates/ruff_python_formatter/src/lib.rs`.
The current formatter approach is outlined in `crates/ruff_python_formatter/src/lib.rs`, and is structured as follows:
- Tokenize the code using the RustPython lexer.
- In `crates/ruff_python_formatter/src/trivia.rs`, extract a variety of trivia tokens from the token stream. These include comments, trailing commas, and empty lines.
- Generate the AST via the RustPython parser.
- In `crates/ruff_python_formatter/src/cst.rs`, convert the AST to a CST structure. As of now, the CST is nearly identical to the AST, except that every node gets a `trivia` vector. But we might want to modify it further.
- In `crates/ruff_python_formatter/src/attachment.rs`, attach each trivia token to the corresponding CST node. The logic for this is mostly in `decorate_trivia` and is ported almost directly from Prettier (given each token, find its preceding, following, and enclosing nodes, then attach the token to the appropriate node in a second pass).
- In `crates/ruff_python_formatter/src/newlines.rs`, normalize newlines to match Black’s preferences. This involves traversing the CST and inserting or removing `TriviaToken` values as we go.
- Call `format!` on the CST, which delegates to type-specific formatter implementations (e.g., `crates/ruff_python_formatter/src/format/stmt.rs` for `Stmt` nodes, and similar for `Expr` nodes; the others are trivial). Those type-specific implementations delegate to kind-specific functions (e.g., `format_func_def`).
## Testing and iteration
The formatter is being developed against the Black test suite, which was copied over in-full to `crates/ruff_python_formatter/resources/test/fixtures/black`.
The Black fixtures had to be modified to create `[insta](https://github.com/mitsuhiko/insta)`-compatible snapshots, which now exist in the repo.
My approach thus far has been to try and improve coverage by tackling fixtures one-by-one.
## What works, and what doesn’t
- *Most* nodes are supported at a basic level (though there are a few stragglers at time of writing, like `StmtKind::Try`).
- Newlines are properly preserved in most cases.
- Magic trailing commas are properly preserved in some (but not all) cases.
- Trivial leading and trailing standalone comments mostly work (although maybe not at the end of a file).
- Inline comments, and comments within expressions, often don’t work -- they work in a few cases, but it’s one-off right now. (We’re probably associating them with the “right” nodes more often than we are actually rendering them in the right place.)
- We don’t properly normalize string quotes. (At present, we just repeat any constants verbatim.)
- We’re mishandling a bunch of wrapping cases (if we treat Black as the reference implementation). Here are a few examples (demonstrating Black's stable behavior):
```py
# In some cases, if the end expression is "self-closing" (functions,
# lists, dictionaries, sets, subscript accesses, and any length-two
# boolean operations that end in these elments), Black
# will wrap like this...
if some_expression and f(
b,
c,
d,
):
pass
# ...whereas we do this:
if (
some_expression
and f(
b,
c,
d,
)
):
pass
# If function arguments can fit on a single line, then Black will
# format them like this, rather than exploding them vertically.
if f(
a, b, c, d, e, f, g, ...
):
pass
```
- We don’t properly preserve parentheses in all cases. Black preserves parentheses in some but not all cases.