## Summary
This is really bad PR hygiene, but a mix of: using `Locator`-based fixes
in a few places (in lieu of `Generator`-based fixes), using match syntax
to avoid `.len() == 1` checks, using common helpers in more places, etc.
## Test Plan
`cargo test`
## Summary
We have two `Cursor` implementations. This PR moves the implementation
from the formatter into `ruff_python_whitespace` (kind of a poorly-named
crate now) and uses it for both use-cases.
## Summary
This is the result of running `cargo +nightly clippy --workspace
--all-targets --all-features -- -D warnings` and fixing all violations.
Just wanted to see if there were any interesting new checks on nightly
👀
## Summary
Format `ExprIfExp`, also known as the ternary operator or inline `if`.
It can look like
```python
a1 = 1 if True else 2
```
but also
```python
b1 = (
# We return "a" ...
"a" # that's our True value
# ... if this condition matches ...
if True # that's our test
# ... otherwise we return "b§
else "b" # that's our False value
)
```
This also fixes a visitor order bug.
The jaccard index on django goes from 0.911 to 0.915.
## Test Plan
I added fixtures without and with comments in strange places.
## Summary
The following code was previously leading to unstable formatting:
```python
try:
try:
pass
finally:
print(1) # issue7208
except A:
pass
```
The comment would be formatted as a trailing comment of `try` which is
unstable as an end-of-line comment gets two extra whitespaces.
This was originally found in
99b00efd5e/Lib/getpass.py (L68-L91)
## Test Plan
I added a regression test
Support for `let…else` formatting was just merged to nightly
(rust-lang/rust#113225). Rerun `cargo fmt` with Rust nightly 2023-07-02
to pick this up. Followup to #939.
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
## Summary
When visiting AugAssign in evaluation order, the AugAssign `target`
should be visited after it's `value`. Based on my testing, the pseudo
code for `a += b` is effectively:
```python
tmp = a
a = tmp.__iadd__(b)
```
That is, an ideal traversal order would look something like this:
1. load a
2. b
3. op
4. store a
But, there is only a single AST node which captures `a` in the statement
`a += b`, so it cannot be traversed both before and after the traversal
of `b` and the `op`.
Nonetheless, I think traversing `a` after `b` and the `op` makes the
most sense for a number of reasons:
1. All the other assignment expressions traverse their `value`s before
their `target`s. Having `AugAssign` traverse in the same order would be
more consistent.
2. Within the AST, the `ctx` of the `target` for an `AugAssign` is
`Store` (though technically this is a `Load` and `Store` operation, the
AST only indicates it as a `Store`). Since the the store portion of the
`AugAssign` occurs last, I think it makes sense to traverse the `target`
last as well.
The effect of this is marginal, but it may have an impact on the
behavior of #5271.
<!--
Thank you for contributing to Ruff! To help us out with reviewing, please consider the following:
- Does this pull request include a summary of the change? (See below.)
- Does this pull request include a descriptive title?
- Does this pull request include references to any relevant issues?
-->
## Summary
This PR implements formatting for non-f-string Strings that do not use implicit concatenation.
Docstring formatting is out of the scope of this PR.
<!-- What's the purpose of the change? What does it do, and why? -->
## Test Plan
I added a few tests for simple string literals.
## Performance
Ouch. This is hitting performance somewhat hard. This is probably because we now iterate each string a couple of times:
1. To detect if it is an implicit string continuation
2. To detect if the string contains any new lines
3. To detect the preferred quote
4. To normalize the string
Edit: I integrated the detection of newlines into the preferred quote detection so that we only iterate the string three time.
We can probably do better by merging the implicit string continuation with the quote detection and new line detection by iterating till the end of the string part and returning the offset. We then use our simple tokenizer to skip over any comments or whitespace until we find the first non trivia token. From there we keep continue doing this in a loop until we reach the end o the string. I'll leave this improvement for later.
## Summary
The `Visitor` and `preorder::Visitor` traits provide some convenience
functions, `visit_annotation` and `visit_format_spec`, for handling
annotation and format spec expressions respectively. Both of these
functions accept an `&Expr` and have a default implementation which
delegates to `walk_expr`. The problem with this approach is that any
custom handling done in `visit_expr` will be skipped for annotations and
format specs. Instead, to capture any custom logic implemented in
`visit_expr`, both of these function's default implementations should
delegate to `visit_expr` instead of `walk_expr`.
## Example
Consider the below `Visitor` implementation:
```rust
impl<'a> Visitor<'a> for Example<'a> {
fn visit_expr(&mut self, expr: &'a Expr) {
match expr {
Expr::Name(ExprName { id, .. }) => println!("Visiting {:?}", id),
_ => walk_expr(self, expr),
}
}
}
```
Run on the following Python snippet:
```python
a: b
```
I would expect such a visitor to print the following:
```
Visiting b
Visiting a
```
But it instead prints the following:
```
Visiting a
```
Our custom `visit_expr` handler is not invoked for the annotation.
## Test Plan
Tests added in #5271 caught this behavior.
## Summary
This is a follow up to #5221. Turns out it was easy to restructure the
visitor to get the right order, I'm just dumb 🤷♂️ I've
removed `visit_arg_with_default` entirely from the `Visitor`, although
it still exists as part of `preorder::Visitor`.
## Summary
According to the AST visitor documentation, the AST visitor "visits all
nodes in the AST recursively in evaluation-order". However, the current
traversal fails to meet this specification in a few places.
### Function traversal
```python
order = []
@(order.append("decorator") or (lambda x: x))
def f(
posonly: order.append("posonly annotation") = order.append("posonly default"),
/,
arg: order.append("arg annotation") = order.append("arg default"),
*args: order.append("vararg annotation"),
kwarg: order.append("kwarg annotation") = order.append("kwarg default"),
**kwargs: order.append("kwarg annotation")
) -> order.append("return annotation"):
pass
print(order)
```
Executing the above snippet using CPython 3.10.6 prints the following
result (formatted for readability):
```python
[
'decorator',
'posonly default',
'arg default',
'kwarg default',
'arg annotation',
'posonly annotation',
'vararg annotation',
'kwarg annotation',
'kwarg annotation',
'return annotation',
]
```
Here we can see that decorators are evaluated first, followed by
argument defaults, and annotations are last. The current traversal of a
function's AST does not align with this order.
### Annotated assignment traversal
```python
order = []
x: order.append("annotation") = order.append("expression")
print(order)
```
Executing the above snippet using CPython 3.10.6 prints the following
result:
```python
['expression', 'annotation']
```
Here we can see that an annotated assignments annotation gets evaluated
after the assignment's expression. The current traversal of an annotated
assignment's AST does not align with this order.
## Why?
I'm slowly working on #3946 and porting over some of the logic and tests
from ssort. ssort is very sensitive to AST traversal order, so ensuring
the utmost correctness here is important.
## Test Plan
There doesn't seem to be existing tests for the AST visitor, so I didn't
bother adding tests for these very subtle changes. However, this
behavior will be captured in the tests for the PR which addresses #3946.
## Summary
Now that all identifiers include ranges (#5194), we can remove a ton of
this "custom lexing" code that we have to sketchily extract identifier
ranges from source.
## Test Plan
`cargo test`
## Summary
In https://github.com/astral-sh/RustPython-Parser/pull/8, we modified
RustPython to include ranges for any identifiers that aren't
`Expr::Name` (which already has an identifier).
For example, the `e` in `except ValueError as e` was previously
un-ranged. To extract its range, we had to do some lexing of our own.
This change should improve performance and let us remove a bunch of
code.
## Test Plan
`cargo test`
## Summary
This PR modifies our statement deletion logic to delete any preceding
continuation lines.
For example, given:
```py
x = 1; \
import os
```
We'll now rewrite to:
```py
x = 1;
```
In addition, the logic can now handle multiple preceding continuations
(which is unlikely, but valid).
## Summary
This PR upgrade RustPython to pull in the changes to `Arguments` (zip
defaults with their identifiers) and all the renames to `CmpOp` and
friends.
## Summary
This changes the caching design from one cache file per source file, to
one cache file per package. This greatly reduces the amount of cache
files that are opened and written, while maintaining roughly the same
(combined) size as bincode is very compact.
Below are some very much not scientific performance tests. It uses
projects/sources to check:
* small.py: single, 31 bytes Python file with 2 errors.
* test.py: single, 43k Python file with 8 errors.
* fastapi: FastAPI repo, 1134 files checked, 0 errors.
Source | Before # files | After # files | Before size | After size
-------|-------|-------|-------|-------
small.py | 1 | 1 | 20 K | 20 K
test.py | 1 | 1 | 60 K | 60 K
fastapi | 1134 | 518 | 4.5 M | 2.3 M
One question that might come up is why fastapi still has 518 cache files
and not 1? That is because this is using the existing package
resolution, which sees examples, docs, etc. as separate from the "main"
source code (in the fastapi directory in the repo). In this future it
might be worth consider switching to a one cache file per repo strategy.
This new design is not perfect and does have a number of known issues.
First, like the old design it doesn't remove the cache for a source file
that has been (re)moved until `ruff clean` is called.
Second, this currently uses a large mutex around the mutation of the
package cache (e.g. inserting result). This could be (or become) a
bottleneck. It's future work to test and improve this (if needed).
Third, currently the packages and opened and stored in a sequential
loop, this could be done parallel. This is also future work.
## Test Plan
Run `ruff check` (with caching enabled) twice on any Python source code
and it should produce the same results.
## Summary
Given:
```python
\
import os
```
Deleting `import os` leaves a syntax error: a file can't end in a
continuation. We have code to handle this case, but it failed to pick up
continuations at the _very start_ of a file.
Closes#5156.
## Summary
At present, when we store a binding, we include a `TextRange` alongside
it. The `TextRange` _sometimes_ matches the exact range of the
identifier to which the `Binding` is linked, but... not always.
For example, given:
```python
x = 1
```
The binding we create _will_ use the range of `x`, because the left-hand
side is an `Expr::Name`, which has a valid range on it.
However, given:
```python
try:
pass
except ValueError as e:
pass
```
When we create a binding for `e`, we don't have a `TextRange`... The AST
doesn't give us one. So we end up extracting it via lexing.
This PR extends that pattern to the rest of the binding kinds, to ensure
that whenever we create a binding, we always use the range of the bound
name. This leads to better diagnostics in cases like pattern matching,
whereby the diagnostic for "unused variable `x`" here used to include
`*x`, instead of just `x`:
```python
def f(provided: int) -> int:
match provided:
case [_, *x]:
pass
```
This is _also_ required for symbol renames, since we track writes as
bindings -- so we need to know the ranges of the bound symbols.
By storing these bindings precisely, we can also remove the
`binding.trimmed_range` abstraction -- since bindings already use the
"trimmed range".
To implement this behavior, I took some of our existing utilities (like
the code we had for `except ValueError as e` above), migrated them from
a full lexer to a zero-allocation lexer that _only_ identifies
"identifiers", and moved the behavior into a trait, so we can now do
`stmt.identifier(locator)` to get the range for the identifier.
Honestly, we might end up discarding much of this if we decide to put
ranges on all identifiers
(https://github.com/astral-sh/RustPython-Parser/pull/8). But even if we
do, this will _still_ be a good change, because the lexer introduced
here is useful beyond names (e.g., we use it find the `except` keyword
in an exception handler, to find the `else` after a `for` loop, and so
on). So, I'm fine committing this even if we end up changing our minds
about the right approach.
Closes#5090.
## Benchmarks
No significant change, with one statistically significant improvement
(-2.1654% on `linter/all-rules/large/dataset.py`):
```
linter/default-rules/numpy/globals.py
time: [73.922 µs 73.955 µs 73.986 µs]
thrpt: [39.882 MiB/s 39.898 MiB/s 39.916 MiB/s]
change:
time: [-0.5579% -0.4732% -0.3980%] (p = 0.00 < 0.05)
thrpt: [+0.3996% +0.4755% +0.5611%]
Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) low severe
1 (1.00%) low mild
1 (1.00%) high mild
linter/default-rules/pydantic/types.py
time: [1.4909 ms 1.4917 ms 1.4926 ms]
thrpt: [17.087 MiB/s 17.096 MiB/s 17.106 MiB/s]
change:
time: [+0.2140% +0.2741% +0.3392%] (p = 0.00 < 0.05)
thrpt: [-0.3380% -0.2734% -0.2136%]
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
linter/default-rules/numpy/ctypeslib.py
time: [688.97 µs 691.34 µs 694.15 µs]
thrpt: [23.988 MiB/s 24.085 MiB/s 24.168 MiB/s]
change:
time: [-1.3282% -0.7298% -0.1466%] (p = 0.02 < 0.05)
thrpt: [+0.1468% +0.7351% +1.3461%]
Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
1 (1.00%) low mild
2 (2.00%) high mild
12 (12.00%) high severe
linter/default-rules/large/dataset.py
time: [3.3872 ms 3.4032 ms 3.4191 ms]
thrpt: [11.899 MiB/s 11.954 MiB/s 12.011 MiB/s]
change:
time: [-0.6427% -0.2635% +0.0906%] (p = 0.17 > 0.05)
thrpt: [-0.0905% +0.2642% +0.6469%]
No change in performance detected.
Found 20 outliers among 100 measurements (20.00%)
1 (1.00%) low severe
2 (2.00%) low mild
4 (4.00%) high mild
13 (13.00%) high severe
linter/all-rules/numpy/globals.py
time: [148.99 µs 149.21 µs 149.42 µs]
thrpt: [19.748 MiB/s 19.776 MiB/s 19.805 MiB/s]
change:
time: [-0.7340% -0.5068% -0.2778%] (p = 0.00 < 0.05)
thrpt: [+0.2785% +0.5094% +0.7395%]
Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high severe
linter/all-rules/pydantic/types.py
time: [3.0362 ms 3.0396 ms 3.0441 ms]
thrpt: [8.3779 MiB/s 8.3903 MiB/s 8.3997 MiB/s]
change:
time: [-0.0957% +0.0618% +0.2125%] (p = 0.45 > 0.05)
thrpt: [-0.2121% -0.0618% +0.0958%]
No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low severe
3 (3.00%) low mild
5 (5.00%) high mild
2 (2.00%) high severe
linter/all-rules/numpy/ctypeslib.py
time: [1.6879 ms 1.6894 ms 1.6909 ms]
thrpt: [9.8478 MiB/s 9.8562 MiB/s 9.8652 MiB/s]
change:
time: [-0.2279% -0.0888% +0.0436%] (p = 0.18 > 0.05)
thrpt: [-0.0435% +0.0889% +0.2284%]
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) low mild
1 (1.00%) high severe
linter/all-rules/large/dataset.py
time: [7.1520 ms 7.1586 ms 7.1654 ms]
thrpt: [5.6777 MiB/s 5.6831 MiB/s 5.6883 MiB/s]
change:
time: [-2.5626% -2.1654% -1.7780%] (p = 0.00 < 0.05)
thrpt: [+1.8102% +2.2133% +2.6300%]
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild
```
## Summary
This fixes a number of problems in the formatter that showed up with
various files in the [cpython](https://github.com/python/cpython)
repository. These problems surfaced as unstable formatting and invalid
code. This is not the entirety of problems discovered through cpython,
but a big enough chunk to separate it. Individual fixes are generally
individual commits. They were discovered with #5055, which i update as i
work through the output
## Test Plan
I added regression tests with links to cpython for each entry, except
for the two stubs that also got comment stubs since they'll be
implemented properly later.
## Summary
This PR runs `rustfmt` with a few nightly options as a one-time fix to
catch some malformatted comments. I ended up just running with:
```toml
condense_wildcard_suffixes = true
edition = "2021"
max_width = 100
normalize_comments = true
normalize_doc_attributes = true
reorder_impl_items = true
unstable_features = true
use_field_init_shorthand = true
```
Since these all seem like reasonable things to fix, so may as well while
I'm here.
## Summary
Our current mechanism for handling deletions (e.g., `del x`) is to
remove the symbol from the scope's `bindings` table. This "does the
right thing", in that if we then reference a deleted symbol, we're able
to determine that it's unbound -- but it causes a variety of problems,
mostly in that it makes certain bindings and references unreachable
after-the-fact.
Consider:
```python
x = 1
print(x)
del x
```
If we analyze this code _after_ running the semantic model over the AST,
we'll have no way of knowing that `x` was ever introduced in the scope,
much less that it was bound to a value, read, and then deleted --
because we effectively erased `x` from the model entirely when we hit
the deletion.
In practice, this will make it impossible for us to support local symbol
renames. It also means that certain rules that we want to move out of
the model-building phase and into the "check dead scopes" phase wouldn't
work today, since we'll have lost important information about the source
code.
This PR introduces two new `BindingKind` variants to model deletions:
- `BindingKind::Deletion`, which represents `x = 1; del x`.
- `BindingKind::UnboundException`, which represents:
```python
try:
1 / 0
except Exception as e:
pass
```
In the latter case, `e` gets unbound after the exception handler
(assuming it's triggered), so we want to handle it similarly to a
deletion.
The main challenge here is auditing all of our existing `Binding` and
`Scope` usages to understand whether they need to accommodate deletions
or otherwise behave differently. If you look one commit back on this
branch, you'll see that the code is littered with `NOTE(charlie)`
comments that describe the reasoning behind changing (or not) each of
those call sites. I've also augmented our test suite in preparation for
this change over a few prior PRs.
### Alternatives
As an alternative, I considered introducing a flag to `BindingFlags`,
like `BindingFlags::UNBOUND`, and setting that at the appropriate time.
This turned out to be a much more difficult change, because we tend to
match on `BindingKind` all over the place (e.g., we have a bunch of code
blocks that only run when a `BindingKind` is
`BindingKind::Importation`). As a result, introducing these new
`BindingKind` variants requires only a few changes at the client sites.
Adding a flag would've required a much wider-reaching change.
## Summary
This PR (1) avoids flagging `TypedDict` and `NamedTuple` conversions
when attributes are dunder methods, like `__dict__`, and (2) avoids
flagging the `A003` shadowed-attribute rule for `TypedDict` classes at
all, where it doesn't really apply (since those attributes are only
accessed via subscripting anyway).
Closes#5027.
Improves the `ruff_parse_simple` fuzz harness by adding checks for
parsed locations to ensure they all lie on UTF-8 character boundaries.
This will allow for faster identification of issues like #5004.
This also adds additional details for Apple M1 users and clarifies the
importance of using `init-fuzzer.sh` (thanks for the feedback,
@jasikpark 🙂).
## Summary
The `RET504` rule, which looks for unnecessary assignments before return
statements, is a frequent source of issues (#4173, #4236, #4242, #1606,
#2950). Over time, we've tried to refine the logic to handle more cases.
For example, we now avoid analyzing any functions that contain any
function calls or attribute assignments, since those operations can
contain side effects (and so we mark them as a "read" on all variables
in the function -- we could do a better job with code graph analysis to
handle this limitation, but that'd be a more involved change.) We also
avoid flagging any variables that are the target of multiple
assignments. Ultimately, though, I'm not happy with the implementation
-- we just can't do sufficiently reliable analysis of arbitrary code
flow given the limited logic herein, and the existing logic is very hard
to reason about and maintain.
This PR refocuses the rule to only catch cases of the form:
```py
def f():
x = 1
return x
```
That is, we now only flag returns that are immediately preceded by an
assignment to the returned variable. While this is more limiting, in
some ways, it lets us flag more cases vis-a-vis the previous
implementation, since we no longer "fully eject" when functions contain
function calls and other effect-ful operations.
Closes#4173.
Closes#4236.
Closes#4242.
## Summary
We use `.trim()` and friends in a bunch of places, to strip whitespace
from source code. However, not all Unicode whitespace characters are
considered "whitespace" in Python, which only supports the standard
space, tab, and form-feed characters.
This PR audits our usages of `.trim()`, `.trim_start()`, `.trim_end()`,
and `char::is_whitespace`, and replaces them as appropriate with a new
`.trim_whitespace()` analogues, powered by a `PythonWhitespace` trait.
In general, the only place that should continue to use `.trim()` is
content within docstrings, which don't need to adhere to Python's
semantic definitions of whitespace.
Closes#4991.
## Summary
`ruff_newlines` becomes `ruff_python_whitespace`, and includes the
existing "universal newline" handlers alongside the Python
whitespace-specific utilities.