Fixes#17867
## Summary
The CPython parser does not allow generator expressions which are the
sole arguments in an argument list to have a trailing comma.
With this change, we start flagging such instances.
## Test Plan
Added new inline tests.
## Summary
Part of #17412
Starred expressions cannot be used as values in assignment expressions.
Add a new semantic syntax error to catch such instances.
Note that we already have
`ParseErrorType::InvalidStarredExpressionUsage` to catch some starred
expression errors during parsing, but that does not cover top level
assignment expressions.
## Test Plan
- Added new inline tests for the new rule
- Found some examples marked as "valid" in existing tests (`_ = *data`),
which are not really valid (per this new rule) and updated them
- There was an existing inline test - `assign_stmt_invalid_value_expr`
which had instances of `*` expression which would be deemed invalid by
this new rule. Converted these to tuples, so that they do not trigger
this new rule.
## Summary
Part of #17412
Add a new compile-time syntax error for detecting `nonlocal`
declarations at a module level.
## Test Plan
- Added new inline tests for the syntax error
- Updated existing tests for `nonlocal` statement parsing to be inside a
function scope
Co-authored-by: Brent Westbrook <36778786+ntBre@users.noreply.github.com>
## Summary
Based on the discussion in
https://github.com/astral-sh/ruff/pull/17298#discussion_r2033975460, we
decided to move the scope handling out of the `SemanticSyntaxChecker`
and into the `SemanticSyntaxContext` trait. This PR implements that
refactor by:
- Reverting all of the `Checkpoint` and `in_async_context` code in the
`SemanticSyntaxChecker`
- Adding four new methods to the `SemanticSyntaxContext` trait
- `in_async_context`: matches `SemanticModel::in_async_context` and only
detects the nearest enclosing function
- `in_sync_comprehension`: uses the new `is_async` tracking on
`Generator` scopes to detect any enclosing sync comprehension
- `in_module_scope`: reports whether we're at the top-level scope
- `in_notebook`: reports whether we're in a Jupyter notebook
- In-lining the `TestContext` directly into the
`SemanticSyntaxCheckerVisitor`
- This allows modifying the context as the visitor traverses the AST,
which wasn't possible before
One potential question here is "why not add a single method returning a
`Scope` or `Scopes` to the context?" The main reason is that the `Scope`
type is defined in the `ruff_python_semantic` crate, which is not
currently a dependency of the parser. It also doesn't appear to be used
in red-knot. So it seemed best to use these more granular methods
instead of trying to access `Scope` in `ruff_python_parser` (and
red-knot).
## Test Plan
Existing parser and linter tests.
Summary
--
Detect async comprehensions nested in sync comprehensions in async
functions before Python 3.11, when this was [changed].
The actual logic of this rule is very straightforward, but properly
tracking the async scopes took a bit of work. An alternative to the
current approach is to offload the `in_async_context` check into the
`SemanticSyntaxContext` trait, but that actually required much more
extensive changes to the `TestContext` and also to ruff's semantic
model, as you can see in the changes up to
31554b473507034735bd410760fde6341d54a050. This version has the benefit
of mostly centralizing the state tracking in `SemanticSyntaxChecker`,
although there was some subtlety around deferred function body traversal
that made the changes to `Checker` more intrusive too (hence the new
linter test).
The `Checkpoint` struct/system is obviously overkill for now since it's
only tracking a single `bool`, but I thought it might be more useful
later.
[changed]: https://github.com/python/cpython/issues/77527
Test Plan
--
New inline tests and a new linter integration test.
Summary
--
This PR extends the checks in #17101 and #17282 to annotated assignments
after Python 3.13.
Currently stacked on #17282 to include `await`.
Test Plan
--
New inline tests. These are simpler than the other cases because there's
no place to put generics.
Summary
--
This PR extends the changes in #17101 to include `await` in the same
positions.
I also renamed the `valid_annotation_function` test to include `_py313`
and explicitly passed a Python version to contrast it with the `_py314`
version.
Test Plan
--
New test cases added to existing files.
Summary
--
This PR fixes the issue pointed out by @JelleZijlstra in
https://github.com/astral-sh/ruff/pull/17101#issuecomment-2777480204.
Namely, I conflated two very different errors from CPython:
```pycon
>>> def m[T](x: (yield from 1)): ...
File "<python-input-310>", line 1
def m[T](x: (yield from 1)): ...
^^^^^^^^^^^^
SyntaxError: yield expression cannot be used within the definition of a generic
>>> def m(x: (yield from 1)): ...
File "<python-input-311>", line 1
def m(x: (yield from 1)): ...
^^^^^^^^^^^^
SyntaxError: 'yield from' outside function
>>> def outer():
... def m(x: (yield from 1)): ...
...
>>>
```
I thought the second error was the same as the first, but `yield` (and
`yield from`) is actually valid in this position when inside a function
scope. The same is true for base classes, as pointed out in the original
comment.
We don't currently raise an error for `yield` outside of a function, but
that should be handled separately.
On the upside, this had the benefit of removing the
`InvalidExpressionPosition::BaseClass` variant and the
`allow_named_expr` field from the visitor because they were both no
longer used.
Test Plan
--
Updated inline tests.
Summary
--
This PR detects the use of invalid syntax in annotation scopes,
including
`yield` and `yield from` expressions and named expressions. I combined a
few
different types of CPython errors here, but I think the resulting error
messages
still make sense and are even preferable to what CPython gives. For
example, we
report `yield expression cannot be used in a type annotation` for both
of these:
```pycon
>>> def f[T](x: (yield 1)): ...
File "<python-input-26>", line 1
def f[T](x: (yield 1)): ...
^^^^^^^
SyntaxError: yield expression cannot be used within the definition of a generic
>>> def foo() -> (yield x): ...
File "<python-input-28>", line 1
def foo() -> (yield x): ...
^^^^^^^
SyntaxError: 'yield' outside function
```
Fixes https://github.com/astral-sh/ruff/issues/11118.
Test Plan
--
New inline tests, along with some updates to existing tests.
Summary
--
Detects duplicate attributes in a `match` class pattern:
```python
match x:
case Class(x=1, x=2): ...
```
which are more analogous to the similar check for mapping patterns than
to the
multiple assignments rule.
I also realized that both this and the mapping check would only work on
top-level patterns, despite the possibility that they can be nested
inside other
patterns:
```python
match x:
case [{"x": 1, "x": 2}]: ... # false negative in the old version
```
and moved these checks into the recursive pattern visitor instead.
I also tidied up some of the names like the `multiple_case_assignment`
function
and the `MultipleCaseAssignmentVisitor`, which are now doing more than
checking
for multiple assignments.
Test Plan
--
New inline tests for both classes and mappings.
Summary
--
Fixes#17181. The cases being tested with multiple *keys* being equal
are actually a slightly different error, more like the error for
`MatchMapping` than like the other multiple assignment errors:
```pycon
>>> match x:
... case Class(x=x, x=x): ...
...
File "<python-input-249>", line 2
case Class(x=x, x=x): ...
^
SyntaxError: attribute name repeated in class pattern: x
>>> match x:
... case {"x": 1, "x": 2}: ...
...
File "<python-input-251>", line 2
case {"x": 1, "x": 2}: ...
^^^^^^^^^^^^^^^^
SyntaxError: mapping pattern checks duplicate key ('x')
>>> match x:
... case [x, x]: ...
...
File "<python-input-252>", line 2
case [x, x]: ...
^
SyntaxError: multiple assignments to name 'x' in pattern
```
This PR just stops the false positive reported in the issue, but I will
quickly follow it up with a new rule (or possibly combined with the
mapping rule) catching the repeated attributes separately.
Test Plan
--
New inline `test_ok` and updating the `test_err` cases to have duplicate
values instead of keys.
Summary
--
Detects duplicate literals in `match` mapping keys.
This PR also adds a `source` method to `SemanticSyntaxContext` to
display the duplicated key in the error message by slicing out its
range.
Test Plan
--
New inline tests.
Summary
--
Fixes https://github.com/astral-sh/ruff/issues/16520 by flagging single,
starred expressions in `return`, `yield`, and
`for` statements.
I thought `yield from` would also be included here, but that error is
emitted by
the CPython parser:
```pycon
>>> ast.parse("def f(): yield from *x")
Traceback (most recent call last):
File "<python-input-214>", line 1, in <module>
ast.parse("def f(): yield from *x")
~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.13/ast.py", line 54, in parse
return compile(source, filename, mode, flags,
_feature_version=feature_version, optimize=optimize)
File "<unknown>", line 1
def f(): yield from *x
^
SyntaxError: invalid syntax
```
And we also already catch it in our parser.
Test Plan
--
New inline tests and updates to existing tests.
Summary
--
Detects starred assignment targets outside of tuples and lists like `*a
= (1,)`.
This PR only considers assignment statements. I also checked annotated
assigment statements, but these give a separate error that we already
catch, so I think they're okay not to consider:
```pycon
>>> *a: list[int] = []
File "<python-input-72>", line 1
*a: list[int] = []
^
SyntaxError: invalid syntax
```
Fixes#13759
Test Plan
--
New inline tests, plus a new `SemanticSyntaxError` for an existing
parser test. I also removed a now-invalid case from an otherwise-valid
test fixture.
The new semantic error leads to two errors for the case below:
```python
*foo() = 42
```
but this matches [pyright] too.
[pyright]: https://pyright-play.net/?code=FQMw9mAUCUAEC8sAsAmAUEA
Summary
--
Detect setting or deleting `__debug__`. Assigning to `__debug__` was a
`SyntaxError` on the earliest version I tested (3.8). Deleting
`__debug__` was made a `SyntaxError` in [BPO 45000], which said it was
resolved in Python 3.10. However, `del __debug__` was also a runtime
error (`NameError`) when I tested in Python 3.9.6, so I thought it was
worth including 3.9 in this check.
I don't think it was ever a *good* idea to try `del __debug__`, so I
think there's also an argument for not making this version-dependent at
all. That would only simplify the implementation very slightly, though.
[BPO 45000]: https://github.com/python/cpython/issues/89163
Test Plan
--
New inline tests. This also required adding a `PythonVersion` field to
the `TestContext` that could be taken from the inline `ParseOptions` and
making the version field on the options accessible.
Summary
--
This PR detects multiple assignments to the same name in `case` patterns
by recursively visiting each pattern.
Test Plan
--
New inline tests.
Summary
--
Detects irrefutable `match` cases before the final case using a modified
version
of the existing `Pattern::is_irrefutable` method from the AST crate. The
modified method helps to retrieve a more precise diagnostic range to
match what
Python 3.13 shows in the REPL.
Test Plan
--
New inline tests, as well as some updates to existing tests that had
irrefutable
patterns before the last block.
Summary
--
Fixes#16943 by checking if the tuple is not parenthesized before
emitting an error.
Test Plan
--
New inline test based on the initial report
Summary
--
Detects duplicate type parameter names in function definitions, class
definitions, and type alias statements.
I also boxed the `type_params` field on `StmtTypeAlias` to make it
easier to
`match` with functions and classes. (That's the reason for the red-knot
code
owner review requests, sorry!)
Test Plan
--
New `ruff_python_syntax_errors` unit tests.
Fixes#11119.
## Summary
This PR implements the "greeter" approach for checking the AST for
syntax errors emitted by the CPython compiler. It introduces two main
infrastructural changes to support all of the compile-time errors:
1. Adds a new `semantic_errors` module to the parser crate with public
`SemanticSyntaxChecker` and `SemanticSyntaxError` types
2. Embeds a `SemanticSyntaxChecker` in the `ruff_linter::Checker` for
checking these errors in ruff
As a proof of concept, it also implements detection of two syntax
errors:
1. A reimplementation of
[`late-future-import`](https://docs.astral.sh/ruff/rules/late-future-import/)
(`F404`)
2. Detection of rebound comprehension iteration variables
(https://github.com/astral-sh/ruff/issues/14395)
## Test plan
Existing F404 tests, new inline tests in the `ruff_python_parser` crate,
and a linter CLI test showing an example of the `Message` output.
I also tested in VS Code, where `preview = false` and turning off syntax
errors both disable the new errors:

And on the playground, where `preview = false` also disables the errors:

Fixes#14395
---------
Co-authored-by: Micha Reiser <micha@reiser.io>
Summary
--
Fixes#16874. I previously emitted a syntax error when starred
annotations were _allowed_ rather than when they were actually used.
This caused false positives for any starred parameter name because these
are allowed to have starred annotations but not required to. The fix is
to check if the annotation is actually starred after parsing it.
Test Plan
--
New inline parser tests derived from the initial report and more
examples from the comments, although I think the first case should cover
them all.
## Summary
This PR detects the use of PEP 701 f-strings before 3.12. This one
sounded difficult and ended up being pretty easy, so I think there's a
good chance I've over-simplified things. However, from experimenting in
the Python REPL and checking with [pyright], I think this is correct.
pyright actually doesn't even flag the comment case, but Python does.
I also checked pyright's implementation for
[quotes](98dc4469cc/packages/pyright-internal/src/analyzer/checker.ts (L1379-L1398))
and
[escapes](98dc4469cc/packages/pyright-internal/src/analyzer/checker.ts (L1365-L1377))
and think I've approximated how they do it.
Python's error messages also point to the simple approach of these
characters simply not being allowed:
```pycon
Python 3.11.11 (main, Feb 12 2025, 14:51:05) [Clang 19.1.6 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> f'''multiline {
... expression # comment
... }'''
File "<stdin>", line 3
}'''
^
SyntaxError: f-string expression part cannot include '#'
>>> f'''{not a line \
... continuation}'''
File "<stdin>", line 2
continuation}'''
^
SyntaxError: f-string expression part cannot include a backslash
>>> f'hello {'world'}'
File "<stdin>", line 1
f'hello {'world'}'
^^^^^
SyntaxError: f-string: expecting '}'
```
And since escapes aren't allowed, I don't think there are any tricky
cases where nested quotes or comments can sneak in.
It's also slightly annoying that the error is repeated for every nested
quote character, but that also mirrors pyright, although they highlight
the whole nested string, which is a little nicer. However, their check
is in the analysis phase, so I don't think we have such easy access to
the quoted range, at least without adding another mini visitor.
## Test Plan
New inline tests
[pyright]:
https://pyright-play.net/?pythonVersion=3.11&strict=true&code=EYQw5gBAvBAmCWBjALgCgO4gHaygRgEoAoEaCAIgBpyiiBiCLAUwGdknYIBHAVwHt2LIgDMA5AFlwSCJhwAuCAG8IoMAG1Rs2KIC6EAL6iIxosbPmLlq5foRWiEAAcmERAAsQAJxAomnltY2wuSKogA6WKIAdABWfPBYqCAE%2BuSBVqbpWVm2iHwAtvlMWMgB2ekiolUAgq4FjgA2TAAeEMieSADWCsoV5qoaqrrGDJ5MiDz%2B8ABuLqosAIREhlXlaybrmyYMXsDw7V4AnoysyAmQ5SIhwYo3d9cheADUeKlv5O%2BpQA
## Summary
A small followup to https://github.com/astral-sh/ruff/pull/16386. We now
tell the user exactly what it was about their decorator that constituted
invalid syntax on Python <3.9, and the range now highlights the specific
sub-expression that is invalid rather than highlighting the whole
decorator
## Test Plan
Inline snapshots are updated, and new ones are added.
## Summary
This PR detects unparenthesized assignment expressions used in set
literals and comprehensions and in sequence indexes. The link to the
release notes in https://github.com/astral-sh/ruff/issues/6591 just has
this entry:
> * Assignment expressions can now be used unparenthesized within set
literals and set comprehensions, as well as in sequence indexes (but not
slices).
with no other information, so hopefully the test cases I came up with
cover all of the changes. I also tested these out in the Python REPL and
they actually worked in Python 3.9 too. I'm guessing this may be another
case that was "formally made part of the language spec in Python 3.10,
but usable -- and commonly used -- in Python >=3.9" as @AlexWaygood
added to the body of #6591 for context managers. So we may want to
change the version cutoff, but I've gone along with the release notes
for now.
## Test Plan
New inline parser tests and linter CLI tests.
Summary
--
This is closely related to (and stacked on)
https://github.com/astral-sh/ruff/pull/16544 and detects star
annotations in function definitions.
I initially called the variant `StarExpressionInAnnotation` to mirror
`StarExpressionInIndex`, but I realized it's not really a "star
expression" in this position and renamed it. `StarAnnotation` seems in
line with the PEP.
Test Plan
--
Two new inline tests. It looked like there was pretty good existing
coverage of this syntax, so I just added simple examples to test the
version cutoff.
Summary
--
This PR reuses a slightly modified version of the
`check_tuple_unpacking` method added for detecting unpacking in `return`
and `yield` statements to detect the same issue in the iterator clause
of `for` loops.
I ran into the same issue with a bare `for x in *rest: ...` example
(invalid even on Python 3.13) and added it as a comment on
https://github.com/astral-sh/ruff/issues/16520.
I considered just making this an additional `StarTupleKind` variant as
well, but this change was in a different version of Python, so I kept it
separate.
Test Plan
--
New inline tests.
Summary
--
Unlike the other syntax errors detected so far, parenthesized keyword
arguments are only allowed *before* 3.8. It sounds like they were only
accidentally allowed before that [^1].
As an aside, you get a pretty confusing error from Python for this, so
it's nice that we can catch it:
```pycon
>>> def f(**kwargs): ...
... f((a)=1)
...
File "<python-input-0>", line 2
f((a)=1)
^^^
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
>>>
```
Test Plan
--
Inline tests.
[^1]: https://github.com/python/cpython/issues/78822
Summary
--
Checks for tuple unpacking in `return` and `yield` statements before
Python 3.8, as described [here].
Test Plan
--
Inline tests.
[here]: https://github.com/python/cpython/issues/76298
Summary
--
Another simple one, just detect type parameter lists in functions
and classes. Like pyright, we don't emit a second diagnostic for
`type` alias statements, which were also introduced in 3.12.
Test Plan
--
Inline tests.
Summary
--
Detects the presence of a [PEP 696] type parameter default before Python
3.13.
Test Plan
--
New inline parser tests for type aliases, generic functions and generic
classes.
[PEP 696]: https://peps.python.org/pep-0696/#grammar-changes
Summary
--
This is a follow-up to #16446 to fix the diagnostic range to point to
the `*` like `pyright` does
(https://github.com/astral-sh/ruff/pull/16446#discussion_r1976900643).
Storing the range in the `ExceptClauseKind::Star` variant feels slightly
awkward, but we don't store the star itself anywhere on the
`ExceptHandler`. And we can't just take `ExceptHandler.start() +
"except".text_len()` because this code appears to be valid:
```python
try: ...
except * Error: ...
```
Test Plan
--
Existing tests.
## Summary
This PR is the first in a series derived from
https://github.com/astral-sh/ruff/pull/16308, each of which add support
for detecting one version-related syntax error from
https://github.com/astral-sh/ruff/issues/6591. This one should be
the largest because it also includes the addition of the
`Parser::add_unsupported_syntax_error` method
Otherwise I think the general structure will be the same for each syntax
error:
* Detecting the error in the parser
* Inline parser tests for the new error
* New ruff CLI tests for the new error
## Test Plan
As noted above, there are new inline parser tests, as well as new ruff
CLI
tests. Once https://github.com/astral-sh/ruff/pull/16379 is resolved,
there should also be new mdtests for red-knot,
but this PR does not currently include those.
## Summary
This PR adds support for a pragma-style header for inline parser tests
containing JSON-serialized `ParseOptions`. For example,
```python
# parse_options: { "target-version": "3.9" }
match 2:
case 1:
pass
```
The line must start with `# parse_options: ` and then the rest of the
(trimmed) line is deserialized into `ParseOptions` used for parsing the
the test.
## Test Plan
Existing inline tests, plus two new inline tests for
`match-before-py310`.
---------
Co-authored-by: Alex Waygood <alex.waygood@gmail.com>
When confronted with `raise from exc` the parser will now create a
`StmtRaise` that has `None` for the exception and `exc` for the cause.
Before, the parser created a `StmtRaise` with `from` for the exception,
no cause, and a spurious expression `exc` afterwards.
This PR adds a syntax error if the parser encounters a `TryStmt` that
has except clauses both with and without a star.
The displayed error points to each except clause that contradicts the
original except clause kind. So, for example,
```python
try:
....
except: #<-- we assume this is the desired except kind
....
except*: #<--- error will point here
....
except*: #<--- and here
....
```
Closes#14860
## Summary
This PR fixes a bug to raise a syntax error when an unparenthesized
generator expression is used as an argument to a call when there are
more than one argument.
For reference, the grammar is:
```
primary:
| ...
| primary genexp
| primary '(' [arguments] ')'
| ...
genexp:
| '(' ( assignment_expression | expression !':=') for_if_clauses ')'
```
The `genexp` requires the parenthesis as mentioned in the grammar. So,
the grammar for a call expression is either a name followed by a
generator expression or a name followed by a list of argument. In the
former case, the parenthesis are excluded because the generator
expression provides them while in the later case, the parenthesis are
explicitly provided for a list of arguments which means that the
generator expression requires it's own parenthesis.
This was discovered in https://github.com/astral-sh/ruff/issues/12420.
## Test Plan
Add test cases for valid and invalid syntax.
Make sure that the parser from CPython also raises this at the parsing
step:
```console
$ python3.13 -m ast parser/_.py
File "parser/_.py", line 1
total(1, 2, x for x in range(5), 6)
^^^^^^^^^^^^^^^^^^^
SyntaxError: Generator expression must be parenthesized
$ python3.13 -m ast parser/_.py
File "parser/_.py", line 1
sum(x for x in range(10), 10)
^^^^^^^^^^^^^^^^^^^^
SyntaxError: Generator expression must be parenthesized
```
## Summary
(I'm pretty sure I added this in the parser re-write but must've got
lost in the rebase?)
This PR raises a syntax error if the type parameter list is empty.
As per the grammar, there should be at least one type parameter:
```
type_params:
| invalid_type_params
| '[' type_param_seq ']'
type_param_seq: ','.type_param+ [',']
```
Verified via the builtin `ast` module as well:
```console
$ python3.13 -m ast parser/_.py
Traceback (most recent call last):
[..]
File "parser/_.py", line 1
def foo[]():
^
SyntaxError: Type parameter list cannot be empty
```
## Test Plan
Add inline test cases and update the snapshots.
## Summary
This PR implements the re-lexing logic in the parser.
This logic is only applied when recovering from an error during list
parsing. The logic is as follows:
1. During list parsing, if an unexpected token is encountered and it
detects that an outer context can understand it and thus recover from
it, it invokes the re-lexing logic in the lexer
2. This logic first checks if the lexer is in a parenthesized context
and returns if it's not. Thus, the logic is a no-op if the lexer isn't
in a parenthesized context
3. It then reduces the nesting level by 1. It shouldn't reset it to 0
because otherwise the recovery from nested list parsing will be
incorrect
4. Then, it tries to find last newline character going backwards from
the current position of the lexer. This avoids any whitespaces but if it
encounters any character other than newline or whitespace, it aborts.
5. Now, if there's a newline character, then it needs to be re-lexed in
a logical context which means that the lexer needs to emit it as a
`Newline` token instead of `NonLogicalNewline`.
6. If the re-lexing gives a different token than the current one, the
token source needs to update it's token collection to remove all the
tokens which comes after the new current position.
It turns out that the list parsing isn't that happy with the results so
it requires some re-arranging such that the following two errors are
raised correctly:
1. Expected comma
2. Recovery context error
For (1), the following scenarios needs to be considered:
* Missing comma between two elements
* Half parsed element because the grammar doesn't allow it (for example,
named expressions)
For (2), the following scenarios needs to be considered:
1. If the parser is at a comma which means that there's a missing
element otherwise the comma would've been consumed by the first `eat`
call above. And, the parser doesn't take the re-lexing route on a comma
token.
2. If it's the first element and the current token is not a comma which
means that it's an invalid element.
resolves: #11640
## Test Plan
- [x] Update existing test snapshots and validate them
- [x] Add additional test cases specific to the re-lexing logic and
validate the snapshots
- [x] Run the fuzzer on 3000+ valid inputs
- [x] Run the fuzzer on invalid inputs
- [x] Run the parser on various open source projects
- [x] Make sure the ecosystem changes are none
## Summary
This PR updates the entire parser stack in multiple ways:
### Make the lexer lazy
* https://github.com/astral-sh/ruff/pull/11244
* https://github.com/astral-sh/ruff/pull/11473
Previously, Ruff's lexer would act as an iterator. The parser would
collect all the tokens in a vector first and then process the tokens to
create the syntax tree.
The first task in this project is to update the entire parsing flow to
make the lexer lazy. This includes the `Lexer`, `TokenSource`, and
`Parser`. For context, the `TokenSource` is a wrapper around the `Lexer`
to filter out the trivia tokens[^1]. Now, the parser will ask the token
source to get the next token and only then the lexer will continue and
emit the token. This means that the lexer needs to be aware of the
"current" token. When the `next_token` is called, the current token will
be updated with the newly lexed token.
The main motivation to make the lexer lazy is to allow re-lexing a token
in a different context. This is going to be really useful to make the
parser error resilience. For example, currently the emitted tokens
remains the same even if the parser can recover from an unclosed
parenthesis. This is important because the lexer emits a
`NonLogicalNewline` in parenthesized context while a normal `Newline` in
non-parenthesized context. This different kinds of newline is also used
to emit the indentation tokens which is important for the parser as it's
used to determine the start and end of a block.
Additionally, this allows us to implement the following functionalities:
1. Checkpoint - rewind infrastructure: The idea here is to create a
checkpoint and continue lexing. At a later point, this checkpoint can be
used to rewind the lexer back to the provided checkpoint.
2. Remove the `SoftKeywordTransformer` and instead use lookahead or
speculative parsing to determine whether a soft keyword is a keyword or
an identifier
3. Remove the `Tok` enum. The `Tok` enum represents the tokens emitted
by the lexer but it contains owned data which makes it expensive to
clone. The new `TokenKind` enum just represents the type of token which
is very cheap.
This brings up a question as to how will the parser get the owned value
which was stored on `Tok`. This will be solved by introducing a new
`TokenValue` enum which only contains a subset of token kinds which has
the owned value. This is stored on the lexer and is requested by the
parser when it wants to process the data. For example:
8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L1260-L1262)
[^1]: Trivia tokens are `NonLogicalNewline` and `Comment`
### Remove `SoftKeywordTransformer`
* https://github.com/astral-sh/ruff/pull/11441
* https://github.com/astral-sh/ruff/pull/11459
* https://github.com/astral-sh/ruff/pull/11442
* https://github.com/astral-sh/ruff/pull/11443
* https://github.com/astral-sh/ruff/pull/11474
For context,
https://github.com/RustPython/RustPython/pull/4519/files#diff-5de40045e78e794aa5ab0b8aacf531aa477daf826d31ca129467703855408220
added support for soft keywords in the parser which uses infinite
lookahead to classify a soft keyword as a keyword or an identifier. This
is a brilliant idea as it basically wraps the existing Lexer and works
on top of it which means that the logic for lexing and re-lexing a soft
keyword remains separate. The change here is to remove
`SoftKeywordTransformer` and let the parser determine this based on
context, lookahead and speculative parsing.
* **Context:** The transformer needs to know the position of the lexer
between it being at a statement position or a simple statement position.
This is because a `match` token starts a compound statement while a
`type` token starts a simple statement. **The parser already knows
this.**
* **Lookahead:** Now that the parser knows the context it can perform
lookahead of up to two tokens to classify the soft keyword. The logic
for this is mentioned in the PR implementing it for `type` and `match
soft keyword.
* **Speculative parsing:** This is where the checkpoint - rewind
infrastructure helps. For `match` soft keyword, there are certain cases
for which we can't classify based on lookahead. The idea here is to
create a checkpoint and keep parsing. Based on whether the parsing was
successful and what tokens are ahead we can classify the remaining
cases. Refer to #11443 for more details.
If the soft keyword is being parsed in an identifier context, it'll be
converted to an identifier and the emitted token will be updated as
well. Refer
8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L487-L491).
The `case` soft keyword doesn't require any special handling because
it'll be a keyword only in the context of a match statement.
### Update the parser API
* https://github.com/astral-sh/ruff/pull/11494
* https://github.com/astral-sh/ruff/pull/11505
Now that the lexer is in sync with the parser, and the parser helps to
determine whether a soft keyword is a keyword or an identifier, the
lexer cannot be used on its own. The reason being that it's not
sensitive to the context (which is correct). This means that the parser
API needs to be updated to not allow any access to the lexer.
Previously, there were multiple ways to parse the source code:
1. Passing the source code itself
2. Or, passing the tokens
Now that the lexer and parser are working together, the API
corresponding to (2) cannot exists. The final API is mentioned in this
PR description: https://github.com/astral-sh/ruff/pull/11494.
### Refactor the downstream tools (linter and formatter)
* https://github.com/astral-sh/ruff/pull/11511
* https://github.com/astral-sh/ruff/pull/11515
* https://github.com/astral-sh/ruff/pull/11529
* https://github.com/astral-sh/ruff/pull/11562
* https://github.com/astral-sh/ruff/pull/11592
And, the final set of changes involves updating all references of the
lexer and `Tok` enum. This was done in two-parts:
1. Update all the references in a way that doesn't require any changes
from this PR i.e., it can be done independently
* https://github.com/astral-sh/ruff/pull/11402
* https://github.com/astral-sh/ruff/pull/11406
* https://github.com/astral-sh/ruff/pull/11418
* https://github.com/astral-sh/ruff/pull/11419
* https://github.com/astral-sh/ruff/pull/11420
* https://github.com/astral-sh/ruff/pull/11424
2. Update all the remaining references to use the changes made in this
PR
For (2), there were various strategies used:
1. Introduce a new `Tokens` struct which wraps the token vector and add
methods to query a certain subset of tokens. These includes:
1. `up_to_first_unknown` which replaces the `tokenize` function
2. `in_range` and `after` which replaces the `lex_starts_at` function
where the former returns the tokens within the given range while the
latter returns all the tokens after the given offset
2. Introduce a new `TokenFlags` which is a set of flags to query certain
information from a token. Currently, this information is only limited to
any string type token but can be expanded to include other information
in the future as needed. https://github.com/astral-sh/ruff/pull/11578
3. Move the `CommentRanges` to the parsed output because this
information is common to both the linter and the formatter. This removes
the need for `tokens_and_ranges` function.
## Test Plan
- [x] Update and verify the test snapshots
- [x] Make sure the entire test suite is passing
- [x] Make sure there are no changes in the ecosystem checks
- [x] Run the fuzzer on the parser
- [x] Run this change on dozens of open-source projects
### Running this change on dozens of open-source projects
Refer to the PR description to get the list of open source projects used
for testing.
Now, the following tests were done between `main` and this branch:
1. Compare the output of `--select=E999` (syntax errors)
2. Compare the output of default rule selection
3. Compare the output of `--select=ALL`
**Conclusion: all output were same**
## What's next?
The next step is to introduce re-lexing logic and update the parser to
feed the recovery information to the lexer so that it can emit the
correct token. This moves us one step closer to having error resilience
in the parser and provides Ruff the possibility to lint even if the
source code contains syntax errors.
## Summary
This PR adds a new `ExpressionContext` struct which is used in
expression parsing.
This solves the following problem:
1. Allowing starred expression with different precedence
2. Allowing yield expression in certain context
3. Remove ambiguity with `in` keyword when parsing a `for ... in`
statement
For context, (1) was solved by adding `parse_star_expression_list` and
`parse_star_expression_or_higher` in #10623, (2) was solved by by adding
`parse_yield_expression_or_else` in #10809, and (3) was fixed in #11009.
All of the mentioned functions have been removed in favor of the context
flags.
As mentioned in #11009, an ideal solution would be to implement an
expression context which is what this PR implements. This is passed
around as function parameter and the call stack is used to automatically
reset the context.
### Recovery
How should the parser recover if the target expression is invalid when
an expression can consume the `in` keyword?
1. Should the `in` keyword be part of the target expression?
2. Or, should the expression parsing stop as soon as `in` keyword is
encountered, no matter the expression?
For example:
```python
for yield x in y: ...
# Here, should this be parsed as
for (yield x) in (y): ...
# Or
for (yield x in y): ...
# where the `in iter` part is missing
```
Or, for binary expression parsing:
```python
for x or y in z: ...
# Should this be parsed as
for (x or y) in z: ...
# Or
for (x or y in z): ...
# where the `in iter` part is missing
```
This need not be solved now, but is very easy to change. For context
this PR does the following:
* For binary, comparison, and unary expressions, stop at `in`
* For lambda, yield expressions, consume the `in`
## Test Plan
1. Add test cases for the `for ... in` statement and verify the
snapshots
2. Make sure the existing test suite pass
3. Run the fuzzer for around 3000 generated source code
4. Run the updated logic on a dozen or so open source repositories
(codename "parser-checkouts")
## Summary
This fixes a bug where the parser would panic when there is a "gap" in
the token source.
What's a gap?
The reason it's `<=` instead of just `==` is because there could be
whitespaces between
the two tokens. For example:
```python
# last token end
# | current token (newline) start
# v v
def foo \n
# ^
# assume there's trailing whitespace here
```
Or, there could tokens that are considered "trivia" and thus aren't
emitted by the token
source. These are comments and non-logical newlines. For example:
```python
# last token end
# v
def foo # comment\n
# ^ current token (newline) start
```
In either of the above cases, there's a "gap" between the end of the
last token and start
of the current token.
## Test Plan
Add test cases and update the snapshots.