Commit graph

24 commits

Author SHA1 Message Date
Ibraheem Ahmed
c9dff5c7d5
[ty] AST garbage collection (#18482)
## Summary

Garbage collect ASTs once we are done checking a given file. Queries
with a cross-file dependency on the AST will reparse the file on demand.
This reduces ty's peak memory usage by ~20-30%.

The primary change of this PR is adding a `node_index` field to every
AST node, that is assigned by the parser. `ParsedModule` can use this to
create a flat index of AST nodes any time the file is parsed (or
reparsed). This allows `AstNodeRef` to simply index into the current
instance of the `ParsedModule`, instead of storing a pointer directly.

The indices are somewhat hackily (using an atomic integer) assigned by
the `parsed_module` query instead of by the parser directly. Assigning
the indices in source-order in the (recursive) parser turns out to be
difficult, and collecting the nodes during semantic indexing is
impossible as `SemanticIndex` does not hold onto a specific
`ParsedModuleRef`, which the pointers in the flat AST are tied to. This
means that we have to do an extra AST traversal to assign and collect
the nodes into a flat index, but the small performance impact (~3% on
cold runs) seems worth it for the memory savings.

Part of https://github.com/astral-sh/ty/issues/214.
2025-06-13 08:40:11 -04:00
Dylan
9bbf4987e8
Implement template strings (#17851)
This PR implements template strings (t-strings) in the parser and
formatter for Ruff.

Minimal changes necessary to compile were made in other parts of the code (e.g. ty, the linter, etc.). These will be covered properly in follow-up PRs.
2025-05-30 15:00:56 -05:00
Abhijeet Prasad Bodas
f5096f2050
[parser] Flag single unparenthesized generator expr with trailing comma in arguments. (#17893)
Fixes #17867

## Summary

The CPython parser does not allow generator expressions which are the
sole arguments in an argument list to have a trailing comma.
With this change, we start flagging such instances.

## Test Plan

Added new inline tests.
2025-05-07 14:11:35 -04:00
Brent Westbrook
5697d21fca
[syntax-errors] Irrefutable case pattern before final case (#16905)
Summary
--

Detects irrefutable `match` cases before the final case using a modified
version
of the existing `Pattern::is_irrefutable` method from the AST crate. The
modified method helps to retrieve a more precise diagnostic range to
match what
Python 3.13 shows in the REPL.

Test Plan
--

New inline tests, as well as some updates to existing tests that had
irrefutable
patterns before the last block.
2025-03-26 12:27:16 -04:00
Brent Westbrook
b3c884f4f3
[syntax-errors] Parenthesized keyword argument names after Python 3.8 (#16482)
Summary
--

Unlike the other syntax errors detected so far, parenthesized keyword
arguments are only allowed *before* 3.8. It sounds like they were only
accidentally allowed before that [^1].

As an aside, you get a pretty confusing error from Python for this, so
it's nice that we can catch it:

```pycon
>>> def f(**kwargs): ...
... f((a)=1)
...
  File "<python-input-0>", line 2
    f((a)=1)
       ^^^
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
>>>
```
Test Plan
--
Inline tests.

[^1]: https://github.com/python/cpython/issues/78822
2025-03-06 12:18:13 -05:00
Brent Westbrook
e924ecbdac
[syntax-errors] except* before Python 3.11 (#16446)
Summary
--

One of the simpler ones, just detect the use of `except*` before 3.11.

Test Plan
--

New inline tests.
2025-03-02 18:20:18 +00:00
Brent Westbrook
4431978262
[syntax-errors] Assignment expressions before Python 3.8 (#16383)
## Summary
This PR is the first in a series derived from
https://github.com/astral-sh/ruff/pull/16308, each of which add support
for detecting one version-related syntax error from
https://github.com/astral-sh/ruff/issues/6591. This one should be
the largest because it also includes the addition of the 
`Parser::add_unsupported_syntax_error` method

Otherwise I think the general structure will be the same for each syntax
error:
* Detecting the error in the parser
* Inline parser tests for the new error
* New ruff CLI tests for the new error

## Test Plan
As noted above, there are new inline parser tests, as well as new ruff
CLI
tests. Once https://github.com/astral-sh/ruff/pull/16379 is resolved,
there should also be new mdtests for red-knot,
but this PR does not currently include those.
2025-02-28 17:13:46 -05:00
Brent Westbrook
78806361fd
Start detecting version-related syntax errors in the parser (#16090)
## Summary

This PR builds on the changes in #16220 to pass a target Python version
to the parser. It also adds the `Parser::unsupported_syntax_errors` field, which
collects version-related syntax errors while parsing. These syntax
errors are then turned into `Message`s in ruff (in preview mode).

This PR only detects one syntax error (`match` statement before Python
3.10), but it has been pretty quick to extend to several other simple
errors (see #16308 for example).

## Test Plan

The current tests are CLI tests in the linter crate, but these could be
supplemented with inline parser tests after #16357.

I also tested the display of these syntax errors in VS Code:


![image](https://github.com/user-attachments/assets/062b4441-740e-46c3-887c-a954049ef26e)

![image](https://github.com/user-attachments/assets/101f55b8-146c-4d59-b6b0-922f19bcd0fa)

---------

Co-authored-by: Alex Waygood <alex.waygood@gmail.com>
2025-02-25 23:03:48 -05:00
Brent Westbrook
97d0659ce3
Pass ParserOptions to the parser (#16220)
## Summary

This is part of the preparation for detecting syntax errors in the
parser from https://github.com/astral-sh/ruff/pull/16090/. As suggested
in [this
comment](https://github.com/astral-sh/ruff/pull/16090/#discussion_r1953084509),
I started working on a `ParseOptions` struct that could be stored in the
parser. For this initial refactor, I only made it hold the existing
`Mode` option, but for syntax errors, we will also need it to have a
`PythonVersion`. For that use case, I'm picturing something like a
`ParseOptions::with_python_version` method, so you can extend the
current calls to something like

```rust
ParseOptions::from(mode).with_python_version(settings.target_version)
```

But I thought it was worth adding `ParseOptions` alone without changing
any other behavior first.

Most of the diff is just updating call sites taking `Mode` to take
`ParseOptions::from(Mode)` or those taking `PySourceType`s to take
`ParseOptions::from(PySourceType)`. The interesting changes are in the
new `parser/options.rs` file and smaller parts of `parser/mod.rs` and
`ruff_python_parser/src/lib.rs`.

## Test Plan

Existing tests, this should not change any behavior.
2025-02-19 10:50:50 -05:00
Shaygan Hooshyari
cf4ab7cba1
Parse triple quoted string annotations as if parenthesized (#15387)
## Summary

Resolves #9467 

Parse quoted annotations as if the string content is inside parenthesis.
With this logic `x` and `y` in this example are equal:

```python
y: """
   int |
   str
"""

z: """(
    int |
    str
)
"""
```

Also this rule only applies to triple
quotes([link](https://github.com/python/typing-council/issues/9#issuecomment-1890808610)).

This PR is based on the
[comments](https://github.com/astral-sh/ruff/issues/9467#issuecomment-2579180991)
on the issue.

I did one extra change, since we don't want any indentation tokens I am
setting the `State::Other` as the initial state of the Lexer.

Remaining work:

- [x] Add a test case for red-knot.
- [x] Add more tests.

## Test Plan

Added a test which previously failed because quoted annotation contained
indentation.
Added an mdtest for red-knot.
Updated previous test.

Co-authored-by: Dhruv Manilawala <dhruvmanila@gmail.com>
Co-authored-by: Micha Reiser <micha@reiser.io>
2025-01-16 11:38:15 +05:30
Dhruv Manilawala
96da136e6a
Move token and error structs into related modules (#11957)
## Summary

This PR does some housekeeping into moving certain structs into related
modules. Specifically,
1. Move `LexicalError` from `lexer.rs` to `error.rs` which also contains
the `ParseError`
2. Move `Token`, `TokenFlags` and `TokenValue` from `lexer.rs` to
`token.rs`
2024-06-21 10:07:19 +00:00
Dhruv Manilawala
1e0642fac8
Use re-lexing for normal list parsing (#11871)
## Summary

This PR is a follow-up on #11845 to add the re-lexing logic for normal
list parsing.

A normal list parsing is basically parsing elements without any
separator in between i.e., there can only be trivia tokens in between
the two elements. Currently, this is only being used for parsing
**assignment statement** and **f-string elements**. Assignment
statements cannot be in a parenthesized context, but f-string can have
curly braces so this PR is specifically for them.

I don't think this is an ideal recovery but the problem is that both
lexer and parser could add an error for f-strings. If the lexer adds an
error it'll emit an `Unknown` token instead while the parser adds the
error directly. I think we'd need to move all f-string errors to be
emitted by the parser instead. This way the parser can correctly inform
the lexer that it's out of an f-string and then the lexer can pop the
current f-string context out of the stack.

## Test Plan

Add test cases, update the snapshots, and run the fuzzer.
2024-06-18 12:14:41 +05:30
Dhruv Manilawala
8499abfa7f
Implement re-lexing logic for better error recovery (#11845)
## Summary

This PR implements the re-lexing logic in the parser.

This logic is only applied when recovering from an error during list
parsing. The logic is as follows:
1. During list parsing, if an unexpected token is encountered and it
detects that an outer context can understand it and thus recover from
it, it invokes the re-lexing logic in the lexer
2. This logic first checks if the lexer is in a parenthesized context
and returns if it's not. Thus, the logic is a no-op if the lexer isn't
in a parenthesized context
3. It then reduces the nesting level by 1. It shouldn't reset it to 0
because otherwise the recovery from nested list parsing will be
incorrect
4. Then, it tries to find last newline character going backwards from
the current position of the lexer. This avoids any whitespaces but if it
encounters any character other than newline or whitespace, it aborts.
5. Now, if there's a newline character, then it needs to be re-lexed in
a logical context which means that the lexer needs to emit it as a
`Newline` token instead of `NonLogicalNewline`.
6. If the re-lexing gives a different token than the current one, the
token source needs to update it's token collection to remove all the
tokens which comes after the new current position.

It turns out that the list parsing isn't that happy with the results so
it requires some re-arranging such that the following two errors are
raised correctly:
1. Expected comma
2. Recovery context error

For (1), the following scenarios needs to be considered:
* Missing comma between two elements
* Half parsed element because the grammar doesn't allow it (for example,
named expressions)

For (2), the following scenarios needs to be considered:
1. If the parser is at a comma which means that there's a missing
element otherwise the comma would've been consumed by the first `eat`
call above. And, the parser doesn't take the re-lexing route on a comma
token.
2. If it's the first element and the current token is not a comma which
means that it's an invalid element.

resolves: #11640 

## Test Plan

- [x] Update existing test snapshots and validate them
- [x] Add additional test cases specific to the re-lexing logic and
validate the snapshots
- [x] Run the fuzzer on 3000+ valid inputs
- [x] Run the fuzzer on invalid inputs
- [x] Run the parser on various open source projects
- [x] Make sure the ecosystem changes are none
2024-06-17 06:47:00 +00:00
Dhruv Manilawala
60ea72a6bc
Add list terminator kind for error recovery (#11843)
## Summary

This PR adds a new enum to determine the kind of terminator token i.e.,
is it actually terminates the list or is it used for error recovery.

This is important because the parser should take the error recovery
route in case the terminator token is used for better error recovery.
This will then try to re-lex the token if it's the case.

I haven't updated any reference to use this new enum as otherwise it'll
update the snapshots. I plan to do that in a follow-up PR so that it's
easier to reason about.

## Test plan

`cargo insta test`
2024-06-12 08:33:26 +00:00
Dhruv Manilawala
a525b4be3d
Separate terminator token for f-string elements kind (#11842)
## Summary

This PR separates the terminator token for f-string elements depending
on the context. A list of f-string element can occur either in a regular
f-string or a format spec of an f-string. The terminator token is
different depending on that context.

## Test Plan

`cargo insta test` and verify the updated snapshots.
2024-06-12 13:57:35 +05:30
Dhruv Manilawala
549cc1e437
Build CommentRanges outside the parser (#11792)
## Summary

This PR updates the parser to remove building the `CommentRanges` and
instead it'll be built by the linter and the formatter when it's
required.

For the linter, it'll be built and owned by the `Indexer` while for the
formatter it'll be built from the `Tokens` struct and passed as an
argument.

## Test Plan

`cargo insta test`
2024-06-09 09:55:17 +00:00
Dhruv Manilawala
1b7d08c2c9
Consider : to terminate parenthesized with items (#11775)
## Summary

This PR is a follow-up to this discussion
(https://github.com/astral-sh/ruff/pull/11770#discussion_r1628917209)
which adds the `:` token in the terminator set for parenthesized with
items.

The main motivation is to avoid parsing too much in speculative mode.
This is evident with the following _before_ and _after_ parsed with
items list for the following code:

```py
with (item1, item2:
    foo
```

<table>
  <tr>
    <th>Before (3 items)</th>
    <th>After (2 items)</th>
  </tr>
  <tr>
    <td>
<pre>
parsed_with_items: [
    ParsedWithItem {
        item: WithItem {
            range: 6..11,
            context_expr: Name(
                ExprName {
                    range: 6..11,
                    id: "item1",
                    ctx: Load,
                },
            ),
            optional_vars: None,
        },
        is_parenthesized: false,
    },
    ParsedWithItem {
        item: WithItem {
            range: 13..18,
            context_expr: Name(
                ExprName {
                    range: 13..18,
                    id: "item2",
                    ctx: Load,
                },
            ),
            optional_vars: None,
        },
        is_parenthesized: false,
    },
    ParsedWithItem {
        item: WithItem {
            range: 24..27,
            context_expr: Name(
                ExprName {
                    range: 24..27,
                    id: "foo",
                    ctx: Load,
                },
            ),
            optional_vars: None,
        },
        is_parenthesized: false,
    },
]
</pre>
	</td>
    <td>
<pre>
parsed_with_items: [
    ParsedWithItem {
        item: WithItem {
            range: 6..11,
            context_expr: Name(
                ExprName {
                    range: 6..11,
                    id: "item1",
                    ctx: Load,
                },
            ),
            optional_vars: None,
        },
        is_parenthesized: false,
    },
    ParsedWithItem {
        item: WithItem {
            range: 13..18,
            context_expr: Name(
                ExprName {
                    range: 13..18,
                    id: "item2",
                    ctx: Load,
                },
            ),
            optional_vars: None,
        },
        is_parenthesized: false,
    },
]
</pre>
	</td>
  </tr>
</table>

## Test Plan

`cargo insta test`
2024-06-06 18:40:44 +05:30
Dhruv Manilawala
6c1fa1d440
Use speculative parsing for with-items (#11770)
## Summary

This PR updates the with-items parsing logic to use speculative parsing
instead.

### Existing logic

First, let's understand the previous logic:
1. The parser sees `(`, it doesn't know whether it's part of a
parenthesized with items or a parenthesized expression
2. Consider it a parenthesized with items and perform a hand-rolled
speculative parsing
3. Then, verify the assumption and if it's incorrect convert the parsed
with items into an appropriate expression which becomes part of the
first with item

Here, in (3) there are lots of edge cases which we've to deal with:
1. Trailing comma with a single element should be [converted to the
expression as
is](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2140-L2153))
2. Trailing comma with multiple elements should be [converted to a tuple
expression](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2155-L2178))
3. Limit the allowed expression based on whether it's
[(1)](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2144-L2152))
or
[(2)](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2157-L2171))
4. [Consider postfix
expressions](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2181-L2200))
after (3)
5. [Consider `if`
expressions](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2203-L2208))
after (3)
6. [Consider binary
expressions](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2210-L2228))
after (3)

Consider other cases like
* [Single generator
expression](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2020-L2035))
* [Expecting a
comma](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2122-L2130))

And, this is all possible only if we allow parsing these expressions in
the [with item parsing
logic](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2287-L2334)).

### Speculative parsing

With #11457 merged, we can simplify this logic by changing the step (3)
from above to just rewind the parser back to the `(` if our assumption
(parenthesized with-items) was incorrect and then continue parsing it
considering parenthesized expression.

This also behaves a lot similar to what a PEG parser does which is to
consider the first grammar rule and if it fails consider the second
grammar rule and so on.

resolves: #11639 

## Test Plan

- [x] Verify the updated snapshots
- [x] Run the fuzzer on around 3000 valid source code (locally)
2024-06-06 08:59:56 +00:00
Dhruv Manilawala
3b19df04d7
Use cursor offset for lexer checkpoint (#11734)
## Summary

This PR updates the lexer checkpoint to store the cursor offset instead
of cloning the cursor itself. This reduces the size of `LexerCheckpoint`
from 136 to 112 bytes and also removes the need for lifetime.

## Test Plan

`cargo insta test`
2024-06-04 14:13:57 +05:30
Dhruv Manilawala
bf5b62edac
Maintain synchronicity between the lexer and the parser (#11457)
## Summary

This PR updates the entire parser stack in multiple ways:

### Make the lexer lazy

* https://github.com/astral-sh/ruff/pull/11244
* https://github.com/astral-sh/ruff/pull/11473

Previously, Ruff's lexer would act as an iterator. The parser would
collect all the tokens in a vector first and then process the tokens to
create the syntax tree.

The first task in this project is to update the entire parsing flow to
make the lexer lazy. This includes the `Lexer`, `TokenSource`, and
`Parser`. For context, the `TokenSource` is a wrapper around the `Lexer`
to filter out the trivia tokens[^1]. Now, the parser will ask the token
source to get the next token and only then the lexer will continue and
emit the token. This means that the lexer needs to be aware of the
"current" token. When the `next_token` is called, the current token will
be updated with the newly lexed token.

The main motivation to make the lexer lazy is to allow re-lexing a token
in a different context. This is going to be really useful to make the
parser error resilience. For example, currently the emitted tokens
remains the same even if the parser can recover from an unclosed
parenthesis. This is important because the lexer emits a
`NonLogicalNewline` in parenthesized context while a normal `Newline` in
non-parenthesized context. This different kinds of newline is also used
to emit the indentation tokens which is important for the parser as it's
used to determine the start and end of a block.

Additionally, this allows us to implement the following functionalities:
1. Checkpoint - rewind infrastructure: The idea here is to create a
checkpoint and continue lexing. At a later point, this checkpoint can be
used to rewind the lexer back to the provided checkpoint.
2. Remove the `SoftKeywordTransformer` and instead use lookahead or
speculative parsing to determine whether a soft keyword is a keyword or
an identifier
3. Remove the `Tok` enum. The `Tok` enum represents the tokens emitted
by the lexer but it contains owned data which makes it expensive to
clone. The new `TokenKind` enum just represents the type of token which
is very cheap.

This brings up a question as to how will the parser get the owned value
which was stored on `Tok`. This will be solved by introducing a new
`TokenValue` enum which only contains a subset of token kinds which has
the owned value. This is stored on the lexer and is requested by the
parser when it wants to process the data. For example:
8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L1260-L1262)

[^1]: Trivia tokens are `NonLogicalNewline` and `Comment`

### Remove `SoftKeywordTransformer`

* https://github.com/astral-sh/ruff/pull/11441
* https://github.com/astral-sh/ruff/pull/11459
* https://github.com/astral-sh/ruff/pull/11442
* https://github.com/astral-sh/ruff/pull/11443
* https://github.com/astral-sh/ruff/pull/11474

For context,
https://github.com/RustPython/RustPython/pull/4519/files#diff-5de40045e78e794aa5ab0b8aacf531aa477daf826d31ca129467703855408220
added support for soft keywords in the parser which uses infinite
lookahead to classify a soft keyword as a keyword or an identifier. This
is a brilliant idea as it basically wraps the existing Lexer and works
on top of it which means that the logic for lexing and re-lexing a soft
keyword remains separate. The change here is to remove
`SoftKeywordTransformer` and let the parser determine this based on
context, lookahead and speculative parsing.

* **Context:** The transformer needs to know the position of the lexer
between it being at a statement position or a simple statement position.
This is because a `match` token starts a compound statement while a
`type` token starts a simple statement. **The parser already knows
this.**
* **Lookahead:** Now that the parser knows the context it can perform
lookahead of up to two tokens to classify the soft keyword. The logic
for this is mentioned in the PR implementing it for `type` and `match
soft keyword.
* **Speculative parsing:** This is where the checkpoint - rewind
infrastructure helps. For `match` soft keyword, there are certain cases
for which we can't classify based on lookahead. The idea here is to
create a checkpoint and keep parsing. Based on whether the parsing was
successful and what tokens are ahead we can classify the remaining
cases. Refer to #11443 for more details.

If the soft keyword is being parsed in an identifier context, it'll be
converted to an identifier and the emitted token will be updated as
well. Refer
8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L487-L491).

The `case` soft keyword doesn't require any special handling because
it'll be a keyword only in the context of a match statement.

### Update the parser API

* https://github.com/astral-sh/ruff/pull/11494
* https://github.com/astral-sh/ruff/pull/11505

Now that the lexer is in sync with the parser, and the parser helps to
determine whether a soft keyword is a keyword or an identifier, the
lexer cannot be used on its own. The reason being that it's not
sensitive to the context (which is correct). This means that the parser
API needs to be updated to not allow any access to the lexer.

Previously, there were multiple ways to parse the source code:
1. Passing the source code itself
2. Or, passing the tokens

Now that the lexer and parser are working together, the API
corresponding to (2) cannot exists. The final API is mentioned in this
PR description: https://github.com/astral-sh/ruff/pull/11494.

### Refactor the downstream tools (linter and formatter)

* https://github.com/astral-sh/ruff/pull/11511
* https://github.com/astral-sh/ruff/pull/11515
* https://github.com/astral-sh/ruff/pull/11529
* https://github.com/astral-sh/ruff/pull/11562
* https://github.com/astral-sh/ruff/pull/11592

And, the final set of changes involves updating all references of the
lexer and `Tok` enum. This was done in two-parts:
1. Update all the references in a way that doesn't require any changes
from this PR i.e., it can be done independently
	* https://github.com/astral-sh/ruff/pull/11402
	* https://github.com/astral-sh/ruff/pull/11406
	* https://github.com/astral-sh/ruff/pull/11418
	* https://github.com/astral-sh/ruff/pull/11419
	* https://github.com/astral-sh/ruff/pull/11420
	* https://github.com/astral-sh/ruff/pull/11424
2. Update all the remaining references to use the changes made in this
PR

For (2), there were various strategies used:
1. Introduce a new `Tokens` struct which wraps the token vector and add
methods to query a certain subset of tokens. These includes:
	1. `up_to_first_unknown` which replaces the `tokenize` function
2. `in_range` and `after` which replaces the `lex_starts_at` function
where the former returns the tokens within the given range while the
latter returns all the tokens after the given offset
2. Introduce a new `TokenFlags` which is a set of flags to query certain
information from a token. Currently, this information is only limited to
any string type token but can be expanded to include other information
in the future as needed. https://github.com/astral-sh/ruff/pull/11578
3. Move the `CommentRanges` to the parsed output because this
information is common to both the linter and the formatter. This removes
the need for `tokens_and_ranges` function.

## Test Plan

- [x] Update and verify the test snapshots
- [x] Make sure the entire test suite is passing
- [x] Make sure there are no changes in the ecosystem checks
- [x] Run the fuzzer on the parser
- [x] Run this change on dozens of open-source projects

### Running this change on dozens of open-source projects

Refer to the PR description to get the list of open source projects used
for testing.

Now, the following tests were done between `main` and this branch:
1. Compare the output of `--select=E999` (syntax errors)
2. Compare the output of default rule selection
3. Compare the output of `--select=ALL`

**Conclusion: all output were same**

## What's next?

The next step is to introduce re-lexing logic and update the parser to
feed the recovery information to the lexer so that it can emit the
correct token. This moves us one step closer to having error resilience
in the parser and provides Ruff the possibility to lint even if the
source code contains syntax errors.
2024-06-03 18:23:50 +05:30
Dhruv Manilawala
04a922866a
Add basic docs for the parser crate (#11199)
## Summary

This PR adds a basic README for the `ruff_python_parser` crate and
updates the CONTRIBUTING docs with the fuzzer and benchmark section.

Additionally, it also updates some inline documentation within the
parser crate and splits the `parse_program` function into
`parse_single_expression` and `parse_module` which will be called by
matching against the `Mode`.

This PR doesn't go into too much internal detail around the parser logic
due to the following reasons:
1. Where should the docs go? Should it be as a module docs in `lib.rs`
or in README?
2. The parser is still evolving and could include a lot of refactors
with the future work (feedback loop and improved error recovery and
resilience)

---------

Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
2024-04-29 17:08:07 +00:00
Dhruv Manilawala
c30735d4a7
Add ExpressionContext for expression parsing (#11055)
## Summary

This PR adds a new `ExpressionContext` struct which is used in
expression parsing.

This solves the following problem:
1. Allowing starred expression with different precedence
2. Allowing yield expression in certain context
3. Remove ambiguity with `in` keyword when parsing a `for ... in`
statement

For context, (1) was solved by adding `parse_star_expression_list` and
`parse_star_expression_or_higher` in #10623, (2) was solved by by adding
`parse_yield_expression_or_else` in #10809, and (3) was fixed in #11009.
All of the mentioned functions have been removed in favor of the context
flags.

As mentioned in #11009, an ideal solution would be to implement an
expression context which is what this PR implements. This is passed
around as function parameter and the call stack is used to automatically
reset the context.

### Recovery

How should the parser recover if the target expression is invalid when
an expression can consume the `in` keyword?

1. Should the `in` keyword be part of the target expression?
2. Or, should the expression parsing stop as soon as `in` keyword is
encountered, no matter the expression?

For example:
```python
for yield x in y: ...

# Here, should this be parsed as
for (yield x) in (y): ...
# Or
for (yield x in y): ...
# where the `in iter` part is missing
```

Or, for binary expression parsing:
```python
for x or y in z: ...

# Should this be parsed as
for (x or y) in z: ...
# Or
for (x or y in z): ...
# where the `in iter` part is missing
```

This need not be solved now, but is very easy to change. For context
this PR does the following:
* For binary, comparison, and unary expressions, stop at `in`
* For lambda, yield expressions, consume the `in`

## Test Plan

1. Add test cases for the `for ... in` statement and verify the
snapshots
2. Make sure the existing test suite pass
3. Run the fuzzer for around 3000 generated source code
4. Run the updated logic on a dozen or so open source repositories
(codename "parser-checkouts")
2024-04-23 04:19:05 +00:00
Dhruv Manilawala
d3cd61f804
Use empty range when there's "gap" in token source (#11032)
## Summary

This fixes a bug where the parser would panic when there is a "gap" in
the token source.

What's a gap?

The reason it's `<=` instead of just `==` is because there could be
whitespaces between
the two tokens. For example:

```python
#     last token end
#     | current token (newline) start
#     v v
def foo \n
#      ^
#      assume there's trailing whitespace here
```

Or, there could tokens that are considered "trivia" and thus aren't
emitted by the token
source. These are comments and non-logical newlines. For example:

```python
#     last token end
#     v
def foo # comment\n
#                ^ current token (newline) start
```

In either of the above cases, there's a "gap" between the end of the
last token and start
of the current token.

## Test Plan

Add test cases and update the snapshots.
2024-04-19 11:36:26 +00:00
Dhruv Manilawala
13ffb5bc19
Replace LALRPOP parser with hand-written parser (#10036)
(Supersedes #9152, authored by @LaBatata101)

## Summary

This PR replaces the current parser generated from LALRPOP to a
hand-written recursive descent parser.

It also updates the grammar for [PEP
646](https://peps.python.org/pep-0646/) so that the parser outputs the
correct AST. For example, in `data[*x]`, the index expression is now a
tuple with a single starred expression instead of just a starred
expression.

Beyond the performance improvements, the parser is also error resilient
and can provide better error messages. The behavior as seen by any
downstream tools isn't changed. That is, the linter and formatter can
still assume that the parser will _stop_ at the first syntax error. This
will be updated in the following months.

For more details about the change here, refer to the PR corresponding to
the individual commits and the release blog post.

## Test Plan

Write _lots_ and _lots_ of tests for both valid and invalid syntax and
verify the output.

## Acknowledgements

- @MichaReiser for reviewing 100+ parser PRs and continuously providing
guidance throughout the project
- @LaBatata101 for initiating the transition to a hand-written parser in
#9152
- @addisoncrump for implementing the fuzzer which helped
[catch](https://github.com/astral-sh/ruff/pull/10903)
[a](https://github.com/astral-sh/ruff/pull/10910)
[lot](https://github.com/astral-sh/ruff/pull/10966)
[of](https://github.com/astral-sh/ruff/pull/10896)
[bugs](https://github.com/astral-sh/ruff/pull/10877)

---------

Co-authored-by: Victor Hugo Gomes <labatata101@linuxmail.org>
Co-authored-by: Micha Reiser <micha@reiser.io>
2024-04-18 17:57:39 +05:30