mirrors/ruff - Forgejo: Beyond coding. We Forge.

mirror of https://github.com/astral-sh/ruff.git synced 2025-07-12 15:45:07 +00:00

Author	SHA1	Message	Date
Brent Westbrook	37fbe58b13	Document `LinterResult::has_syntax_error` and add `Parsed::has_no_syntax_errors` (#16443 ) Summary -- This is a follow up addressing the comments on #16425. As @dhruvmanila pointed out, the naming is a bit tricky. I went with `has_no_errors` to try to differentiate it from `is_valid`. It actually ends up negated in most uses, so it would be more convenient to have `has_any_errors` or `has_errors`, but I thought it would sound too much like the opposite of `is_valid` in that case. I'm definitely open to suggestions here. Test Plan -- Existing tests.	2025-03-04 08:35:38 -05:00
Brent Westbrook	e924ecbdac	[syntax-errors] `except` before Python 3.11 (#16446 ) Summary -- One of the simpler ones, just detect the use of `except` before 3.11. Test Plan -- New inline tests.	2025-03-02 18:20:18 +00:00
Brent Westbrook	4431978262	[syntax-errors] Assignment expressions before Python 3.8 (#16383 ) ## Summary This PR is the first in a series derived from https://github.com/astral-sh/ruff/pull/16308, each of which add support for detecting one version-related syntax error from https://github.com/astral-sh/ruff/issues/6591. This one should be the largest because it also includes the addition of the `Parser::add_unsupported_syntax_error` method Otherwise I think the general structure will be the same for each syntax error: * Detecting the error in the parser * Inline parser tests for the new error * New ruff CLI tests for the new error ## Test Plan As noted above, there are new inline parser tests, as well as new ruff CLI tests. Once https://github.com/astral-sh/ruff/pull/16379 is resolved, there should also be new mdtests for red-knot, but this PR does not currently include those.	2025-02-28 17:13:46 -05:00
Brent Westbrook	764aa0e6a1	Allow passing `ParseOptions` to inline tests (#16357 ) ## Summary This PR adds support for a pragma-style header for inline parser tests containing JSON-serialized `ParseOptions`. For example, ```python # parse_options: { "target-version": "3.9" } match 2: case 1: pass ``` The line must start with `# parse_options: ` and then the rest of the (trimmed) line is deserialized into `ParseOptions` used for parsing the the test. ## Test Plan Existing inline tests, plus two new inline tests for `match-before-py310`. --------- Co-authored-by: Alex Waygood <alex.waygood@gmail.com>	2025-02-27 10:23:15 -05:00
Brent Westbrook	97d0659ce3	Pass `ParserOptions` to the parser (#16220 ) ## Summary This is part of the preparation for detecting syntax errors in the parser from https://github.com/astral-sh/ruff/pull/16090/. As suggested in [this comment](https://github.com/astral-sh/ruff/pull/16090/#discussion_r1953084509), I started working on a `ParseOptions` struct that could be stored in the parser. For this initial refactor, I only made it hold the existing `Mode` option, but for syntax errors, we will also need it to have a `PythonVersion`. For that use case, I'm picturing something like a `ParseOptions::with_python_version` method, so you can extend the current calls to something like ```rust ParseOptions::from(mode).with_python_version(settings.target_version) ``` But I thought it was worth adding `ParseOptions` alone without changing any other behavior first. Most of the diff is just updating call sites taking `Mode` to take `ParseOptions::from(Mode)` or those taking `PySourceType`s to take `ParseOptions::from(PySourceType)`. The interesting changes are in the new `parser/options.rs` file and smaller parts of `parser/mod.rs` and `ruff_python_parser/src/lib.rs`. ## Test Plan Existing tests, this should not change any behavior.	2025-02-19 10:50:50 -05:00
Andrew Gallant	17f01a4355	test: add more missing carets This update includes some missing `^` in the diagnostic annotations. This update also includes some shifting of "syntax error" annotations to the end of the preceding line. I believe this is technically a regression, but fixing them has proven quite difficult. I think the best way to do that might be to tweak the spans generated by the Python parser errors, but I didn't want to dig into that. (Another approach would be to change the `annotate-snippets` rendering, but when I tried that and managed to fix these regressions, I ended up causing a bunch of other regressions.) Ref `77d454525e (r1915458616)`	2025-01-15 13:37:52 -05:00
Andrew Gallant	84ba4ecaf5	ruff_annotate_snippets: support overriding the "cut indicator" We do this because `...` is valid Python, which makes it pretty likely that some line trimming will lead to ambiguous output. So we add support for overriding the cut indicator. This also requires changing some of the alignment math, which was previously tightly coupled to `...`. For Ruff, we go with `…` (`U+2026 HORIZONTAL ELLIPSIS`) for our cut indicator. For more details, see the patch sent to upstream: https://github.com/rust-lang/annotate-snippets-rs/pull/172	2025-01-15 13:37:52 -05:00
Andrew Gallant	5caef89af3	test: update snapshots with improper end-of-line placement This looks like a bug fix that occurs when the annotation is a zero-width span immediately following a line terminator. Previously, the caret seems to be rendered on the next line, but it should be rendered at the end of the line the span corresponds to. I admit that this one is kinda weird. I would somewhat expect that our spans here are actually incorrect, and that to obtain this sort of rendering, we should identify a span just immediately _before_ the line terminator and not after it. But I don't want to dive into that rabbit hole for now (and given how `annotate-snippets` now renders these spans, perhaps there is more to it than I see), and this does seem like a clear improvement given the spans we feed to `annotate-snippets`.	2025-01-15 13:37:52 -05:00
Andrew Gallant	f49cfb6c28	test: update snapshots with missing `^` The previous rendering just seems wrong in that a `^` is omitted. The new version of `annotate-snippets` seems to get this right. I checked a pseudo random sample of these, and it seems to only happen when the position pointed at a line terminator.	2025-01-15 13:37:52 -05:00
Andrew Gallant	3fa4479c85	test: update snapshots with missing annotations These updates center around the addition of annotations in the diagnostic rendering. Previously, the annotation was just not rendered at all. With the `annotate-snippets` upgrade, it is now rendered. I examined a pseudo random sample of these, and they all look correct. As will be true in future batches, some of these snapshots also have changes to whitespace in them as well.	2025-01-15 13:37:52 -05:00
Andrew Gallant	0de8216a25	test: update snapshots with just whitespace changes These snapshot changes should all only be a result of changes to trailing whitespace in the output. I checked a psuedo random sample of these, and the whitespace found in the previous snapshots seems to be an artifact of the rendering and _not_ of the source data. So this seems like a strict bug fix to me. There are other snapshots with whitespace changes, but they also have other changes that we split out into separate commits. Basically, we're going to do approximately one commit per category of change. This represents, by far, the biggest chunk of changes to snapshots as a result of the `annotate-snippets` upgrade.	2025-01-15 13:37:52 -05:00
Andrew Gallant	84179aaa96	ruff_linter,ruff_python_parser: migrate to updated `annotate-snippets` This is pretty much just moving to the new API and taking care to use byte offsets. This is almost enough. The next commit will fix a bug involving the handling of unprintable characters as a result of switching to byte offsets.	2025-01-15 13:37:52 -05:00
Dylan	c1eaf6ff72	Modify parsing of raise with cause when exception is absent (#15049 ) When confronted with `raise from exc` the parser will now create a `StmtRaise` that has `None` for the exception and `exc` for the cause. Before, the parser created a `StmtRaise` with `from` for the exception, no cause, and a spurious expression `exc` afterwards.	2024-12-19 13:36:32 +00:00
Dylan	a3bb0cd5ec	Raise syntax error for mixing `except` and `except` (#14895 ) Some checks are pending CI / Determine changes (push) Waiting to run Details CI / cargo fmt (push) Waiting to run Details CI / cargo clippy (push) Blocked by required conditions Details CI / cargo test (linux) (push) Blocked by required conditions Details CI / cargo test (linux, release) (push) Blocked by required conditions Details CI / cargo test (windows) (push) Blocked by required conditions Details CI / cargo test (wasm) (push) Blocked by required conditions Details CI / cargo build (release) (push) Waiting to run Details CI / cargo build (msrv) (push) Blocked by required conditions Details CI / cargo fuzz build (push) Blocked by required conditions Details CI / fuzz parser (push) Blocked by required conditions Details CI / test scripts (push) Blocked by required conditions Details CI / ecosystem (push) Blocked by required conditions Details CI / cargo shear (push) Blocked by required conditions Details CI / python package (push) Waiting to run Details CI / pre-commit (push) Waiting to run Details CI / mkdocs (push) Waiting to run Details CI / formatter instabilities and black similarity (push) Blocked by required conditions Details CI / test ruff-lsp (push) Blocked by required conditions Details CI / benchmarks (push) Blocked by required conditions Details This PR adds a syntax error if the parser encounters a `TryStmt` that has except clauses both with and without a star. The displayed error points to each except clause that contradicts the original except clause kind. So, for example, ```python try: .... except: #<-- we assume this is the desired except kind .... except: #<--- error will point here .... except*: #<--- and here .... ``` Closes #14860	2024-12-10 17:50:55 -06:00
Micha Reiser	c847cad389	Update insta snapshots (#14366 ) Some checks are pending CI / Determine changes (push) Waiting to run Details CI / cargo fmt (push) Waiting to run Details CI / cargo shear (push) Blocked by required conditions Details CI / cargo clippy (push) Blocked by required conditions Details CI / cargo test (linux) (push) Blocked by required conditions Details CI / cargo test (windows) (push) Blocked by required conditions Details CI / cargo test (wasm) (push) Blocked by required conditions Details CI / cargo build (release) (push) Blocked by required conditions Details CI / cargo build (msrv) (push) Blocked by required conditions Details CI / cargo fuzz (push) Blocked by required conditions Details CI / Fuzz the parser (push) Blocked by required conditions Details CI / test scripts (push) Blocked by required conditions Details CI / ecosystem (push) Blocked by required conditions Details CI / python package (push) Waiting to run Details CI / pre-commit (push) Waiting to run Details CI / mkdocs (push) Waiting to run Details CI / formatter instabilities and black similarity (push) Blocked by required conditions Details CI / test ruff-lsp (push) Blocked by required conditions Details CI / benchmarks (push) Blocked by required conditions Details	2024-11-15 19:31:15 +01:00
Dhruv Manilawala	978909fcf4	Raise syntax error for unparenthesized generator expr in multi-argument call (#12445 ) ## Summary This PR fixes a bug to raise a syntax error when an unparenthesized generator expression is used as an argument to a call when there are more than one argument. For reference, the grammar is: ``` primary: \| ... \| primary genexp \| primary '(' [arguments] ')' \| ... genexp: \| '(' ( assignment_expression \| expression !':=') for_if_clauses ')' ``` The `genexp` requires the parenthesis as mentioned in the grammar. So, the grammar for a call expression is either a name followed by a generator expression or a name followed by a list of argument. In the former case, the parenthesis are excluded because the generator expression provides them while in the later case, the parenthesis are explicitly provided for a list of arguments which means that the generator expression requires it's own parenthesis. This was discovered in https://github.com/astral-sh/ruff/issues/12420. ## Test Plan Add test cases for valid and invalid syntax. Make sure that the parser from CPython also raises this at the parsing step: ```console $ python3.13 -m ast parser/_.py File "parser/_.py", line 1 total(1, 2, x for x in range(5), 6) ^^^^^^^^^^^^^^^^^^^ SyntaxError: Generator expression must be parenthesized $ python3.13 -m ast parser/_.py File "parser/_.py", line 1 sum(x for x in range(10), 10) ^^^^^^^^^^^^^^^^^^^^ SyntaxError: Generator expression must be parenthesized ```	2024-07-22 14:44:20 +05:30
Micha Reiser	5109b50bb3	Use `CompactString` for `Identifier` (#12101 )	2024-07-01 10:06:02 +02:00
Dhruv Manilawala	434ce307a7	Revert "Use correct range to highlight line continuation error" (#12089 ) This PR reverts https://github.com/astral-sh/ruff/pull/12016 with a small change where the error location points to the continuation character only. Earlier, it would also highlight the whitespace that came before it. The motivation for this change is to avoid panic in https://github.com/astral-sh/ruff/pull/11950. For example: ```py \) ``` Playground: https://play.ruff.rs/87711071-1b54-45a3-b45a-81a336a1ea61 The range of `Unknown` token and `Rpar` is the same. Once #11950 is enabled, the indexer would panic. It won't panic in the stable version because we stop at the first `Unknown` token.	2024-06-28 18:10:00 +05:30
Dhruv Manilawala	a4688aebe9	Use `TokenSource` to find new location for re-lexing (#12060 ) ## Summary This PR splits the re-lexing logic into two parts: 1. `TokenSource`: The token source will be responsible to find the position the lexer needs to be moved to 2. `Lexer`: The lexer will be responsible to reduce the nesting level and move itself to the new position if recovered from a parenthesized context This split makes it easy to find the new lexer position without needing to implement the backwards lexing logic again which would need to handle cases involving: * Different kinds of newlines * Line continuation character(s) * Comments * Whitespaces ### F-strings This change did reveal one thing about re-lexing f-strings. Consider the following example: ```py f'{' # ^ f'foo' ``` Here, the quote as highlighted by the caret (`^`) is the start of a string inside an f-string expression. This is unterminated string which means the token emitted is actually `Unknown`. The parser tries to recover from it but there's no newline token in the vector so the new logic doesn't recover from it. The previous logic does recover because it's looking at the raw characters instead. The parser would be at `FStringStart` (the one for the second line) when it calls into the re-lexing logic to recover from an unterminated f-string on the first line. So, moving backwards the first character encountered is a newline character but the first token encountered is an `Unknown` token. This is improved with #12067 fixes: #12046 fixes: #12036 ## Test Plan Update the snapshot and validate the changes.	2024-06-27 17:12:39 +05:30
Dhruv Manilawala	e137c824c3	Avoid consuming newline for unterminated string (#12067 ) ## Summary This PR fixes the lexer logic to not consume the newline character for an unterminated string literal. Currently, the lexer would consume it to be part of the string itself but that would be bad for recovery because then the lexer wouldn't emit the newline token ever. This PR fixes that to avoid consuming the newline character in that case. This was discovered during https://github.com/astral-sh/ruff/pull/12060. ## Test Plan Update the snapshots and validate them.	2024-06-27 17:02:48 +05:30
Dhruv Manilawala	47c9ed07f2	Consider 2-character EOL before line continuation (#12035 ) ## Summary This PR fixes a bug introduced in https://github.com/astral-sh/ruff/pull/12008 which didn't consider the two character newline after the line continuation character. For example, consider the following code highlighted with whitespaces: ```py call(foo # comment \\r\n \r\n def bar():\r\n ....pass\r\n ``` The lexer is at `def` when it's running the re-lexing logic and trying to move back to a newline character. It encounters `\n` and it's being escaped (incorrect) but `\r` is being escaped, so it moves the lexer to `\n` character. This creates an overlap in token ranges which causes the panic. ``` Name 0..4 Lpar 4..5 Name 5..8 Comment 9..20 NonLogicalNewline 20..22 <-- overlap between Newline 21..22 <-- these two tokens NonLogicalNewline 22..23 Def 23..26 ... ``` fixes: #12028 ## Test Plan Add a test case with line continuation and windows style newline character.	2024-06-26 14:00:48 +05:30
Dhruv Manilawala	7cb2619ef5	Add syntax error for empty type parameter list (#12030 ) ## Summary (I'm pretty sure I added this in the parser re-write but must've got lost in the rebase?) This PR raises a syntax error if the type parameter list is empty. As per the grammar, there should be at least one type parameter: ``` type_params: \| invalid_type_params \| '[' type_param_seq ']' type_param_seq: ','.type_param+ [','] ``` Verified via the builtin `ast` module as well: ```console $ python3.13 -m ast parser/_.py Traceback (most recent call last): [..] File "parser/_.py", line 1 def foo[](): ^ SyntaxError: Type parameter list cannot be empty ``` ## Test Plan Add inline test cases and update the snapshots.	2024-06-26 08:10:35 +05:30
Dhruv Manilawala	7109214b57	Update parser tests to validate token ranges (#12019 ) ## Summary This PR updates the parser test infrastructure to validate the token ranges. From the code documentation: ``` /// Verifies that: /// * the ranges are strictly increasing when loop the tokens in insertion order /// * all ranges are within the length of the source code ``` Follow-up from #12016 and #12017 resolves: #11938 ## Test Plan Make sure that there are no failures.	2024-06-25 08:14:28 +00:00
Dhruv Manilawala	d930e97212	Do not include newline for unterminated string range (#12017 ) ## Summary This PR updates the unterminated string error range to not include the final newline character. This is a follow-up to #12016 and required for #12019 This is not done for when the unterminated string goes till the end of file (not a newline character). The unterminated f-string range is correct. ### Why is this required for #12019 ? Because otherwise the token ranges will overlap. For example: ```py f"{" f"{foo!r" ``` Here, the re-lexing logic recovers from an unterminated f-string and thus emitting a `Newline` token for the one at the end of the first line. But, currently the `Unknown` and the `Newline` token would overlap because the `Unknown` token (unterminated string literal) range would include the newline character. ## Test Plan Update and validate the snapshot.	2024-06-25 08:10:07 +00:00
Dhruv Manilawala	9c1b6ec411	Use correct range to highlight line continuation error (#12016 ) ## Summary This PR fixes the range highlighted for the line continuation error. Previously, it would highlight an incorrect range: ``` 1 \| call(a, b, \\\ \| ^^ Syntax Error: unexpected character after line continuation character 2 \| 3 \| def bar(): \| ``` And now: ``` \| 1 \| call(a, b, \\\ \| ^ Syntax Error: unexpected character after line continuation character 2 \| 3 \| def bar(): \| ``` This is implemented by avoiding to update the token range for the `Unknown` token which is emitted when there's a lexical error. Instead, the `push_error` helper method will be responsible to update the range to the error location. This actually becomes a requirement which can be seen in follow-up PRs. ## Test Plan Update and validate the snapshot.	2024-06-25 13:35:24 +05:30
Dhruv Manilawala	68a8978454	Consider line continuation character for re-lexing (#12008 ) ## Summary This PR fixes a bug where the re-lexing logic didn't consider the line continuation character being present before the newline character. This meant that the lexer was being moved back to the newline character which is actually ignored via `\`. Considering the following code: ```py f'middle {'string':\ 'format spec'} ``` The old token stream is: ``` ... Colon 18..19 FStringMiddle 19..29 (flags = F_STRING) Newline 20..21 Indent 21..29 String 29..42 Rbrace 42..43 ... ``` Notice how the ranges are overlapping between the `FStringMiddle` token and the tokens emitted after moving the lexer backwards. After this fix, the new token stream which is without moving the lexer backwards in this scenario: ``` FStringStart 0..2 (flags = F_STRING) FStringMiddle 2..9 (flags = F_STRING) Lbrace 9..10 String 10..18 Colon 18..19 FStringMiddle 19..29 (flags = F_STRING) FStringEnd 29..30 (flags = F_STRING) Name 30..36 Name 37..41 Unknown 41..44 Newline 44..45 ``` fixes: #12004 ## Test Plan Add test cases and update the snapshots.	2024-06-25 02:13:54 +00:00
Dhruv Manilawala	ed948eaefb	Avoid moving back the lexer for triple-quoted fstring (#11939 ) ## Summary This PR avoids moving back the lexer for a triple-quoted f-string during the re-lexing phase. The reason this is a problem is that for a triple-quoted f-string the newlines are part of the f-string itself, specifically they'll be part of the `FStringMiddle` token. So, if we moved the lexer back, there would be a `Newline` token whose range would be in between an `FStringMiddle` token. This creates a panic in downstream usage. fixes: #11937 ## Test Plan Add test cases and validate the snapshots.	2024-06-20 16:27:36 +05:30
Dhruv Manilawala	b617d90651	Update `E999` to show all syntax errors (#11900 ) ## Summary This PR updates the linter to show all the parse errors as diagnostics instead of just the first one. Note that this doesn't affect the parse error displayed as error log message. This will be removed in a follow-up PR. ### Breaking? I don't think this is a breaking change even though this might give more diagnostics. The main reason is that this shouldn't affect any users because it'll only give additional diagnostics in the case of multiple syntax errors. ## Test Plan Add an integration test case which would raise more than one parse error.	2024-06-19 13:09:54 +05:30
Dhruv Manilawala	cdc7c71449	Avoid consuming trailing whitespace during re-lexing (#11933 ) ## Summary This PR updates the re-lexing logic to avoid consuming the trailing whitespace and move the lexer explicitly to the last newline character encountered while moving backwards. Consider the following code snippet as taken from the test case highlighted with whitespace (`.`) and newline (`\n`) characters: ```py # There are trailing whitespace before the newline character but those whitespaces are # part of the comment token f"""hello {x # comment....\n # ^ y = 1\n ``` The parser is at `y` when it's trying to recover from an unclosed `{`, so it calls into the re-lexing logic which tries to move the lexer back to the end of the previous line. But, as it consumed all whitespaces it moved the lexer to the location marked by `^` in the above code snippet. But, those whitespaces are part of the comment token. This means that the range for the two tokens were overlapping which introduced the panic. Note that this is only a bug when there's a comment with a trailing whitespace otherwise it's fine to move the lexer to the whitespace character. This is because the lexer would just skip the whitespace otherwise. Nevertheless, this PR updates the logic to move it explicitly to the newline character in all cases. fixes: #11929 ## Test Plan Add test cases and update the snapshot. Make sure that it doesn't panic on the code snippet in the linked issue.	2024-06-19 12:14:18 +05:30
Dhruv Manilawala	1e0642fac8	Use re-lexing for normal list parsing (#11871 ) ## Summary This PR is a follow-up on #11845 to add the re-lexing logic for normal list parsing. A normal list parsing is basically parsing elements without any separator in between i.e., there can only be trivia tokens in between the two elements. Currently, this is only being used for parsing assignment statement and f-string elements. Assignment statements cannot be in a parenthesized context, but f-string can have curly braces so this PR is specifically for them. I don't think this is an ideal recovery but the problem is that both lexer and parser could add an error for f-strings. If the lexer adds an error it'll emit an `Unknown` token instead while the parser adds the error directly. I think we'd need to move all f-string errors to be emitted by the parser instead. This way the parser can correctly inform the lexer that it's out of an f-string and then the lexer can pop the current f-string context out of the stack. ## Test Plan Add test cases, update the snapshots, and run the fuzzer.	2024-06-18 12:14:41 +05:30
Dhruv Manilawala	8499abfa7f	Implement re-lexing logic for better error recovery (#11845 ) ## Summary This PR implements the re-lexing logic in the parser. This logic is only applied when recovering from an error during list parsing. The logic is as follows: 1. During list parsing, if an unexpected token is encountered and it detects that an outer context can understand it and thus recover from it, it invokes the re-lexing logic in the lexer 2. This logic first checks if the lexer is in a parenthesized context and returns if it's not. Thus, the logic is a no-op if the lexer isn't in a parenthesized context 3. It then reduces the nesting level by 1. It shouldn't reset it to 0 because otherwise the recovery from nested list parsing will be incorrect 4. Then, it tries to find last newline character going backwards from the current position of the lexer. This avoids any whitespaces but if it encounters any character other than newline or whitespace, it aborts. 5. Now, if there's a newline character, then it needs to be re-lexed in a logical context which means that the lexer needs to emit it as a `Newline` token instead of `NonLogicalNewline`. 6. If the re-lexing gives a different token than the current one, the token source needs to update it's token collection to remove all the tokens which comes after the new current position. It turns out that the list parsing isn't that happy with the results so it requires some re-arranging such that the following two errors are raised correctly: 1. Expected comma 2. Recovery context error For (1), the following scenarios needs to be considered: * Missing comma between two elements * Half parsed element because the grammar doesn't allow it (for example, named expressions) For (2), the following scenarios needs to be considered: 1. If the parser is at a comma which means that there's a missing element otherwise the comma would've been consumed by the first `eat` call above. And, the parser doesn't take the re-lexing route on a comma token. 2. If it's the first element and the current token is not a comma which means that it's an invalid element. resolves: #11640 ## Test Plan - [x] Update existing test snapshots and validate them - [x] Add additional test cases specific to the re-lexing logic and validate the snapshots - [x] Run the fuzzer on 3000+ valid inputs - [x] Run the fuzzer on invalid inputs - [x] Run the parser on various open source projects - [x] Make sure the ecosystem changes are none	2024-06-17 06:47:00 +00:00
Dhruv Manilawala	a525b4be3d	Separate terminator token for f-string elements kind (#11842 ) ## Summary This PR separates the terminator token for f-string elements depending on the context. A list of f-string element can occur either in a regular f-string or a format spec of an f-string. The terminator token is different depending on that context. ## Test Plan `cargo insta test` and verify the updated snapshots.	2024-06-12 13:57:35 +05:30
Micha Reiser	32ca704956	Rename `PreorderVisitor` to `SourceOrderVisitor` (#11798 ) Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>	2024-06-07 17:01:58 +00:00
Dhruv Manilawala	6c1fa1d440	Use speculative parsing for with-items (#11770 ) ## Summary This PR updates the with-items parsing logic to use speculative parsing instead. ### Existing logic First, let's understand the previous logic: 1. The parser sees `(`, it doesn't know whether it's part of a parenthesized with items or a parenthesized expression 2. Consider it a parenthesized with items and perform a hand-rolled speculative parsing 3. Then, verify the assumption and if it's incorrect convert the parsed with items into an appropriate expression which becomes part of the first with item Here, in (3) there are lots of edge cases which we've to deal with: 1. Trailing comma with a single element should be [converted to the expression as is](`9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2140-L2153)`) 2. Trailing comma with multiple elements should be [converted to a tuple expression](`9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2155-L2178)`) 3. Limit the allowed expression based on whether it's [(1)](`9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2144-L2152)`) or [(2)](`9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2157-L2171)`) 4. [Consider postfix expressions](`9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2181-L2200)`) after (3) 5. [Consider `if` expressions](`9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2203-L2208)`) after (3) 6. [Consider binary expressions](`9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2210-L2228)`) after (3) Consider other cases like * [Single generator expression](`9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2020-L2035)`) * [Expecting a comma](`9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2122-L2130)`) And, this is all possible only if we allow parsing these expressions in the [with item parsing logic](`9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2287-L2334)`). ### Speculative parsing With #11457 merged, we can simplify this logic by changing the step (3) from above to just rewind the parser back to the `(` if our assumption (parenthesized with-items) was incorrect and then continue parsing it considering parenthesized expression. This also behaves a lot similar to what a PEG parser does which is to consider the first grammar rule and if it fails consider the second grammar rule and so on. resolves: #11639 ## Test Plan - [x] Verify the updated snapshots - [x] Run the fuzzer on around 3000 valid source code (locally)	2024-06-06 08:59:56 +00:00
Dhruv Manilawala	bf5b62edac	Maintain synchronicity between the lexer and the parser (#11457 ) ## Summary This PR updates the entire parser stack in multiple ways: ### Make the lexer lazy * https://github.com/astral-sh/ruff/pull/11244 * https://github.com/astral-sh/ruff/pull/11473 Previously, Ruff's lexer would act as an iterator. The parser would collect all the tokens in a vector first and then process the tokens to create the syntax tree. The first task in this project is to update the entire parsing flow to make the lexer lazy. This includes the `Lexer`, `TokenSource`, and `Parser`. For context, the `TokenSource` is a wrapper around the `Lexer` to filter out the trivia tokens[^1]. Now, the parser will ask the token source to get the next token and only then the lexer will continue and emit the token. This means that the lexer needs to be aware of the "current" token. When the `next_token` is called, the current token will be updated with the newly lexed token. The main motivation to make the lexer lazy is to allow re-lexing a token in a different context. This is going to be really useful to make the parser error resilience. For example, currently the emitted tokens remains the same even if the parser can recover from an unclosed parenthesis. This is important because the lexer emits a `NonLogicalNewline` in parenthesized context while a normal `Newline` in non-parenthesized context. This different kinds of newline is also used to emit the indentation tokens which is important for the parser as it's used to determine the start and end of a block. Additionally, this allows us to implement the following functionalities: 1. Checkpoint - rewind infrastructure: The idea here is to create a checkpoint and continue lexing. At a later point, this checkpoint can be used to rewind the lexer back to the provided checkpoint. 2. Remove the `SoftKeywordTransformer` and instead use lookahead or speculative parsing to determine whether a soft keyword is a keyword or an identifier 3. Remove the `Tok` enum. The `Tok` enum represents the tokens emitted by the lexer but it contains owned data which makes it expensive to clone. The new `TokenKind` enum just represents the type of token which is very cheap. This brings up a question as to how will the parser get the owned value which was stored on `Tok`. This will be solved by introducing a new `TokenValue` enum which only contains a subset of token kinds which has the owned value. This is stored on the lexer and is requested by the parser when it wants to process the data. For example: `8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L1260-L1262)` [^1]: Trivia tokens are `NonLogicalNewline` and `Comment` ### Remove `SoftKeywordTransformer` * https://github.com/astral-sh/ruff/pull/11441 * https://github.com/astral-sh/ruff/pull/11459 * https://github.com/astral-sh/ruff/pull/11442 * https://github.com/astral-sh/ruff/pull/11443 * https://github.com/astral-sh/ruff/pull/11474 For context, https://github.com/RustPython/RustPython/pull/4519/files#diff-5de40045e78e794aa5ab0b8aacf531aa477daf826d31ca129467703855408220 added support for soft keywords in the parser which uses infinite lookahead to classify a soft keyword as a keyword or an identifier. This is a brilliant idea as it basically wraps the existing Lexer and works on top of it which means that the logic for lexing and re-lexing a soft keyword remains separate. The change here is to remove `SoftKeywordTransformer` and let the parser determine this based on context, lookahead and speculative parsing. * Context: The transformer needs to know the position of the lexer between it being at a statement position or a simple statement position. This is because a `match` token starts a compound statement while a `type` token starts a simple statement. The parser already knows this. * Lookahead: Now that the parser knows the context it can perform lookahead of up to two tokens to classify the soft keyword. The logic for this is mentioned in the PR implementing it for `type` and `match soft keyword. * Speculative parsing: This is where the checkpoint - rewind infrastructure helps. For `match` soft keyword, there are certain cases for which we can't classify based on lookahead. The idea here is to create a checkpoint and keep parsing. Based on whether the parsing was successful and what tokens are ahead we can classify the remaining cases. Refer to #11443 for more details. If the soft keyword is being parsed in an identifier context, it'll be converted to an identifier and the emitted token will be updated as well. Refer `8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L487-L491)`. The `case` soft keyword doesn't require any special handling because it'll be a keyword only in the context of a match statement. ### Update the parser API * https://github.com/astral-sh/ruff/pull/11494 * https://github.com/astral-sh/ruff/pull/11505 Now that the lexer is in sync with the parser, and the parser helps to determine whether a soft keyword is a keyword or an identifier, the lexer cannot be used on its own. The reason being that it's not sensitive to the context (which is correct). This means that the parser API needs to be updated to not allow any access to the lexer. Previously, there were multiple ways to parse the source code: 1. Passing the source code itself 2. Or, passing the tokens Now that the lexer and parser are working together, the API corresponding to (2) cannot exists. The final API is mentioned in this PR description: https://github.com/astral-sh/ruff/pull/11494. ### Refactor the downstream tools (linter and formatter) * https://github.com/astral-sh/ruff/pull/11511 * https://github.com/astral-sh/ruff/pull/11515 * https://github.com/astral-sh/ruff/pull/11529 * https://github.com/astral-sh/ruff/pull/11562 * https://github.com/astral-sh/ruff/pull/11592 And, the final set of changes involves updating all references of the lexer and `Tok` enum. This was done in two-parts: 1. Update all the references in a way that doesn't require any changes from this PR i.e., it can be done independently * https://github.com/astral-sh/ruff/pull/11402 * https://github.com/astral-sh/ruff/pull/11406 * https://github.com/astral-sh/ruff/pull/11418 * https://github.com/astral-sh/ruff/pull/11419 * https://github.com/astral-sh/ruff/pull/11420 * https://github.com/astral-sh/ruff/pull/11424 2. Update all the remaining references to use the changes made in this PR For (2), there were various strategies used: 1. Introduce a new `Tokens` struct which wraps the token vector and add methods to query a certain subset of tokens. These includes: 1. `up_to_first_unknown` which replaces the `tokenize` function 2. `in_range` and `after` which replaces the `lex_starts_at` function where the former returns the tokens within the given range while the latter returns all the tokens after the given offset 2. Introduce a new `TokenFlags` which is a set of flags to query certain information from a token. Currently, this information is only limited to any string type token but can be expanded to include other information in the future as needed. https://github.com/astral-sh/ruff/pull/11578 3. Move the `CommentRanges` to the parsed output because this information is common to both the linter and the formatter. This removes the need for `tokens_and_ranges` function. ## Test Plan - [x] Update and verify the test snapshots - [x] Make sure the entire test suite is passing - [x] Make sure there are no changes in the ecosystem checks - [x] Run the fuzzer on the parser - [x] Run this change on dozens of open-source projects ### Running this change on dozens of open-source projects Refer to the PR description to get the list of open source projects used for testing. Now, the following tests were done between `main` and this branch: 1. Compare the output of `--select=E999` (syntax errors) 2. Compare the output of default rule selection 3. Compare the output of `--select=ALL` Conclusion: all output were same ## What's next? The next step is to introduce re-lexing logic and update the parser to feed the recovery information to the lexer so that it can emit the correct token. This moves us one step closer to having error resilience in the parser and provides Ruff the possibility to lint even if the source code contains syntax errors.	2024-06-03 18:23:50 +05:30
Dhruv Manilawala	83152fff92	Include soft keywords for `is_keyword` check (#11445 ) ## Summary This PR updates the `TokenKind::is_keyword` check to include soft keywords. To account for this change, it adds a new `is_non_soft_keyword` method. The usage in logical line rules were updated to use the `is_non_soft_keyword` method but it'll be updated to use `is_keyword` in a follow-up PR (#11446). While, the parser usages were kept as is. And because of that, the snapshots for two test cases were updated in a better direction. ## Test Plan `cargo insta test`	2024-05-17 10:26:48 +05:30
Dimitri Papadopoulos Orfanos	3b0584449d	Fix a few typos found by codespell (#11404 ) ## Summary Just fix typos. ## Test Plan CI jobs. --------- Co-authored-by: Dhruv Manilawala <dhruvmanila@gmail.com>	2024-05-13 13:22:35 +00:00
Alex Waygood	6774f27f4b	Refactor the `ExprDict` node (#11267 ) Co-authored-by: Micha Reiser <micha@reiser.io>	2024-05-07 11:46:10 +00:00
Alex Waygood	87929ad5f1	Add convenience methods for iterating over all parameter nodes in a function (#11174 )	2024-04-29 10:36:15 +00:00
Carl Meyer	845ba7cf5f	Make ImportFrom level just a u32 (#11170 )	2024-04-26 20:38:35 -06:00
Jelle Zijlstra	cd3e319538	Add support for PEP 696 syntax (#11120 )	2024-04-26 09:47:29 +02:00
Dhruv Manilawala	c30735d4a7	Add `ExpressionContext` for expression parsing (#11055 ) ## Summary This PR adds a new `ExpressionContext` struct which is used in expression parsing. This solves the following problem: 1. Allowing starred expression with different precedence 2. Allowing yield expression in certain context 3. Remove ambiguity with `in` keyword when parsing a `for ... in` statement For context, (1) was solved by adding `parse_star_expression_list` and `parse_star_expression_or_higher` in #10623, (2) was solved by by adding `parse_yield_expression_or_else` in #10809, and (3) was fixed in #11009. All of the mentioned functions have been removed in favor of the context flags. As mentioned in #11009, an ideal solution would be to implement an expression context which is what this PR implements. This is passed around as function parameter and the call stack is used to automatically reset the context. ### Recovery How should the parser recover if the target expression is invalid when an expression can consume the `in` keyword? 1. Should the `in` keyword be part of the target expression? 2. Or, should the expression parsing stop as soon as `in` keyword is encountered, no matter the expression? For example: ```python for yield x in y: ... # Here, should this be parsed as for (yield x) in (y): ... # Or for (yield x in y): ... # where the `in iter` part is missing ``` Or, for binary expression parsing: ```python for x or y in z: ... # Should this be parsed as for (x or y) in z: ... # Or for (x or y in z): ... # where the `in iter` part is missing ``` This need not be solved now, but is very easy to change. For context this PR does the following: * For binary, comparison, and unary expressions, stop at `in` * For lambda, yield expressions, consume the `in` ## Test Plan 1. Add test cases for the `for ... in` statement and verify the snapshots 2. Make sure the existing test suite pass 3. Run the fuzzer for around 3000 generated source code 4. Run the updated logic on a dozen or so open source repositories (codename "parser-checkouts")	2024-04-23 04:19:05 +00:00
Dhruv Manilawala	d3cd61f804	Use empty range when there's "gap" in token source (#11032 ) ## Summary This fixes a bug where the parser would panic when there is a "gap" in the token source. What's a gap? The reason it's `<=` instead of just `==` is because there could be whitespaces between the two tokens. For example: ```python # last token end # \| current token (newline) start # v v def foo \n # ^ # assume there's trailing whitespace here ``` Or, there could tokens that are considered "trivia" and thus aren't emitted by the token source. These are comments and non-logical newlines. For example: ```python # last token end # v def foo # comment\n # ^ current token (newline) start ``` In either of the above cases, there's a "gap" between the end of the last token and start of the current token. ## Test Plan Add test cases and update the snapshots.	2024-04-19 11:36:26 +00:00
Dhruv Manilawala	9bb23b0a38	Expect indented case block instead of match stmt (#11033 ) ## Summary This PR adds a new `Clause::Case` and uses it to parse the body of a `case` block. Earlier, it was using `Match` which would give an incorrect error message like: ``` \| 1 \| match subject: 2 \| case 1: 3 \| case 2: ... \| ^^^^ Syntax Error: Expected an indented block after `match` statement \| ``` ## Test Plan Add test case and update the snapshot.	2024-04-19 16:46:15 +05:30
Dhruv Manilawala	b7066e64e7	Consider binary expr for parenthesized with items parsing (#11012 ) ## Summary This PR fixes the bug in with items parsing where it would fail to recognize that the parenthesized expression is part of a large binary expression. ## Test Plan Add test cases and verified the snapshots.	2024-04-18 21:39:30 +05:30
Dhruv Manilawala	6c4d779140	Consider `if` expression for parenthesized with items parsing (#11010 ) ## Summary This PR fixes the bug in parenthesized with items parsing where the `if` expression would result into a syntax error. The reason being that once we identify that the ambiguous left parenthesis belongs to the context expression, the parser converts the parsed with item into an equivalent expression. Then, the parser continuous to parse any postfix expressions. Now, attribute, subscript, and call are taken into account as they're grouped in `parse_postfix_expression` but `if` expression has it's own parsing function. Use `parse_if_expression` once all postfix expressions have been parsed. Ideally, I think that `if` could be included in postfix expression parsing as they can be chained as well (`x if True else y if True else z`). ## Test Plan Add test cases and verified the snapshots.	2024-04-18 14:30:15 +00:00
Dhruv Manilawala	8020d486f6	Reset `FOR_TARGET` context for all kinds of parentheses (#11009 ) ## Summary This PR fixes a bug in the new parser which involves the parser context w.r.t. for statement. This is specifically around the `in` keyword which can be present in the target expression and shouldn't be considered to be part of the `for` statement header. Ideally it should use a context which is passed between functions, thus using a call stack to set / unset a specific variant which will be done in a follow-up PR as it requires some amount of refactor. ## Test Plan Add test cases and update the snapshots.	2024-04-18 19:37:50 +05:30
Dhruv Manilawala	13ffb5bc19	Replace LALRPOP parser with hand-written parser (#10036 ) (Supersedes #9152, authored by @LaBatata101) ## Summary This PR replaces the current parser generated from LALRPOP to a hand-written recursive descent parser. It also updates the grammar for [PEP 646](https://peps.python.org/pep-0646/) so that the parser outputs the correct AST. For example, in `data[*x]`, the index expression is now a tuple with a single starred expression instead of just a starred expression. Beyond the performance improvements, the parser is also error resilient and can provide better error messages. The behavior as seen by any downstream tools isn't changed. That is, the linter and formatter can still assume that the parser will _stop_ at the first syntax error. This will be updated in the following months. For more details about the change here, refer to the PR corresponding to the individual commits and the release blog post. ## Test Plan Write _lots_ and _lots_ of tests for both valid and invalid syntax and verify the output. ## Acknowledgements - @MichaReiser for reviewing 100+ parser PRs and continuously providing guidance throughout the project - @LaBatata101 for initiating the transition to a hand-written parser in #9152 - @addisoncrump for implementing the fuzzer which helped [catch](https://github.com/astral-sh/ruff/pull/10903) [a](https://github.com/astral-sh/ruff/pull/10910) [lot](https://github.com/astral-sh/ruff/pull/10966) [of](https://github.com/astral-sh/ruff/pull/10896) [bugs](https://github.com/astral-sh/ruff/pull/10877) --------- Co-authored-by: Victor Hugo Gomes <labatata101@linuxmail.org> Co-authored-by: Micha Reiser <micha@reiser.io>	2024-04-18 17:57:39 +05:30

1 2

98 commits