mirrors/ruff - Forgejo: Beyond coding. We Forge.

mirror of https://github.com/astral-sh/ruff.git synced 2025-07-12 15:45:07 +00:00

Author	SHA1	Message	Date
Dhruv Manilawala	6ecb4776de	Rename `AnyStringKind` -> `AnyStringFlags` (#11405 ) ## Summary This PR renames `AnyStringKind` to `AnyStringFlags` and `AnyStringFlags` to `AnyStringFlagsInner`. The main motivation is to have consistent usage of "kind" and "flags". For each string kind, it's "flags" like `StringLiteralFlags`, `BytesLiteralFlags`, and `FStringFlags` but it was `AnyStringKind` for the "any" variant.	2024-05-13 13:18:07 +00:00
Charlie Marsh	35ba3c91ce	Use `u64` instead of `i64` in Int type (#11356 ) ## Summary I believe the value here is always unsigned, since we represent `-42` as a unary operator on `42`.	2024-05-10 13:35:15 +00:00
Alex Waygood	6774f27f4b	Refactor the `ExprDict` node (#11267 ) Co-authored-by: Micha Reiser <micha@reiser.io>	2024-05-07 11:46:10 +00:00
Dhruv Manilawala	04a922866a	Add basic docs for the parser crate (#11199 ) ## Summary This PR adds a basic README for the `ruff_python_parser` crate and updates the CONTRIBUTING docs with the fuzzer and benchmark section. Additionally, it also updates some inline documentation within the parser crate and splits the `parse_program` function into `parse_single_expression` and `parse_module` which will be called by matching against the `Mode`. This PR doesn't go into too much internal detail around the parser logic due to the following reasons: 1. Where should the docs go? Should it be as a module docs in `lib.rs` or in README? 2. The parser is still evolving and could include a lot of refactors with the future work (feedback loop and improved error recovery and resilience) --------- Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>	2024-04-29 17:08:07 +00:00
Alex Waygood	87929ad5f1	Add convenience methods for iterating over all parameter nodes in a function (#11174 )	2024-04-29 10:36:15 +00:00
Carl Meyer	845ba7cf5f	Make ImportFrom level just a u32 (#11170 )	2024-04-26 20:38:35 -06:00
Jelle Zijlstra	cd3e319538	Add support for PEP 696 syntax (#11120 )	2024-04-26 09:47:29 +02:00
Alex Waygood	269014a539	Delete unused methods from `Parameters` (#11150 )	2024-04-25 22:11:24 +01:00
Dhruv Manilawala	4738e19974	Remove unused lexical error types (#11145 )	2024-04-25 15:24:16 +00:00
Dhruv Manilawala	38d2562f41	Refactor unary expression parsing (#11088 ) ## Summary This PR refactors unary expression parsing with the following changes: * Ability to get `OperatorPrecedence` from a unary operator (`UnaryOp`) * Implement methods on `TokenKind` * Add `as_unary_operator` which returns an `Option<UnaryOp>` * Add `as_unary_arithmetic_operator` which returns an `Option<UnaryOp>` (used for pattern parsing) * Rename `is_unary` to `is_unary_arithmetic_operator` (used in the linter) resolves: #10752 ## Test Plan Verify that the existing test cases pass, no ecosystem changes, run the Python based fuzzer on 3000 random inputs and run it on dozens of open-source repositories.	2024-04-23 04:55:02 +00:00
Dhruv Manilawala	7eba967e16	Refactor binary expression parsing (#11073 ) ## Summary This PR refactors the binary expression parsing in a way to make it readable and easy to understand. It draws inspiration from the suggested edits in the linked messages in #10752. ### Changes * Ability to get the precedence of an operator * From a boolean operator (`BinOp`) to `OperatorPrecedence` * From a binary operator (`Operator`) to `OperatorPrecedence` * No comparison operator because all of them have the same precedence * Implement methods on `TokenKind` to convert it to an appropriate operator enum * Add `as_boolean_operator` which returns an `Option<BoolOp>` * Add `as_binary_operator` which returns an `Option<Operator>` * No `as_comparison_operator` because it requires lookahead and I'm not sure if `token.as_comparison_operator(peek)` is a good way to implement it * Introduce `BinaryLikeOperator` * Constructed from two tokens using the methods from the second point * Add `precedence` method using the conversion methods mentioned in the first point * Make most of the functions in `TokenKind` private to the module * Use `self` instead of `&self` for `TokenKind` fixes: #11072 ## Test Plan Refer #11088	2024-04-23 04:42:40 +00:00
Dhruv Manilawala	5b81b8368d	Make associativity a property of operator precedence (#11065 ) ## Summary This PR does a few things but the main change is that is makes associativity a property of operator precedence. 1. Rename `Precedence` -> `OperatorPrecedence` 2. Rename `parse_expression_with_precedence` -> `parse_binary_expression_or_higher` 3. Move `current_binding_power` to `OperatorPrecedence::try_from_tokens` [^1] 4. Add a `OperatorPrecedence::is_right_associative` method 5. Move from `increment_precedence` to using `<=` / `<` to check if the parsing loop needs to stop [^2] [^1]: Another alternative would be to have two separate methods to avoid lookahead as it's required only for once case (`not in`). So, `try_from_current_token(current).or_else(\|\| try_from_next_token(current, peek))` [^2]: This will allow us to easily make the refactors mentioned in #10752 ## Test Plan Make sure the precedence parsing algorithm is still correct by running the test suite, fuzz testing it and running it against a dozen or so open-source repositories.	2024-04-23 04:28:46 +00:00
Dhruv Manilawala	c30735d4a7	Add `ExpressionContext` for expression parsing (#11055 ) ## Summary This PR adds a new `ExpressionContext` struct which is used in expression parsing. This solves the following problem: 1. Allowing starred expression with different precedence 2. Allowing yield expression in certain context 3. Remove ambiguity with `in` keyword when parsing a `for ... in` statement For context, (1) was solved by adding `parse_star_expression_list` and `parse_star_expression_or_higher` in #10623, (2) was solved by by adding `parse_yield_expression_or_else` in #10809, and (3) was fixed in #11009. All of the mentioned functions have been removed in favor of the context flags. As mentioned in #11009, an ideal solution would be to implement an expression context which is what this PR implements. This is passed around as function parameter and the call stack is used to automatically reset the context. ### Recovery How should the parser recover if the target expression is invalid when an expression can consume the `in` keyword? 1. Should the `in` keyword be part of the target expression? 2. Or, should the expression parsing stop as soon as `in` keyword is encountered, no matter the expression? For example: ```python for yield x in y: ... # Here, should this be parsed as for (yield x) in (y): ... # Or for (yield x in y): ... # where the `in iter` part is missing ``` Or, for binary expression parsing: ```python for x or y in z: ... # Should this be parsed as for (x or y) in z: ... # Or for (x or y in z): ... # where the `in iter` part is missing ``` This need not be solved now, but is very easy to change. For context this PR does the following: * For binary, comparison, and unary expressions, stop at `in` * For lambda, yield expressions, consume the `in` ## Test Plan 1. Add test cases for the `for ... in` statement and verify the snapshots 2. Make sure the existing test suite pass 3. Run the fuzzer for around 3000 generated source code 4. Run the updated logic on a dozen or so open source repositories (codename "parser-checkouts")	2024-04-23 04:19:05 +00:00
Carl Meyer	c80b9a4a90	Reduce size of Stmt from 144 to 120 bytes (#11051 ) ## Summary I happened to notice that we box `TypeParams` on `StmtClassDef` but not on `StmtFunctionDef` and wondered why, since `StmtFunctionDef` is bigger and sets the size of `Stmt`. @charliermarsh found that at the time we started boxing type params on classes, classes were the largest statement type (see #6275), but that's no longer true. So boxing type-params also on functions reduces the overall size of `Stmt`. ## Test Plan The `<=` size tests are a bit irritating (since their failure doesn't tell you the actual size), but I manually confirmed that the size is actually 120 now.	2024-04-19 17:02:17 -06:00
Dhruv Manilawala	d3cd61f804	Use empty range when there's "gap" in token source (#11032 ) ## Summary This fixes a bug where the parser would panic when there is a "gap" in the token source. What's a gap? The reason it's `<=` instead of just `==` is because there could be whitespaces between the two tokens. For example: ```python # last token end # \| current token (newline) start # v v def foo \n # ^ # assume there's trailing whitespace here ``` Or, there could tokens that are considered "trivia" and thus aren't emitted by the token source. These are comments and non-logical newlines. For example: ```python # last token end # v def foo # comment\n # ^ current token (newline) start ``` In either of the above cases, there's a "gap" between the end of the last token and start of the current token. ## Test Plan Add test cases and update the snapshots.	2024-04-19 11:36:26 +00:00
Dhruv Manilawala	9bb23b0a38	Expect indented case block instead of match stmt (#11033 ) ## Summary This PR adds a new `Clause::Case` and uses it to parse the body of a `case` block. Earlier, it was using `Match` which would give an incorrect error message like: ``` \| 1 \| match subject: 2 \| case 1: 3 \| case 2: ... \| ^^^^ Syntax Error: Expected an indented block after `match` statement \| ``` ## Test Plan Add test case and update the snapshot.	2024-04-19 16:46:15 +05:30
Dhruv Manilawala	b7066e64e7	Consider binary expr for parenthesized with items parsing (#11012 ) ## Summary This PR fixes the bug in with items parsing where it would fail to recognize that the parenthesized expression is part of a large binary expression. ## Test Plan Add test cases and verified the snapshots.	2024-04-18 21:39:30 +05:30
Dhruv Manilawala	6c4d779140	Consider `if` expression for parenthesized with items parsing (#11010 ) ## Summary This PR fixes the bug in parenthesized with items parsing where the `if` expression would result into a syntax error. The reason being that once we identify that the ambiguous left parenthesis belongs to the context expression, the parser converts the parsed with item into an equivalent expression. Then, the parser continuous to parse any postfix expressions. Now, attribute, subscript, and call are taken into account as they're grouped in `parse_postfix_expression` but `if` expression has it's own parsing function. Use `parse_if_expression` once all postfix expressions have been parsed. Ideally, I think that `if` could be included in postfix expression parsing as they can be chained as well (`x if True else y if True else z`). ## Test Plan Add test cases and verified the snapshots.	2024-04-18 14:30:15 +00:00
Dhruv Manilawala	8020d486f6	Reset `FOR_TARGET` context for all kinds of parentheses (#11009 ) ## Summary This PR fixes a bug in the new parser which involves the parser context w.r.t. for statement. This is specifically around the `in` keyword which can be present in the target expression and shouldn't be considered to be part of the `for` statement header. Ideally it should use a context which is passed between functions, thus using a call stack to set / unset a specific variant which will be done in a follow-up PR as it requires some amount of refactor. ## Test Plan Add test cases and update the snapshots.	2024-04-18 19:37:50 +05:30
Dhruv Manilawala	13ffb5bc19	Replace LALRPOP parser with hand-written parser (#10036 ) (Supersedes #9152, authored by @LaBatata101) ## Summary This PR replaces the current parser generated from LALRPOP to a hand-written recursive descent parser. It also updates the grammar for [PEP 646](https://peps.python.org/pep-0646/) so that the parser outputs the correct AST. For example, in `data[*x]`, the index expression is now a tuple with a single starred expression instead of just a starred expression. Beyond the performance improvements, the parser is also error resilient and can provide better error messages. The behavior as seen by any downstream tools isn't changed. That is, the linter and formatter can still assume that the parser will _stop_ at the first syntax error. This will be updated in the following months. For more details about the change here, refer to the PR corresponding to the individual commits and the release blog post. ## Test Plan Write _lots_ and _lots_ of tests for both valid and invalid syntax and verify the output. ## Acknowledgements - @MichaReiser for reviewing 100+ parser PRs and continuously providing guidance throughout the project - @LaBatata101 for initiating the transition to a hand-written parser in #9152 - @addisoncrump for implementing the fuzzer which helped [catch](https://github.com/astral-sh/ruff/pull/10903) [a](https://github.com/astral-sh/ruff/pull/10910) [lot](https://github.com/astral-sh/ruff/pull/10966) [of](https://github.com/astral-sh/ruff/pull/10896) [bugs](https://github.com/astral-sh/ruff/pull/10877) --------- Co-authored-by: Victor Hugo Gomes <labatata101@linuxmail.org> Co-authored-by: Micha Reiser <micha@reiser.io>	2024-04-18 17:57:39 +05:30
Boshen	d467aa78c2	Remove an unused dependency (#10747 ) ## Summary Continuation of #10475, I improved [`cargo shear`](https://github.com/Boshen/cargo-shear) even more. We can put this in CI once I test it a bit more, given that [ignoring false positives](https://github.com/Boshen/cargo-shear?tab=readme-ov-file#ignore-false-positives) has been implemented. ## Test Plan `cargo check --all-features --all-targets`	2024-04-03 09:57:19 +01:00
veryyet	c5ea4209bb	chore: remove repetitive words (#10502 )	2024-03-21 03:57:16 +00:00
Alex Waygood	7caf0d064a	Simplify formatting of strings by using flags from the AST nodes (#10489 )	2024-03-20 16:16:54 +00:00
Alex Waygood	ffd6e79677	Fix typo in `string_token_flags.rs` (#10476 )	2024-03-19 17:43:08 +00:00
Alex Waygood	162d2eb723	Track casing of r-string prefixes in the tokenizer and AST (#10314 ) Co-authored-by: Micha Reiser <micha@reiser.io>	2024-03-18 17:18:04 +00:00
Alex Waygood	92e6026446	Apply NFKC normalization to unicode identifiers in the lexer (#10412 )	2024-03-18 11:56:56 +00:00
Alex Waygood	c2e15f38ee	Unify enums used for internal representation of quoting style (#10383 )	2024-03-13 17:19:17 +00:00
Auguste Lalande	3ed707f245	Spellcheck & grammar (#10375 ) ## Summary I used `codespell` and `gramma` to identify mispellings and grammar errors throughout the codebase and fixed them. I tried not to make any controversial changes, but feel free to revert as you see fit.	2024-03-13 02:34:23 +00:00
Gautier Moin	a067d87ccc	Fix incorrect `Parameter` range for `args` and `kwargs` (#10283 ) ## Summary Fix #10282 This PR updates the Python grammar to include the `` character in `args` `kwargs` in the range of the `Parameter` ``` def f(args, *kwargs): pass # ~~~~ ~~~~~~ <-- range before the PR # ^^^^^ ^^^^^^^^ <-- range after ``` The invalid syntax `def f(, **kwargs): ...` is also now correctly reported. ## Test Plan Test cases were added to `function.rs`.	2024-03-08 18:57:49 -05:00
Alex Waygood	1d97f27335	Start tracking quoting style in the AST (#10298 ) This PR modifies our AST so that nodes for string literals, bytes literals and f-strings all retain the following information: - The quoting style used (double or single quotes) - Whether the string is triple-quoted or not - Whether the string is raw or not This PR is a followup to #10256. Like with that PR, this PR does not, in itself, fix any bugs. However, it means that we will have the necessary information to preserve quoting style and rawness of strings in the `ExprGenerator` in a followup PR, which will allow us to provide a fix for https://github.com/astral-sh/ruff/issues/7799. The information is recorded on the AST nodes using a bitflag field on each node, similarly to how we recorded the information on `Tok::String`, `Tok::FStringStart` and `Tok::FStringMiddle` tokens in #10298. Rather than reusing the bitflag I used for the tokens, however, I decided to create a custom bitflag for each AST node. Using different bitflags for each node allows us to make invalid states unrepresentable: it is valid to set a `u` prefix on a string literal, but not on a bytes literal or an f-string. It also allows us to have better debug representations for each AST node modified in this PR.	2024-03-08 19:11:47 +00:00
Alex Waygood	c504d7ab11	Track quoting style in the tokenizer (#10256 )	2024-03-08 08:40:06 +00:00
Micha Reiser	184241f99a	Remove `Expr` postfix from `ExprNamed`, `ExprIf`, and `ExprGenerator` (#10229 ) The expression types in our AST are called `ExprYield`, `ExprAwait`, `ExprStringLiteral` etc, except `ExprNamedExpr`, `ExprIfExpr` and `ExprGenratorExpr`. This seems to align with [Python AST's naming](https://docs.python.org/3/library/ast.html) but feels inconsistent and excessive. This PR removes the `Expr` postfix from `ExprNamedExpr`, `ExprIfExpr`, and `ExprGeneratorExpr`.	2024-03-04 12:55:01 +01:00
Micha Reiser	77c5561646	Add `parenthesized` flag to `ExprTuple` and `ExprGenerator` (#9614 )	2024-02-26 15:35:20 +00:00
Dhruv Manilawala	33ac2867b7	Use non-parenthesized range for `DebugText` (#9953 ) ## Summary This PR fixes the `DebugText` implementation to use the expression range instead of the parenthesized range. Taking the following code snippet as an example: ```python x = 1 print(f"{ ( x ) = }") ``` The output of running it would be: ``` ( x ) = 1 ``` Notice that the whitespace between the parentheses and the expression is preserved as is. Currently, we don't preserve this information in the AST which defeats the purpose of `DebugText` as the main purpose of the struct is to preserve whitespaces _around_ the expression. This is also problematic when generating the code from the AST node as then the generator has no information about the parentheses the whitespaces between them and the expression which would lead to the removal of the parentheses in the generated code. I noticed this while working on the f-string formatting where the debug text would be used to preserve the text surrounding the expression in the presence of debug expression. The parentheses were being dropped then which made me realize that the problem is instead in the parser. ## Test Plan 1. Add a test case for the parser 2. Add a test case for the generator	2024-02-12 23:00:02 +05:30
Charlie Marsh	6f0e4ad332	Remove unnecessary string cloning from the parser (#9884 ) Closes https://github.com/astral-sh/ruff/issues/9869.	2024-02-09 16:03:27 -05:00
Charlie Marsh	49fe1b85f2	Reduce size of `Expr` from 80 to 64 bytes (#9900 ) ## Summary This PR reduces the size of `Expr` from 80 to 64 bytes, by reducing the sizes of... - `ExprCall` from 72 to 56 bytes, by using boxed slices for `Arguments`. - `ExprCompare` from 64 to 48 bytes, by using boxed slices for its various vectors. In testing, the parser gets a bit faster, and the linter benchmarks improve quite a bit.	2024-02-09 02:53:13 +00:00
Micha Reiser	fe7d965334	Reduce `Result<Tok, LexicalError>` size by using `Box<str>` instead of `String` (#9885 )	2024-02-08 20:36:22 +00:00
Micha Reiser	688177ff6a	Use Rust 1.76 (#9897 )	2024-02-08 18:20:08 +00:00
Charlie Marsh	6fffde72e7	Use `memchr` for string lexing (#9888 ) ## Summary On `main`, string lexing consists of walking through the string character-by-character to search for the closing quote (with some nuance: we also need to skip escaped characters, and error if we see newlines in non-triple-quoted strings). This PR rewrites `lex_string` to instead use `memchr` to search for the closing quote, which is significantly faster. On my machine, at least, the `globals.py` benchmark (which contains a lot of docstrings) gets 40% faster... ```text lexer/numpy/globals.py time: [3.6410 µs 3.6496 µs 3.6585 µs] thrpt: [806.53 MiB/s 808.49 MiB/s 810.41 MiB/s] change: time: [-40.413% -40.185% -39.984%] (p = 0.00 < 0.05) thrpt: [+66.623% +67.181% +67.822%] Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild lexer/unicode/pypinyin.py time: [12.422 µs 12.445 µs 12.467 µs] thrpt: [337.03 MiB/s 337.65 MiB/s 338.27 MiB/s] change: time: [-9.4213% -9.1930% -8.9586%] (p = 0.00 < 0.05) thrpt: [+9.8401% +10.124% +10.401%] Performance has improved. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) high mild 2 (2.00%) high severe lexer/pydantic/types.py time: [107.45 µs 107.50 µs 107.56 µs] thrpt: [237.11 MiB/s 237.24 MiB/s 237.35 MiB/s] change: time: [-4.0108% -3.7005% -3.3787%] (p = 0.00 < 0.05) thrpt: [+3.4968% +3.8427% +4.1784%] Performance has improved. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) high mild 5 (5.00%) high severe lexer/numpy/ctypeslib.py time: [46.123 µs 46.165 µs 46.208 µs] thrpt: [360.36 MiB/s 360.69 MiB/s 361.01 MiB/s] change: time: [-19.313% -18.996% -18.710%] (p = 0.00 < 0.05) thrpt: [+23.016% +23.451% +23.935%] Performance has improved. Found 8 outliers among 100 measurements (8.00%) 3 (3.00%) low mild 1 (1.00%) high mild 4 (4.00%) high severe lexer/large/dataset.py time: [231.07 µs 231.19 µs 231.33 µs] thrpt: [175.87 MiB/s 175.97 MiB/s 176.06 MiB/s] change: time: [-2.0437% -1.7663% -1.4922%] (p = 0.00 < 0.05) thrpt: [+1.5148% +1.7981% +2.0864%] Performance has improved. Found 10 outliers among 100 measurements (10.00%) 5 (5.00%) high mild 5 (5.00%) high severe ```	2024-02-08 17:23:06 +00:00
Seo Sanghyeon	df7fb95cbc	Index multiline f-strings (#9837 ) Fix #9777.	2024-02-05 21:25:33 -05:00
Micha Reiser	47ad7b4500	Approximate tokens len (#9546 )	2024-01-19 17:39:37 +01:00
Micha Reiser	21f2d0c90b	Add an explicit fast path for whitespace to `is_identifier_continuation` (#9532 )	2024-01-16 08:23:43 +00:00
Micha Reiser	f192c72596	Remove type parameter from `parse_*` methods (#9466 )	2024-01-11 19:41:19 +01:00
Micha Reiser	94968fedd5	Use Rust 1.75 toolchain (#9437 )	2024-01-08 18:03:16 +01:00
Micha Reiser	b1a5df8694	Move `locate_cmp_ops` to `invalid_literal_comparisons` (#9438 )	2024-01-08 13:15:36 +01:00
Charlie Marsh	1666c7a5cb	Add size hints to string parser (#9413 )	2024-01-06 15:59:34 -05:00
Charlie Marsh	f0d43dafcf	Ignore trailing quotes for unclosed l-brace errors (#9388 ) ## Summary Given: ```python F"{"ڤ ``` We try to locate the "unclosed left brace" error by subtracting the quote size from the lexer offset -- so we subtract 1 from the end of the source, which puts us in the middle of a Unicode character. I don't think we should try to adjust the offset in this way, since there can be content _after_ the quote. For example, with the advent of PEP 701, this string could reasonably be fixed as: ```python F"{"ڤ"}" ```` Closes https://github.com/astral-sh/ruff/issues/9379.	2024-01-04 05:00:55 +00:00
Charlie Marsh	9073220887	Make all dependencies workspace dependencies (#9333 ) ## Summary This PR modifies our `Cargo.toml` files to use workspace dependencies for _all_ dependencies, rather than the status quo of sporadically trying to use workspace dependencies for those dependencies that are used across multiple crates. I find the current situation more confusing and harder to manage, since we have a mix of workspace and crate-local dependencies, whereas this setup consistently uses the same approach for all dependencies.	2024-01-02 13:41:59 +00:00
Charlie Marsh	48e04cc2c8	Add row and column numbers to formatted parse errors (#9321 ) ## Summary We now render parse errors in the formatter identically to those in the linter, e.g.: ``` ❯ cargo run -p ruff_cli -- format foo.py error: Failed to parse foo.py:1:17: Unexpected token '=' ``` Closes https://github.com/astral-sh/ruff/issues/8338. Closes https://github.com/astral-sh/ruff/issues/9311.	2023-12-31 07:10:45 -05:00
Charlie Marsh	e80260a3c5	Remove source path from parser errors (#9322 ) ## Summary I always found it odd that we had to pass this in, since it's really higher-level context for the error. The awkwardness is further evidenced by the fact that we pass in fake values everywhere (even outside of tests). The source path isn't actually used to display the error; it's only accessed elsewhere to _re-display_ the error in certain cases. This PR modifies to instead pass the path directly in those cases.	2023-12-30 20:33:05 +00:00

... 2 3 4 5 6

278 commits