language-servers/ruff - Forgejo: Beyond coding. We Forge.

mirror of https://github.com/astral-sh/ruff.git synced 2025-08-21 11:00:49 +00:00

Author	SHA1	Message	Date
Dhruv Manilawala	bf5b62edac	Maintain synchronicity between the lexer and the parser (#11457 ) ## Summary This PR updates the entire parser stack in multiple ways: ### Make the lexer lazy * https://github.com/astral-sh/ruff/pull/11244 * https://github.com/astral-sh/ruff/pull/11473 Previously, Ruff's lexer would act as an iterator. The parser would collect all the tokens in a vector first and then process the tokens to create the syntax tree. The first task in this project is to update the entire parsing flow to make the lexer lazy. This includes the `Lexer`, `TokenSource`, and `Parser`. For context, the `TokenSource` is a wrapper around the `Lexer` to filter out the trivia tokens[^1]. Now, the parser will ask the token source to get the next token and only then the lexer will continue and emit the token. This means that the lexer needs to be aware of the "current" token. When the `next_token` is called, the current token will be updated with the newly lexed token. The main motivation to make the lexer lazy is to allow re-lexing a token in a different context. This is going to be really useful to make the parser error resilience. For example, currently the emitted tokens remains the same even if the parser can recover from an unclosed parenthesis. This is important because the lexer emits a `NonLogicalNewline` in parenthesized context while a normal `Newline` in non-parenthesized context. This different kinds of newline is also used to emit the indentation tokens which is important for the parser as it's used to determine the start and end of a block. Additionally, this allows us to implement the following functionalities: 1. Checkpoint - rewind infrastructure: The idea here is to create a checkpoint and continue lexing. At a later point, this checkpoint can be used to rewind the lexer back to the provided checkpoint. 2. Remove the `SoftKeywordTransformer` and instead use lookahead or speculative parsing to determine whether a soft keyword is a keyword or an identifier 3. Remove the `Tok` enum. The `Tok` enum represents the tokens emitted by the lexer but it contains owned data which makes it expensive to clone. The new `TokenKind` enum just represents the type of token which is very cheap. This brings up a question as to how will the parser get the owned value which was stored on `Tok`. This will be solved by introducing a new `TokenValue` enum which only contains a subset of token kinds which has the owned value. This is stored on the lexer and is requested by the parser when it wants to process the data. For example: `8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L1260-L1262)` [^1]: Trivia tokens are `NonLogicalNewline` and `Comment` ### Remove `SoftKeywordTransformer` * https://github.com/astral-sh/ruff/pull/11441 * https://github.com/astral-sh/ruff/pull/11459 * https://github.com/astral-sh/ruff/pull/11442 * https://github.com/astral-sh/ruff/pull/11443 * https://github.com/astral-sh/ruff/pull/11474 For context, https://github.com/RustPython/RustPython/pull/4519/files#diff-5de40045e78e794aa5ab0b8aacf531aa477daf826d31ca129467703855408220 added support for soft keywords in the parser which uses infinite lookahead to classify a soft keyword as a keyword or an identifier. This is a brilliant idea as it basically wraps the existing Lexer and works on top of it which means that the logic for lexing and re-lexing a soft keyword remains separate. The change here is to remove `SoftKeywordTransformer` and let the parser determine this based on context, lookahead and speculative parsing. * Context: The transformer needs to know the position of the lexer between it being at a statement position or a simple statement position. This is because a `match` token starts a compound statement while a `type` token starts a simple statement. The parser already knows this. * Lookahead: Now that the parser knows the context it can perform lookahead of up to two tokens to classify the soft keyword. The logic for this is mentioned in the PR implementing it for `type` and `match soft keyword. * Speculative parsing: This is where the checkpoint - rewind infrastructure helps. For `match` soft keyword, there are certain cases for which we can't classify based on lookahead. The idea here is to create a checkpoint and keep parsing. Based on whether the parsing was successful and what tokens are ahead we can classify the remaining cases. Refer to #11443 for more details. If the soft keyword is being parsed in an identifier context, it'll be converted to an identifier and the emitted token will be updated as well. Refer `8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L487-L491)`. The `case` soft keyword doesn't require any special handling because it'll be a keyword only in the context of a match statement. ### Update the parser API * https://github.com/astral-sh/ruff/pull/11494 * https://github.com/astral-sh/ruff/pull/11505 Now that the lexer is in sync with the parser, and the parser helps to determine whether a soft keyword is a keyword or an identifier, the lexer cannot be used on its own. The reason being that it's not sensitive to the context (which is correct). This means that the parser API needs to be updated to not allow any access to the lexer. Previously, there were multiple ways to parse the source code: 1. Passing the source code itself 2. Or, passing the tokens Now that the lexer and parser are working together, the API corresponding to (2) cannot exists. The final API is mentioned in this PR description: https://github.com/astral-sh/ruff/pull/11494. ### Refactor the downstream tools (linter and formatter) * https://github.com/astral-sh/ruff/pull/11511 * https://github.com/astral-sh/ruff/pull/11515 * https://github.com/astral-sh/ruff/pull/11529 * https://github.com/astral-sh/ruff/pull/11562 * https://github.com/astral-sh/ruff/pull/11592 And, the final set of changes involves updating all references of the lexer and `Tok` enum. This was done in two-parts: 1. Update all the references in a way that doesn't require any changes from this PR i.e., it can be done independently * https://github.com/astral-sh/ruff/pull/11402 * https://github.com/astral-sh/ruff/pull/11406 * https://github.com/astral-sh/ruff/pull/11418 * https://github.com/astral-sh/ruff/pull/11419 * https://github.com/astral-sh/ruff/pull/11420 * https://github.com/astral-sh/ruff/pull/11424 2. Update all the remaining references to use the changes made in this PR For (2), there were various strategies used: 1. Introduce a new `Tokens` struct which wraps the token vector and add methods to query a certain subset of tokens. These includes: 1. `up_to_first_unknown` which replaces the `tokenize` function 2. `in_range` and `after` which replaces the `lex_starts_at` function where the former returns the tokens within the given range while the latter returns all the tokens after the given offset 2. Introduce a new `TokenFlags` which is a set of flags to query certain information from a token. Currently, this information is only limited to any string type token but can be expanded to include other information in the future as needed. https://github.com/astral-sh/ruff/pull/11578 3. Move the `CommentRanges` to the parsed output because this information is common to both the linter and the formatter. This removes the need for `tokens_and_ranges` function. ## Test Plan - [x] Update and verify the test snapshots - [x] Make sure the entire test suite is passing - [x] Make sure there are no changes in the ecosystem checks - [x] Run the fuzzer on the parser - [x] Run this change on dozens of open-source projects ### Running this change on dozens of open-source projects Refer to the PR description to get the list of open source projects used for testing. Now, the following tests were done between `main` and this branch: 1. Compare the output of `--select=E999` (syntax errors) 2. Compare the output of default rule selection 3. Compare the output of `--select=ALL` Conclusion: all output were same ## What's next? The next step is to introduce re-lexing logic and update the parser to feed the recovery information to the lexer so that it can emit the correct token. This moves us one step closer to having error resilience in the parser and provides Ruff the possibility to lint even if the source code contains syntax errors.	2024-06-03 18:23:50 +05:30
Dhruv Manilawala	77a72ecd38	Avoid multiline expression if format specifier is present (#11123 ) ## Summary This PR fixes the bug where the formatter would format an f-string and could potentially change the AST. For a triple-quoted f-string, the element can't be formatted into multiline if it has a format specifier because otherwise the newline would be treated as part of the format specifier. Given the following f-string: ```python f"""aaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbb ccccccccccc { variable:.3f} ddddddddddddddd eeeeeeee""" ``` The formatter sees that the f-string is already multiline so it assumes that it can contain line breaks i.e., broken into multiple lines. But, in this specific case we can't format it as: ```python f"""aaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbb ccccccccccc { variable:.3f } ddddddddddddddd eeeeeeee""" ``` Because the format specifier string would become ".3f\n", which is not the original string (`.3f`). If the original source code already contained a newline, they'll be preserved. For example: ```python f"""aaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbb ccccccccccc { variable:.3f } ddddddddddddddd eeeeeeee""" ``` The above will be formatted as: ```py f"""aaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbb ccccccccccc {variable:.3f } ddddddddddddddd eeeeeeee""" ``` Note that the newline after `.3f` is part of the format specifier which needs to be preserved. The Python version is irrelevant in this case. fixes: #10040 ## Test Plan Add some test cases to verify this behavior.	2024-04-26 13:34:38 +00:00
veryyet	c5ea4209bb	chore: remove repetitive words (#10502 )	2024-03-21 03:57:16 +00:00
Alex Waygood	c2e15f38ee	Unify enums used for internal representation of quoting style (#10383 )	2024-03-13 17:19:17 +00:00
Micha Reiser	a6f32ddc5e	Ruff 2024.2 style (#9639 )	2024-02-29 09:30:54 +01:00
Dhruv Manilawala	72bf1c2880	Preview minimal f-string formatting (#9642 ) ## Summary _This is preview only feature and is available using the `--preview` command-line flag._ With the implementation of [PEP 701] in Python 3.12, f-strings can now be broken into multiple lines, can contain comments, and can re-use the same quote character. Currently, no other Python formatter formats the f-strings so there's some discussion which needs to happen in defining the style used for f-string formatting. Relevant discussion: https://github.com/astral-sh/ruff/discussions/9785 The goal for this PR is to add minimal support for f-string formatting. This would be to format expression within the replacement field without introducing any major style changes. ### Newlines The heuristics for adding newline is similar to that of [Prettier](https://prettier.io/docs/en/next/rationale.html#template-literals) where the formatter would only split an expression in the replacement field across multiple lines if there was already a line break within the replacement field. In other words, the formatter would not add any newlines unless they were already present i.e., they were added by the user. This makes breaking any expression inside an f-string optional and in control of the user. For example, ```python # We wouldn't break this aaaaaaaaaaa = f"asaaaaaaaaaaaaaaaa { aaaaaaaaaaaa + bbbbbbbbbbbb + ccccccccccccccc } cccccccccc" # But, we would break the following as there's already a newline aaaaaaaaaaa = f"asaaaaaaaaaaaaaaaa { aaaaaaaaaaaa + bbbbbbbbbbbb + ccccccccccccccc } cccccccccc" ``` If there are comments in any of the replacement field of the f-string, then it will always be a multi-line f-string in which case the formatter would prefer to break expressions i.e., introduce newlines. For example, ```python x = f"{ # comment a }" ``` ### Quotes The logic for formatting quotes remains unchanged. The existing logic is used to determine the necessary quote char and is used accordingly. Now, if the expression inside an f-string is itself a string like, then we need to make sure to preserve the existing quote and not change it to the preferred quote unless it's 3.12. For example, ```python f"outer {'inner'} outer" # For pre 3.12, preserve the single quote f"outer {'inner'} outer" # While for 3.12 and later, the quotes can be changed f"outer {"inner"} outer" ``` But, for triple-quoted strings, we can re-use the same quote char unless the inner string is itself a triple-quoted string. ```python f"""outer {"inner"} outer""" # valid f"""outer {'''inner'''} outer""" # preserve the single quote char for the inner string ``` ### Debug expressions If debug expressions are present in the replacement field of a f-string, then the whitespace needs to be preserved as they will be rendered as it is (for example, `f"{ x = }"`. If there are any nested f-strings, then the whitespace in them needs to be preserved as well which means that we'll stop formatting the f-string as soon as we encounter a debug expression. ```python f"outer { x = !s :.3f}" # ^^ # We can remove these whitespaces ``` Now, the whitespace doesn't need to be preserved around conversion spec and format specifiers, so we'll format them as usual but we won't be formatting any nested f-string within the format specifier. ### Miscellaneous - The [`hug_parens_with_braces_and_square_brackets`](https://github.com/astral-sh/ruff/issues/8279) preview style isn't implemented w.r.t. the f-string curly braces. - The [indentation](https://github.com/astral-sh/ruff/discussions/9785#discussioncomment-8470590) is always relative to the f-string containing statement ## Test Plan * Add new test cases * Review existing snapshot changes * Review the ecosystem changes [PEP 701]: https://peps.python.org/pep-0701/	2024-02-16 20:28:11 +05:30
Dhruv Manilawala	189e947808	Split string formatting to individual nodes (#9058 ) This PR splits the string formatting code in the formatter to be handled by the respective nodes. Previously, the string formatting was done through a single `FormatString` interface. Now, the nodes themselves are responsible for formatting. The following changes were made: 1. Remove `StringLayout::ImplicitStringConcatenationInBinaryLike` and inline the call to `FormatStringContinuation`. After the refactor, the binary like formatting would delegate to `FormatString` which would then delegate to `FormatStringContinuation`. This removes the intermediary steps. 2. Add formatter implementation for `FStringPart` which delegates it to the respective string literal or f-string node. 3. Add `ExprStringLiteralKind` which is either `String` or `Docstring`. If it's a docstring variant, then the string expression would not be implicitly concatenated. This is guaranteed by the `DocstringStmt::try_from_expression` constructor. 4. Add `StringLiteralKind` which is either a `String`, `Docstring` or `InImplicitlyConcatenatedFString`. The last variant is for when the string literal is implicitly concatenated with an f-string (`"foo" f"bar {x}"`). 5. Remove `FormatString`. 6. Extract the f-string quote detection as a standalone function which is public to the crate. This is used to detect the quote to be used for an f-string at the expression level (`ExprFString` or `FormatStringContinuation`). ### Formatter ecosystem result This PR \| project \| similarity index \| total files \| changed files \| \|----------------\|------------------:\|------------------:\|------------------:\| \| cpython \| 0.75804 \| 1799 \| 1648 \| \| django \| 0.99984 \| 2772 \| 34 \| \| home-assistant \| 0.99955 \| 10596 \| 214 \| \| poetry \| 0.99905 \| 321 \| 15 \| \| transformers \| 0.99967 \| 2657 \| 324 \| \| twine \| 1.00000 \| 33 \| 0 \| \| typeshed \| 0.99980 \| 3669 \| 18 \| \| warehouse \| 0.99976 \| 654 \| 14 \| \| zulip \| 0.99958 \| 1459 \| 36 \| main \| project \| similarity index \| total files \| changed files \| \|----------------\|------------------:\|------------------:\|------------------:\| \| cpython \| 0.75804 \| 1799 \| 1648 \| \| django \| 0.99984 \| 2772 \| 34 \| \| home-assistant \| 0.99955 \| 10596 \| 214 \| \| poetry \| 0.99905 \| 321 \| 15 \| \| transformers \| 0.99967 \| 2657 \| 324 \| \| twine \| 1.00000 \| 33 \| 0 \| \| typeshed \| 0.99980 \| 3669 \| 18 \| \| warehouse \| 0.99976 \| 654 \| 14 \| \| zulip \| 0.99958 \| 1459 \| 36 \|	2023-12-14 12:55:10 -06:00
Andrew Gallant	b972455ac7	ruff_python_formatter: implement "dynamic" line width mode for docstring code formatting (#9098 ) ## Summary This PR changes the internal `docstring-code-line-width` setting to additionally accept a string value `dynamic`. When `dynamic` is set, the line width is dynamically adjusted when reformatting code snippets in docstrings based on the indent level of the docstring. The result is that the reformatted lines from the code snippet should not exceed the "global" line width configuration for the surrounding source. This PR does not change the default behavior, although I suspect the default should probably be `dynamic`. ## Test Plan I added a new configuration to the existing docstring code tests and also added a new set of tests dedicated to the new `dynamic` mode.	2023-12-12 09:58:07 -05:00
Samuel Cormier-Iijima	2414298289	Add "preserve" quote-style to mimic Black's skip-string-normalization (#8822 ) Co-authored-by: Micha Reiser <micha@reiser.io>	2023-12-07 23:59:22 +00:00
Micha Reiser	0bda1913d1	Create dedicated `is_*_enabled` functions for each preview style (#8988 )	2023-12-04 05:38:54 +00:00
Andrew Gallant	d9845a2628	format doctests in docstrings (#8811 ) ## Summary This PR adds opt-in support for formatting doctests in docstrings. This reflects initial support and it is intended to add support for Markdown and reStructuredText Python code blocks in the future. But I believe this PR lays the groundwork, and future additions for Markdown and reST should be less costly to add. It's strongly recommended to review this PR commit-by-commit. The last few commits in particular implement the bulk of the work here and represent the denser portions. Some things worth mentioning: * The formatter is itself not perfect, and it is possible for it to produce invalid Python code. Because of this, reformatted code snippets are checked for Python validity. If they aren't valid, then we (unfortunately silently) bail on formatting that code snippet. * There are a couple places where it would be nice to at least warn the user that doctest formatting failed, but it wasn't clear to me what the best way to do that is. * I haven't yet run this in anger on a real world code base. I think that should happen before merging. Closes #7146 ## Test Plan * [x] Pass the local test suite. * [x] Scrutinize ecosystem changes. * [x] Run this formatter on extant code and scrutinize the results. (e.g., CPython, numpy.)	2023-11-27 11:14:55 -05:00
Dhruv Manilawala	3e00ddce38	Preserve trailing semicolon for Notebooks (#8590 ) ## Summary This PR updates the formatter to preserve trailing semicolon for Jupyter Notebooks. The motivation behind the change is that semicolons in notebooks are typically used to hide the output, for example when plotting. This is highlighted in the linked issue. The conditions required as to when the trailing semicolon should be preserved are: 1. It should be a top-level statement which is last in the module. 2. For statement, it can be either assignment, annotated assignment, or augmented assignment. Here, the target should only be a single identifier i.e., multiple assignments or tuple unpacking isn't considered. 3. For expression, it can be any. ## Test Plan Add a new integration test in `ruff_cli`. The test notebook basically acts as a document as to which trailing semicolons are to be preserved. fixes: #8254	2023-11-10 21:53:35 +05:30
konsti	0ef6af807b	Implement DerefMut for WithNodeLevel (#6443 ) Summary Implement `DerefMut` for `WithNodeLevel` so it can be used in the same way as `PyFormatter`. I want this for my WIP upstack branch to enable `.fmt(f)` on `WithNodeLevel` context. We could extend this to remove the other two method from `WithNodeLevel`.	2023-08-11 10:41:48 +00:00
Micha Reiser	38b5726948	formatter: `WithNodeLevel` helper (#6212 )	2023-07-31 21:22:17 +00:00
Micha Reiser	2cf00fee96	Remove parser dependency from ruff-python-ast (#6096 )	2023-07-26 17:47:22 +02:00
Micha Reiser	8187bf9f7e	Cover Black's `is_aritmetic_like` formatting (#5738 )	2023-07-14 17:54:58 +02:00
Micha Reiser	8665a1a19d	Pass `FormatContext` to `NeedsParentheses` <!-- Thank you for contributing to Ruff! To help us out with reviewing, please consider the following: - Does this pull request include a summary of the change? (See below.) - Does this pull request include a descriptive title? - Does this pull request include references to any relevant issues? --> ## Summary I started working on this because I assumed that I would need access to options inside of `NeedsParantheses` but it then turned out that I won't. Anyway, it kind of felt nice to pass fewer arguments. So I'm gonna put this out here to get your feedback if you prefer this over passing individual fiels. Oh, I sneeked in another change. I renamed `context.contents` to `source`. `contents` is too generic and doesn't tell you anything. <!-- What's the purpose of the change? What does it do, and why? --> ## Test Plan It compiles	2023-07-11 14:28:50 +02:00
Micha Reiser	715250a179	Prefer expanding parenthesized expressions before operands <!-- Thank you for contributing to Ruff! To help us out with reviewing, please consider the following: - Does this pull request include a summary of the change? (See below.) - Does this pull request include a descriptive title? - Does this pull request include references to any relevant issues? --> ## Summary This PR implements Black's behavior where it first splits off parenthesized expressions before splitting before operands to avoid unnecessary parentheses: ```python # We want if a + [ b, c ]: pass # Rather than if ( a + [b, c] ): pass ``` This is implemented by using the new IR elements introduced in #5596. * We give the group wrapping the optional parentheses an ID (`parentheses_id`) * We use `conditional_group` for the lower priority groups (all non-parenthesized expressions) with the condition that the `parentheses_id` group breaks (we want to split before operands only if the parentheses are necessary) * We use `fits_expanded` to wrap all other parenthesized expressions (lists, dicts, sets), to prevent that expanding e.g. a list expands the `parentheses_id` group. We gate the `fits_expand` to only apply if the `parentheses_id` group fits (because we prefer `a\n+[b, c]` over expanding `[b, c]` if the whole expression gets parenthesized). We limit using `fits_expanded` and `conditional_group` only to expressions that themselves are not in parentheses (checking the conditions isn't free) ## Test Plan It increases the Jaccard index for Django from 0.915 to 0.917 ## Incompatibilites There are two incompatibilities left that I'm aware of (there may be more, I didn't go through all snapshot differences). ### Long string literals I commented on the regression. The issue is that a very long string (or any content without a split point) may not fit when only breaking the right side. The formatter than inserts the optional parentheses. But this is kind of useless because the overlong string will still not fit, because there are no new split points. I think we should ignore this incompatibility for now ### Expressions on statement level I don't fully understand the logic behind this yet, but black doesn't break before the operators for the following example even though the expression exceeds the configured line width ```python aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa < bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb > ccccccccccccccccccccccccccccc == ddddddddddddddddddddd ``` But it would if the expression is used inside of a condition. What I understand so far is that Black doesn't insert optional parentheses on the expression statement level (and a few other places) and, therefore, only breaks after opening parentheses. I propose to keep this deviation for now to avoid overlong-lines and use the compatibility report to make a decision if we should implement the same behavior.	2023-07-11 14:07:39 +02:00
Micha Reiser	dd0d1afb66	Create `PyFormatOptions` <!-- Thank you for contributing to Ruff! To help us out with reviewing, please consider the following: - Does this pull request include a summary of the change? (See below.) - Does this pull request include a descriptive title? - Does this pull request include references to any relevant issues? --> ## Summary This PR adds a new `PyFormatOptions` struct that stores the python formatter options. The new options aren't used yet, with the exception of magical trailing commas and the options passed to the printer. I'll follow up with more PRs that use the new options (e.g. `QuoteStyle`). <!-- What's the purpose of the change? What does it do, and why? --> ## Test Plan `cargo test` I'll follow up with a new PR that adds support for overriding the options in our fixture tests.	2023-06-26 14:02:17 +02:00
Micha Reiser	3f032cf09d	Format binary expressions (#4862 ) * Format Binary Expressions * Extract NeedsParentheses trait	2023-06-06 08:34:53 +00:00
Micha Reiser	c65f47d7c4	Format `while` Statement (#4810 )	2023-06-05 08:24:00 +00:00
Micha Reiser	5d939222db	Leading, Dangling, and Trailing comments formatting (#4785 )	2023-06-02 09:26:36 +02:00
konstin	0945803427	Generate FormatRule definitions (#4724 ) * Generate FormatRule definitions * Generate verbatim output * pub(crate) everything * clippy fix * Update crates/ruff_python_formatter/src/lib.rs Co-authored-by: Micha Reiser <micha@reiser.io> * Update crates/ruff_python_formatter/src/lib.rs Co-authored-by: Micha Reiser <micha@reiser.io> * stub out with Ok(()) again * Update crates/ruff_python_formatter/src/lib.rs Co-authored-by: Micha Reiser <micha@reiser.io> * PyFormatContext::{contents, locator} with `#[allow(unused)]` * Can't leak private type * remove commented code * Fix ruff errors * pub struct Format{node} due to rust rules --------- Co-authored-by: Julian LaNeve <lanevejulian@gmail.com> Co-authored-by: Micha Reiser <micha@reiser.io>	2023-06-01 08:38:53 +02:00
Micha Reiser	0cd453bdf0	Generic "comment to node" association logic (#4642 )	2023-05-30 09:28:01 +00:00
Micha Reiser	86ced3516b	Introduce `SourceCodeSlice` to reduce the size of `FormatElement` (#4622 ) Introduce `SourceCodeSlice` to reduce the size of `FormatElement`	2023-05-24 15:04:52 +00:00
Micha Reiser	1ccef5150d	Remove lifetime from FormatContext (#4376 )	2023-05-11 15:43:42 +00:00
Charlie Marsh	0a9d259f9c	Remove copied `core` modules from `ruff_python_formatter` (#3371 )	2023-03-08 19:03:40 +00:00
Charlie Marsh	ca49b00e55	Add initial formatter implementation (#2883 ) # Summary This PR contains the code for the autoformatter proof-of-concept. ## Crate structure The primary formatting hook is the `fmt` function in `crates/ruff_python_formatter/src/lib.rs`. The current formatter approach is outlined in `crates/ruff_python_formatter/src/lib.rs`, and is structured as follows: - Tokenize the code using the RustPython lexer. - In `crates/ruff_python_formatter/src/trivia.rs`, extract a variety of trivia tokens from the token stream. These include comments, trailing commas, and empty lines. - Generate the AST via the RustPython parser. - In `crates/ruff_python_formatter/src/cst.rs`, convert the AST to a CST structure. As of now, the CST is nearly identical to the AST, except that every node gets a `trivia` vector. But we might want to modify it further. - In `crates/ruff_python_formatter/src/attachment.rs`, attach each trivia token to the corresponding CST node. The logic for this is mostly in `decorate_trivia` and is ported almost directly from Prettier (given each token, find its preceding, following, and enclosing nodes, then attach the token to the appropriate node in a second pass). - In `crates/ruff_python_formatter/src/newlines.rs`, normalize newlines to match Black’s preferences. This involves traversing the CST and inserting or removing `TriviaToken` values as we go. - Call `format!` on the CST, which delegates to type-specific formatter implementations (e.g., `crates/ruff_python_formatter/src/format/stmt.rs` for `Stmt` nodes, and similar for `Expr` nodes; the others are trivial). Those type-specific implementations delegate to kind-specific functions (e.g., `format_func_def`). ## Testing and iteration The formatter is being developed against the Black test suite, which was copied over in-full to `crates/ruff_python_formatter/resources/test/fixtures/black`. The Black fixtures had to be modified to create `[insta](https://github.com/mitsuhiko/insta)`-compatible snapshots, which now exist in the repo. My approach thus far has been to try and improve coverage by tackling fixtures one-by-one. ## What works, and what doesn’t - Most nodes are supported at a basic level (though there are a few stragglers at time of writing, like `StmtKind::Try`). - Newlines are properly preserved in most cases. - Magic trailing commas are properly preserved in some (but not all) cases. - Trivial leading and trailing standalone comments mostly work (although maybe not at the end of a file). - Inline comments, and comments within expressions, often don’t work -- they work in a few cases, but it’s one-off right now. (We’re probably associating them with the “right” nodes more often than we are actually rendering them in the right place.) - We don’t properly normalize string quotes. (At present, we just repeat any constants verbatim.) - We’re mishandling a bunch of wrapping cases (if we treat Black as the reference implementation). Here are a few examples (demonstrating Black's stable behavior): ```py # In some cases, if the end expression is "self-closing" (functions, # lists, dictionaries, sets, subscript accesses, and any length-two # boolean operations that end in these elments), Black # will wrap like this... if some_expression and f( b, c, d, ): pass # ...whereas we do this: if ( some_expression and f( b, c, d, ) ): pass # If function arguments can fit on a single line, then Black will # format them like this, rather than exploding them vertically. if f( a, b, c, d, e, f, g, ... ): pass ``` - We don’t properly preserve parentheses in all cases. Black preserves parentheses in some but not all cases.	2023-02-15 04:06:35 +00:00

28 commits