Commit graph

63 commits

Author SHA1 Message Date
Micha Reiser
1188ffccc4
Disallow newlines in format specifiers of single quoted f- or t-strings (#18708) 2025-06-18 14:56:15 +02:00
Ibraheem Ahmed
c9dff5c7d5
[ty] AST garbage collection (#18482)
## Summary

Garbage collect ASTs once we are done checking a given file. Queries
with a cross-file dependency on the AST will reparse the file on demand.
This reduces ty's peak memory usage by ~20-30%.

The primary change of this PR is adding a `node_index` field to every
AST node, that is assigned by the parser. `ParsedModule` can use this to
create a flat index of AST nodes any time the file is parsed (or
reparsed). This allows `AstNodeRef` to simply index into the current
instance of the `ParsedModule`, instead of storing a pointer directly.

The indices are somewhat hackily (using an atomic integer) assigned by
the `parsed_module` query instead of by the parser directly. Assigning
the indices in source-order in the (recursive) parser turns out to be
difficult, and collecting the nodes during semantic indexing is
impossible as `SemanticIndex` does not hold onto a specific
`ParsedModuleRef`, which the pointers in the flat AST are tied to. This
means that we have to do an extra AST traversal to assign and collect
the nodes into a flat index, but the small performance impact (~3% on
cold runs) seems worth it for the memory savings.

Part of https://github.com/astral-sh/ty/issues/214.
2025-06-13 08:40:11 -04:00
Dylan
9bbf4987e8
Implement template strings (#17851)
This PR implements template strings (t-strings) in the parser and
formatter for Ruff.

Minimal changes necessary to compile were made in other parts of the code (e.g. ty, the linter, etc.). These will be covered properly in follow-up PRs.
2025-05-30 15:00:56 -05:00
Dhruv Manilawala
bfc17fecaa
Raise syntax error when \ is at end of file (#17409)
## Summary

This PR fixes a bug in the lexer specifically around line continuation
character at end of file.

The reason this was occurring is because the lexer wouldn't check for
EOL _after_ consuming the escaped newline but only if the EOL was right
after the line continuation character.

fixes: #17398 

## Test Plan

Add tests for the scenarios where this should occur mainly (a) when the
state is `AfterNewline` and (b) when the state is `Other`.
2025-04-15 21:26:12 +05:30
Micha Reiser
c847cad389
Update insta snapshots (#14366)
Some checks are pending
CI / Determine changes (push) Waiting to run
CI / cargo fmt (push) Waiting to run
CI / cargo shear (push) Blocked by required conditions
CI / cargo clippy (push) Blocked by required conditions
CI / cargo test (linux) (push) Blocked by required conditions
CI / cargo test (windows) (push) Blocked by required conditions
CI / cargo test (wasm) (push) Blocked by required conditions
CI / cargo build (release) (push) Blocked by required conditions
CI / cargo build (msrv) (push) Blocked by required conditions
CI / cargo fuzz (push) Blocked by required conditions
CI / Fuzz the parser (push) Blocked by required conditions
CI / test scripts (push) Blocked by required conditions
CI / ecosystem (push) Blocked by required conditions
CI / python package (push) Waiting to run
CI / pre-commit (push) Waiting to run
CI / mkdocs (push) Waiting to run
CI / formatter instabilities and black similarity (push) Blocked by required conditions
CI / test ruff-lsp (push) Blocked by required conditions
CI / benchmarks (push) Blocked by required conditions
2024-11-15 19:31:15 +01:00
Micha Reiser
5109b50bb3
Use CompactString for Identifier (#12101) 2024-07-01 10:06:02 +02:00
Dhruv Manilawala
2567e14b7a
Lexer should consider BOM for the start offset (#11732)
## Summary

This PR fixes a bug where the lexer didn't consider the BOM into the
start offset.

fixes: #11731

## Test Plan

Add multiple test cases which involves BOM character in the source for
the lexer and verify the snapshot.
2024-06-04 08:45:46 +00:00
Dhruv Manilawala
bf5b62edac
Maintain synchronicity between the lexer and the parser (#11457)
## Summary

This PR updates the entire parser stack in multiple ways:

### Make the lexer lazy

* https://github.com/astral-sh/ruff/pull/11244
* https://github.com/astral-sh/ruff/pull/11473

Previously, Ruff's lexer would act as an iterator. The parser would
collect all the tokens in a vector first and then process the tokens to
create the syntax tree.

The first task in this project is to update the entire parsing flow to
make the lexer lazy. This includes the `Lexer`, `TokenSource`, and
`Parser`. For context, the `TokenSource` is a wrapper around the `Lexer`
to filter out the trivia tokens[^1]. Now, the parser will ask the token
source to get the next token and only then the lexer will continue and
emit the token. This means that the lexer needs to be aware of the
"current" token. When the `next_token` is called, the current token will
be updated with the newly lexed token.

The main motivation to make the lexer lazy is to allow re-lexing a token
in a different context. This is going to be really useful to make the
parser error resilience. For example, currently the emitted tokens
remains the same even if the parser can recover from an unclosed
parenthesis. This is important because the lexer emits a
`NonLogicalNewline` in parenthesized context while a normal `Newline` in
non-parenthesized context. This different kinds of newline is also used
to emit the indentation tokens which is important for the parser as it's
used to determine the start and end of a block.

Additionally, this allows us to implement the following functionalities:
1. Checkpoint - rewind infrastructure: The idea here is to create a
checkpoint and continue lexing. At a later point, this checkpoint can be
used to rewind the lexer back to the provided checkpoint.
2. Remove the `SoftKeywordTransformer` and instead use lookahead or
speculative parsing to determine whether a soft keyword is a keyword or
an identifier
3. Remove the `Tok` enum. The `Tok` enum represents the tokens emitted
by the lexer but it contains owned data which makes it expensive to
clone. The new `TokenKind` enum just represents the type of token which
is very cheap.

This brings up a question as to how will the parser get the owned value
which was stored on `Tok`. This will be solved by introducing a new
`TokenValue` enum which only contains a subset of token kinds which has
the owned value. This is stored on the lexer and is requested by the
parser when it wants to process the data. For example:
8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L1260-L1262)

[^1]: Trivia tokens are `NonLogicalNewline` and `Comment`

### Remove `SoftKeywordTransformer`

* https://github.com/astral-sh/ruff/pull/11441
* https://github.com/astral-sh/ruff/pull/11459
* https://github.com/astral-sh/ruff/pull/11442
* https://github.com/astral-sh/ruff/pull/11443
* https://github.com/astral-sh/ruff/pull/11474

For context,
https://github.com/RustPython/RustPython/pull/4519/files#diff-5de40045e78e794aa5ab0b8aacf531aa477daf826d31ca129467703855408220
added support for soft keywords in the parser which uses infinite
lookahead to classify a soft keyword as a keyword or an identifier. This
is a brilliant idea as it basically wraps the existing Lexer and works
on top of it which means that the logic for lexing and re-lexing a soft
keyword remains separate. The change here is to remove
`SoftKeywordTransformer` and let the parser determine this based on
context, lookahead and speculative parsing.

* **Context:** The transformer needs to know the position of the lexer
between it being at a statement position or a simple statement position.
This is because a `match` token starts a compound statement while a
`type` token starts a simple statement. **The parser already knows
this.**
* **Lookahead:** Now that the parser knows the context it can perform
lookahead of up to two tokens to classify the soft keyword. The logic
for this is mentioned in the PR implementing it for `type` and `match
soft keyword.
* **Speculative parsing:** This is where the checkpoint - rewind
infrastructure helps. For `match` soft keyword, there are certain cases
for which we can't classify based on lookahead. The idea here is to
create a checkpoint and keep parsing. Based on whether the parsing was
successful and what tokens are ahead we can classify the remaining
cases. Refer to #11443 for more details.

If the soft keyword is being parsed in an identifier context, it'll be
converted to an identifier and the emitted token will be updated as
well. Refer
8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L487-L491).

The `case` soft keyword doesn't require any special handling because
it'll be a keyword only in the context of a match statement.

### Update the parser API

* https://github.com/astral-sh/ruff/pull/11494
* https://github.com/astral-sh/ruff/pull/11505

Now that the lexer is in sync with the parser, and the parser helps to
determine whether a soft keyword is a keyword or an identifier, the
lexer cannot be used on its own. The reason being that it's not
sensitive to the context (which is correct). This means that the parser
API needs to be updated to not allow any access to the lexer.

Previously, there were multiple ways to parse the source code:
1. Passing the source code itself
2. Or, passing the tokens

Now that the lexer and parser are working together, the API
corresponding to (2) cannot exists. The final API is mentioned in this
PR description: https://github.com/astral-sh/ruff/pull/11494.

### Refactor the downstream tools (linter and formatter)

* https://github.com/astral-sh/ruff/pull/11511
* https://github.com/astral-sh/ruff/pull/11515
* https://github.com/astral-sh/ruff/pull/11529
* https://github.com/astral-sh/ruff/pull/11562
* https://github.com/astral-sh/ruff/pull/11592

And, the final set of changes involves updating all references of the
lexer and `Tok` enum. This was done in two-parts:
1. Update all the references in a way that doesn't require any changes
from this PR i.e., it can be done independently
	* https://github.com/astral-sh/ruff/pull/11402
	* https://github.com/astral-sh/ruff/pull/11406
	* https://github.com/astral-sh/ruff/pull/11418
	* https://github.com/astral-sh/ruff/pull/11419
	* https://github.com/astral-sh/ruff/pull/11420
	* https://github.com/astral-sh/ruff/pull/11424
2. Update all the remaining references to use the changes made in this
PR

For (2), there were various strategies used:
1. Introduce a new `Tokens` struct which wraps the token vector and add
methods to query a certain subset of tokens. These includes:
	1. `up_to_first_unknown` which replaces the `tokenize` function
2. `in_range` and `after` which replaces the `lex_starts_at` function
where the former returns the tokens within the given range while the
latter returns all the tokens after the given offset
2. Introduce a new `TokenFlags` which is a set of flags to query certain
information from a token. Currently, this information is only limited to
any string type token but can be expanded to include other information
in the future as needed. https://github.com/astral-sh/ruff/pull/11578
3. Move the `CommentRanges` to the parsed output because this
information is common to both the linter and the formatter. This removes
the need for `tokens_and_ranges` function.

## Test Plan

- [x] Update and verify the test snapshots
- [x] Make sure the entire test suite is passing
- [x] Make sure there are no changes in the ecosystem checks
- [x] Run the fuzzer on the parser
- [x] Run this change on dozens of open-source projects

### Running this change on dozens of open-source projects

Refer to the PR description to get the list of open source projects used
for testing.

Now, the following tests were done between `main` and this branch:
1. Compare the output of `--select=E999` (syntax errors)
2. Compare the output of default rule selection
3. Compare the output of `--select=ALL`

**Conclusion: all output were same**

## What's next?

The next step is to introduce re-lexing logic and update the parser to
feed the recovery information to the lexer so that it can emit the
correct token. This moves us one step closer to having error resilience
in the parser and provides Ruff the possibility to lint even if the
source code contains syntax errors.
2024-06-03 18:23:50 +05:30
Dhruv Manilawala
6ecb4776de
Rename AnyStringKind -> AnyStringFlags (#11405)
## Summary

This PR renames `AnyStringKind` to `AnyStringFlags` and `AnyStringFlags`
to `AnyStringFlagsInner`.

The main motivation is to have consistent usage of "kind" and "flags".
For each string kind, it's "flags" like `StringLiteralFlags`,
`BytesLiteralFlags`, and `FStringFlags` but it was `AnyStringKind` for
the "any" variant.
2024-05-13 13:18:07 +00:00
Charlie Marsh
35ba3c91ce
Use u64 instead of i64 in Int type (#11356)
## Summary

I believe the value here is always unsigned, since we represent `-42` as
a unary operator on `42`.
2024-05-10 13:35:15 +00:00
Dhruv Manilawala
13ffb5bc19
Replace LALRPOP parser with hand-written parser (#10036)
(Supersedes #9152, authored by @LaBatata101)

## Summary

This PR replaces the current parser generated from LALRPOP to a
hand-written recursive descent parser.

It also updates the grammar for [PEP
646](https://peps.python.org/pep-0646/) so that the parser outputs the
correct AST. For example, in `data[*x]`, the index expression is now a
tuple with a single starred expression instead of just a starred
expression.

Beyond the performance improvements, the parser is also error resilient
and can provide better error messages. The behavior as seen by any
downstream tools isn't changed. That is, the linter and formatter can
still assume that the parser will _stop_ at the first syntax error. This
will be updated in the following months.

For more details about the change here, refer to the PR corresponding to
the individual commits and the release blog post.

## Test Plan

Write _lots_ and _lots_ of tests for both valid and invalid syntax and
verify the output.

## Acknowledgements

- @MichaReiser for reviewing 100+ parser PRs and continuously providing
guidance throughout the project
- @LaBatata101 for initiating the transition to a hand-written parser in
#9152
- @addisoncrump for implementing the fuzzer which helped
[catch](https://github.com/astral-sh/ruff/pull/10903)
[a](https://github.com/astral-sh/ruff/pull/10910)
[lot](https://github.com/astral-sh/ruff/pull/10966)
[of](https://github.com/astral-sh/ruff/pull/10896)
[bugs](https://github.com/astral-sh/ruff/pull/10877)

---------

Co-authored-by: Victor Hugo Gomes <labatata101@linuxmail.org>
Co-authored-by: Micha Reiser <micha@reiser.io>
2024-04-18 17:57:39 +05:30
Alex Waygood
ffd6e79677
Fix typo in string_token_flags.rs (#10476) 2024-03-19 17:43:08 +00:00
Alex Waygood
162d2eb723
Track casing of r-string prefixes in the tokenizer and AST (#10314)
Co-authored-by: Micha Reiser <micha@reiser.io>
2024-03-18 17:18:04 +00:00
Gautier Moin
a067d87ccc
Fix incorrect Parameter range for *args and **kwargs (#10283)
## Summary

Fix #10282 

This PR updates the Python grammar to include the `*` character in
`*args` `**kwargs` in the range of the `Parameter`
```
def f(*args, **kwargs): pass
#      ~~~~    ~~~~~~    <-- range before the PR
#     ^^^^^  ^^^^^^^^    <-- range after
```

The invalid syntax `def f(*, **kwargs): ...` is also now correctly
reported.

## Test Plan

Test cases were added to `function.rs`.
2024-03-08 18:57:49 -05:00
Alex Waygood
1d97f27335
Start tracking quoting style in the AST (#10298)
This PR modifies our AST so that nodes for string literals, bytes literals and f-strings all retain the following information:
- The quoting style used (double or single quotes)
- Whether the string is triple-quoted or not
- Whether the string is raw or not

This PR is a followup to #10256. Like with that PR, this PR does not, in itself, fix any bugs. However, it means that we will have the necessary information to preserve quoting style and rawness of strings in the `ExprGenerator` in a followup PR, which will allow us to provide a fix for https://github.com/astral-sh/ruff/issues/7799.

The information is recorded on the AST nodes using a bitflag field on each node, similarly to how we recorded the information on `Tok::String`, `Tok::FStringStart` and `Tok::FStringMiddle` tokens in #10298. Rather than reusing the bitflag I used for the tokens, however, I decided to create a custom bitflag for each AST node.

Using different bitflags for each node allows us to make invalid states unrepresentable: it is valid to set a `u` prefix on a string literal, but not on a bytes literal or an f-string. It also allows us to have better debug representations for each AST node modified in this PR.
2024-03-08 19:11:47 +00:00
Alex Waygood
c504d7ab11
Track quoting style in the tokenizer (#10256) 2024-03-08 08:40:06 +00:00
Micha Reiser
184241f99a
Remove Expr postfix from ExprNamed, ExprIf, and ExprGenerator (#10229)
The expression types in our AST are called `ExprYield`, `ExprAwait`,
`ExprStringLiteral` etc, except `ExprNamedExpr`, `ExprIfExpr` and
`ExprGenratorExpr`. This seems to align with [Python AST's
naming](https://docs.python.org/3/library/ast.html) but feels
inconsistent and excessive.

This PR removes the `Expr` postfix from `ExprNamedExpr`, `ExprIfExpr`,
and `ExprGeneratorExpr`.
2024-03-04 12:55:01 +01:00
Micha Reiser
77c5561646
Add parenthesized flag to ExprTuple and ExprGenerator (#9614) 2024-02-26 15:35:20 +00:00
Dhruv Manilawala
33ac2867b7
Use non-parenthesized range for DebugText (#9953)
## Summary

This PR fixes the `DebugText` implementation to use the expression range
instead of the parenthesized range.

Taking the following code snippet as an example:
```python
x = 1
print(f"{  ( x  ) = }")
```

The output of running it would be:
```
  ( x  ) = 1
```

Notice that the whitespace between the parentheses and the expression is
preserved as is.

Currently, we don't preserve this information in the AST which defeats
the purpose of `DebugText` as the main purpose of the struct is to
preserve whitespaces _around_ the expression.

This is also problematic when generating the code from the AST node as
then the generator has no information about the parentheses the
whitespaces between them and the expression which would lead to the
removal of the parentheses in the generated code.

I noticed this while working on the f-string formatting where the debug
text would be used to preserve the text surrounding the expression in
the presence of debug expression. The parentheses were being dropped
then which made me realize that the problem is instead in the parser.

## Test Plan

1. Add a test case for the parser
2. Add a test case for the generator
2024-02-12 23:00:02 +05:30
Seo Sanghyeon
df7fb95cbc
Index multiline f-strings (#9837)
Fix #9777.
2024-02-05 21:25:33 -05:00
Dhruv Manilawala
cdac90ef68
New AST nodes for f-string elements (#8835)
Rebase of #6365 authored by @davidszotten.

## Summary

This PR updates the AST structure for an f-string elements.

The main **motivation** behind this change is to have a dedicated node
for the string part of an f-string. Previously, the existing
`ExprStringLiteral` node was used for this purpose which isn't exactly
correct. The `ExprStringLiteral` node should include the quotes as well
in the range but the f-string literal element doesn't include the quote
as it's a specific part within an f-string. For example,

```python
f"foo {x}"
# ^^^^
# This is the literal part of an f-string
```

The introduction of `FStringElement` enum is helpful which represent
either the literal part or the expression part of an f-string.

### Rule Updates

This means that there'll be two nodes representing a string depending on
the context. One for a normal string literal while the other is a string
literal within an f-string. The AST checker is updated to accommodate
this change. The rules which work on string literal are updated to check
on the literal part of f-string as well.

#### Notes

1. The `Expr::is_literal_expr` method would check for
`ExprStringLiteral` and return true if so. But now that we don't
represent the literal part of an f-string using that node, this improves
the method's behavior and confines to the actual expression. We do have
the `FStringElement::is_literal` method.
2. We avoid checking if we're in a f-string context before adding to
`string_type_definitions` because the f-string literal is now a
dedicated node and not part of `Expr`.
3. Annotations cannot use f-string so we avoid changing any rules which
work on annotation and checks for `ExprStringLiteral`.

## Test Plan

- All references of `Expr::StringLiteral` were checked to see if any of
the rules require updating to account for the f-string literal element
node.
- New test cases are added for rules which check against the literal
part of an f-string.
- Check the ecosystem results and ensure it remains unchanged.

## Performance

There's a performance penalty in the parser. The reason for this remains
unknown as it seems that the generated assembly code is now different
for the `__reduce154` function. The reduce function body is just popping
the `ParenthesizedExpr` on top of the stack and pushing it with the new
location.

- The size of `FStringElement` enum is the same as `Expr` which is what
it replaces in `FString::format_spec`
- The size of `FStringExpressionElement` is the same as
`ExprFormattedValue` which is what it replaces

I tried reducing the `Expr` enum from 80 bytes to 72 bytes but it hardly
resulted in any performance gain. The difference can be seen here:
- Original profile: https://share.firefox.dev/3Taa7ES
- Profile after boxing some node fields:
https://share.firefox.dev/3GsNXpD

### Backtracking

I tried backtracking the changes to see if any of the isolated change
produced this regression. The problem here is that the overall change is
so small that there's only a single checkpoint where I can backtrack and
that checkpoint results in the same regression. This checkpoint is to
revert using `Expr` to the `FString::format_spec` field. After this
point, the change would revert back to the original implementation.

## Review process

The review process is similar to #7927. The first set of commits update
the node structure, parser, and related AST files. Then, further commits
update the linter and formatter part to account for the AST change.

---------

Co-authored-by: David Szotten <davidszotten@gmail.com>
2023-12-07 10:28:05 -06:00
Charlie Marsh
20782ab02c
Support type alias statements in simple statement positions (#8916)
<!--
Thank you for contributing to Ruff! To help us out with reviewing,
please consider the following:

- Does this pull request include a summary of the change? (See below.)
- Does this pull request include a descriptive title?
- Does this pull request include references to any relevant issues?
-->

## Summary

Our `SoftKeywordTokenizer` only respected soft keywords in compound
statement positions -- for example, at the start of a logical line:

```python
type X = int
```

However, type aliases can also appear in simple statement positions,
like:

```python
class Class: type X = int
```

(Note that `match` and `case` are _not_ valid keywords in such
positions.)

This PR upgrades the tokenizer to track both kinds of valid positions.

Closes https://github.com/astral-sh/ruff/issues/8900.
Closes https://github.com/astral-sh/ruff/issues/8899.

## Test Plan

`cargo test`
2023-11-30 19:15:19 +00:00
Charlie Marsh
774c77adae
Avoid off-by-one error in with-item named expressions (#8915)
## Summary

Given `with (a := b): pass`, we truncate the `WithItem` range by one on
both sides such that the parentheses are part of the statement, rather
than the item. However, for `with (a := b) as x: pass`, we want to avoid
this trick.

Closes https://github.com/astral-sh/ruff/issues/8913.
2023-11-30 00:11:04 +00:00
Dhruv Manilawala
47d80f29a7
Lexer start of line is false only for Mode::Expression (#8880)
## Summary

This PR fixes the bug in the lexer where the `Mode::Ipython` wasn't
being considered when initializing the soft keyword transformer which
wraps the lexer. This means that if the source code starts with either
`match` or `type` keyword, then the keywords were being considered as
name tokens instead. For example,

```python
match foo:
    case bar:
        pass
```

This would transform the `match` keyword into an identifier if the mode
is `Ipython`.

The fix is to reverse the condition in the soft keyword initializer so
that any new modes are by default considered as the lexer being at start
of line.

## Test Plan

Add a new test case for `Mode::Ipython` and verify the snapshot.

fixes: #8870
2023-11-28 20:38:25 +00:00
Dhruv Manilawala
017e829115
Update string nodes for implicit concatenation (#7927)
## Summary

This PR updates the string nodes (`ExprStringLiteral`,
`ExprBytesLiteral`, and `ExprFString`) to account for implicit string
concatenation.

### Motivation

In Python, implicit string concatenation are joined while parsing
because the interpreter doesn't require the information for each part.
While that's feasible for an interpreter, it falls short for a static
analysis tool where having such information is more useful. Currently,
various parts of the code uses the lexer to get the individual string
parts.

One of the main challenge this solves is that of string formatting.
Currently, the formatter relies on the lexer to get the individual
string parts, and formats them including the comments accordingly. But,
with PEP 701, f-string can also contain comments. Without this change,
it becomes very difficult to add support for f-string formatting.

### Implementation

The initial proposal was made in this discussion:
https://github.com/astral-sh/ruff/discussions/6183#discussioncomment-6591993.
There were various AST designs which were explored for this task which
are available in the linked internal document[^1].

The selected variant was the one where the nodes were kept as it is
except that the `implicit_concatenated` field was removed and instead a
new struct was added to the `Expr*` struct. This would be a private
struct would contain the actual implementation of how the AST is
designed for both single and implicitly concatenated strings.

This implementation is achieved through an enum with two variants:
`Single` and `Concatenated` to avoid allocating a vector even for single
strings. There are various public methods available on the value struct
to query certain information regarding the node.

The nodes are structured in the following way:

```
ExprStringLiteral - "foo" "bar"
|- StringLiteral - "foo"
|- StringLiteral - "bar"

ExprBytesLiteral - b"foo" b"bar"
|- BytesLiteral - b"foo"
|- BytesLiteral - b"bar"

ExprFString - "foo" f"bar {x}"
|- FStringPart::Literal - "foo"
|- FStringPart::FString - f"bar {x}"
  |- StringLiteral - "bar "
  |- FormattedValue - "x"
```

[^1]: Internal document:
https://www.notion.so/astral-sh/Implicit-String-Concatenation-e036345dc48943f89e416c087bf6f6d9?pvs=4

#### Visitor

The way the nodes are structured is that the entire string, including
all the parts that are implicitly concatenation, is a single node
containing individual nodes for the parts. The previous section has a
representation of that tree for all the string nodes. This means that
new visitor methods are added to visit the individual parts of string,
bytes, and f-strings for `Visitor`, `PreorderVisitor`, and
`Transformer`.

## Test Plan

- `cargo insta test --workspace --all-features --unreferenced reject`
- Verify that the ecosystem results are unchanged
2023-11-24 17:55:41 -06:00
Andrew Gallant
6a1fa4778f
Reject more syntactically invalid Python programs (#8524)
## Summary

This commit adds some additional error checking to the parser such that
assignments that are invalid syntax are rejected. This covers the
obvious cases like `5 = 3` and some not so obvious cases like `x + y =
42`.

This does add an additional recursive call to the parser for the cases
handling assignments. I had initially been concerned about doing this,
but `set_context` is already doing recursion during assignments, so I
didn't feel as though this was changing any fundamental performance
characteristics of the parser. (Also, in practice, I would expect any
such recursion here to be quite shallow since the recursion is done on
the target of an assignment. Such things are rarely nested much in
practice.)

Fixes #6895

## Test Plan

I've added unit tests covering every case that is detected as invalid on
an `Expr`.
2023-11-07 07:16:06 -05:00
konsti
daea870c3c
Fix panic with 8 in octal escape (#8356)
**Summary** The digits for an octal escape are 0 to 7, not 0 to 8,
fixing the panic in #8355

**Test plan** Regression test parser fixture
2023-10-30 14:42:15 +01:00
Dhruv Manilawala
230c9ce236
Split Constant to individual literal nodes (#8064)
## Summary

This PR splits the `Constant` enum as individual literal nodes. It
introduces the following new nodes for each variant:
* `ExprStringLiteral`
* `ExprBytesLiteral`
* `ExprNumberLiteral`
* `ExprBooleanLiteral`
* `ExprNoneLiteral`
* `ExprEllipsisLiteral`

The main motivation behind this refactor is to introduce the new AST
node for implicit string concatenation in the coming PR. The elements of
that node will be either a string literal, bytes literal or a f-string
which can be implemented using an enum. This means that a string or
bytes literal cannot be represented by `Constant::Str` /
`Constant::Bytes` which creates an inconsistency.

This PR avoids that inconsistency by splitting the constant nodes into
it's own literal nodes, literal being the more appropriate naming
convention from a static analysis tool perspective.

This also makes working with literals in the linter and formatter much
more ergonomic like, for example, if one would want to check if this is
a string literal, it can be done easily using
`Expr::is_string_literal_expr` or matching against `Expr::StringLiteral`
as oppose to matching against the `ExprConstant` and enum `Constant`. A
few AST helper methods can be simplified as well which will be done in a
follow-up PR.

This introduces a new `Expr::is_literal_expr` method which is the same
as `Expr::is_constant_expr`. There are also intermediary changes related
to implicit string concatenation which are quiet less. This is done so
as to avoid having a huge PR which this already is.

## Test Plan

1. Verify and update all of the existing snapshots (parser, visitor)
2. Verify that the ecosystem check output remains **unchanged** for both
the linter and formatter

### Formatter ecosystem check

#### `main`

| project | similarity index | total files | changed files |

|----------------|------------------:|------------------:|------------------:|
| cpython | 0.75803 | 1799 | 1647 |
| django | 0.99983 | 2772 | 34 |
| home-assistant | 0.99953 | 10596 | 186 |
| poetry | 0.99891 | 317 | 17 |
| transformers | 0.99966 | 2657 | 330 |
| twine | 1.00000 | 33 | 0 |
| typeshed | 0.99978 | 3669 | 20 |
| warehouse | 0.99977 | 654 | 13 |
| zulip | 0.99970 | 1459 | 22 |

#### `dhruv/constant-to-literal`

| project | similarity index | total files | changed files |

|----------------|------------------:|------------------:|------------------:|
| cpython | 0.75803 | 1799 | 1647 |
| django | 0.99983 | 2772 | 34 |
| home-assistant | 0.99953 | 10596 | 186 |
| poetry | 0.99891 | 317 | 17 |
| transformers | 0.99966 | 2657 | 330 |
| twine | 1.00000 | 33 | 0 |
| typeshed | 0.99978 | 3669 | 20 |
| warehouse | 0.99977 | 654 | 13 |
| zulip | 0.99970 | 1459 | 22 |
2023-10-30 12:13:23 +05:30
Dhruv Manilawala
78bbf6d403
New Singleton enum for PatternMatchSingleton node (#8063)
## Summary

This PR adds a new `Singleton` enum for the `PatternMatchSingleton`
node.

Earlier the node was using the `Constant` enum but the value for this
pattern can only be either `None`, `True` or `False`. With the coming PR
to remove the `Constant`, this node required a new type to fill in.

This also has the benefit of narrowing the type down to only the
possible values for the node as evident by the removal of `unreachable`.

## Test Plan

Update the AST snapshots and run `cargo test`.
2023-10-30 05:48:53 +00:00
Charlie Marsh
d6a4283003
Fix range of unparenthesized tuple subject in match statement (#8101)
## Summary

This was just a bug in the parser ranges, probably since it was
initially implemented. Given `match n % 3, n % 5: ...`, the "subject"
(i.e., the tuple of two binary operators) was using the entire range of
the `match` statement.

Closes https://github.com/astral-sh/ruff/issues/8091.

## Test Plan

`cargo test`
2023-10-22 19:58:33 -04:00
Dhruv Manilawala
43883b7a15
Disallow f-strings in match pattern literal (#7857)
## Summary

This PR fixes a bug to disallow f-strings in match pattern literal.

```
literal_pattern ::=  signed_number
                     | signed_number "+" NUMBER
                     | signed_number "-" NUMBER
                     | strings
                     | "None"
                     | "True"
                     | "False"
                     | signed_number: NUMBER | "-" NUMBER
```

Source:
https://docs.python.org/3/reference/compound_stmts.html#grammar-token-python-grammar-literal_pattern

Also,

```console
$ python /tmp/t.py
  File "/tmp/t.py", line 4
    case "hello " f"{name}":
         ^^^^^^^^^^^^^^^^^^
SyntaxError: patterns may only match literals and attribute lookups
```

## Test Plan

Update existing test case and accordingly the snapshots. Also, add a new
test case to verify that the parser does raise an error.
2023-10-09 10:11:08 +00:00
Dhruv Manilawala
709abd534a
Fix lexing single-quoted f-string with multi-line format spec (#7787)
## Summary

Reported at https://github.com/python/cpython/issues/110259

## Test Plan

Add test cases for the fix and update the snapshots
2023-10-05 23:12:09 +05:30
Dhruv Manilawala
69b8136463
Avoid curly brace escape in f-string format spec (#7780)
## Summary

This PR fixes a bug in the lexer for f-string format spec where it would
consider the `{{` (double curly braces) as an escape pattern.

This is not the case as evident by the
[PEP](https://peps.python.org/pep-0701/#how-to-produce-these-new-tokens)
as well but I missed the part:

> [..]
> * **If in “format specifier mode” (see step 3), an opening brace ({)
or a closing brace (}).**
> * If not in “format specifier mode” (see step 3), an opening brace ({)
or a closing brace (}) that is not immediately followed by another
opening/closing brace.

## Test Plan

Add a test case to verify the fix and update the snapshot.

fixes: #7778
2023-10-03 19:38:03 +05:30
konsti
3ccd1d580d
Use crates.io unicode_names2 0.6.0 (#6478)
Update `unicode_names2` to the crates.io release 0.6.0, removing a git
dependency.
2023-10-02 18:17:38 -04:00
Dhruv Manilawala
e91ffe3e93
Consume the escaped Windows newline (\r\n) for FStringMiddle (#7722)
## Summary

This PR fixes a bug where if a Windows newline (`\r\n`) character was
escaped, then only the `\r` was consumed and not `\n` leading to an
unterminated string error.

## Test Plan

Add new test cases to check the newline escapes.

fixes: #7632
2023-10-01 07:58:20 +05:30
Dhruv Manilawala
e72d617f4b
Remove escaped mac/windows eol from AST string value (#7724)
## Summary

This PR fixes the bug where the value of a string node type includes the
escaped mac/windows newline character.

Note that the token value still includes them, it's only removed when
parsing the string content.

## Test Plan

Add new test cases for the string node type to check that the escapes
aren't being included in the string value.

fixes: #7723
2023-10-01 07:37:59 +05:30
Dhruv Manilawala
e62e245c61
Add support for PEP 701 (#7376)
## Summary

This PR adds support for PEP 701 in Ruff. This is a rollup PR of all the
other individual PRs. The separate PRs were created for logic separation
and code reviews. Refer to each pull request for a detail description on
the change.

Refer to the PR description for the list of pull requests within this PR.

## Test Plan

### Formatter ecosystem checks

Explanation for the change in ecosystem check:
https://github.com/astral-sh/ruff/pull/7597#issue-1908878183

#### `main`

```
| project      | similarity index  | total files       | changed files     |
|--------------|------------------:|------------------:|------------------:|
| cpython      |           0.76083 |              1789 |              1631 |
| django       |           0.99983 |              2760 |                36 |
| transformers |           0.99963 |              2587 |               319 |
| twine        |           1.00000 |                33 |                 0 |
| typeshed     |           0.99983 |              3496 |                18 |
| warehouse    |           0.99967 |               648 |                15 |
| zulip        |           0.99972 |              1437 |                21 |
```

#### `dhruv/pep-701`

```
| project      | similarity index  | total files       | changed files     |
|--------------|------------------:|------------------:|------------------:|
| cpython      |           0.76051 |              1789 |              1632 |
| django       |           0.99983 |              2760 |                36 |
| transformers |           0.99963 |              2587 |               319 |
| twine        |           1.00000 |                33 |                 0 |
| typeshed     |           0.99983 |              3496 |                18 |
| warehouse    |           0.99967 |               648 |                15 |
| zulip        |           0.99972 |              1437 |                21 |
```
2023-09-29 02:55:39 +00:00
Charlie Marsh
f45281345d
Include radix base prefix in large number representation (#7700)
## Summary

When lexing a number like `0x995DC9BBDF1939FA` that exceeds our small
number representation, we were only storing the portion after the base
(in this case, `995DC9BBDF1939FA`). When using that representation in
code generation, this could lead to invalid syntax, since
`995DC9BBDF1939FA)` on its own is not a valid integer.

This PR modifies the code to store the full span, including the radix
prefix.

See:
https://github.com/astral-sh/ruff/issues/7455#issuecomment-1739802958.

## Test Plan

`cargo test`
2023-09-28 20:38:06 +00:00
Charlie Marsh
93b5d8a0fb
Implement our own small-integer optimization (#7584)
## Summary

This is a follow-up to #7469 that attempts to achieve similar gains, but
without introducing malachite. Instead, this PR removes the `BigInt`
type altogether, instead opting for a simple enum that allows us to
store small integers directly and only allocate for values greater than
`i64`:

```rust
/// A Python integer literal. Represents both small (fits in an `i64`) and large integers.
#[derive(Clone, PartialEq, Eq, Hash)]
pub struct Int(Number);

#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum Number {
    /// A "small" number that can be represented as an `i64`.
    Small(i64),
    /// A "large" number that cannot be represented as an `i64`.
    Big(Box<str>),
}

impl std::fmt::Display for Number {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            Number::Small(value) => write!(f, "{value}"),
            Number::Big(value) => write!(f, "{value}"),
        }
    }
}
```

We typically don't care about numbers greater than `isize` -- our only
uses are comparisons against small constants (like `1`, `2`, `3`, etc.),
so there's no real loss of information, except in one or two rules where
we're now a little more conservative (with the worst-case being that we
don't flag, e.g., an `itertools.pairwise` that uses an extremely large
value for the slice start constant). For simplicity, a few diagnostics
now show a dedicated message when they see integers that are out of the
supported range (e.g., `outdated-version-block`).

An additional benefit here is that we get to remove a few dependencies,
especially `num-bigint`.

## Test Plan

`cargo test`
2023-09-25 15:13:21 +00:00
Micha Reiser
8ce138760a
Emit LexError for dedent to incorrect level (#7638) 2023-09-25 11:45:44 +01:00
Dhruv Manilawala
a41bb2733f
Add range to lexer test snapshots (#7265)
## Summary

This PR updates the lexer test snapshots to include the range value as
well. This is mainly a mechanical refactor.

### Motivation

The main motivation is so that we can verify that the ranges are valid
and do not overlap.

## Test Plan

`cargo test`
2023-09-11 19:12:46 +00:00
Dhruv Manilawala
f5701fcc63
Use snapshots for remaining lexer tests (#7264)
## Summary

This PR updates the remaining lexer test cases to use the snapshots.
This is mainly a mechanical refactor.

## Motivation

The main motivation is so that when we add the token range values to the
test case output, it's easier to update the test cases.

The reason they were not using the snapshots before was because of the usage of
`test_case` macro. The macros is mainly used for different EOL test cases. If we
just generate the snapshots directly, then the snapshot name would be suffixed
with `-1`, `-2`, etc. as the test function is still the same. So, we'll create
the snapshot ourselves with the platform name for the respective EOL
test cases.

## Test Plan

`cargo test`
2023-09-12 00:16:38 +05:30
Dhruv Manilawala
04f2842e4f
Move ExprConstant::kind to StringConstant::unicode (#7180) 2023-09-06 07:39:25 +00:00
Charlie Marsh
68f605e80a
Fix WithItem ranges for parenthesized, non-as items (#6782)
## Summary

This PR attempts to address a problem in the parser related to the
range's of `WithItem` nodes in certain contexts -- specifically,
`WithItem` nodes in parentheses that do not have an `as` token after
them.

For example,
[here](https://play.ruff.rs/71be2d0b-2a04-4c7e-9082-e72bff152679):

```python
with (a, b):
    pass
```

The range of the `WithItem` `a` is set to the range of `(a, b)`, as is
the range of the `WithItem` `b`. In other words, when we have this kind
of sequence, we use the range of the entire parenthesized context,
rather than the ranges of the items themselves.

Note that this also applies to cases
[like](https://play.ruff.rs/c551e8e9-c3db-4b74-8cc6-7c4e3bf3713a):

```python
with (a, b, c as d):
    pass
```

You can see the issue in the parser here:

```rust
#[inline]
WithItemsNoAs: Vec<ast::WithItem> = {
    <location:@L> <all:OneOrMore<Test<"all">>> <end_location:@R> => {
        all.into_iter().map(|context_expr| ast::WithItem { context_expr, optional_vars: None, range: (location..end_location).into() }).collect()
    },
}
```

Fixing this issue is... very tricky. The naive approach is to use the
range of the `context_expr` as the range for the `WithItem`, but that
range will be incorrect when the `context_expr` is itself parenthesized.
For example, _that_ solution would fail here, since the range of the
first `WithItem` would be that of `a`, rather than `(a)`:

```python
with ((a), b):
    pass
```

The `with` parsing in general is highly precarious due to ambiguities in
the grammar. Changing it in _any_ way seems to lead to an ambiguous
grammar that LALRPOP fails to translate. Consensus seems to be that we
don't really understand _why_ the current grammar works (i.e., _how_ it
avoids these ambiguities as-is).

The solution implemented here is to avoid changing the grammar itself,
and instead change the shape of the nodes returned by various rules in
the grammar. Specifically, everywhere that we return `Expr`, we instead
return `ParenthesizedExpr`, which includes a parenthesized range and the
underlying `Expr` itself. (If an `Expr` isn't parenthesized, the ranges
will be equivalent.) In `WithItemsNoAs`, we can then use the
parenthesized range as the range for the `WithItem`.
2023-08-31 16:21:29 +01:00
Charlie Marsh
15b73bdb8a
Introduce AST nodes for PatternMatchClass arguments (#6881)
## Summary

This PR introduces two new AST nodes to improve the representation of
`PatternMatchClass`. As a reminder, `PatternMatchClass` looks like this:

```python
case Point2D(0, 0, x=1, y=2):
  ...
```

Historically, this was represented as a vector of patterns (for the `0,
0` portion) and parallel vectors of keyword names (for `x` and `y`) and
values (for `1` and `2`). This introduces a bunch of challenges for the
formatter, but importantly, it's also really different from how we
represent similar nodes, like arguments (`func(0, 0, x=1, y=2)`) or
parameters (`def func(x, y)`).

So, firstly, we now use a single node (`PatternArguments`) for the
entire parenthesized region, making it much more consistent with our
other nodes. So, above, `PatternArguments` would be `(0, 0, x=1, y=2)`.

Secondly, we now have a `PatternKeyword` node for `x=1` and `y=2`. This
is much more similar to the how `Keyword` is represented within
`Arguments` for call expressions.

Closes https://github.com/astral-sh/ruff/issues/6866.

Closes https://github.com/astral-sh/ruff/issues/6880.
2023-08-26 14:45:44 +00:00
Dhruv Manilawala
fb7caf43c8
Update lexer tests to use snapshots (#6658)
## Summary

This PR updates the lexer tests to use the snapshot testing framework.
It also
makes the following changes:
* Remove the use of macros in the lexer tests
* Use `test_case` for EOL tests

## Test Plan

```
cargo test --package ruff_python_parser --lib --all-features -- lexer::tests --no-capture
```
2023-08-22 18:23:19 +00:00
Charlie Marsh
6a5acde226
Make Parameters an optional field on ExprLambda (#6669)
## Summary

If a lambda doesn't contain any parameters, or any parameter _tokens_
(like `*`), we can use `None` for the parameters. This feels like a
better representation to me, since, e.g., what should the `TextRange` be
for a non-existent set of parameters? It also allows us to remove
several sites where we check if the `Parameters` is empty by seeing if
it contains any arguments, so semantically, we're already trying to
detect and model around this elsewhere.

Changing this also fixes a number of issues with dangling comments in
parameter-less lambdas, since those comments are now automatically
marked as dangling on the lambda. (As-is, we were also doing something
not-great whereby the lambda was responsible for formatting dangling
comments on the parameters, which has been removed.)

Closes https://github.com/astral-sh/ruff/issues/6646.

Closes https://github.com/astral-sh/ruff/issues/6647.

## Test Plan

`cargo test`
2023-08-18 15:34:54 +00:00
Charlie Marsh
a70807e1e1
Expand NamedExpr range to include full range of parenthesized value (#6632)
## Summary

Given:

```python
if (
    x
    :=
    (  # 4
        y # 5
    )  # 6
):
    pass
```

It turns out the parser ended the range of the `NamedExpr` at the end of
`y`, rather than the end of the parenthesis that encloses `y`. This just
seems like a bug -- the range should be from the start of the name on
the left, to the end of the parenthesized node on the right.

## Test Plan

`cargo test`
2023-08-17 14:34:05 +00:00
Charlie Marsh
96d310fbab
Remove Stmt::TryStar (#6566)
## Summary

Instead, we set an `is_star` flag on `Stmt::Try`. This is similar to the
pattern we've migrated towards for `Stmt::For` (removing
`Stmt::AsyncFor`) and friends. While these are significant differences
for an interpreter, we tend to handle these cases identically or nearly
identically.

## Test Plan

`cargo test`
2023-08-14 13:39:44 -04:00
Charlie Marsh
f16e780e0a
Add an implicit concatenation flag to string and bytes constants (#6512)
## Summary

Per the discussion in
https://github.com/astral-sh/ruff/discussions/6183, this PR adds an
`implicit_concatenated` flag to the string and bytes constant variants.
It's not actually _used_ anywhere as of this PR, but it is covered by
the tests.

Specifically, we now use a struct for the string and bytes cases, along
with the `Expr::FString` node. That struct holds the value, plus the
flag:

```rust
#[derive(Clone, Debug, PartialEq, is_macro::Is)]
pub enum Constant {
    Str(StringConstant),
    Bytes(BytesConstant),
    ...
}

#[derive(Clone, Debug, PartialEq, Eq)]
pub struct StringConstant {
    /// The string value as resolved by the parser (i.e., without quotes, or escape sequences, or
    /// implicit concatenations).
    pub value: String,
    /// Whether the string contains multiple string tokens that were implicitly concatenated.
    pub implicit_concatenated: bool,
}

impl Deref for StringConstant {
    type Target = str;
    fn deref(&self) -> &Self::Target {
        self.value.as_str()
    }
}

#[derive(Clone, Debug, PartialEq, Eq)]
pub struct BytesConstant {
    /// The bytes value as resolved by the parser (i.e., without quotes, or escape sequences, or
    /// implicit concatenations).
    pub value: Vec<u8>,
    /// Whether the string contains multiple string tokens that were implicitly concatenated.
    pub implicit_concatenated: bool,
}

impl Deref for BytesConstant {
    type Target = [u8];
    fn deref(&self) -> &Self::Target {
        self.value.as_slice()
    }
}
```

## Test Plan

`cargo test`
2023-08-14 13:46:54 +00:00