Maintain synchronicity between the lexer and the parser (#11457)

## Summary

This PR updates the entire parser stack in multiple ways:

### Make the lexer lazy

* https://github.com/astral-sh/ruff/pull/11244
* https://github.com/astral-sh/ruff/pull/11473

Previously, Ruff's lexer would act as an iterator. The parser would
collect all the tokens in a vector first and then process the tokens to
create the syntax tree.

The first task in this project is to update the entire parsing flow to
make the lexer lazy. This includes the `Lexer`, `TokenSource`, and
`Parser`. For context, the `TokenSource` is a wrapper around the `Lexer`
to filter out the trivia tokens[^1]. Now, the parser will ask the token
source to get the next token and only then the lexer will continue and
emit the token. This means that the lexer needs to be aware of the
"current" token. When the `next_token` is called, the current token will
be updated with the newly lexed token.

The main motivation to make the lexer lazy is to allow re-lexing a token
in a different context. This is going to be really useful to make the
parser error resilience. For example, currently the emitted tokens
remains the same even if the parser can recover from an unclosed
parenthesis. This is important because the lexer emits a
`NonLogicalNewline` in parenthesized context while a normal `Newline` in
non-parenthesized context. This different kinds of newline is also used
to emit the indentation tokens which is important for the parser as it's
used to determine the start and end of a block.

Additionally, this allows us to implement the following functionalities:
1. Checkpoint - rewind infrastructure: The idea here is to create a
checkpoint and continue lexing. At a later point, this checkpoint can be
used to rewind the lexer back to the provided checkpoint.
2. Remove the `SoftKeywordTransformer` and instead use lookahead or
speculative parsing to determine whether a soft keyword is a keyword or
an identifier
3. Remove the `Tok` enum. The `Tok` enum represents the tokens emitted
by the lexer but it contains owned data which makes it expensive to
clone. The new `TokenKind` enum just represents the type of token which
is very cheap.

This brings up a question as to how will the parser get the owned value
which was stored on `Tok`. This will be solved by introducing a new
`TokenValue` enum which only contains a subset of token kinds which has
the owned value. This is stored on the lexer and is requested by the
parser when it wants to process the data. For example:
8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L1260-L1262)

[^1]: Trivia tokens are `NonLogicalNewline` and `Comment`

### Remove `SoftKeywordTransformer`

* https://github.com/astral-sh/ruff/pull/11441
* https://github.com/astral-sh/ruff/pull/11459
* https://github.com/astral-sh/ruff/pull/11442
* https://github.com/astral-sh/ruff/pull/11443
* https://github.com/astral-sh/ruff/pull/11474

For context,
https://github.com/RustPython/RustPython/pull/4519/files#diff-5de40045e78e794aa5ab0b8aacf531aa477daf826d31ca129467703855408220
added support for soft keywords in the parser which uses infinite
lookahead to classify a soft keyword as a keyword or an identifier. This
is a brilliant idea as it basically wraps the existing Lexer and works
on top of it which means that the logic for lexing and re-lexing a soft
keyword remains separate. The change here is to remove
`SoftKeywordTransformer` and let the parser determine this based on
context, lookahead and speculative parsing.

* **Context:** The transformer needs to know the position of the lexer
between it being at a statement position or a simple statement position.
This is because a `match` token starts a compound statement while a
`type` token starts a simple statement. **The parser already knows
this.**
* **Lookahead:** Now that the parser knows the context it can perform
lookahead of up to two tokens to classify the soft keyword. The logic
for this is mentioned in the PR implementing it for `type` and `match
soft keyword.
* **Speculative parsing:** This is where the checkpoint - rewind
infrastructure helps. For `match` soft keyword, there are certain cases
for which we can't classify based on lookahead. The idea here is to
create a checkpoint and keep parsing. Based on whether the parsing was
successful and what tokens are ahead we can classify the remaining
cases. Refer to #11443 for more details.

If the soft keyword is being parsed in an identifier context, it'll be
converted to an identifier and the emitted token will be updated as
well. Refer
8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L487-L491).

The `case` soft keyword doesn't require any special handling because
it'll be a keyword only in the context of a match statement.

### Update the parser API

* https://github.com/astral-sh/ruff/pull/11494
* https://github.com/astral-sh/ruff/pull/11505

Now that the lexer is in sync with the parser, and the parser helps to
determine whether a soft keyword is a keyword or an identifier, the
lexer cannot be used on its own. The reason being that it's not
sensitive to the context (which is correct). This means that the parser
API needs to be updated to not allow any access to the lexer.

Previously, there were multiple ways to parse the source code:
1. Passing the source code itself
2. Or, passing the tokens

Now that the lexer and parser are working together, the API
corresponding to (2) cannot exists. The final API is mentioned in this
PR description: https://github.com/astral-sh/ruff/pull/11494.

### Refactor the downstream tools (linter and formatter)

* https://github.com/astral-sh/ruff/pull/11511
* https://github.com/astral-sh/ruff/pull/11515
* https://github.com/astral-sh/ruff/pull/11529
* https://github.com/astral-sh/ruff/pull/11562
* https://github.com/astral-sh/ruff/pull/11592

And, the final set of changes involves updating all references of the
lexer and `Tok` enum. This was done in two-parts:
1. Update all the references in a way that doesn't require any changes
from this PR i.e., it can be done independently
	* https://github.com/astral-sh/ruff/pull/11402
	* https://github.com/astral-sh/ruff/pull/11406
	* https://github.com/astral-sh/ruff/pull/11418
	* https://github.com/astral-sh/ruff/pull/11419
	* https://github.com/astral-sh/ruff/pull/11420
	* https://github.com/astral-sh/ruff/pull/11424
2. Update all the remaining references to use the changes made in this
PR

For (2), there were various strategies used:
1. Introduce a new `Tokens` struct which wraps the token vector and add
methods to query a certain subset of tokens. These includes:
	1. `up_to_first_unknown` which replaces the `tokenize` function
2. `in_range` and `after` which replaces the `lex_starts_at` function
where the former returns the tokens within the given range while the
latter returns all the tokens after the given offset
2. Introduce a new `TokenFlags` which is a set of flags to query certain
information from a token. Currently, this information is only limited to
any string type token but can be expanded to include other information
in the future as needed. https://github.com/astral-sh/ruff/pull/11578
3. Move the `CommentRanges` to the parsed output because this
information is common to both the linter and the formatter. This removes
the need for `tokens_and_ranges` function.

## Test Plan

- [x] Update and verify the test snapshots
- [x] Make sure the entire test suite is passing
- [x] Make sure there are no changes in the ecosystem checks
- [x] Run the fuzzer on the parser
- [x] Run this change on dozens of open-source projects

### Running this change on dozens of open-source projects

Refer to the PR description to get the list of open source projects used
for testing.

Now, the following tests were done between `main` and this branch:
1. Compare the output of `--select=E999` (syntax errors)
2. Compare the output of default rule selection
3. Compare the output of `--select=ALL`

**Conclusion: all output were same**

## What's next?

The next step is to introduce re-lexing logic and update the parser to
feed the recovery information to the lexer so that it can emit the
correct token. This moves us one step closer to having error resilience
in the parser and provides Ruff the possibility to lint even if the
source code contains syntax errors.
This commit is contained in:
Dhruv Manilawala 2024-06-03 18:23:50 +05:30 committed by GitHub
parent c69a789aa5
commit bf5b62edac
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
262 changed files with 8174 additions and 6132 deletions

View file

@ -57,81 +57,37 @@
//!
//! - token: This module contains the definition of the tokens that are generated by the lexer.
//! - [lexer]: This module contains the lexer and is responsible for generating the tokens.
//! - parser: This module contains an interface to the [Program] and is responsible for generating the AST.
//! - parser: This module contains an interface to the [Parsed] and is responsible for generating the AST.
//! - mode: This module contains the definition of the different modes that the `ruff_python_parser` can be in.
//!
//! # Examples
//!
//! For example, to get a stream of tokens from a given string, one could do this:
//!
//! ```
//! use ruff_python_parser::{lexer::lex, Mode};
//!
//! let python_source = r#"
//! def is_odd(i):
//! return bool(i & 1)
//! "#;
//! let mut tokens = lex(python_source, Mode::Module);
//! assert!(tokens.all(|t| t.is_ok()));
//! ```
//!
//! These tokens can be directly fed into the `ruff_python_parser` to generate an AST:
//!
//! ```
//! use ruff_python_parser::lexer::lex;
//! use ruff_python_parser::{Mode, parse_tokens};
//!
//! let python_source = r#"
//! def is_odd(i):
//! return bool(i & 1)
//! "#;
//! let tokens = lex(python_source, Mode::Module);
//! let ast = parse_tokens(tokens.collect(), python_source, Mode::Module);
//!
//! assert!(ast.is_ok());
//! ```
//!
//! Alternatively, you can use one of the other `parse_*` functions to parse a string directly without using a specific
//! mode or tokenizing the source beforehand:
//!
//! ```
//! use ruff_python_parser::parse_suite;
//!
//! let python_source = r#"
//! def is_odd(i):
//! return bool(i & 1)
//! "#;
//! let ast = parse_suite(python_source);
//!
//! assert!(ast.is_ok());
//! ```
//!
//! [lexical analysis]: https://en.wikipedia.org/wiki/Lexical_analysis
//! [parsing]: https://en.wikipedia.org/wiki/Parsing
//! [lexer]: crate::lexer
use std::iter::FusedIterator;
use std::cell::OnceCell;
use std::ops::Deref;
use ruff_python_ast::{Expr, Mod, ModModule, PySourceType, Suite};
use ruff_text_size::{TextRange, TextSize};
pub use crate::error::{FStringErrorType, ParseError, ParseErrorType};
use crate::lexer::{lex, lex_starts_at, LexResult};
pub use crate::parser::Program;
pub use crate::token::{Tok, TokenKind};
pub use crate::lexer::Token;
pub use crate::token::TokenKind;
use crate::parser::Parser;
use itertools::Itertools;
use ruff_python_ast::{Expr, Mod, ModExpression, ModModule, PySourceType, Suite};
use ruff_python_trivia::CommentRanges;
use ruff_text_size::{Ranged, TextRange, TextSize};
mod error;
pub mod lexer;
mod parser;
mod soft_keywords;
mod string;
mod token;
mod token_set;
mod token_source;
pub mod typing;
/// Parse a full Python program usually consisting of multiple lines.
/// Parse a full Python module usually consisting of multiple lines.
///
/// This is a convenience function that can be used to parse a full Python program without having to
/// specify the [`Mode`] or the location. It is probably what you want to use most of the time.
@ -141,7 +97,7 @@ pub mod typing;
/// For example, parsing a simple function definition and a call to that function:
///
/// ```
/// use ruff_python_parser::parse_program;
/// use ruff_python_parser::parse_module;
///
/// let source = r#"
/// def foo():
@ -150,41 +106,15 @@ pub mod typing;
/// print(foo())
/// "#;
///
/// let program = parse_program(source);
/// assert!(program.is_ok());
/// let module = parse_module(source);
/// assert!(module.is_ok());
/// ```
pub fn parse_program(source: &str) -> Result<ModModule, ParseError> {
let lexer = lex(source, Mode::Module);
match parse_tokens(lexer.collect(), source, Mode::Module)? {
Mod::Module(m) => Ok(m),
Mod::Expression(_) => unreachable!("Mode::Module doesn't return other variant"),
}
}
/// Parse a full Python program into a [`Suite`].
///
/// This function is similar to [`parse_program`] except that it returns the module body
/// instead of the module itself.
///
/// # Example
///
/// For example, parsing a simple function definition and a call to that function:
///
/// ```
/// use ruff_python_parser::parse_suite;
///
/// let source = r#"
/// def foo():
/// return 42
///
/// print(foo())
/// "#;
///
/// let body = parse_suite(source);
/// assert!(body.is_ok());
/// ```
pub fn parse_suite(source: &str) -> Result<Suite, ParseError> {
parse_program(source).map(|m| m.body)
pub fn parse_module(source: &str) -> Result<Parsed<ModModule>, ParseError> {
Parser::new(source, Mode::Module)
.parse()
.try_into_module()
.unwrap()
.into_result()
}
/// Parses a single Python expression.
@ -202,37 +132,40 @@ pub fn parse_suite(source: &str) -> Result<Suite, ParseError> {
/// let expr = parse_expression("1 + 2");
/// assert!(expr.is_ok());
/// ```
pub fn parse_expression(source: &str) -> Result<Expr, ParseError> {
let lexer = lex(source, Mode::Expression).collect();
match parse_tokens(lexer, source, Mode::Expression)? {
Mod::Expression(expression) => Ok(*expression.body),
Mod::Module(_m) => unreachable!("Mode::Expression doesn't return other variant"),
}
pub fn parse_expression(source: &str) -> Result<Parsed<ModExpression>, ParseError> {
Parser::new(source, Mode::Expression)
.parse()
.try_into_expression()
.unwrap()
.into_result()
}
/// Parses a Python expression from a given location.
/// Parses a Python expression for the given range in the source.
///
/// This function allows to specify the location of the expression in the source code, other than
/// This function allows to specify the range of the expression in the source code, other than
/// that, it behaves exactly like [`parse_expression`].
///
/// # Example
///
/// Parsing a single expression denoting the addition of two numbers, but this time specifying a different,
/// somewhat silly, location:
/// Parsing one of the numeric literal which is part of an addition expression:
///
/// ```
/// use ruff_python_parser::parse_expression_starts_at;
/// # use ruff_text_size::TextSize;
/// use ruff_python_parser::parse_expression_range;
/// # use ruff_text_size::{TextRange, TextSize};
///
/// let expr = parse_expression_starts_at("1 + 2", TextSize::from(400));
/// assert!(expr.is_ok());
/// let parsed = parse_expression_range("11 + 22 + 33", TextRange::new(TextSize::new(5), TextSize::new(7)));
/// assert!(parsed.is_ok());
/// ```
pub fn parse_expression_starts_at(source: &str, offset: TextSize) -> Result<Expr, ParseError> {
let lexer = lex_starts_at(source, Mode::Module, offset).collect();
match parse_tokens(lexer, source, Mode::Expression)? {
Mod::Expression(expression) => Ok(*expression.body),
Mod::Module(_m) => unreachable!("Mode::Expression doesn't return other variant"),
}
pub fn parse_expression_range(
source: &str,
range: TextRange,
) -> Result<Parsed<ModExpression>, ParseError> {
let source = &source[..range.end().to_usize()];
Parser::new_starts_at(source, Mode::Expression, range.start())
.parse()
.try_into_expression()
.unwrap()
.into_result()
}
/// Parse the given Python source code using the specified [`Mode`].
@ -249,8 +182,8 @@ pub fn parse_expression_starts_at(source: &str, offset: TextSize) -> Result<Expr
/// ```
/// use ruff_python_parser::{Mode, parse};
///
/// let expr = parse("1 + 2", Mode::Expression);
/// assert!(expr.is_ok());
/// let parsed = parse("1 + 2", Mode::Expression);
/// assert!(parsed.is_ok());
/// ```
///
/// Alternatively, we can parse a full Python program consisting of multiple lines:
@ -264,8 +197,8 @@ pub fn parse_expression_starts_at(source: &str, offset: TextSize) -> Result<Expr
/// def greet(self):
/// print("Hello, world!")
/// "#;
/// let program = parse(source, Mode::Module);
/// assert!(program.is_ok());
/// let parsed = parse(source, Mode::Module);
/// assert!(parsed.is_ok());
/// ```
///
/// Additionally, we can parse a Python program containing IPython escapes:
@ -278,189 +211,308 @@ pub fn parse_expression_starts_at(source: &str, offset: TextSize) -> Result<Expr
/// ?str.replace
/// !ls
/// "#;
/// let program = parse(source, Mode::Ipython);
/// assert!(program.is_ok());
/// let parsed = parse(source, Mode::Ipython);
/// assert!(parsed.is_ok());
/// ```
pub fn parse(source: &str, mode: Mode) -> Result<Mod, ParseError> {
let lxr = lexer::lex(source, mode);
parse_tokens(lxr.collect(), source, mode)
pub fn parse(source: &str, mode: Mode) -> Result<Parsed<Mod>, ParseError> {
parse_unchecked(source, mode).into_result()
}
/// Parse the given Python source code using the specified [`Mode`] and [`TextSize`].
/// Parse the given Python source code using the specified [`Mode`].
///
/// This function allows to specify the location of the source code, other than
/// that, it behaves exactly like [`parse`].
///
/// # Example
///
/// ```
/// # use ruff_text_size::TextSize;
/// use ruff_python_parser::{Mode, parse_starts_at};
///
/// let source = r#"
/// def fib(i):
/// a, b = 0, 1
/// for _ in range(i):
/// a, b = b, a + b
/// return a
///
/// print(fib(42))
/// "#;
/// let program = parse_starts_at(source, Mode::Module, TextSize::from(0));
/// assert!(program.is_ok());
/// ```
pub fn parse_starts_at(source: &str, mode: Mode, offset: TextSize) -> Result<Mod, ParseError> {
let lxr = lexer::lex_starts_at(source, mode, offset);
parse_tokens(lxr.collect(), source, mode)
/// This is same as the [`parse`] function except that it doesn't check for any [`ParseError`]
/// and returns the [`Parsed`] as is.
pub fn parse_unchecked(source: &str, mode: Mode) -> Parsed<Mod> {
Parser::new(source, mode).parse()
}
/// Parse an iterator of [`LexResult`]s using the specified [`Mode`].
///
/// This could allow you to perform some preprocessing on the tokens before parsing them.
///
/// # Example
///
/// As an example, instead of parsing a string, we can parse a list of tokens after we generate
/// them using the [`lexer::lex`] function:
///
/// ```
/// use ruff_python_parser::lexer::lex;
/// use ruff_python_parser::{Mode, parse_tokens};
///
/// let source = "1 + 2";
/// let tokens = lex(source, Mode::Expression);
/// let expr = parse_tokens(tokens.collect(), source, Mode::Expression);
/// assert!(expr.is_ok());
/// ```
pub fn parse_tokens(tokens: Vec<LexResult>, source: &str, mode: Mode) -> Result<Mod, ParseError> {
let program = Program::parse_tokens(source, tokens, mode);
if program.is_valid() {
Ok(program.into_ast())
} else {
Err(program.into_errors().into_iter().next().unwrap())
/// Parse the given Python source code using the specified [`PySourceType`].
pub fn parse_unchecked_source(source: &str, source_type: PySourceType) -> Parsed<ModModule> {
// SAFETY: Safe because `PySourceType` always parses to a `ModModule`
Parser::new(source, source_type.as_mode())
.parse()
.try_into_module()
.unwrap()
}
/// Represents the parsed source code.
#[derive(Debug, Clone)]
pub struct Parsed<T> {
syntax: T,
tokens: Tokens,
errors: Vec<ParseError>,
comment_ranges: CommentRanges,
}
impl<T> Parsed<T> {
/// Returns the syntax node represented by this parsed output.
pub fn syntax(&self) -> &T {
&self.syntax
}
/// Returns all the tokens for the parsed output.
pub fn tokens(&self) -> &Tokens {
&self.tokens
}
/// Returns a list of syntax errors found during parsing.
pub fn errors(&self) -> &[ParseError] {
&self.errors
}
/// Returns the comment ranges for the parsed output.
pub fn comment_ranges(&self) -> &CommentRanges {
&self.comment_ranges
}
/// Consumes the [`Parsed`] output and returns the contained syntax node.
pub fn into_syntax(self) -> T {
self.syntax
}
/// Consumes the [`Parsed`] output and returns a list of syntax errors found during parsing.
pub fn into_errors(self) -> Vec<ParseError> {
self.errors
}
/// Returns `true` if the parsed source code is valid i.e., it has no syntax errors.
pub fn is_valid(&self) -> bool {
self.errors.is_empty()
}
/// Returns the [`Parsed`] output as a [`Result`], returning [`Ok`] if it has no syntax errors,
/// or [`Err`] containing the first [`ParseError`] encountered.
pub fn as_result(&self) -> Result<&Parsed<T>, &ParseError> {
if let [error, ..] = self.errors() {
Err(error)
} else {
Ok(self)
}
}
/// Consumes the [`Parsed`] output and returns a [`Result`] which is [`Ok`] if it has no syntax
/// errors, or [`Err`] containing the first [`ParseError`] encountered.
pub(crate) fn into_result(self) -> Result<Parsed<T>, ParseError> {
if self.is_valid() {
Ok(self)
} else {
Err(self.into_errors().into_iter().next().unwrap())
}
}
}
/// Tokens represents a vector of [`LexResult`].
///
/// This should only include tokens up to and including the first error. This struct is created
/// by the [`tokenize`] function.
impl Parsed<Mod> {
/// Attempts to convert the [`Parsed<Mod>`] into a [`Parsed<ModModule>`].
///
/// This method checks if the `syntax` field of the output is a [`Mod::Module`]. If it is, the
/// method returns [`Some(Parsed<ModModule>)`] with the contained module. Otherwise, it
/// returns [`None`].
///
/// [`Some(Parsed<ModModule>)`]: Some
fn try_into_module(self) -> Option<Parsed<ModModule>> {
match self.syntax {
Mod::Module(module) => Some(Parsed {
syntax: module,
tokens: self.tokens,
errors: self.errors,
comment_ranges: self.comment_ranges,
}),
Mod::Expression(_) => None,
}
}
/// Attempts to convert the [`Parsed<Mod>`] into a [`Parsed<ModExpression>`].
///
/// This method checks if the `syntax` field of the output is a [`Mod::Expression`]. If it is,
/// the method returns [`Some(Parsed<ModExpression>)`] with the contained expression.
/// Otherwise, it returns [`None`].
///
/// [`Some(Parsed<ModExpression>)`]: Some
fn try_into_expression(self) -> Option<Parsed<ModExpression>> {
match self.syntax {
Mod::Module(_) => None,
Mod::Expression(expression) => Some(Parsed {
syntax: expression,
tokens: self.tokens,
errors: self.errors,
comment_ranges: self.comment_ranges,
}),
}
}
}
impl Parsed<ModModule> {
/// Returns the module body contained in this parsed output as a [`Suite`].
pub fn suite(&self) -> &Suite {
&self.syntax.body
}
/// Consumes the [`Parsed`] output and returns the module body as a [`Suite`].
pub fn into_suite(self) -> Suite {
self.syntax.body
}
}
impl Parsed<ModExpression> {
/// Returns the expression contained in this parsed output.
pub fn expr(&self) -> &Expr {
&self.syntax.body
}
/// Consumes the [`Parsed`] output and returns the contained [`Expr`].
pub fn into_expr(self) -> Expr {
*self.syntax.body
}
}
/// Tokens represents a vector of lexed [`Token`].
#[derive(Debug, Clone)]
pub struct Tokens(Vec<LexResult>);
pub struct Tokens {
raw: Vec<Token>,
/// Index of the first [`TokenKind::Unknown`] token or the length of the token vector.
first_unknown_or_len: OnceCell<usize>,
}
impl Tokens {
/// Returns an iterator over the [`TokenKind`] and the range corresponding to the tokens.
pub fn kinds(&self) -> TokenKindIter {
TokenKindIter::new(&self.0)
pub(crate) fn new(tokens: Vec<Token>) -> Tokens {
Tokens {
raw: tokens,
first_unknown_or_len: OnceCell::new(),
}
}
/// Consumes the [`Tokens`], returning the underlying vector of [`LexResult`].
pub fn into_inner(self) -> Vec<LexResult> {
self.0
/// Returns a slice of tokens up to (and excluding) the first [`TokenKind::Unknown`] token or
/// all the tokens if there is none.
pub fn up_to_first_unknown(&self) -> &[Token] {
let end = *self.first_unknown_or_len.get_or_init(|| {
self.raw
.iter()
.find_position(|token| token.kind() == TokenKind::Unknown)
.map(|(idx, _)| idx)
.unwrap_or_else(|| self.raw.len())
});
&self.raw[..end]
}
/// Returns a slice of [`Token`] that are within the given `range`.
///
/// The start and end offset of the given range should be either:
/// 1. Token boundary
/// 2. Gap between the tokens
///
/// For example, considering the following tokens and their corresponding range:
///
/// | Token | Range |
/// |---------------------|-----------|
/// | `Def` | `0..3` |
/// | `Name` | `4..7` |
/// | `Lpar` | `7..8` |
/// | `Rpar` | `8..9` |
/// | `Colon` | `9..10` |
/// | `Newline` | `10..11` |
/// | `Comment` | `15..24` |
/// | `NonLogicalNewline` | `24..25` |
/// | `Indent` | `25..29` |
/// | `Pass` | `29..33` |
///
/// Here, for (1) a token boundary is considered either the start or end offset of any of the
/// above tokens. For (2), the gap would be any offset between the `Newline` and `Comment`
/// token which are 12, 13, and 14.
///
/// Examples:
/// 1) `4..10` would give `Name`, `Lpar`, `Rpar`, `Colon`
/// 2) `11..25` would give `Comment`, `NonLogicalNewline`
/// 3) `12..25` would give same as (2) and offset 12 is in the "gap"
/// 4) `9..12` would give `Colon`, `Newline` and offset 12 is in the "gap"
/// 5) `18..27` would panic because both the start and end offset is within a token
///
/// ## Note
///
/// The returned slice can contain the [`TokenKind::Unknown`] token if there was a lexical
/// error encountered within the given range.
///
/// # Panics
///
/// If either the start or end offset of the given range is within a token range.
pub fn in_range(&self, range: TextRange) -> &[Token] {
let tokens_after_start = self.after(range.start());
match tokens_after_start.binary_search_by_key(&range.end(), Ranged::end) {
Ok(idx) => {
// If we found the token with the end offset, that token should be included in the
// return slice.
&tokens_after_start[..=idx]
}
Err(idx) => {
if let Some(token) = tokens_after_start.get(idx) {
// If it's equal to the start offset, then it's at a token boundary which is
// valid. If it's less than the start offset, then it's in the gap between the
// tokens which is valid as well.
assert!(
range.end() <= token.start(),
"End offset {:?} is inside a token range {:?}",
range.end(),
token.range()
);
}
// This index is where the token with the offset _could_ be, so that token should
// be excluded from the return slice.
&tokens_after_start[..idx]
}
}
}
/// Returns a slice of tokens after the given [`TextSize`] offset.
///
/// If the given offset is between two tokens, the returned slice will start from the following
/// token. In other words, if the offset is between the end of previous token and start of next
/// token, the returned slice will start from the next token.
///
/// # Panics
///
/// If the given offset is inside a token range.
pub fn after(&self, offset: TextSize) -> &[Token] {
match self.binary_search_by(|token| token.start().cmp(&offset)) {
Ok(idx) => &self[idx..],
Err(idx) => {
// We can't use `saturating_sub` here because a file could contain a BOM header, in
// which case the token starts at offset 3 for UTF-8 encoded file content.
if idx > 0 {
if let Some(prev) = self.get(idx - 1) {
// If it's equal to the end offset, then it's at a token boundary which is
// valid. If it's greater than the end offset, then it's in the gap between
// the tokens which is valid as well.
assert!(
offset >= prev.end(),
"Offset {:?} is inside a token range {:?}",
offset,
prev.range()
);
}
}
&self[idx..]
}
}
}
}
impl<'a> IntoIterator for &'a Tokens {
type Item = &'a Token;
type IntoIter = std::slice::Iter<'a, Token>;
fn into_iter(self) -> Self::IntoIter {
self.iter()
}
}
impl Deref for Tokens {
type Target = [LexResult];
type Target = [Token];
fn deref(&self) -> &Self::Target {
&self.0
}
}
/// An iterator over the [`TokenKind`] and the corresponding range.
///
/// This struct is created by the [`Tokens::kinds`] method.
#[derive(Clone, Default)]
pub struct TokenKindIter<'a> {
inner: std::iter::Flatten<std::slice::Iter<'a, LexResult>>,
}
impl<'a> TokenKindIter<'a> {
/// Create a new iterator from a slice of [`LexResult`].
pub fn new(tokens: &'a [LexResult]) -> Self {
Self {
inner: tokens.iter().flatten(),
}
}
/// Return the next value without advancing the iterator.
pub fn peek(&mut self) -> Option<(TokenKind, TextRange)> {
self.clone().next()
}
}
impl Iterator for TokenKindIter<'_> {
type Item = (TokenKind, TextRange);
fn next(&mut self) -> Option<Self::Item> {
let &(ref tok, range) = self.inner.next()?;
Some((TokenKind::from_token(tok), range))
}
}
impl FusedIterator for TokenKindIter<'_> {}
impl DoubleEndedIterator for TokenKindIter<'_> {
fn next_back(&mut self) -> Option<Self::Item> {
let &(ref tok, range) = self.inner.next_back()?;
Some((TokenKind::from_token(tok), range))
}
}
/// Collect tokens up to and including the first error.
pub fn tokenize(contents: &str, mode: Mode) -> Tokens {
let mut tokens: Vec<LexResult> = allocate_tokens_vec(contents);
for tok in lexer::lex(contents, mode) {
let is_err = tok.is_err();
tokens.push(tok);
if is_err {
break;
}
}
Tokens(tokens)
}
/// Tokenizes all tokens.
///
/// It differs from [`tokenize`] in that it tokenizes all tokens and doesn't stop
/// after the first `Err`.
pub fn tokenize_all(contents: &str, mode: Mode) -> Vec<LexResult> {
let mut tokens = allocate_tokens_vec(contents);
for token in lexer::lex(contents, mode) {
tokens.push(token);
}
tokens
}
/// Allocates a [`Vec`] with an approximated capacity to fit all tokens
/// of `contents`.
///
/// See [#9546](https://github.com/astral-sh/ruff/pull/9546) for a more detailed explanation.
pub fn allocate_tokens_vec(contents: &str) -> Vec<LexResult> {
Vec::with_capacity(approximate_tokens_lower_bound(contents))
}
/// Approximates the number of tokens when lexing `contents`.
fn approximate_tokens_lower_bound(contents: &str) -> usize {
contents.len().saturating_mul(15) / 100
}
/// Parse a full Python program from its tokens.
pub fn parse_program_tokens(
tokens: Tokens,
source: &str,
is_jupyter_notebook: bool,
) -> anyhow::Result<Suite, ParseError> {
let mode = if is_jupyter_notebook {
Mode::Ipython
} else {
Mode::Module
};
match parse_tokens(tokens.into_inner(), source, mode)? {
Mod::Module(m) => Ok(m.body),
Mod::Expression(_) => unreachable!("Mode::Module doesn't return other variant"),
&self.raw
}
}
@ -529,3 +581,174 @@ impl std::fmt::Display for ModeParseError {
write!(f, r#"mode must be "exec", "eval", "ipython", or "single""#)
}
}
#[cfg(test)]
mod tests {
use std::ops::Range;
use crate::lexer::TokenFlags;
use super::*;
/// Test case containing a "gap" between two tokens.
///
/// Code: <https://play.ruff.rs/a3658340-6df8-42c5-be80-178744bf1193>
const TEST_CASE_WITH_GAP: [(TokenKind, Range<u32>); 10] = [
(TokenKind::Def, 0..3),
(TokenKind::Name, 4..7),
(TokenKind::Lpar, 7..8),
(TokenKind::Rpar, 8..9),
(TokenKind::Colon, 9..10),
(TokenKind::Newline, 10..11),
// Gap ||..||
(TokenKind::Comment, 15..24),
(TokenKind::NonLogicalNewline, 24..25),
(TokenKind::Indent, 25..29),
(TokenKind::Pass, 29..33),
// No newline at the end to keep the token set full of unique tokens
];
/// Test case containing [`TokenKind::Unknown`] token.
///
/// Code: <https://play.ruff.rs/ea722760-9bf5-4d00-be9f-dc441793f88e>
const TEST_CASE_WITH_UNKNOWN: [(TokenKind, Range<u32>); 5] = [
(TokenKind::Name, 0..1),
(TokenKind::Equal, 2..3),
(TokenKind::Unknown, 4..11),
(TokenKind::Plus, 11..12),
(TokenKind::Int, 13..14),
// No newline at the end to keep the token set full of unique tokens
];
/// Helper function to create [`Tokens`] from an iterator of (kind, range).
fn new_tokens(tokens: impl Iterator<Item = (TokenKind, Range<u32>)>) -> Tokens {
Tokens::new(
tokens
.map(|(kind, range)| {
Token::new(
kind,
TextRange::new(TextSize::new(range.start), TextSize::new(range.end)),
TokenFlags::empty(),
)
})
.collect(),
)
}
#[test]
fn tokens_up_to_first_unknown_empty() {
let tokens = Tokens::new(vec![]);
assert_eq!(tokens.up_to_first_unknown(), &[]);
}
#[test]
fn tokens_up_to_first_unknown_noop() {
let tokens = new_tokens(TEST_CASE_WITH_GAP.into_iter());
let up_to_first_unknown = tokens.up_to_first_unknown();
assert_eq!(up_to_first_unknown.len(), tokens.len());
}
#[test]
fn tokens_up_to_first_unknown() {
let tokens = new_tokens(TEST_CASE_WITH_UNKNOWN.into_iter());
let up_to_first_unknown = tokens.up_to_first_unknown();
assert_eq!(up_to_first_unknown.len(), 2);
}
#[test]
fn tokens_after_offset_at_token_start() {
let tokens = new_tokens(TEST_CASE_WITH_GAP.into_iter());
let after = tokens.after(TextSize::new(8));
assert_eq!(after.len(), 7);
assert_eq!(after.first().unwrap().kind(), TokenKind::Rpar);
}
#[test]
fn tokens_after_offset_at_token_end() {
let tokens = new_tokens(TEST_CASE_WITH_GAP.into_iter());
let after = tokens.after(TextSize::new(11));
assert_eq!(after.len(), 4);
assert_eq!(after.first().unwrap().kind(), TokenKind::Comment);
}
#[test]
fn tokens_after_offset_between_tokens() {
let tokens = new_tokens(TEST_CASE_WITH_GAP.into_iter());
let after = tokens.after(TextSize::new(13));
assert_eq!(after.len(), 4);
assert_eq!(after.first().unwrap().kind(), TokenKind::Comment);
}
#[test]
fn tokens_after_offset_at_last_token_end() {
let tokens = new_tokens(TEST_CASE_WITH_GAP.into_iter());
let after = tokens.after(TextSize::new(33));
assert_eq!(after.len(), 0);
}
#[test]
#[should_panic(expected = "Offset 5 is inside a token range 4..7")]
fn tokens_after_offset_inside_token() {
let tokens = new_tokens(TEST_CASE_WITH_GAP.into_iter());
tokens.after(TextSize::new(5));
}
#[test]
fn tokens_in_range_at_token_offset() {
let tokens = new_tokens(TEST_CASE_WITH_GAP.into_iter());
let in_range = tokens.in_range(TextRange::new(4.into(), 10.into()));
assert_eq!(in_range.len(), 4);
assert_eq!(in_range.first().unwrap().kind(), TokenKind::Name);
assert_eq!(in_range.last().unwrap().kind(), TokenKind::Colon);
}
#[test]
fn tokens_in_range_start_offset_at_token_end() {
let tokens = new_tokens(TEST_CASE_WITH_GAP.into_iter());
let in_range = tokens.in_range(TextRange::new(11.into(), 29.into()));
assert_eq!(in_range.len(), 3);
assert_eq!(in_range.first().unwrap().kind(), TokenKind::Comment);
assert_eq!(in_range.last().unwrap().kind(), TokenKind::Indent);
}
#[test]
fn tokens_in_range_end_offset_at_token_start() {
let tokens = new_tokens(TEST_CASE_WITH_GAP.into_iter());
let in_range = tokens.in_range(TextRange::new(8.into(), 15.into()));
assert_eq!(in_range.len(), 3);
assert_eq!(in_range.first().unwrap().kind(), TokenKind::Rpar);
assert_eq!(in_range.last().unwrap().kind(), TokenKind::Newline);
}
#[test]
fn tokens_in_range_start_offset_between_tokens() {
let tokens = new_tokens(TEST_CASE_WITH_GAP.into_iter());
let in_range = tokens.in_range(TextRange::new(13.into(), 29.into()));
assert_eq!(in_range.len(), 3);
assert_eq!(in_range.first().unwrap().kind(), TokenKind::Comment);
assert_eq!(in_range.last().unwrap().kind(), TokenKind::Indent);
}
#[test]
fn tokens_in_range_end_offset_between_tokens() {
let tokens = new_tokens(TEST_CASE_WITH_GAP.into_iter());
let in_range = tokens.in_range(TextRange::new(9.into(), 13.into()));
assert_eq!(in_range.len(), 2);
assert_eq!(in_range.first().unwrap().kind(), TokenKind::Colon);
assert_eq!(in_range.last().unwrap().kind(), TokenKind::Newline);
}
#[test]
#[should_panic(expected = "Offset 5 is inside a token range 4..7")]
fn tokens_in_range_start_offset_inside_token() {
let tokens = new_tokens(TEST_CASE_WITH_GAP.into_iter());
tokens.in_range(TextRange::new(5.into(), 10.into()));
}
#[test]
#[should_panic(expected = "End offset 6 is inside a token range 4..7")]
fn tokens_in_range_end_offset_inside_token() {
let tokens = new_tokens(TEST_CASE_WITH_GAP.into_iter());
tokens.in_range(TextRange::new(0.into(), 6.into()));
}
}