Always use identifier ranges to store bindings (#5110)

## Summary

At present, when we store a binding, we include a `TextRange` alongside
it. The `TextRange` _sometimes_ matches the exact range of the
identifier to which the `Binding` is linked, but... not always.

For example, given:

```python
x = 1
```

The binding we create _will_ use the range of `x`, because the left-hand
side is an `Expr::Name`, which has a valid range on it.

However, given:

```python
try:
  pass
except ValueError as e:
  pass
```

When we create a binding for `e`, we don't have a `TextRange`... The AST
doesn't give us one. So we end up extracting it via lexing.

This PR extends that pattern to the rest of the binding kinds, to ensure
that whenever we create a binding, we always use the range of the bound
name. This leads to better diagnostics in cases like pattern matching,
whereby the diagnostic for "unused variable `x`" here used to include
`*x`, instead of just `x`:

```python
def f(provided: int) -> int:
    match provided:
        case [_, *x]:
            pass
```

This is _also_ required for symbol renames, since we track writes as
bindings -- so we need to know the ranges of the bound symbols.

By storing these bindings precisely, we can also remove the
`binding.trimmed_range` abstraction -- since bindings already use the
"trimmed range".

To implement this behavior, I took some of our existing utilities (like
the code we had for `except ValueError as e` above), migrated them from
a full lexer to a zero-allocation lexer that _only_ identifies
"identifiers", and moved the behavior into a trait, so we can now do
`stmt.identifier(locator)` to get the range for the identifier.

Honestly, we might end up discarding much of this if we decide to put
ranges on all identifiers
(https://github.com/astral-sh/RustPython-Parser/pull/8). But even if we
do, this will _still_ be a good change, because the lexer introduced
here is useful beyond names (e.g., we use it find the `except` keyword
in an exception handler, to find the `else` after a `for` loop, and so
on). So, I'm fine committing this even if we end up changing our minds
about the right approach.

Closes #5090.

## Benchmarks

No significant change, with one statistically significant improvement
(-2.1654% on `linter/all-rules/large/dataset.py`):

```
linter/default-rules/numpy/globals.py
                        time:   [73.922 µs 73.955 µs 73.986 µs]
                        thrpt:  [39.882 MiB/s 39.898 MiB/s 39.916 MiB/s]
                 change:
                        time:   [-0.5579% -0.4732% -0.3980%] (p = 0.00 < 0.05)
                        thrpt:  [+0.3996% +0.4755% +0.5611%]
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
linter/default-rules/pydantic/types.py
                        time:   [1.4909 ms 1.4917 ms 1.4926 ms]
                        thrpt:  [17.087 MiB/s 17.096 MiB/s 17.106 MiB/s]
                 change:
                        time:   [+0.2140% +0.2741% +0.3392%] (p = 0.00 < 0.05)
                        thrpt:  [-0.3380% -0.2734% -0.2136%]
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
linter/default-rules/numpy/ctypeslib.py
                        time:   [688.97 µs 691.34 µs 694.15 µs]
                        thrpt:  [23.988 MiB/s 24.085 MiB/s 24.168 MiB/s]
                 change:
                        time:   [-1.3282% -0.7298% -0.1466%] (p = 0.02 < 0.05)
                        thrpt:  [+0.1468% +0.7351% +1.3461%]
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  12 (12.00%) high severe
linter/default-rules/large/dataset.py
                        time:   [3.3872 ms 3.4032 ms 3.4191 ms]
                        thrpt:  [11.899 MiB/s 11.954 MiB/s 12.011 MiB/s]
                 change:
                        time:   [-0.6427% -0.2635% +0.0906%] (p = 0.17 > 0.05)
                        thrpt:  [-0.0905% +0.2642% +0.6469%]
                        No change in performance detected.
Found 20 outliers among 100 measurements (20.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  13 (13.00%) high severe

linter/all-rules/numpy/globals.py
                        time:   [148.99 µs 149.21 µs 149.42 µs]
                        thrpt:  [19.748 MiB/s 19.776 MiB/s 19.805 MiB/s]
                 change:
                        time:   [-0.7340% -0.5068% -0.2778%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2785% +0.5094% +0.7395%]
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe
linter/all-rules/pydantic/types.py
                        time:   [3.0362 ms 3.0396 ms 3.0441 ms]
                        thrpt:  [8.3779 MiB/s 8.3903 MiB/s 8.3997 MiB/s]
                 change:
                        time:   [-0.0957% +0.0618% +0.2125%] (p = 0.45 > 0.05)
                        thrpt:  [-0.2121% -0.0618% +0.0958%]
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
linter/all-rules/numpy/ctypeslib.py
                        time:   [1.6879 ms 1.6894 ms 1.6909 ms]
                        thrpt:  [9.8478 MiB/s 9.8562 MiB/s 9.8652 MiB/s]
                 change:
                        time:   [-0.2279% -0.0888% +0.0436%] (p = 0.18 > 0.05)
                        thrpt:  [-0.0435% +0.0889% +0.2284%]
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) low mild
  1 (1.00%) high severe
linter/all-rules/large/dataset.py
                        time:   [7.1520 ms 7.1586 ms 7.1654 ms]
                        thrpt:  [5.6777 MiB/s 5.6831 MiB/s 5.6883 MiB/s]
                 change:
                        time:   [-2.5626% -2.1654% -1.7780%] (p = 0.00 < 0.05)
                        thrpt:  [+1.8102% +2.2133% +2.6300%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
```
This commit is contained in:
Charlie Marsh 2023-06-15 14:43:19 -04:00 committed by GitHub
parent 66089e1a2e
commit 5ea3e42513
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
58 changed files with 1001 additions and 576 deletions

View file

@ -1,15 +1,13 @@
use std::borrow::Cow;
use std::ops::{Add, Sub};
use std::ops::Sub;
use std::path::Path;
use itertools::Itertools;
use log::error;
use num_traits::Zero;
use ruff_text_size::{TextRange, TextSize};
use rustc_hash::{FxHashMap, FxHashSet};
use rustpython_ast::Cmpop;
use rustpython_parser::ast::{
self, Arguments, Cmpop, Constant, Excepthandler, Expr, Keyword, MatchCase, Pattern, Ranged,
Stmt,
self, Arguments, Constant, Excepthandler, Expr, Keyword, MatchCase, Pattern, Ranged, Stmt,
};
use rustpython_parser::{lexer, Mode, Tok};
use smallvec::SmallVec;
@ -44,6 +42,7 @@ where
range: _range,
}) = expr
{
// Ex) `list()`
if args.is_empty() && keywords.is_empty() {
if let Expr::Name(ast::ExprName { id, .. }) = func.as_ref() {
if !is_iterable_initializer(id.as_str(), |id| is_builtin(id)) {
@ -1071,185 +1070,6 @@ pub fn match_parens(start: TextSize, locator: &Locator) -> Option<TextRange> {
}
}
/// Return `true` if the given character is a valid identifier character.
fn is_identifier(c: char) -> bool {
c.is_alphanumeric() || c == '_'
}
#[derive(Debug)]
enum IdentifierState {
/// We're in a comment, awaiting the identifier at the given index.
InComment { index: usize },
/// We're looking for the identifier at the given index.
AwaitingIdentifier { index: usize },
/// We're in the identifier at the given index, starting at the given character.
InIdentifier { index: usize, start: TextSize },
}
/// Return the appropriate visual `Range` for any message that spans a `Stmt`.
/// Specifically, this method returns the range of a function or class name,
/// rather than that of the entire function or class body.
pub fn identifier_range(stmt: &Stmt, locator: &Locator) -> TextRange {
match stmt {
Stmt::ClassDef(ast::StmtClassDef {
decorator_list,
range,
..
})
| Stmt::FunctionDef(ast::StmtFunctionDef {
decorator_list,
range,
..
})
| Stmt::AsyncFunctionDef(ast::StmtAsyncFunctionDef {
decorator_list,
range,
..
}) => {
let header_range = decorator_list.last().map_or(*range, |last_decorator| {
TextRange::new(last_decorator.end(), range.end())
});
// If the statement is an async function, we're looking for the third
// keyword-or-identifier (`foo` in `async def foo()`). Otherwise, it's the
// second keyword-or-identifier (`foo` in `def foo()` or `Foo` in `class Foo`).
let name_index = if stmt.is_async_function_def_stmt() {
2
} else {
1
};
let mut state = IdentifierState::AwaitingIdentifier { index: 0 };
for (char_index, char) in locator.slice(header_range).char_indices() {
match state {
IdentifierState::InComment { index } => match char {
// Read until the end of the comment.
'\r' | '\n' => {
state = IdentifierState::AwaitingIdentifier { index };
}
_ => {}
},
IdentifierState::AwaitingIdentifier { index } => match char {
// Read until we hit an identifier.
'#' => {
state = IdentifierState::InComment { index };
}
c if is_identifier(c) => {
state = IdentifierState::InIdentifier {
index,
start: TextSize::try_from(char_index).unwrap(),
};
}
_ => {}
},
IdentifierState::InIdentifier { index, start } => {
// We've reached the end of the identifier.
if !is_identifier(char) {
if index == name_index {
// We've found the identifier we're looking for.
let end = TextSize::try_from(char_index).unwrap();
return TextRange::new(
header_range.start().add(start),
header_range.start().add(end),
);
}
// We're looking for a different identifier.
state = IdentifierState::AwaitingIdentifier { index: index + 1 };
}
}
}
}
error!("Failed to find identifier for {:?}", stmt);
header_range
}
_ => stmt.range(),
}
}
/// Return the ranges of [`Tok::Name`] tokens within a specified node.
pub fn find_names<'a, T>(
located: &'a T,
locator: &'a Locator,
) -> impl Iterator<Item = TextRange> + 'a
where
T: Ranged,
{
let contents = locator.slice(located.range());
lexer::lex_starts_at(contents, Mode::Module, located.start())
.flatten()
.filter(|(tok, _)| matches!(tok, Tok::Name { .. }))
.map(|(_, range)| range)
}
/// Return the `Range` of `name` in `Excepthandler`.
pub fn excepthandler_name_range(handler: &Excepthandler, locator: &Locator) -> Option<TextRange> {
let Excepthandler::ExceptHandler(ast::ExcepthandlerExceptHandler {
name,
type_,
body,
range: _range,
}) = handler;
match (name, type_) {
(Some(_), Some(type_)) => {
let contents = &locator.contents()[TextRange::new(type_.end(), body[0].start())];
lexer::lex_starts_at(contents, Mode::Module, type_.end())
.flatten()
.tuple_windows()
.find(|(tok, next_tok)| {
matches!(tok.0, Tok::As) && matches!(next_tok.0, Tok::Name { .. })
})
.map(|((..), (_, range))| range)
}
_ => None,
}
}
/// Return the `Range` of `except` in `Excepthandler`.
pub fn except_range(handler: &Excepthandler, locator: &Locator) -> TextRange {
let Excepthandler::ExceptHandler(ast::ExcepthandlerExceptHandler { body, type_, .. }) = handler;
let end = if let Some(type_) = type_ {
type_.end()
} else {
body.first().expect("Expected body to be non-empty").start()
};
let contents = &locator.contents()[TextRange::new(handler.start(), end)];
lexer::lex_starts_at(contents, Mode::Module, handler.start())
.flatten()
.find(|(kind, _)| matches!(kind, Tok::Except { .. }))
.map(|(_, range)| range)
.expect("Failed to find `except` range")
}
/// Return the `Range` of `else` in `For`, `AsyncFor`, and `While` statements.
pub fn else_range(stmt: &Stmt, locator: &Locator) -> Option<TextRange> {
match stmt {
Stmt::For(ast::StmtFor { body, orelse, .. })
| Stmt::AsyncFor(ast::StmtAsyncFor { body, orelse, .. })
| Stmt::While(ast::StmtWhile { body, orelse, .. })
if !orelse.is_empty() =>
{
let body_end = body.last().expect("Expected body to be non-empty").end();
let or_else_start = orelse
.first()
.expect("Expected orelse to be non-empty")
.start();
let contents = &locator.contents()[TextRange::new(body_end, or_else_start)];
lexer::lex_starts_at(contents, Mode::Module, body_end)
.flatten()
.find(|(kind, _)| matches!(kind, Tok::Else))
.map(|(_, range)| range)
}
_ => None,
}
}
/// Return the `Range` of the first `Tok::Colon` token in a `Range`.
pub fn first_colon_range(range: TextRange, locator: &Locator) -> Option<TextRange> {
let contents = &locator.contents()[range];
@ -1482,7 +1302,101 @@ pub fn is_unpacking_assignment(parent: &Stmt, child: &Expr) -> bool {
}
}
#[derive(Clone, PartialEq, Debug)]
#[derive(Copy, Clone, Debug, PartialEq, is_macro::Is)]
pub enum Truthiness {
// An expression evaluates to `False`.
Falsey,
// An expression evaluates to `True`.
Truthy,
// An expression evaluates to an unknown value (e.g., a variable `x` of unknown type).
Unknown,
}
impl From<Option<bool>> for Truthiness {
fn from(value: Option<bool>) -> Self {
match value {
Some(true) => Truthiness::Truthy,
Some(false) => Truthiness::Falsey,
None => Truthiness::Unknown,
}
}
}
impl From<Truthiness> for Option<bool> {
fn from(truthiness: Truthiness) -> Self {
match truthiness {
Truthiness::Truthy => Some(true),
Truthiness::Falsey => Some(false),
Truthiness::Unknown => None,
}
}
}
impl Truthiness {
pub fn from_expr<F>(expr: &Expr, is_builtin: F) -> Self
where
F: Fn(&str) -> bool,
{
match expr {
Expr::Constant(ast::ExprConstant { value, .. }) => match value {
Constant::Bool(value) => Some(*value),
Constant::None => Some(false),
Constant::Str(string) => Some(!string.is_empty()),
Constant::Bytes(bytes) => Some(!bytes.is_empty()),
Constant::Int(int) => Some(!int.is_zero()),
Constant::Float(float) => Some(*float != 0.0),
Constant::Complex { real, imag } => Some(*real != 0.0 || *imag != 0.0),
Constant::Ellipsis => Some(true),
Constant::Tuple(elts) => Some(!elts.is_empty()),
},
Expr::JoinedStr(ast::ExprJoinedStr { values, range: _range }) => {
if values.is_empty() {
Some(false)
} else if values.iter().any(|value| {
let Expr::Constant(ast::ExprConstant { value: Constant::Str(string), .. } )= &value else {
return false;
};
!string.is_empty()
}) {
Some(true)
} else {
None
}
}
Expr::List(ast::ExprList { elts, range: _range, .. })
| Expr::Set(ast::ExprSet { elts, range: _range })
| Expr::Tuple(ast::ExprTuple { elts, range: _range,.. }) => Some(!elts.is_empty()),
Expr::Dict(ast::ExprDict { keys, range: _range, .. }) => Some(!keys.is_empty()),
Expr::Call(ast::ExprCall {
func,
args,
keywords, range: _range,
}) => {
if let Expr::Name(ast::ExprName { id, .. }) = func.as_ref() {
if is_iterable_initializer(id.as_str(), |id| is_builtin(id)) {
if args.is_empty() && keywords.is_empty() {
// Ex) `list()`
Some(false)
} else if args.len() == 1 && keywords.is_empty() {
// Ex) `list([1, 2, 3])`
Self::from_expr(&args[0], is_builtin).into()
} else {
None
}
} else {
None
}
} else {
None
}
}
_ => None,
}
.into()
}
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct LocatedCmpop {
pub range: TextRange,
pub op: Cmpop,
@ -1581,111 +1495,19 @@ pub fn locate_cmpops(expr: &Expr, locator: &Locator) -> Vec<LocatedCmpop> {
ops
}
#[derive(Copy, Clone, Debug, PartialEq, is_macro::Is)]
pub enum Truthiness {
// An expression evaluates to `False`.
Falsey,
// An expression evaluates to `True`.
Truthy,
// An expression evaluates to an unknown value (e.g., a variable `x` of unknown type).
Unknown,
}
impl From<Option<bool>> for Truthiness {
fn from(value: Option<bool>) -> Self {
match value {
Some(true) => Truthiness::Truthy,
Some(false) => Truthiness::Falsey,
None => Truthiness::Unknown,
}
}
}
impl From<Truthiness> for Option<bool> {
fn from(truthiness: Truthiness) -> Self {
match truthiness {
Truthiness::Truthy => Some(true),
Truthiness::Falsey => Some(false),
Truthiness::Unknown => None,
}
}
}
impl Truthiness {
pub fn from_expr<F>(expr: &Expr, is_builtin: F) -> Self
where
F: Fn(&str) -> bool,
{
match expr {
Expr::Constant(ast::ExprConstant { value, .. }) => match value {
Constant::Bool(value) => Some(*value),
Constant::None => Some(false),
Constant::Str(string) => Some(!string.is_empty()),
Constant::Bytes(bytes) => Some(!bytes.is_empty()),
Constant::Int(int) => Some(!int.is_zero()),
Constant::Float(float) => Some(*float != 0.0),
Constant::Complex { real, imag } => Some(*real != 0.0 || *imag != 0.0),
Constant::Ellipsis => Some(true),
Constant::Tuple(elts) => Some(!elts.is_empty()),
},
Expr::JoinedStr(ast::ExprJoinedStr { values, range: _range }) => {
if values.is_empty() {
Some(false)
} else if values.iter().any(|value| {
let Expr::Constant(ast::ExprConstant { value: Constant::Str(string), .. } )= &value else {
return false;
};
!string.is_empty()
}) {
Some(true)
} else {
None
}
}
Expr::List(ast::ExprList { elts, range: _range, .. })
| Expr::Set(ast::ExprSet { elts, range: _range })
| Expr::Tuple(ast::ExprTuple { elts, range: _range,.. }) => Some(!elts.is_empty()),
Expr::Dict(ast::ExprDict { keys, range: _range, .. }) => Some(!keys.is_empty()),
Expr::Call(ast::ExprCall {
func,
args,
keywords, range: _range,
}) => {
if let Expr::Name(ast::ExprName { id, .. }) = func.as_ref() {
if is_iterable_initializer(id.as_str(), |id| is_builtin(id)) {
if args.is_empty() && keywords.is_empty() {
Some(false)
} else if args.len() == 1 && keywords.is_empty() {
Self::from_expr(&args[0], is_builtin).into()
} else {
None
}
} else {
None
}
} else {
None
}
}
_ => None,
}
.into()
}
}
#[cfg(test)]
mod tests {
use std::borrow::Cow;
use anyhow::Result;
use ruff_text_size::{TextLen, TextRange, TextSize};
use rustpython_ast::{Expr, Stmt, Suite};
use rustpython_parser::ast::Cmpop;
use rustpython_ast::{Cmpop, Expr, Stmt};
use rustpython_parser::ast::Suite;
use rustpython_parser::Parse;
use crate::helpers::{
elif_else_range, else_range, first_colon_range, has_trailing_content, identifier_range,
locate_cmpops, resolve_imported_module_path, LocatedCmpop,
elif_else_range, first_colon_range, has_trailing_content, locate_cmpops,
resolve_imported_module_path, LocatedCmpop,
};
use crate::source_code::Locator;
@ -1728,90 +1550,6 @@ y = 2
Ok(())
}
#[test]
fn extract_identifier_range() -> Result<()> {
let contents = "def f(): pass".trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
identifier_range(&stmt, &locator),
TextRange::new(TextSize::from(4), TextSize::from(5))
);
let contents = "async def f(): pass".trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
identifier_range(&stmt, &locator),
TextRange::new(TextSize::from(10), TextSize::from(11))
);
let contents = r#"
def \
f():
pass
"#
.trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
identifier_range(&stmt, &locator),
TextRange::new(TextSize::from(8), TextSize::from(9))
);
let contents = "class Class(): pass".trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
identifier_range(&stmt, &locator),
TextRange::new(TextSize::from(6), TextSize::from(11))
);
let contents = "class Class: pass".trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
identifier_range(&stmt, &locator),
TextRange::new(TextSize::from(6), TextSize::from(11))
);
let contents = r#"
@decorator()
class Class():
pass
"#
.trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
identifier_range(&stmt, &locator),
TextRange::new(TextSize::from(19), TextSize::from(24))
);
let contents = r#"
@decorator() # Comment
class Class():
pass
"#
.trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
identifier_range(&stmt, &locator),
TextRange::new(TextSize::from(30), TextSize::from(35))
);
let contents = r#"x = y + 1"#.trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
identifier_range(&stmt, &locator),
TextRange::new(TextSize::from(0), TextSize::from(9))
);
Ok(())
}
#[test]
fn resolve_import() {
// Return the module directly.
@ -1849,26 +1587,6 @@ class Class():
);
}
#[test]
fn extract_else_range() -> Result<()> {
let contents = r#"
for x in y:
pass
else:
pass
"#
.trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
let range = else_range(&stmt, &locator).unwrap();
assert_eq!(&contents[range], "else");
assert_eq!(
range,
TextRange::new(TextSize::from(21), TextSize::from(25))
);
Ok(())
}
#[test]
fn extract_first_colon_range() {
let contents = "with a: pass";

View file

@ -0,0 +1,621 @@
//! Extract [`TextRange`] information from AST nodes.
//!
//! In the `RustPython` AST, each node has a `range` field that contains the
//! start and end byte offsets of the node. However, attributes on those
//! nodes may not have their own ranges. In particular, identifiers are
//! not given their own ranges, unless they're part of a name expression.
//!
//! For example, given:
//! ```python
//! def f():
//! ...
//! ```
//!
//! The statement defining `f` has a range, but the identifier `f` does not.
//!
//! This module assists with extracting [`TextRange`] ranges from AST nodes
//! via manual lexical analysis.
use std::ops::{Add, Sub};
use std::str::Chars;
use ruff_text_size::{TextLen, TextRange, TextSize};
use rustpython_ast::{Alias, Arg, Pattern};
use rustpython_parser::ast::{self, Excepthandler, Ranged, Stmt};
use ruff_python_whitespace::is_python_whitespace;
use crate::source_code::Locator;
pub trait Identifier {
/// Return the [`TextRange`] of the identifier in the given AST node.
fn identifier(&self, locator: &Locator) -> TextRange;
}
pub trait TryIdentifier {
/// Return the [`TextRange`] of the identifier in the given AST node, or `None` if
/// the node does not have an identifier.
fn try_identifier(&self, locator: &Locator) -> Option<TextRange>;
}
impl Identifier for Stmt {
/// Return the [`TextRange`] of the identifier in the given statement.
///
/// For example, return the range of `f` in:
/// ```python
/// def f():
/// ...
/// ```
fn identifier(&self, locator: &Locator) -> TextRange {
match self {
Stmt::ClassDef(ast::StmtClassDef {
decorator_list,
range,
..
})
| Stmt::FunctionDef(ast::StmtFunctionDef {
decorator_list,
range,
..
}) => {
let range = decorator_list.last().map_or(*range, |last_decorator| {
TextRange::new(last_decorator.end(), range.end())
});
// The first "identifier" is the `def` or `class` keyword.
// The second "identifier" is the function or class name.
IdentifierTokenizer::starts_at(range.start(), locator.contents())
.nth(1)
.expect("Unable to identify identifier in function or class definition")
}
Stmt::AsyncFunctionDef(ast::StmtAsyncFunctionDef {
decorator_list,
range,
..
}) => {
let range = decorator_list.last().map_or(*range, |last_decorator| {
TextRange::new(last_decorator.end(), range.end())
});
// The first "identifier" is the `async` keyword.
// The second "identifier" is the `def` or `class` keyword.
// The third "identifier" is the function or class name.
IdentifierTokenizer::starts_at(range.start(), locator.contents())
.nth(2)
.expect("Unable to identify identifier in function or class definition")
}
_ => self.range(),
}
}
}
impl Identifier for Arg {
/// Return the [`TextRange`] for the identifier defining an [`Arg`].
///
/// For example, return the range of `x` in:
/// ```python
/// def f(x: int = 0):
/// ...
/// ```
fn identifier(&self, locator: &Locator) -> TextRange {
IdentifierTokenizer::new(locator.contents(), self.range())
.next()
.expect("Failed to find argument identifier")
}
}
impl Identifier for Alias {
/// Return the [`TextRange`] for the identifier defining an [`Alias`].
///
/// For example, return the range of `x` in:
/// ```python
/// from foo import bar as x
/// ```
fn identifier(&self, locator: &Locator) -> TextRange {
if matches!(self.name.as_str(), "*") {
self.range()
} else if self.asname.is_none() {
// The first identifier is the module name.
IdentifierTokenizer::new(locator.contents(), self.range())
.next()
.expect("Failed to find alias identifier")
} else {
// The first identifier is the module name.
// The second identifier is the "as" keyword.
// The third identifier is the alias name.
IdentifierTokenizer::new(locator.contents(), self.range())
.last()
.expect("Failed to find alias identifier")
}
}
}
impl TryIdentifier for Pattern {
/// Return the [`TextRange`] of the identifier in the given pattern.
///
/// For example, return the range of `z` in:
/// ```python
/// match x:
/// # Pattern::MatchAs
/// case z:
/// ...
/// ```
///
/// Or:
/// ```python
/// match x:
/// # Pattern::MatchAs
/// case y as z:
/// ...
/// ```
///
/// Or :
/// ```python
/// match x:
/// # Pattern::MatchMapping
/// case {"a": 1, **z}
/// ...
/// ```
///
/// Or :
/// ```python
/// match x:
/// # Pattern::MatchStar
/// case *z:
/// ...
/// ```
fn try_identifier(&self, locator: &Locator) -> Option<TextRange> {
match self {
Pattern::MatchAs(ast::PatternMatchAs {
name: Some(_),
pattern,
range,
}) => {
Some(if let Some(pattern) = pattern {
// Identify `z` in:
// ```python
// match x:
// case Foo(bar) as z:
// ...
// ```
IdentifierTokenizer::starts_at(pattern.end(), locator.contents())
.nth(1)
.expect("Unable to identify identifier in pattern")
} else {
// Identify `z` in:
// ```python
// match x:
// case z:
// ...
// ```
*range
})
}
Pattern::MatchMapping(ast::PatternMatchMapping {
patterns,
rest: Some(_),
..
}) => {
Some(if let Some(pattern) = patterns.last() {
// Identify `z` in:
// ```python
// match x:
// case {"a": 1, **z}
// ...
// ```
//
// A mapping pattern can contain at most one double-star pattern,
// and it must be the last pattern in the mapping.
IdentifierTokenizer::starts_at(pattern.end(), locator.contents())
.next()
.expect("Unable to identify identifier in pattern")
} else {
// Identify `z` in:
// ```python
// match x:
// case {**z}
// ...
// ```
IdentifierTokenizer::starts_at(self.start(), locator.contents())
.next()
.expect("Unable to identify identifier in pattern")
})
}
Pattern::MatchStar(ast::PatternMatchStar { name: Some(_), .. }) => {
// Identify `z` in:
// ```python
// match x:
// case *z:
// ...
// ```
Some(
IdentifierTokenizer::starts_at(self.start(), locator.contents())
.next()
.expect("Unable to identify identifier in pattern"),
)
}
_ => None,
}
}
}
impl TryIdentifier for Excepthandler {
/// Return the [`TextRange`] of a named exception in an [`Excepthandler`].
///
/// For example, return the range of `e` in:
/// ```python
/// try:
/// ...
/// except ValueError as e:
/// ...
/// ```
fn try_identifier(&self, locator: &Locator) -> Option<TextRange> {
let Excepthandler::ExceptHandler(ast::ExcepthandlerExceptHandler { type_, name, .. }) =
self;
if name.is_none() {
return None;
}
let Some(type_) = type_ else {
return None;
};
// The exception name is the first identifier token after the `as` keyword.
Some(
IdentifierTokenizer::starts_at(type_.end(), locator.contents())
.nth(1)
.expect("Failed to find exception identifier in exception handler"),
)
}
}
/// Return the [`TextRange`] for every name in a [`Stmt`].
///
/// Intended to be used for `global` and `nonlocal` statements.
///
/// For example, return the ranges of `x` and `y` in:
/// ```python
/// global x, y
/// ```
pub fn names<'a>(stmt: &Stmt, locator: &'a Locator<'a>) -> impl Iterator<Item = TextRange> + 'a {
// Given `global x, y`, the first identifier is `global`, and the remaining identifiers are
// the names.
IdentifierTokenizer::new(locator.contents(), stmt.range()).skip(1)
}
/// Return the [`TextRange`] of the `except` token in an [`Excepthandler`].
pub fn except(handler: &Excepthandler, locator: &Locator) -> TextRange {
IdentifierTokenizer::new(locator.contents(), handler.range())
.next()
.expect("Failed to find `except` token in `Excepthandler`")
}
/// Return the [`TextRange`] of the `else` token in a `For`, `AsyncFor`, or `While` statement.
pub fn else_(stmt: &Stmt, locator: &Locator) -> Option<TextRange> {
let (Stmt::For(ast::StmtFor { body, orelse, .. })
| Stmt::AsyncFor(ast::StmtAsyncFor { body, orelse, .. })
| Stmt::While(ast::StmtWhile { body, orelse, .. })) = stmt else {
return None;
};
if orelse.is_empty() {
return None;
}
IdentifierTokenizer::starts_at(
body.last().expect("Expected body to be non-empty").end(),
locator.contents(),
)
.next()
}
/// Return `true` if the given character starts a valid Python identifier.
///
/// Python identifiers must start with an alphabetic character or an underscore.
fn is_python_identifier_start(c: char) -> bool {
c.is_alphabetic() || c == '_'
}
/// Return `true` if the given character is a valid Python identifier continuation character.
///
/// Python identifiers can contain alphanumeric characters and underscores, but cannot start with a
/// number.
fn is_python_identifier_continue(c: char) -> bool {
c.is_alphanumeric() || c == '_'
}
/// Simple zero allocation tokenizer for Python identifiers.
///
/// The tokenizer must operate over a range that can only contain identifiers, keywords, and
/// comments (along with whitespace and continuation characters). It does not support other tokens,
/// like operators, literals, or delimiters. It also does not differentiate between keywords and
/// identifiers, treating every valid token as an "identifier".
///
/// This is useful for cases like, e.g., identifying the alias name in an aliased import (`bar` in
/// `import foo as bar`), where we're guaranteed to only have identifiers and keywords in the
/// relevant range.
pub(crate) struct IdentifierTokenizer<'a> {
cursor: Cursor<'a>,
offset: TextSize,
}
impl<'a> IdentifierTokenizer<'a> {
pub(crate) fn new(source: &'a str, range: TextRange) -> Self {
Self {
cursor: Cursor::new(&source[range]),
offset: range.start(),
}
}
pub(crate) fn starts_at(offset: TextSize, source: &'a str) -> Self {
let range = TextRange::new(offset, source.text_len());
Self::new(source, range)
}
fn next_token(&mut self) -> Option<TextRange> {
while let Some(c) = self.cursor.bump() {
match c {
c if is_python_identifier_start(c) => {
let start = self.offset.add(self.cursor.offset()).sub(c.text_len());
self.cursor.eat_while(is_python_identifier_continue);
let end = self.offset.add(self.cursor.offset());
return Some(TextRange::new(start, end));
}
c if is_python_whitespace(c) => {
self.cursor.eat_while(is_python_whitespace);
}
'#' => {
self.cursor.eat_while(|c| !matches!(c, '\n' | '\r'));
}
'\r' => {
self.cursor.eat_char('\n');
}
'\n' => {
// Nothing to do.
}
'\\' => {
// Nothing to do.
}
_ => {
// Nothing to do.
}
};
}
None
}
}
impl Iterator for IdentifierTokenizer<'_> {
type Item = TextRange;
fn next(&mut self) -> Option<Self::Item> {
self.next_token()
}
}
const EOF_CHAR: char = '\0';
#[derive(Debug, Clone)]
struct Cursor<'a> {
chars: Chars<'a>,
offset: TextSize,
}
impl<'a> Cursor<'a> {
fn new(source: &'a str) -> Self {
Self {
chars: source.chars(),
offset: TextSize::from(0),
}
}
const fn offset(&self) -> TextSize {
self.offset
}
/// Peeks the next character from the input stream without consuming it.
/// Returns [`EOF_CHAR`] if the file is at the end of the file.
fn first(&self) -> char {
self.chars.clone().next().unwrap_or(EOF_CHAR)
}
/// Returns `true` if the file is at the end of the file.
fn is_eof(&self) -> bool {
self.chars.as_str().is_empty()
}
/// Consumes the next character.
fn bump(&mut self) -> Option<char> {
if let Some(char) = self.chars.next() {
self.offset += char.text_len();
Some(char)
} else {
None
}
}
/// Eats the next character if it matches the given character.
fn eat_char(&mut self, c: char) -> bool {
if self.first() == c {
self.bump();
true
} else {
false
}
}
/// Eats symbols while predicate returns true or until the end of file is reached.
fn eat_while(&mut self, mut predicate: impl FnMut(char) -> bool) {
while predicate(self.first()) && !self.is_eof() {
self.bump();
}
}
}
#[cfg(test)]
mod tests {
use anyhow::Result;
use ruff_text_size::{TextRange, TextSize};
use rustpython_ast::Stmt;
use rustpython_parser::Parse;
use crate::identifier;
use crate::identifier::Identifier;
use crate::source_code::Locator;
#[test]
fn extract_arg_range() -> Result<()> {
let contents = "def f(x): pass".trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let function_def = stmt.as_function_def_stmt().unwrap();
let args = &function_def.args.args;
let arg = &args[0];
let locator = Locator::new(contents);
assert_eq!(
arg.identifier(&locator),
TextRange::new(TextSize::from(6), TextSize::from(7))
);
let contents = "def f(x: int): pass".trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let function_def = stmt.as_function_def_stmt().unwrap();
let args = &function_def.args.args;
let arg = &args[0];
let locator = Locator::new(contents);
assert_eq!(
arg.identifier(&locator),
TextRange::new(TextSize::from(6), TextSize::from(7))
);
let contents = r#"
def f(
x: int, # Comment
):
pass
"#
.trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let function_def = stmt.as_function_def_stmt().unwrap();
let args = &function_def.args.args;
let arg = &args[0];
let locator = Locator::new(contents);
assert_eq!(
arg.identifier(&locator),
TextRange::new(TextSize::from(11), TextSize::from(12))
);
Ok(())
}
#[test]
fn extract_identifier_range() -> Result<()> {
let contents = "def f(): pass".trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
stmt.identifier(&locator),
TextRange::new(TextSize::from(4), TextSize::from(5))
);
let contents = "async def f(): pass".trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
stmt.identifier(&locator),
TextRange::new(TextSize::from(10), TextSize::from(11))
);
let contents = r#"
def \
f():
pass
"#
.trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
stmt.identifier(&locator),
TextRange::new(TextSize::from(8), TextSize::from(9))
);
let contents = "class Class(): pass".trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
stmt.identifier(&locator),
TextRange::new(TextSize::from(6), TextSize::from(11))
);
let contents = "class Class: pass".trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
stmt.identifier(&locator),
TextRange::new(TextSize::from(6), TextSize::from(11))
);
let contents = r#"
@decorator()
class Class():
pass
"#
.trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
stmt.identifier(&locator),
TextRange::new(TextSize::from(19), TextSize::from(24))
);
let contents = r#"
@decorator() # Comment
class Class():
pass
"#
.trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
stmt.identifier(&locator),
TextRange::new(TextSize::from(30), TextSize::from(35))
);
let contents = r#"x = y + 1"#.trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
assert_eq!(
stmt.identifier(&locator),
TextRange::new(TextSize::from(0), TextSize::from(9))
);
Ok(())
}
#[test]
fn extract_else_range() -> Result<()> {
let contents = r#"
for x in y:
pass
else:
pass
"#
.trim();
let stmt = Stmt::parse(contents, "<filename>")?;
let locator = Locator::new(contents);
let range = identifier::else_(&stmt, &locator).unwrap();
assert_eq!(&contents[range], "else");
assert_eq!(
range,
TextRange::new(TextSize::from(21), TextSize::from(25))
);
Ok(())
}
}

View file

@ -6,6 +6,7 @@ pub mod docstrings;
pub mod function;
pub mod hashable;
pub mod helpers;
pub mod identifier;
pub mod imports;
pub mod node;
pub mod prelude;