Update string nodes for implicit concatenation (#7927)

## Summary

This PR updates the string nodes (`ExprStringLiteral`,
`ExprBytesLiteral`, and `ExprFString`) to account for implicit string
concatenation.

### Motivation

In Python, implicit string concatenation are joined while parsing
because the interpreter doesn't require the information for each part.
While that's feasible for an interpreter, it falls short for a static
analysis tool where having such information is more useful. Currently,
various parts of the code uses the lexer to get the individual string
parts.

One of the main challenge this solves is that of string formatting.
Currently, the formatter relies on the lexer to get the individual
string parts, and formats them including the comments accordingly. But,
with PEP 701, f-string can also contain comments. Without this change,
it becomes very difficult to add support for f-string formatting.

### Implementation

The initial proposal was made in this discussion:
https://github.com/astral-sh/ruff/discussions/6183#discussioncomment-6591993.
There were various AST designs which were explored for this task which
are available in the linked internal document[^1].

The selected variant was the one where the nodes were kept as it is
except that the `implicit_concatenated` field was removed and instead a
new struct was added to the `Expr*` struct. This would be a private
struct would contain the actual implementation of how the AST is
designed for both single and implicitly concatenated strings.

This implementation is achieved through an enum with two variants:
`Single` and `Concatenated` to avoid allocating a vector even for single
strings. There are various public methods available on the value struct
to query certain information regarding the node.

The nodes are structured in the following way:

```
ExprStringLiteral - "foo" "bar"
|- StringLiteral - "foo"
|- StringLiteral - "bar"

ExprBytesLiteral - b"foo" b"bar"
|- BytesLiteral - b"foo"
|- BytesLiteral - b"bar"

ExprFString - "foo" f"bar {x}"
|- FStringPart::Literal - "foo"
|- FStringPart::FString - f"bar {x}"
  |- StringLiteral - "bar "
  |- FormattedValue - "x"
```

[^1]: Internal document:
https://www.notion.so/astral-sh/Implicit-String-Concatenation-e036345dc48943f89e416c087bf6f6d9?pvs=4

#### Visitor

The way the nodes are structured is that the entire string, including
all the parts that are implicitly concatenation, is a single node
containing individual nodes for the parts. The previous section has a
representation of that tree for all the string nodes. This means that
new visitor methods are added to visit the individual parts of string,
bytes, and f-strings for `Visitor`, `PreorderVisitor`, and
`Transformer`.

## Test Plan

- `cargo insta test --workspace --all-features --unreferenced reject`
- Verify that the ecosystem results are unchanged
This commit is contained in:
Dhruv Manilawala 2023-11-24 17:55:41 -06:00 committed by GitHub
parent 2590aa30ae
commit 017e829115
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
121 changed files with 27666 additions and 25501 deletions

View file

@ -7,9 +7,9 @@ use crate::lexer::{LexicalError, LexicalErrorType};
use crate::token::{StringKind, Tok};
pub(crate) enum StringType {
Str(ast::ExprStringLiteral),
Bytes(ast::ExprBytesLiteral),
FString(ast::ExprFString),
Str(ast::StringLiteral),
Bytes(ast::BytesLiteral),
FString(ast::FString),
}
impl Ranged for StringType {
@ -22,11 +22,12 @@ impl Ranged for StringType {
}
}
impl StringType {
fn is_unicode(&self) -> bool {
match self {
Self::Str(ast::ExprStringLiteral { unicode, .. }) => *unicode,
_ => false,
impl From<StringType> for Expr {
fn from(string: StringType) -> Self {
match string {
StringType::Str(node) => Expr::from(node),
StringType::Bytes(node) => Expr::from(node),
StringType::FString(node) => Expr::from(node),
}
}
}
@ -35,14 +36,16 @@ struct StringParser<'a> {
rest: &'a str,
kind: StringKind,
location: TextSize,
range: TextRange,
}
impl<'a> StringParser<'a> {
fn new(source: &'a str, kind: StringKind, start: TextSize) -> Self {
fn new(source: &'a str, kind: StringKind, start: TextSize, range: TextRange) -> Self {
Self {
rest: source,
kind,
location: start,
range,
}
}
@ -59,11 +62,6 @@ impl<'a> StringParser<'a> {
self.location
}
#[inline]
fn range(&self, start_location: TextSize) -> TextRange {
TextRange::new(start_location, self.location)
}
/// Returns the next byte in the string, if there is one.
///
/// # Panics
@ -208,7 +206,6 @@ impl<'a> StringParser<'a> {
fn parse_fstring_middle(&mut self) -> Result<Expr, LexicalError> {
let mut value = String::new();
let start_location = self.get_pos();
while let Some(ch) = self.next_char() {
match ch {
// We can encounter a `\` as the last character in a `FStringMiddle`
@ -244,17 +241,15 @@ impl<'a> StringParser<'a> {
ch => value.push(ch),
}
}
Ok(Expr::from(ast::ExprStringLiteral {
Ok(Expr::from(ast::StringLiteral {
value,
unicode: false,
implicit_concatenated: false,
range: self.range(start_location),
range: self.range,
}))
}
fn parse_bytes(&mut self) -> Result<StringType, LexicalError> {
let mut content = String::new();
let start_location = self.get_pos();
while let Some(ch) = self.next_char() {
match ch {
'\\' if !self.kind.is_raw() => {
@ -274,15 +269,13 @@ impl<'a> StringParser<'a> {
}
}
Ok(StringType::Bytes(ast::ExprBytesLiteral {
Ok(StringType::Bytes(ast::BytesLiteral {
value: content.chars().map(|c| c as u8).collect::<Vec<u8>>(),
implicit_concatenated: false,
range: self.range(start_location),
range: self.range,
}))
}
fn parse_string(&mut self) -> Result<StringType, LexicalError> {
let start_location = self.get_pos();
let mut value = String::new();
if self.kind.is_raw() {
@ -301,11 +294,10 @@ impl<'a> StringParser<'a> {
self.parse_escaped_char(&mut value)?;
}
}
Ok(StringType::Str(ast::ExprStringLiteral {
Ok(StringType::Str(ast::StringLiteral {
value,
unicode: self.kind.is_unicode(),
implicit_concatenated: false,
range: self.range(start_location),
range: self.range,
}))
}
@ -322,38 +314,37 @@ pub(crate) fn parse_string_literal(
source: &str,
kind: StringKind,
triple_quoted: bool,
start_location: TextSize,
range: TextRange,
) -> Result<StringType, LexicalError> {
let start_location = start_location
let start_location = range.start()
+ kind.prefix_len()
+ if triple_quoted {
TextSize::from(3)
} else {
TextSize::from(1)
};
StringParser::new(source, kind, start_location).parse()
StringParser::new(source, kind, start_location, range).parse()
}
pub(crate) fn parse_fstring_middle(
source: &str,
is_raw: bool,
start_location: TextSize,
range: TextRange,
) -> Result<Expr, LexicalError> {
let kind = if is_raw {
StringKind::RawString
} else {
StringKind::String
};
StringParser::new(source, kind, start_location).parse_fstring_middle()
StringParser::new(source, kind, range.start(), range).parse_fstring_middle()
}
/// Concatenate a list of string literals into a single string expression.
pub(crate) fn concatenate_strings(
pub(crate) fn concatenated_strings(
strings: Vec<StringType>,
range: TextRange,
) -> Result<Expr, LexicalError> {
#[cfg(debug_assertions)]
debug_assert!(!strings.is_empty());
debug_assert!(strings.len() > 1);
let mut has_fstring = false;
let mut byte_literal_count = 0;
@ -365,7 +356,6 @@ pub(crate) fn concatenate_strings(
}
}
let has_bytes = byte_literal_count > 0;
let implicit_concatenated = strings.len() > 1;
if has_bytes && byte_literal_count < strings.len() {
return Err(LexicalError {
@ -377,111 +367,44 @@ pub(crate) fn concatenate_strings(
}
if has_bytes {
let mut content: Vec<u8> = vec![];
let mut values = Vec::with_capacity(strings.len());
for string in strings {
match string {
StringType::Bytes(ast::ExprBytesLiteral { value, .. }) => content.extend(value),
StringType::Bytes(value) => values.push(value),
_ => unreachable!("Unexpected non-bytes literal."),
}
}
return Ok(ast::ExprBytesLiteral {
value: content,
implicit_concatenated,
return Ok(Expr::from(ast::ExprBytesLiteral {
value: ast::BytesLiteralValue::concatenated(values),
range,
}
.into());
}));
}
if !has_fstring {
let mut content = String::new();
let is_unicode = strings.first().map_or(false, StringType::is_unicode);
let mut values = Vec::with_capacity(strings.len());
for string in strings {
match string {
StringType::Str(ast::ExprStringLiteral { value, .. }) => content.push_str(&value),
StringType::Str(value) => values.push(value),
_ => unreachable!("Unexpected non-string literal."),
}
}
return Ok(ast::ExprStringLiteral {
value: content,
unicode: is_unicode,
implicit_concatenated,
return Ok(Expr::from(ast::ExprStringLiteral {
value: ast::StringLiteralValue::concatenated(values),
range,
}
.into());
}));
}
// De-duplicate adjacent constants.
let mut deduped: Vec<Expr> = vec![];
let mut current = String::new();
let mut current_start = range.start();
let mut current_end = range.end();
let mut is_unicode = false;
let take_current = |current: &mut String, start, end, unicode| -> Expr {
Expr::StringLiteral(ast::ExprStringLiteral {
value: std::mem::take(current),
unicode,
implicit_concatenated,
range: TextRange::new(start, end),
})
};
let mut parts = Vec::with_capacity(strings.len());
for string in strings {
let string_range = string.range();
match string {
StringType::FString(ast::ExprFString { values, .. }) => {
for value in values {
let value_range = value.range();
match value {
Expr::FormattedValue { .. } => {
if !current.is_empty() {
deduped.push(take_current(
&mut current,
current_start,
current_end,
is_unicode,
));
}
deduped.push(value);
is_unicode = false;
}
Expr::StringLiteral(ast::ExprStringLiteral { value, unicode, .. }) => {
if current.is_empty() {
is_unicode |= unicode;
current_start = value_range.start();
}
current_end = value_range.end();
current.push_str(&value);
}
_ => {
unreachable!("Expected `Expr::FormattedValue` or `Expr::StringLiteral`")
}
}
}
}
StringType::Str(ast::ExprStringLiteral { value, unicode, .. }) => {
if current.is_empty() {
is_unicode |= unicode;
current_start = string_range.start();
}
current_end = string_range.end();
current.push_str(&value);
}
StringType::FString(fstring) => parts.push(ast::FStringPart::FString(fstring)),
StringType::Str(string) => parts.push(ast::FStringPart::Literal(string)),
StringType::Bytes(_) => unreachable!("Unexpected bytes literal."),
}
}
if !current.is_empty() {
deduped.push(take_current(
&mut current,
current_start,
current_end,
is_unicode,
));
}
Ok(ast::ExprFString {
values: deduped,
implicit_concatenated,
value: ast::FStringValue::concatenated(parts),
range,
}
.into())