ruff/crates/ruff_python_formatter/CONTRIBUTING.md
Ibraheem Ahmed c9dff5c7d5
[ty] AST garbage collection (#18482)
## Summary

Garbage collect ASTs once we are done checking a given file. Queries
with a cross-file dependency on the AST will reparse the file on demand.
This reduces ty's peak memory usage by ~20-30%.

The primary change of this PR is adding a `node_index` field to every
AST node, that is assigned by the parser. `ParsedModule` can use this to
create a flat index of AST nodes any time the file is parsed (or
reparsed). This allows `AstNodeRef` to simply index into the current
instance of the `ParsedModule`, instead of storing a pointer directly.

The indices are somewhat hackily (using an atomic integer) assigned by
the `parsed_module` query instead of by the parser directly. Assigning
the indices in source-order in the (recursive) parser turns out to be
difficult, and collecting the nodes during semantic indexing is
impossible as `SemanticIndex` does not hold onto a specific
`ParsedModuleRef`, which the pointers in the flat AST are tied to. This
means that we have to do an extra AST traversal to assign and collect
the nodes into a flat index, but the small performance impact (~3% on
cold runs) seems worth it for the memory savings.

Part of https://github.com/astral-sh/ty/issues/214.
2025-06-13 08:40:11 -04:00

15 KiB

Contributing to the Ruff Formatter

The goal of our formatter is to be compatible with Black except for rare edge cases (mostly involving comment placement). This document outlines the expected development workflow for the formatter and walks through some of its internals.

Testing your changes

You can use the ruff_python_formatter binary to format individual files and show debug info. It's fast to compile because it doesn't depend on ruff. The easiest way is to create a scratch.py (or scratch.pyi) in the project root and run:

cargo run --bin ruff_python_formatter -- --emit stdout scratch.py

...which supports the --print-ir and --print-comments flag. (We recommend running with --print-comments.)

Usage example

Command

cargo run --bin ruff_python_formatter -- --emit stdout --print-comments --print-ir scratch.py

Input

def f():  # a
    pass

Output

[
  "def f",
  group([group(["()"]), source_position(7)]),
  ":",
  line_suffix(["  # a"]),
  expand_parent,
  indent([hard_line_break, "pass", source_position(21)]),
  hard_line_break,
  source_position(21),
  hard_line_break,
  source_position(22)
]
{
    Node {
        kind: StmtFunctionDef,
        range: 0..21,
        source: `def f(): # a⏎`,
    }: {
        "leading": [],
        "dangling": [
            SourceComment {
                text: "# a",
                position: EndOfLine,
                formatted: true,
            },
        ],
        "trailing": [],
    },
}
def f():  # a
    pass

The other option is to use the playground (also check the playground README):

cd playground && npm install && npm run dev:wasm && npm run dev

Runnpm run dev:wasm and reload the page in the browser to refresh.

Tests Running the entire ruff test suite is slow, cargo test -p ruff_python_formatter is a lot faster. We use insta to create snapshots of all tests in crates/ruff_python_formatter/resources/test/fixtures/ruff. We have copied the majority of tests over from Black to check the difference between Ruff and Black output. Whenever we have no more differences on a Black input file, the snapshot is deleted.

Ecosystem checks scripts/formatter_ecosystem_checks.sh runs Black compatibility and stability checks on a number of selected projects. It will print the similarity index, the percentage of lines that remains unchanged between Black's formatting and our formatting. You could compute it as the number of neutral lines in a diff divided by the neutral plus the removed lines. We run this script in CI, you can view the results in a PR page under "Checks" > "CI" > "Summary" at the bottom of the page. The stability checks catch for three common problems: The second formatting pass looks different than the first (formatter instability or lack of idempotency), printing invalid syntax (e.g. missing parentheses around multiline expressions) and panics (mostly in debug assertions). You should ensure that your changes don't decrease the similarity index.

Terminology For (), [] and {} we use the following terminology:

  • Parentheses: (, ) or all kind of parentheses ((), [] and {}, e.g. has_own_parentheses)
  • Brackets: [, ]
  • Braces: {, }

format_dev

It's possible to format an entire project:

cargo run --bin ruff_dev -- format-dev --write /path/to/my_project

Available options:

  • --write: Format the files and write them back to disk.
  • --stability-check: Format twice (but don't write to disk without --write) and check for differences and crashes.
  • --multi-project: Treat every subdirectory as a separate project. Useful for ecosystem checks.
  • --error-file: Write all errors to the given file.
  • --log-file: Write all messages to the given file.
  • --stats-file: Use together with --multi-project, this writes the similarity index as unicode table to the given file.

Large ecosystem checks It is also possible to check a large number of repositories. This dataset is large (~60GB), so we only do this occasionally:

# Get the list of projects
curl https://raw.githubusercontent.com/akx/ruff-usage-aggregate/master/data/known-github-tomls-clean.jsonl > github_search.jsonl
# Repurpose this script to download the repositories for us
python scripts/check_ecosystem.py --checkouts target/checkouts --projects github_search.jsonl -v $(which true) $(which true)
# Check each project for formatter stability
cargo run --bin ruff_dev -- format-dev --stability-check --error-file target/formatter-ecosystem-errors.txt --multi-project target/checkouts

Helper structs

To abstract formatting something into a helper, create a new struct with the data you want to format and implement Format<PyFormatContext<'_>> for MyStruct. Below is a small dummy example.

/// Helper to hide the fields for the struct
pub(crate) fn empty_parenthesized<'content>(
    comments: &'content [SourceComment],
    has_plus_prefix: bool,
) -> FormatEmptyParenthesized<'content> {
    FormatEmptyParenthesized {
        comments,
        has_plus_prefix,
    }
}

/// The wrapper struct
pub(crate) struct FormatEmptyParenthesized<'content> {
    comments: &'content [SourceComment],
    has_plus_prefix: bool,
}

impl Format<PyFormatContext<'_>> for FormatEmptyParenthesized<'_> {
    /// Here we implement the actual formatting
    fn fmt(&self, f: &mut Formatter<PyFormatContext>) -> FormatResult<()> {
        if self.has_plus_prefix {
            text("+").fmt(f)?; // This is equivalent to `write!(f, [text("*")])?;`
        }
        write!(
            f,
            [
                text("("),
                soft_block_indent(&dangling_comments(&self.comments)),
                text(")")
            ]
        )
    }
}

If the struct is used across modules, also adds constructor function that hides the fields of the struct. Since it implements Format, you can directly use it in write calls:

write!(f, [empty_parenthesized(dangling_end_of_line_comments)])?;

Check the builders module for existing primitives.

Adding new syntax

Occasionally, Python will add new syntax. After adding it to ruff_python_ast, run generate.py to generate stubs for node formatting. This will add a Format{{Node}} struct that implements Default (and AsFormat/IntoFormat impls in generated.rs, see orphan rules below).

#[derive(Default)]
pub struct FormatStmtReturn;

We implement FormatNodeRule<{{Node}}> for Format{{Node}}. Inside, we destructure the item to make sure we're not missing any field. If we want to write multiple items, we use an efficient write! call, for single items .format().fmt(f) or .fmt(f) is sufficient.

impl FormatNodeRule<StmtReturn> for FormatStmtReturn {
    fn fmt_fields(&self, item: &StmtReturn, f: &mut PyFormatter) -> FormatResult<()> {
        // Here we destructure item and make sure each field is listed.
        // We generally don't need range if it's underscore-ignored
        let StmtReturn { range: _, node_index: _, value } = item;
        // Implement some formatting logic, in this case no space (and no value) after a return with
        // no value
        if let Some(value) = value {
            write!(
                f,
                [
                    text("return"),
                    // There are multiple different space and newline types (e.g.
                    // `soft_line_break_or_space()`, check the builders module), this one will
                    // always be translate to a normal ascii whitespace character
                    space(),
                    // `return a, b` is valid, but if it wraps we'd need parentheses.
                    // This is different from `(a, b).count(1)` where the parentheses around the
                    // tuple are mandatory
                    value.format().with_options(Parenthesize::IfBreaks)
                ]
            )
        } else {
            text("return").fmt(f)
        }
    }
}

If something such as list or a tuple can break into multiple lines if it is too long for a single line, wrap it into a group. Ignoring comments, we could format a tuple with two items like this:

write!(
    f,
    [group(&format_args![
        text("("),
        soft_block_indent(&format_args![
            item1.format()
            text(","),
            soft_line_break_or_space(),
            item2.format(),
            if_group_breaks(&text(","))
        ]),
        text(")")
    ])]
)

If everything fits on a single line, the group doesn't break and we get something like ("a", "b"). If it doesn't, we get something like

(
    "a",
    "b",
)

For a list of expression, you don't need to format it manually but can use the JoinBuilder util, accessible through .join_comma_separated. Finish will write to the formatter internally.

f.join_comma_separated(item.end())
    .nodes(elts.iter())
    .finish()
// Here we have a builder that separates each element by a `,` and a [`soft_line_break_or_space`].
// It emits a trailing `,` that is only shown if the enclosing group expands. It forces the enclosing
// group to expand if the last item has a trailing `comma` and the magical comma option is enabled.

If you need avoid second mutable borrows with a builder, you can use format_with(|f| { ... }) as a formattable element similar to text() or group().

Comments

Comments can either be own line or end-of-line and can be marked as Leading, Trailing and Dangling.

# Leading comment (always own line)
print("hello world")  # Trailing comment (end-of-line)
# Trailing comment (own line)

Comments are automatically attached as Leading or Trailing to a node close to them, or Dangling if there are only tokens and no nodes surrounding it. Categorization is automatic but sometimes needs to be overridden in place_comment in placement.rs, which this section is about.

[
    # This needs to be handled as a dangling comment
]

Here, the comment is dangling because it is preceded by [, which is a non-trivia token but not a node, and followed by ], which is also a non-trivia token but not a node. In the FormatExprList implementation, we have to call dangling_comments manually and stub out the fmt_dangling_comments default from FormatNodeRule.

impl FormatNodeRule<ExprList> for FormatExprList {
    fn fmt_fields(&self, item: &ExprList, f: &mut PyFormatter) -> FormatResult<()> {
        // ...

        write!(
            f,
            [group(&format_args![
                text("["),
                dangling_comments(dangling), // Gets all the comments marked as dangling for the node
                soft_block_indent(&items),
                text("]")
            ])]
        )
    }

    fn fmt_dangling_comments(&self, _node: &ExprList, _f: &mut PyFormatter) -> FormatResult<()> {
        // Handled as part of `fmt_fields`
        Ok(())
    }
}

A related common challenge is that we want to attach comments to tokens (think keywords and syntactically meaningful characters such as :) that have no node on their own. A slightly simplified version of the while node in our AST looks like the following:

pub struct StmtWhile {
    pub range: TextRange,
    pub test: Box<Expr<TextRange>>,
    pub body: Vec<Stmt<TextRange>>,
    pub orelse: Vec<Stmt<TextRange>>,
}

That means in

while True:  # Trailing condition comment
    if f():
        break
    # trailing while comment
# leading else comment
else:
    print("while-else")

the else has no node, we're just getting the statements in its body.

The preceding token of the leading else comment is the break, which has a node, the following token is the else, which lacks a node, so by default the comment would be marked as trailing the break and wrongly formatted as such. We can identify these cases by looking for comments between two bodies that have the same indentation level as the keyword, e.g. in our case the leading else comment is inside the while node (which spans the entire snippet) and on the same level as the else. We identify those case in handle_own_line_comment_around_body and mark them as dangling for manual formatting later. Similarly, we find and mark comment after the colon(s) in handle_end_of_line_comment_around_body .

The comments don't carry any extra information such as why we marked the comment as trailing, instead they are sorted into one list of leading, one list of trailing and one list of dangling comments per node. In FormatStmtWhile, we can have multiple types of dangling comments, so we have to split the dangling list into after-colon-comments, before-else-comments, etc. by some element separating them (e.g. all comments trailing the colon come before the first statement in the body) and manually insert them in the right position.

A simplified implementation with only those two kinds of comments:

fn fmt_fields(&self, item: &StmtWhile, f: &mut PyFormatter) -> FormatResult<()> {

    // ...

    // See FormatStmtWhile for the real, more complex implementation
    let first_while_body_stmt = item.body.first().unwrap().end();
    let trailing_condition_comments_end =
        dangling_comments.partition_point(|comment| comment.slice().end() < first_while_body_stmt);
    let (trailing_condition_comments, or_else_comments) =
        dangling_comments.split_at(trailing_condition_comments_end);

    write!(
        f,
        [
            text("while"),
            space(),
            test.format(),
            text(":"),
            trailing_comments(trailing_condition_comments),
            block_indent(&body.format())
            leading_comments(or_else_comments),
            text("else:"),
            block_indent(&orelse.format())
        ]
    )?;
}

The orphan rules and trait structure

For the formatter, we would like to implement Format from the rust_formatter crate for all AST nodes, defined in the rustpython_parser crate. This violates Rust's orphan rules. We therefore generate in generate.py a newtype for each AST node with implementations of FormatNodeRule, FormatRule, AsFormat and IntoFormat on it.

excalidraw showing the relationships between the different types