ruff/scripts/check_ecosystem.py
Charlie Marsh cc822082a7
Refactor noqa directive parsing away from regex-based implementation (#5554)
## Summary

I'll write up a more detailed description tomorrow, but in short, this
PR removes our regex-based implementation in favor of "manual" parsing.

I tried a couple different implementations. In the benchmarks below:

- `Directive/Regex` is our implementation on `main`.
- `Directive/Find` just uses `text.find("noqa")`, which is insufficient,
since it doesn't cover case-insensitive variants like `NOQA`, and
doesn't handle multiple `noqa` matches in a single like, like ` # Here's
a noqa comment # noqa: F401`. But it's kind of a baseline.
- `Directive/Memchr` uses three `memchr` iterative finders (one for
`noqa`, `NOQA`, and `NoQA`).
- `Directive/AhoCorasick` is roughly the variant checked-in here.

The raw results:

```
Directive/Regex/# noqa: F401
                        time:   [273.69 ns 274.71 ns 276.03 ns]
                        change: [+1.4467% +1.8979% +2.4243%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  3 (3.00%) low mild
  8 (8.00%) high mild
  4 (4.00%) high severe
Directive/Find/# noqa: F401
                        time:   [66.972 ns 67.048 ns 67.132 ns]
                        change: [+2.8292% +2.9377% +3.0540%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  8 (8.00%) high mild
  3 (3.00%) high severe
Directive/AhoCorasick/# noqa: F401
                        time:   [76.922 ns 77.189 ns 77.536 ns]
                        change: [+0.4265% +0.6862% +0.9871%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe
Directive/Memchr/# noqa: F401
                        time:   [62.627 ns 62.654 ns 62.679 ns]
                        change: [-0.1780% -0.0887% -0.0120%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  5 (5.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
Directive/Regex/# noqa: F401, F841
                        time:   [321.83 ns 322.39 ns 322.93 ns]
                        change: [+8602.4% +8623.5% +8644.5%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
Directive/Find/# noqa: F401, F841
                        time:   [78.618 ns 78.758 ns 78.896 ns]
                        change: [+1.6909% +1.8771% +2.0628%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Directive/AhoCorasick/# noqa: F401, F841
                        time:   [87.739 ns 88.057 ns 88.468 ns]
                        change: [+0.1843% +0.4685% +0.7854%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe
Directive/Memchr/# noqa: F401, F841
                        time:   [80.674 ns 80.774 ns 80.860 ns]
                        change: [-0.7343% -0.5633% -0.4031%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) low severe
  9 (9.00%) low mild
  1 (1.00%) high mild
Directive/Regex/# noqa  time:   [194.86 ns 195.93 ns 196.97 ns]
                        change: [+11973% +12039% +12103%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) low mild
  1 (1.00%) high mild
Directive/Find/# noqa   time:   [25.327 ns 25.354 ns 25.383 ns]
                        change: [+3.8524% +4.0267% +4.1845%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe
Directive/AhoCorasick/# noqa
                        time:   [34.267 ns 34.368 ns 34.481 ns]
                        change: [+0.5646% +0.8505% +1.1281%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
Directive/Memchr/# noqa time:   [21.770 ns 21.818 ns 21.874 ns]
                        change: [-0.0990% +0.1464% +0.4046%] (p = 0.26 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
Directive/Regex/# type: ignore # noqa: E501
                        time:   [278.76 ns 279.69 ns 280.72 ns]
                        change: [+7449.4% +7469.8% +7490.5%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
Directive/Find/# type: ignore # noqa: E501
                        time:   [67.791 ns 67.976 ns 68.184 ns]
                        change: [+2.8321% +3.1735% +3.5418%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
Directive/AhoCorasick/# type: ignore # noqa: E501
                        time:   [75.908 ns 76.055 ns 76.210 ns]
                        change: [+0.9269% +1.1427% +1.3955%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
Directive/Memchr/# type: ignore # noqa: E501
                        time:   [72.549 ns 72.723 ns 72.957 ns]
                        change: [+1.5881% +1.9660% +2.3974%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  10 (10.00%) high mild
  5 (5.00%) high severe
Directive/Regex/# type: ignore # nosec
                        time:   [66.967 ns 67.075 ns 67.207 ns]
                        change: [+1713.0% +1715.8% +1718.9%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
Directive/Find/# type: ignore # nosec
                        time:   [18.505 ns 18.548 ns 18.597 ns]
                        change: [+1.3520% +1.6976% +2.0333%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
Directive/AhoCorasick/# type: ignore # nosec
                        time:   [16.162 ns 16.206 ns 16.252 ns]
                        change: [+1.2919% +1.5587% +1.8430%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
Directive/Memchr/# type: ignore # nosec
                        time:   [39.192 ns 39.233 ns 39.276 ns]
                        change: [+0.5164% +0.7456% +0.9790%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe
Directive/Regex/# some very long comment that # is interspersed with characters but # no directive
                        time:   [81.460 ns 81.578 ns 81.703 ns]
                        change: [+2093.3% +2098.8% +2104.2%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
Directive/Find/# some very long comment that # is interspersed with characters but # no directive
                        time:   [26.284 ns 26.331 ns 26.387 ns]
                        change: [+0.7554% +1.1027% +1.3832%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
Directive/AhoCorasick/# some very long comment that # is interspersed with characters but # no direc...
                        time:   [28.643 ns 28.714 ns 28.787 ns]
                        change: [+1.3774% +1.6780% +2.0028%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Directive/Memchr/# some very long comment that # is interspersed with characters but # no directive
                        time:   [55.766 ns 55.831 ns 55.897 ns]
                        change: [+1.5802% +1.7476% +1.9021%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild
```

While memchr is faster than aho-corasick in some of the common cases
(like `# noqa: F401`), the latter is way, way faster when there _isn't_
a match (like 2x faster -- see the last two cases). Since most comments
_aren't_ `noqa` comments, this felt like the right tradeoff. Note that
all implementations are significantly faster than the regex version.

(I know I originally reported a 10x speedup, but I ended up improving
the regex version a bit in some prior PRs, so it got unintentionally
faster via some refactors.)

There's also one behavior change in here, which is that we now allow
variable spaces, e.g., `#noqa` or `# noqa`. Previously, we required
exactly one space. This thus closes #5177.
2023-07-06 16:03:10 +00:00

469 lines
15 KiB
Python
Executable file

#!/usr/bin/env python3
"""Check two versions of ruff against a corpus of open-source code.
Example usage:
scripts/check_ecosystem.py <path/to/ruff1> <path/to/ruff2>
"""
from __future__ import annotations
import argparse
import asyncio
import difflib
import heapq
import json
import logging
import re
import tempfile
import time
from asyncio.subprocess import PIPE, create_subprocess_exec
from contextlib import asynccontextmanager, nullcontext
from pathlib import Path
from signal import SIGINT, SIGTERM
from typing import TYPE_CHECKING, NamedTuple, Self, TypeVar
if TYPE_CHECKING:
from collections.abc import AsyncIterator, Iterator, Sequence
logger = logging.getLogger(__name__)
class Repository(NamedTuple):
"""A GitHub repository at a specific ref."""
org: str
repo: str
ref: str | None
select: str = ""
ignore: str = ""
exclude: str = ""
# Generating fixes is slow and verbose
show_fixes: bool = False
@asynccontextmanager
async def clone(self: Self, checkout_dir: Path) -> AsyncIterator[Path]:
"""Shallow clone this repository to a temporary directory."""
if checkout_dir.exists():
logger.debug(f"Reusing {self.org}:{self.repo}")
yield Path(checkout_dir)
return
logger.debug(f"Cloning {self.org}:{self.repo}")
git_command = [
"git",
"clone",
"--config",
"advice.detachedHead=false",
"--quiet",
"--depth",
"1",
"--no-tags",
]
if self.ref:
git_command.extend(["--branch", self.ref])
git_command.extend(
[
f"https://github.com/{self.org}/{self.repo}",
checkout_dir,
],
)
process = await create_subprocess_exec(
*git_command,
env={"GIT_TERMINAL_PROMPT": "0"},
)
status_code = await process.wait()
logger.debug(
f"Finished cloning {self.org}/{self.repo} with status {status_code}",
)
yield Path(checkout_dir)
REPOSITORIES: list[Repository] = [
Repository("apache", "airflow", "main", select="ALL"),
Repository("bokeh", "bokeh", "branch-3.2", select="ALL"),
Repository("pypa", "build", "main"),
Repository("pypa", "cibuildwheel", "main"),
Repository("pypa", "setuptools", "main"),
Repository("pypa", "pip", "main"),
Repository("python", "mypy", "master"),
Repository("DisnakeDev", "disnake", "master"),
Repository("scikit-build", "scikit-build", "main"),
Repository("scikit-build", "scikit-build-core", "main"),
Repository("python", "typeshed", "main", select="PYI"),
Repository("zulip", "zulip", "main", select="ALL"),
]
SUMMARY_LINE_RE = re.compile(r"^(Found \d+ error.*)|(.*potentially fixable with.*)$")
class RuffError(Exception):
"""An error reported by ruff."""
async def check(
*,
ruff: Path,
path: Path,
name: str,
select: str = "",
ignore: str = "",
exclude: str = "",
show_fixes: bool = False,
) -> Sequence[str]:
"""Run the given ruff binary against the specified path."""
logger.debug(f"Checking {name} with {ruff}")
ruff_args = ["check", "--no-cache", "--exit-zero"]
if select:
ruff_args.extend(["--select", select])
if ignore:
ruff_args.extend(["--ignore", ignore])
if exclude:
ruff_args.extend(["--exclude", exclude])
if show_fixes:
ruff_args.extend(["--show-fixes", "--ecosystem-ci"])
start = time.time()
proc = await create_subprocess_exec(
ruff.absolute(),
*ruff_args,
".",
stdout=PIPE,
stderr=PIPE,
cwd=path,
)
result, err = await proc.communicate()
end = time.time()
logger.debug(f"Finished checking {name} with {ruff} in {end - start:.2f}")
if proc.returncode != 0:
raise RuffError(err.decode("utf8"))
lines = [
line
for line in result.decode("utf8").splitlines()
if not SUMMARY_LINE_RE.match(line)
]
return sorted(lines)
class Diff(NamedTuple):
"""A diff between two runs of ruff."""
removed: set[str]
added: set[str]
def __bool__(self: Self) -> bool:
"""Return true if this diff is non-empty."""
return bool(self.removed or self.added)
def __iter__(self: Self) -> Iterator[str]:
"""Iterate through the changed lines in diff format."""
for line in heapq.merge(sorted(self.removed), sorted(self.added)):
if line in self.removed:
yield f"- {line}"
else:
yield f"+ {line}"
async def compare(
ruff1: Path,
ruff2: Path,
repo: Repository,
checkouts: Path | None = None,
) -> Diff | None:
"""Check a specific repository against two versions of ruff."""
removed, added = set(), set()
# By the default, the git clone are transient, but if the user provides a
# directory for permanent storage we keep it there
if checkouts:
location_context = nullcontext(checkouts)
else:
location_context = tempfile.TemporaryDirectory()
with location_context as checkout_parent:
assert ":" not in repo.org
assert ":" not in repo.repo
checkout_dir = Path(checkout_parent).joinpath(f"{repo.org}:{repo.repo}")
async with repo.clone(checkout_dir) as path:
try:
async with asyncio.TaskGroup() as tg:
check1 = tg.create_task(
check(
ruff=ruff1,
path=path,
name=f"{repo.org}/{repo.repo}",
select=repo.select,
ignore=repo.ignore,
exclude=repo.exclude,
show_fixes=repo.show_fixes,
),
)
check2 = tg.create_task(
check(
ruff=ruff2,
path=path,
name=f"{repo.org}/{repo.repo}",
select=repo.select,
ignore=repo.ignore,
exclude=repo.exclude,
show_fixes=repo.show_fixes,
),
)
except ExceptionGroup as e:
raise e.exceptions[0] from e
for line in difflib.ndiff(check1.result(), check2.result()):
if line.startswith("- "):
removed.add(line[2:])
elif line.startswith("+ "):
added.add(line[2:])
return Diff(removed, added)
def read_projects_jsonl(projects_jsonl: Path) -> dict[tuple[str, str], Repository]:
"""Read either of the two formats of https://github.com/akx/ruff-usage-aggregate."""
repositories = {}
for line in projects_jsonl.read_text().splitlines():
data = json.loads(line)
# Check the input format.
if "items" in data:
for item in data["items"]:
# Pick only the easier case for now.
if item["path"] != "pyproject.toml":
continue
repository = item["repository"]
assert re.fullmatch(r"[a-zA-Z0-9_.-]+", repository["name"]), repository[
"name"
]
# GitHub doesn't give us any branch or pure rev info. This would give
# us the revision, but there's no way with git to just do
# `git clone --depth 1` with a specific ref.
# `ref = item["url"].split("?ref=")[1]` would be exact
repositories[(repository["owner"], repository["repo"])] = Repository(
repository["owner"]["login"],
repository["name"],
None,
select=repository.get("select"),
ignore=repository.get("ignore"),
exclude=repository.get("exclude"),
)
else:
assert "owner" in data, "Unknown ruff-usage-aggregate format"
# Pick only the easier case for now.
if data["path"] != "pyproject.toml":
continue
repositories[(data["owner"], data["repo"])] = Repository(
data["owner"],
data["repo"],
data.get("ref"),
select=data.get("select"),
ignore=data.get("ignore"),
exclude=data.get("exclude"),
)
return repositories
T = TypeVar("T")
async def main(
*,
ruff1: Path,
ruff2: Path,
projects_jsonl: Path | None,
checkouts: Path | None = None,
) -> None:
"""Check two versions of ruff against a corpus of open-source code."""
if projects_jsonl:
repositories = read_projects_jsonl(projects_jsonl)
else:
repositories = {(repo.org, repo.repo): repo for repo in REPOSITORIES}
logger.debug(f"Checking {len(repositories)} projects")
# https://stackoverflow.com/a/61478547/3549270
# Otherwise doing 3k repositories can take >8GB RAM
semaphore = asyncio.Semaphore(50)
async def limited_parallelism(coroutine: T) -> T:
async with semaphore:
return await coroutine
results = await asyncio.gather(
*[
limited_parallelism(compare(ruff1, ruff2, repo, checkouts))
for repo in repositories.values()
],
return_exceptions=True,
)
diffs = dict(zip(repositories, results, strict=True))
total_removed = total_added = 0
errors = 0
for diff in diffs.values():
if isinstance(diff, Exception):
errors += 1
else:
total_removed += len(diff.removed)
total_added += len(diff.added)
if total_removed == 0 and total_added == 0 and errors == 0:
print("\u2705 ecosystem check detected no changes.")
else:
rule_changes: dict[str, tuple[int, int]] = {}
changes = f"(+{total_added}, -{total_removed}, {errors} error(s))"
print(f"\u2139\ufe0f ecosystem check **detected changes**. {changes}")
print()
for (org, repo), diff in diffs.items():
if isinstance(diff, Exception):
changes = "error"
print(f"<details><summary>{repo} ({changes})</summary>")
repo = repositories[(org, repo)]
print(
f"https://github.com/{repo.org}/{repo.repo} ref {repo.ref} "
f"select {repo.select} ignore {repo.ignore} exclude {repo.exclude}",
)
print("<p>")
print()
print("```")
print(str(diff))
print("```")
print()
print("</p>")
print("</details>")
elif diff:
changes = f"+{len(diff.added)}, -{len(diff.removed)}"
print(f"<details><summary>{repo} ({changes})</summary>")
print("<p>")
print()
diff_str = "\n".join(diff)
print("```diff")
print(diff_str)
print("```")
print()
print("</p>")
print("</details>")
# Count rule changes
for line in diff_str.splitlines():
# Find rule change for current line or construction
# + <rule>/<path>:<line>:<column>: <rule_code> <message>
matches = re.search(r": ([A-Z]{1,3}[0-9]{3,4})", line)
if matches is None:
# Handle case where there are no regex matches e.g.
# + "?application=AIRFLOW&authenticator=TEST_AUTH&role=TEST_ROLE&warehouse=TEST_WAREHOUSE" # noqa: E501, ERA001
# Which was found in local testing
continue
rule_code = matches.group(1)
# Get current additions and removals for this rule
current_changes = rule_changes.get(rule_code, (0, 0))
# Check if addition or removal depending on the first character
if line[0] == "+":
current_changes = (current_changes[0] + 1, current_changes[1])
elif line[0] == "-":
current_changes = (current_changes[0], current_changes[1] + 1)
rule_changes[rule_code] = current_changes
else:
continue
if len(rule_changes.keys()) > 0:
print(f"Rules changed: {len(rule_changes.keys())}")
print()
print("| Rule | Changes | Additions | Removals |")
print("| ---- | ------- | --------- | -------- |")
for rule, (additions, removals) in sorted(
rule_changes.items(),
key=lambda x: (x[1][0] + x[1][1]),
reverse=True,
):
print(f"| {rule} | {additions + removals} | {additions} | {removals} |")
logger.debug(f"Finished {len(repositories)} repositories")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Check two versions of ruff against a corpus of open-source code.",
epilog="scripts/check_ecosystem.py <path/to/ruff1> <path/to/ruff2>",
)
parser.add_argument(
"--projects",
type=Path,
help=(
"Optional JSON files to use over the default repositories. "
"Supports both github_search_*.jsonl and known-github-tomls.jsonl."
),
)
parser.add_argument(
"--checkouts",
type=Path,
help=(
"Location for the git checkouts, in case you want to save them"
" (defaults to temporary directory)"
),
)
parser.add_argument(
"-v",
"--verbose",
action="store_true",
help="Activate debug logging",
)
parser.add_argument(
"ruff1",
type=Path,
)
parser.add_argument(
"ruff2",
type=Path,
)
args = parser.parse_args()
if args.verbose:
logging.basicConfig(level=logging.DEBUG)
else:
logging.basicConfig(level=logging.INFO)
loop = asyncio.get_event_loop()
if args.checkouts:
args.checkouts.mkdir(exist_ok=True, parents=True)
main_task = asyncio.ensure_future(
main(
ruff1=args.ruff1,
ruff2=args.ruff2,
projects_jsonl=args.projects,
checkouts=args.checkouts,
),
)
# https://stackoverflow.com/a/58840987/3549270
for signal in [SIGINT, SIGTERM]:
loop.add_signal_handler(signal, main_task.cancel)
try:
loop.run_until_complete(main_task)
finally:
loop.close()