I've tried to investigate puffin's performance wrt to builds and
parallelism in general, but found the previous instrumentation to
granular. I've tried to add spans to every function that either needs
noticeable io or cpu resources without creating duplication. This also
fixes some wrong tracing usage on async functions
(https://docs.rs/tracing/latest/tracing/struct.Span.html#in-asynchronous-code)
and some spans that weren't actually entered.
## Summary
This PR adds support for relative URLs in the simple JSON responses. We
already support relative URLs for HTML responses, but the handling has
been consolidated between the two. Similar to index URLs, we now store
the base alongside the metadata, and use the base when resolving the
URL.
Closes#455.
## Test Plan
`cargo test` (to test HTML indexes). Separately, I also ran `cargo run
-p puffin-cli -- pip-compile requirements.in -n
--index-url=http://localhost:3141/packages/pypi/+simple` on the
`zb/relative` branch with `packse` running, and forced both HTML and
JSON by limiting the `accept` header.
## Summary
This PR makes the `pypi_types::File` a response-only type (i.e., a type
that's only used when deserializing over the wire), and adds a separate
internal `File` type. Right now, the representations are similar, but
already, we can avoid the "lenient" deserialization on our internal
`File` type, and avoid the special-casing of the property names that's
required in the JSON. Over time, we can evolve this representation
entirely separately from the representation we receive from PyPI and
other indexes.
This crate started off as generic caching utilities, but we started
adding a lot of Puffin-specific stuff (like the cache buckets
abstraction that knows about Git vs. direct URL vs. indexes and so on).
This PR moves the generic stuff into a new `cache-key` crate.
From manual inspection, this dataset generated through the [libraries.io
API](https://libraries.io/api#project-search) seems more mainstream than
the current 8k one, which is also preserved. I've added the dataset to
the repo because the API requires an API key.
We lock git checkout directories and the virtualenv to avoid two puffin
instances running in parallel changing files at the same time and
leading to a broken state. When one instance is blocking another, we
need to inform the user (why is the program hanging?) and also add some
information for them to debug the situation.
The new messages will print
```
Waiting to acquire lock for /home/konsti/projects/puffin/.venv (lockfile: /home/konsti/projects/puffin/.venv/.lock)
```
or
```
Waiting to acquire lock for git+https://github.com/pydantic/pydantic-extra-types@0ce9f207a1e09a862287ab77512f0060c1625223 (lockfile: /home/konsti/projects/puffin/cache-all-kinds/git-v0/locks/f157fd329a506a34)
```
The messages aren't perfect but clear enough to see what the contention
is and in the worst case to delete the lockfile.
Fixes#714
This is a pure refactor to follow-up #690, to separate the metadata that
we know upfront about distributions (like the version, for
registry-based distributions) vs. the metadata that requires building
(like the version, for URL-based distributions).
This gives a 1.23 speedup on transformers-extras. We could change to
msgpack for the entire cache if we want. I only tried this format and
postcard so far, where postcard was much slower (like 1.6s).
I don't actually want to merge it like this, i wanted to figure out the
ballpark of improvement for switching away from json.
```
hyperfine --warmup 3 --runs 10 "target/profiling/puffin pip-compile --cache-dir cache-msgpack scripts/requirements/transformers-extras.in" "target/profiling/branch pip-compile scripts/requirements/transformers-extras.in"
Benchmark 1: target/profiling/puffin pip-compile --cache-dir cache-msgpack scripts/requirements/transformers-extras.in
Time (mean ± σ): 179.1 ms ± 4.8 ms [User: 157.5 ms, System: 48.1 ms]
Range (min … max): 174.9 ms … 188.1 ms 10 runs
Benchmark 2: target/profiling/branch pip-compile scripts/requirements/transformers-extras.in
Time (mean ± σ): 221.1 ms ± 6.7 ms [User: 208.1 ms, System: 46.5 ms]
Range (min … max): 213.5 ms … 235.5 ms 10 runs
Summary
target/profiling/puffin pip-compile --cache-dir cache-msgpack scripts/requirements/transformers-extras.in ran
1.23 ± 0.05 times faster than target/profiling/branch pip-compile scripts/requirements/transformers-extras.in
```
Disadvantage: We can't manually look into the cache anymore to debug
things
- [ ] Check more formats, i currently only tested json, msgpack and
postcard, there should be other formats, too
- [x] Switch over `CachedByTimestamp` serialization (for the interpreter
caching)
- [x] Switch over error handling and make sure puffin is still resilient
to cache failure
## Summary
This is more of a hypothetical problem, but the cache manifest could in
theory get out-of-sync with the contents on disk. This PR modifies the
`BuiltWheelMetadata` lookup to warn (but not fail) if the manifest
includes a wheel that no longer exists on disk. You can mimic this by
removing a wheel from the `built-wheels-v0` cache without modifying the
manifest correspondingly.
## Summary
This PR modifies `source_dist.rs` to store source distributions (from
remote URLs) in the cache. The cache structure for registries now looks
like:
<img width="1053" alt="Screen Shot 2023-12-14 at 10 43 43 PM"
src="3c2dbf6b-5926-41f2-b69b-74031741aba8">
(I will update the docs prior to merging, if approved.)
The benefit here is that we can reuse the source distribution (avoid
download + unzipping it) if we need to build multiple wheels. In the
future, it will be even more relevant, since we'll need to reuse the
source distribution to support
https://github.com/astral-sh/puffin/issues/599.
I also included some misc. refactors to DRY up repeated operations and
add some more abstraction to `source_dist.rs`.
## Summary
This PR adds a `VerbatimUrl` struct to preserve verbatim URLs throughout
the resolution and installation pipeline. In short, alongside the parsed
`Url`, we also keep the URL as written by the user. This enables us to
display the URL exactly as written by the user, rather than the
serialized path that we use internally.
This will be especially useful once we start expanding environment
variables since, at that point, we'll be able to write the version of
the URL that includes the _unexpected_ environment variable to the
output file.
We have some shared utilities beyond `puffin-build` and
`puffin-distribution`, and further, I want to be able to access the
sdist archive extraction logic from `puffin-distribution`. This is
really generic, so moving into its own crate.
This allows us to enforce type safety within the resolver. For example,
in the index, we can remove `String` as a key type and enforce that
callers _must_ present us with a `PackageId`. (This actually caught one
bug, where we were using the SHA rather than the package ID. That bug
shouldn't have had any effect given where it was, since those are 1:1,
but it's still problematic.)
Make `prepare_metadata_for_build_wheel` accessible across the puffin
codebase by splitting the built call into a setup, a metadata and a
wheel call. This does not actually use the hook yet, but it's the
required refactoring for it.
Part of #599.
## Summary
At present, we have two separate phases within the installation pipeline
related to populating wheels into the cache. The first phase downloads
the distribution, and then builds any source distributions into wheels;
the second phase unzips all the built wheels into the cache.
This PR merges those two phases into one, such that we seamlessly
download, build, and unzip wheels in one pass. This is more efficient,
since we can start unzipping while we build. It also ensures that if the
install _fails_ partway through, we don't end up with a bunch of
downloaded wheels that we never had a chance to unzip. The code is also
much simpler.
The main downside is that the user-facing feedback isn't as granular,
since we only have one phase and one progress bar for what was
originally three distinct phases.
Closes https://github.com/astral-sh/puffin/issues/571.
## Test Plan
I ran the benchmark script on two separate requirements files, and saw a
7% and 31% speedup respectively:
```text
+ TARGET=./scripts/benchmarks/requirements.txt
+ hyperfine --runs 100 --warmup 10 --prepare 'virtualenv --clear .venv' './target/release/main pip-sync ./scripts/benchmarks/requirements.txt --no-cache' --prepare 'virtualenv --clear .venv' './target/release/puffin pip-sync ./scripts/benchmarks/requirements.txt --no-cache'
Benchmark 1: ./target/release/main pip-sync ./scripts/benchmarks/requirements.txt --no-cache
Time (mean ± σ): 269.4 ms ± 33.0 ms [User: 42.4 ms, System: 117.5 ms]
Range (min … max): 221.7 ms … 446.7 ms 100 runs
Benchmark 2: ./target/release/puffin pip-sync ./scripts/benchmarks/requirements.txt --no-cache
Time (mean ± σ): 250.6 ms ± 28.3 ms [User: 41.5 ms, System: 127.4 ms]
Range (min … max): 207.6 ms … 336.4 ms 100 runs
Summary
'./target/release/puffin pip-sync ./scripts/benchmarks/requirements.txt --no-cache' ran
1.07 ± 0.18 times faster than './target/release/main pip-sync ./scripts/benchmarks/requirements.txt --no-cache'
```
```text
+ TARGET=./scripts/benchmarks/requirements-large.txt
+ hyperfine --runs 100 --warmup 10 --prepare 'virtualenv --clear .venv' './target/release/main pip-sync ./scripts/benchmarks/requirements-large.txt --no-cache' --prepare 'virtualenv --clear .venv' './target/release/puffin pip-sync ./scripts/benchmarks/requirements-large.txt --no-cache'
Benchmark 1: ./target/release/main pip-sync ./scripts/benchmarks/requirements-large.txt --no-cache
Time (mean ± σ): 5.053 s ± 0.354 s [User: 1.413 s, System: 6.710 s]
Range (min … max): 4.584 s … 6.333 s 100 runs
Benchmark 2: ./target/release/puffin pip-sync ./scripts/benchmarks/requirements-large.txt --no-cache
Time (mean ± σ): 3.845 s ± 0.225 s [User: 1.364 s, System: 6.970 s]
Range (min … max): 3.482 s … 4.715 s 100 runs
Summary
'./target/release/puffin pip-sync ./scripts/benchmarks/requirements-large.txt --no-cache' ran
```
I saw warnings when we were e.g. unzipping wheel and setuptools in two
tasks at the same time. We now keep track of in flight unzips.
This introduces a `OnceMap` abstraction which we also use in the
resolver.
## Summary
This PR enables `puffin clean` to accept package names as command line
arguments, and selectively purge entries from the cache tied to the
given package.
Relate to #572.
## Test Plan
Modified all the caching tests to run an additional step to (1) purge
the cache, and (2) re-install the package.
## Summary
This PR modifies the Git wheel cache to: (1) use a shorter version of
the SHA, to save space; and (2) include the package name, for
consistency with all other buckets.
I considered removing the URL hash entirely, and _just_ using the SHA,
which would be even _more_ consistent with other buckets. But if we
remove the URL, then we won't have separate directories for
subdirectories (which are part of the URL).
Before:
<img width="1035" alt="Screen Shot 2023-12-07 at 7 23 42 PM"
src="86afce67-682f-464f-9ba1-0b60d5b7f19f">
After:
<img width="1232" alt="Screen Shot 2023-12-07 at 8 09 23 PM"
src="eda42a19-974f-47fe-8c83-54a602ddfd2d">
## Summary
This PR modifies the cache structure in a few ways. Most notably, we now
shard the set of registry wheels by package, and index them lazily when
computing the install plan.
This applies both to built wheels:
<img width="989" alt="Screen Shot 2023-12-06 at 4 42 19 PM"
src="0e8a306f-befd-4be9-a63e-2303389837bb">
And remote wheels:
<img width="836" alt="Screen Shot 2023-12-06 at 4 42 30 PM"
src="7fd908cd-dd86-475e-9779-07ed067b4a1a">
For other distributions, we now consistently cache using the package
name, which is really just for clarity and debuggability (we could
consider omitting these):
<img width="955" alt="Screen Shot 2023-12-06 at 4 58 30 PM"
src="3e8d0f99-df45-429a-9175-d57b54a72e56">
Obliquely closes https://github.com/astral-sh/puffin/issues/575.
This PR adds caching support for built wheels in the installer.
Specifically, the `RegistryWheelIndex` now indexes both downloaded and
built wheels (from registries), and we have a new `BuiltWheelIndex` that
takes a subdirectory and returns the "best-matching" compatible wheel.
Closes#570.
Path distribution cache reading errors are no longer fatal.
We now invalidate the path file source dists if its modification
timestamp changed, and invalidate path dir source dists if
`pyproject.toml` or alternatively `setup.py` changed, which seems good
choices since changing pyproject.toml should trigger a rebuild and the
user can `touch` the file as part of their workflow.
`CachedByTimestamp` is now a shared util. It doesn't have methods as i
don't think it's worth it yet for two users.
Closes#478
TODO(konstin): Write a test. This is probably twice as much work as that
fix itself, so i made that PR without one for now.
## Summary
This PR uses the wheel stem (e.g., `foo-1.2.3-py3-none-any`) instead of
the wheel name (e.g., `foo-1.2.3-py3-none-any.whl`) when storing
unzipped wheels in the cache, which removes a class of confusing issues
around overwrites and directory-vs.-file collisions.
For now, we retain _both_ the zipped and unzipped wheels in the cache,
though we can easily change this by storing the zipped wheels in a
temporary directory.
Closes https://github.com/astral-sh/puffin/issues/573.
## Test Plan
Some examples from my local cache:
<img width="835" alt="Screen Shot 2023-12-05 at 4 09 55 PM"
src="784146aa-b080-416e-9767-40c843fe5d6a">
<img width="847" alt="Screen Shot 2023-12-05 at 4 12 14 PM"
src="4bc7f30f-bef3-47f1-b4e8-da9cabf87f28">
<img width="637" alt="Screen Shot 2023-12-05 at 4 09 50 PM"
src="25ca4944-4a06-4a08-ac85-c6f7d8b5c8ea">
## Summary
When installing a local wheel, we need to avoid removing the zipped
wheel (since it lives outside of the cache), _and_ need to ensure that
we unzip the wheel into the cache (rather than replacing the zipped
wheel, which may even live outside of the project).
Closes https://github.com/astral-sh/puffin/issues/553.
This PR modifies the source distribution building to replace any
existing targets after building the new wheel. In some cases, the
existence of an existing target may be indicative of a bug, so we warn.
It's partially a workaround for some (but not all) of the errors in
https://github.com/astral-sh/puffin/issues/554.
## Summary
This is hard to reproduce, but if you run a long installation process
that errors part-way through, you can end up with zipped wheels in the
`Wheels` cache, which is intended to contain only unzipped wheels. This
PR avoids returning those entries from the registry, which will then
lead to errors downstream when we treat them as directories.
It turns out that it's not uncommon to use timestamps as patch versions
(e.g., `20230628214621`). I believe this is the ISO 8601 "basic format".
These can't be represented by a `u32`, so I think it makes sense to just
bump to `u64` to remove this limitation.
Ensure we're using atomic writes everywhere in our cache to avoid broken
cache records and error with parallel puffin actions
(https://github.com/astral-sh/puffin/pull/544#issuecomment-1838841581).
All json files that are written to the cache are written atomically and
the build wheels are written to temp dir and then moved atomically. I
didn't touch venv creation though, i don't think that's worth it since
python does not support atomic package installation through its design.
After this change, two wheel caches remain: `built-wheels-v0` and
`wheels-v0`, docs screenshots below. Each contains both the wheel
metadata, cache policy and zip or unzipped wheels under the same name.
The zipped/unzipped strategy is as follows: In `pip-compile`, when we
build a wheel, we store it zipped. When `pip-sync` or a source dist
build in `pip-compile` need to install the wheel, we unzip it, remove
the file and replace it with the unzipped wheel.
This removes `WheelCache` and `UrlIndex` in favor of `Cache` plus
`WheelCache`. The non-built wheel cache now considers index urls and the
url for url wheels.
I'm unsure if we need the `Unzipper` type, this could just be a
function.
I move `no_index` into `IndexUrls` and started using `IndexUrl` up to
the clap level.
I left a number of TODOs in the code, namely performing the actual
invalidation of unzipped wheels and making the `InstallPlan` understand
cache invalidation (i.e. uninstall wheels when their remote changed).

Remove built wheels alongside their metadata when their index source
dist or url source dist changed. For git source dists, we currently
don't clear the previous build but use a new directory (not sure what's
right here - are there any generic cache GC approaches out there? I've
seen that e.g. spotify keeps its cache at 10GB max, but i also haven't
seen any reusable, well tested approaches for this). Path distributions
are unchanged (#478).
I like the structure of metadata alongside the wheel for cache
invalidation, i'll try to do that for `wheels-v0`/`wheel-metadata-v0`
too. (The unzipped wheels afaik currently lack cache invalidation when
the remote changed.) This should give is roughly the same structure for
wheel and built wheels and a very similar pattern of invalidation.
This is mostly a mechanical refactor that moves 80% of our code to the
same cache abstraction.
It introduces cache `Cache`, which abstracts away the path of the cache
and the temp dir drop and is passed throughout the codebase. To get a
specific cache bucket, you need to requests your `CacheBucket` from
`Cache`. `CacheBucket` is the centralizes the names of all cache
buckets, moving them away from the string constants spread throughout
the crates.
Specifically for working with the `CachedClient`, there is a
`CacheEntry`. I'm not sure yet if that is a strict improvement over
`cache_dir: PathBuf, cache_file: String`, i may have to rotate that
later.
The interpreter cache moved into `interpreter-v0`.
We can use the `CacheBucket` page to document the cache structure in
each bucket:

## Summary
We need to pass in the distribution with the "precise" URL to avoid
refetching.
## Test Plan
Ran `cargo run -p puffin-cli -- pip-compile requirements.in --verbose`
with `flask @ git+https://github.com/pallets/flask.git` and verified
that we only checked out Flask once.