## Summary
It turns out that storing an absolute URL for every file caused a
significant performance regression. This PR attempts to address the
regression with two changes.
The first is that we now store the raw string if the URL is an absolute
URL. If the URL is relative, we store the base URL alongside the raw
relative string. As such, we avoid serializing and deserializing URLs
until we need them (later on), except for the base URL.
The second is that we now use the internal `Url` crate methods for
serializing and deserializing. If you look inside `Url`, its standard
serializer and deserialization actually convert it to a string, then
parse the string. But the crate exposes some other methods for faster
serialization and deserialization (with fewer guarantees). I think this
is totally fine since the cache is entirely internal.
If we _just_ change the `Url` serialization (and no other code -- so
continue to store URLs for every file), then the regression goes down to
about 5%:
```shell
❯ python -m scripts.bench \
--puffin-path ./target/release/main \
--puffin-path ./target/release/relative --puffin-path ./target/release/puffin \
scripts/requirements/home-assistant.in --benchmark resolve-warm
Benchmark 1: ./target/release/main (resolve-warm)
Time (mean ± σ): 496.3 ms ± 4.3 ms [User: 452.4 ms, System: 175.5 ms]
Range (min … max): 487.3 ms … 502.4 ms 10 runs
Benchmark 2: ./target/release/relative (resolve-warm)
Time (mean ± σ): 284.8 ms ± 2.1 ms [User: 245.8 ms, System: 165.6 ms]
Range (min … max): 280.3 ms … 288.0 ms 10 runs
Benchmark 3: ./target/release/puffin (resolve-warm)
Time (mean ± σ): 300.4 ms ± 3.2 ms [User: 255.5 ms, System: 178.1 ms]
Range (min … max): 295.4 ms … 305.1 ms 10 runs
Summary
'./target/release/relative (resolve-warm)' ran
1.05 ± 0.01 times faster than './target/release/puffin (resolve-warm)'
1.74 ± 0.02 times faster than './target/release/main (resolve-warm)'
```
So I considered _just_ making that change. But 5% is kind of
borderline...
With both of these changes, the regression is down to 1-2%:
```
Benchmark 1: ./target/release/relative (resolve-warm)
Time (mean ± σ): 282.6 ms ± 7.4 ms [User: 244.6 ms, System: 181.3 ms]
Range (min … max): 275.1 ms … 318.5 ms 30 runs
Benchmark 2: ./target/release/puffin (resolve-warm)
Time (mean ± σ): 286.8 ms ± 2.2 ms [User: 247.0 ms, System: 169.1 ms]
Range (min … max): 282.3 ms … 290.7 ms 30 runs
Summary
'./target/release/relative (resolve-warm)' ran
1.01 ± 0.03 times faster than './target/release/puffin (resolve-warm)'
```
It's consistently ~2%-ish, but at this point it's unclear if that's due
to the URL change or something other change between now and then.
Closes#943.
There is no guarantee that indexes provide hashes at all or the sha256
we support specifically. [PEP
503](https://peps.python.org/pep-0503/#specification):
> The URL SHOULD include a hash in the form of a URL fragment with the
following syntax: #<hashname>=<hashvalue>, where <hashname> is the
lowercase name of the hash function (such as sha256) and <hashvalue> is
the hex encoded digest.
We instead use the url as input to generate a hash when caching.
Changes `File::size` from a `usize` to a `u64`.
The motivations are that with tensorflow wheels being 475 MB
(https://pypi.org/project/tensorflow/2.15.0.post1/#files), we're already
only one order of magnitude away and to avoid target dependent failures.
Instead of trying to fixup _all_ the invalid version specifiers on pypi
and elsewhere, this filters out distributions with invalid
`requires-python` version specifiers that even
`LenientVersionSpecifiers` couldn't parse, as opposed to failing
entirely, which we currently do.
I would be nicer to model through an invalid distribution pubgrub type,
together with e.g. source dists with an unknown extension, so that the
version itself still shows up in the error trace.
At the same time, we reduce the log level for fixups from warning to
trace, as they are not actionable for the user.
This PR builds on #780 by making both version parsing faster, and
perhaps more importantly, making version comparisons much faster.
Overall, these changes result in a considerable improvement for the
`boto3.in` workload. Here's the status quo:
```
$ time puffin pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in
Resolved 31 packages in 34.56s
real 34.579
user 34.004
sys 0.413
maxmem 2867 MB
faults 0
```
And now with this PR:
```
$ time puffin pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in
Resolved 31 packages in 9.20s
real 9.218
user 8.919
sys 0.165
maxmem 463 MB
faults 0
```
This particular workload gets stuck in pubgrub doing resolution, and
thus benefits mightily from a faster `Version::cmp` routine. With that
said, this change does also help a fair bit with "normal" runs:
```
$ hyperfine -w10 \
"puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in" \
"puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in"
Benchmark 1: puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in
Time (mean ± σ): 337.5 ms ± 3.9 ms [User: 310.5 ms, System: 73.2 ms]
Range (min … max): 333.6 ms … 343.4 ms 10 runs
Benchmark 2: puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in
Time (mean ± σ): 189.8 ms ± 3.0 ms [User: 168.1 ms, System: 78.4 ms]
Range (min … max): 185.0 ms … 196.2 ms 15 runs
Summary
puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in ran
1.78 ± 0.03 times faster than puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in
```
There is perhaps some future work here (detailed in the commit
messages), but I suspect it would be more fruitful to explore ways of
making resolution itself and/or deserialization faster.
Fixes#373, Closes#396
This PR does a bit of refactoring to the pep440 crate, and in
particular around the erorr types. This PR is meant to be a precursor
to another PR that does some surgery (both in parsing and in `Version`
representation) that benefits somewhat from this refactoring.
As usual, please review commit-by-commit.
## Summary
This PR adds support for relative URLs in the simple JSON responses. We
already support relative URLs for HTML responses, but the handling has
been consolidated between the two. Similar to index URLs, we now store
the base alongside the metadata, and use the base when resolving the
URL.
Closes#455.
## Test Plan
`cargo test` (to test HTML indexes). Separately, I also ran `cargo run
-p puffin-cli -- pip-compile requirements.in -n
--index-url=http://localhost:3141/packages/pypi/+simple` on the
`zb/relative` branch with `packse` running, and forced both HTML and
JSON by limiting the `accept` header.
## Summary
This PR makes the `pypi_types::File` a response-only type (i.e., a type
that's only used when deserializing over the wire), and adds a separate
internal `File` type. Right now, the representations are similar, but
already, we can avoid the "lenient" deserialization on our internal
`File` type, and avoid the special-casing of the property names that's
required in the JSON. Over time, we can evolve this representation
entirely separately from the representation we receive from PyPI and
other indexes.
## Summary
Now, `puffin_warnings::warn_once` and `puffin_warnings::warn` will go to
`stderr`, as long as the user isn't running under `--quiet`. Previously,
these went through `tracing`, and so were only visible when running
under `--verbose`.
After this change, two wheel caches remain: `built-wheels-v0` and
`wheels-v0`, docs screenshots below. Each contains both the wheel
metadata, cache policy and zip or unzipped wheels under the same name.
The zipped/unzipped strategy is as follows: In `pip-compile`, when we
build a wheel, we store it zipped. When `pip-sync` or a source dist
build in `pip-compile` need to install the wheel, we unzip it, remove
the file and replace it with the unzipped wheel.
This removes `WheelCache` and `UrlIndex` in favor of `Cache` plus
`WheelCache`. The non-built wheel cache now considers index urls and the
url for url wheels.
I'm unsure if we need the `Unzipper` type, this could just be a
function.
I move `no_index` into `IndexUrls` and started using `IndexUrl` up to
the clap level.
I left a number of TODOs in the code, namely performing the actual
invalidation of unzipped wheels and making the `InstallPlan` understand
cache invalidation (i.e. uninstall wheels when their remote changed).

## Summary
This PR adds support for local path dependencies. The approach mostly
just falls out of our existing approach and infrastructure for Git and
URL dependencies.
Closes https://github.com/astral-sh/puffin/issues/436. (We'll open a
separate issue for editable installs.)
## Test Plan
Added `pip-compile` tests that pre-download a wheel or source
distribution, then install it via local path.
A consistent cache structure for remote wheel metadata:
* `<wheel metadata cache>/pypi/foo-1.0.0-py3-none-any.json`
* `<wheel metadata
cache>/<digest(index-url)>/foo-1.0.0-py3-none-any.json`
* `<wheel metadata cache>/url/<digest(url)>/foo-1.0.0-py3-none-any.json`
The source dist caching will use a similar structure (#468).
Deduplicate lenient parsing code between version specifiers and
Requirement. Use `warn_once!` since the warnings did show up multiple
times in my code. Fix the macro hygiene in `warn_once!`.
This works by filtering out files with a more recent upload time, so if
the index you use does not provide upload times, the results might be
inaccurate. pypi provides upload times for all files. This is, the field
is non-nullable in the warehouse schema, but the simple API PEP does not
know this field.
If you have only pypi dependencies, this means deterministic,
reproducible(!) resolution. We could try doing the same for git repos
but it doesn't seem worth the effort, i'd recommend pinning commits
since git histories are arbitrarily malleable and also if you care about
reproducibility and such you such not use git dependencies but a custom
index.
Timestamps are given either as RFC 3339 timestamps such as
`2006-12-02T02:07:43Z` or as UTC dates in the same format such as
`2006-12-02`. Dates are interpreted as including this day, i.e. until
midnight UTC that day. Date only is required to make this ergonomic and
midnight seems like an ergonomic choice.
In action for `pandas`:
```console
$ target/debug/puffin pip-compile --exclude-newer 2023-11-16 target/pandas.in
Resolved 6 packages in 679ms
# This file was autogenerated by Puffin v0.0.1 via the following command:
# target/debug/puffin pip-compile --exclude-newer 2023-11-16 target/pandas.in
numpy==1.26.2
# via pandas
pandas==2.1.3
python-dateutil==2.8.2
# via pandas
pytz==2023.3.post1
# via pandas
six==1.16.0
# via python-dateutil
tzdata==2023.3
# via pandas
$ target/debug/puffin pip-compile --exclude-newer 2022-11-16 target/pandas.in
Resolved 5 packages in 655ms
# This file was autogenerated by Puffin v0.0.1 via the following command:
# target/debug/puffin pip-compile --exclude-newer 2022-11-16 target/pandas.in
numpy==1.23.4
# via pandas
pandas==1.5.1
python-dateutil==2.8.2
# via pandas
pytz==2022.6
# via pandas
six==1.16.0
# via python-dateutil
$ target/debug/puffin pip-compile --exclude-newer 2021-11-16 target/pandas.in
Resolved 5 packages in 594ms
# This file was autogenerated by Puffin v0.0.1 via the following command:
# target/debug/puffin pip-compile --exclude-newer 2021-11-16 target/pandas.in
numpy==1.21.4
# via pandas
pandas==1.3.4
python-dateutil==2.8.2
# via pandas
pytz==2021.3
# via pandas
six==1.16.0
# via python-dateutil
```
[PEP 691](https://peps.python.org/pep-0691/#project-detail) has slightly
different, more relaxed rules around file metadata. These changes are
now reflected in the `File` struct. This will make it easier to support
alternative indices.
I had expected that i need to introduce a separate type for that, so i'm
happy it's two `Option`s more and an alias.
Part of #412
Implement two behaviors for yanked versions:
* During `pip-compile`, yanked versions are filtered out entirely, we
currently treat them is if they don't exist. This is leads to confusing
error messages because a version that does exist seems to have suddenly
disappeared.
* During `pip-sync`, we warn when we fetch a remote distribution and it
has been yanked. We currently don't warn on cached or installed
distributions that have been yanked.
Filter out source dists and wheels whose `requires-python` from the
simple api is incompatible with the current python version.
This change showed an important problem: When we use a fake python
version for resolving, building source distributions breaks down because
we can only build with versions we actually have.
This change became surprisingly big. The tests now require python 3.7 to
be installed, but changing that would mean an even bigger change.
Fixes#388
We now write the `direct_url.json` when installing, and _skip_
installing if we find a package installed via the direct URL that the
user is requesting.
A lot of TODOs, especially around cleaning up the `Source` abstraction
and its relationship to `DirectUrl`. I'm gonna keep working on these
today, but this works and makes the requirements clear.
Closes#332.