mirrors/uv - Forgejo: Beyond coding. We Forge.

mirrors/uv

mirror of https://github.com/astral-sh/uv.git synced 2025-07-13 00:05:01 +00:00

Author	SHA1	Message	Date
konsti	b2a810fe37	Add windows specific filters for tests (#1231 ) Add more windows specific filters in various places. 435 tests run: 333 passed (21 slow), 102 failed, 1 skipped	2024-02-06 15:58:16 +01:00
Andrew Gallant	d4b4c21133	initial implementation of zero-copy deserialization for SimpleMetadata (#1249 ) (Please review this PR commit by commit.) This PR closes an initial loop on zero-copy deserialization. That is, provides a way to get a `Archived<SimpleMetadata>` (spelled `OwnedArchive<SimpleMetadata>` in the code) from a `CachedClient`. The main benefit of zero-copy deserialization is that we can read bytes from a file, cast those bytes to a structured representation without cost, and then start using that type as any other Rust type. The "catch" is that the structured representation is not the actual type you started with, but the "archived" version of it. In order to make all this work, we ended up needing to shave a rather large yak: we had to re-implement HTTP cache semantics. Previously, we were using the `http-cache-semantics` crate. While it does support Serde, it doesn't support `rkyv`. Moreover, even simple support for `rkyv` wouldn't be enough. What we actually want is for the HTTP cache semantics to be implemented on the archived type so that we can decide whether our cached response is stale or not without needing to do a full deserialization into the unarchived type. This is why, in this PR, you'll see `impl ArchivedCachePolicy { ... }` instead of `impl CachePolicy { ... }`. (The `derive(rkyv::Archive)` macro automatically introduces the `ArchivedCachePolicy` type into the current namespace.) Unfortunately, this PR does not fully realize the dream that is zero-copy deserialization. Namely, while a `CachedClient` can now provide an `OwnedArchive<SimpleMetadata>`, the rest of our code doesn't really make use of it. Indeed, as soon as we go to build a `VersionMap`, we eagerly convert our archived metadata into an owned `SimpleMetadata` via deserialization (that isn't zero-copy). After this change, a lot of the work now shifts to `rkyv` deserialization and `VersionMap` construction. More precisely, the main thing we drop here is `CachePolicy` deserialization (which is now truly zero-copy) and the parsing of the MessagePack format for `SimpleMetadata`. But we are still paying for deserialization. We're just paying for it in a different place. This PR does seem to bring a speed-up, but it is somewhat underwhelming. My measurements have been pretty noisy, but I get a 1.1x speedup fairly often: ``` $ hyperfine -w5 "puffin-main pip compile --cache-dir ~/astral/tmp/cache-main ~/astral/tmp/reqs/home-assistant-reduced.in -o /dev/null" "puffin-test pip compile --cache-dir ~/astral/tmp/cache-test ~/astral/tmp/reqs/home-assistant-reduced.in -o /dev/null" ; A kang Benchmark 1: puffin-main pip compile --cache-dir ~/astral/tmp/cache-main ~/astral/tmp/reqs/home-assistant-reduced.in -o /dev/null Time (mean ± σ): 164.4 ms ± 18.8 ms [User: 427.1 ms, System: 348.6 ms] Range (min … max): 131.1 ms … 190.5 ms 18 runs Benchmark 2: puffin-test pip compile --cache-dir ~/astral/tmp/cache-test ~/astral/tmp/reqs/home-assistant-reduced.in -o /dev/null Time (mean ± σ): 148.3 ms ± 10.2 ms [User: 357.1 ms, System: 319.4 ms] Range (min … max): 136.8 ms … 184.4 ms 19 runs Summary puffin-test pip compile --cache-dir ~/astral/tmp/cache-test ~/astral/tmp/reqs/home-assistant-reduced.in -o /dev/null ran 1.11 ± 0.15 times faster than puffin-main pip compile --cache-dir ~/astral/tmp/cache-main ~/astral/tmp/reqs/home-assistant-reduced.in -o /dev/null ``` One downside is that this does increase cache size (`rkyv`'s serialization format is not as compact as MessagePack). On disk size increases by about 1.8x for our `simple-v0` cache. ``` $ sort-filesize cache-main 4.0K cache-main/CACHEDIR.TAG 4.0K cache-main/.gitignore 8.0K cache-main/interpreter-v0 8.7M cache-main/wheels-v0 18M cache-main/archive-v0 59M cache-main/simple-v0 109M cache-main/built-wheels-v0 193M cache-main 193M total $ sort-filesize cache-test 4.0K cache-test/CACHEDIR.TAG 4.0K cache-test/.gitignore 8.0K cache-test/interpreter-v0 8.7M cache-test/wheels-v0 18M cache-test/archive-v0 107M cache-test/simple-v0 109M cache-test/built-wheels-v0 242M cache-test 242M total ``` Also, while I initially intended to do a simplistic implementation of HTTP cache semantics, I found that everything was somewhat inter-connected. I could have wrote code that _specifically_ only worked with the present behavior of PyPI, but then it would need to be special cased and everything else would need to continue to use `http-cache-sematics`. By implementing what we need based on what Puffin actually is (which is still less than what `http-cache-semantics` does), we can avoid special casing and use zero-copy deserialization for our cache policy in _all_ cases.	2024-02-05 16:47:53 -05:00
Charlie Marsh	01258c1bb3	Report number of bytes deleted when clearing cache (#1203 ) ## Summary This is based on Cargo's `clean` implementation, with modifications based on some of my own preferences, and to better adhere to patterns we use in our codebase: ![Screenshot 2024-01-31 at 1 31 10 AM](`38704798`-b17f-4972-ab67-00484ce63d62)	2024-01-31 10:48:28 -05:00
Charlie Marsh	fa3f0d7a55	Remove cache `purge` methods to `clean` (#1159 ) This is more consistent with the public interface.	2024-01-28 21:15:11 -05:00
Charlie Marsh	f593b65447	Remove refresh checks from the install plan (#1119 ) ## Summary Rather than checking cache freshness in the install plan, it's a lot simple to have the install plan _never_ return cached data when the refresh policy is in place, and then rely on the distribution database to check for freshness. The original implementation didn't support this, since the distribution database was rebuilding things too often. Now, it rarely rebuilds (it's much better about this), so it seems conceptually much simpler to split up the responsibilities like this.	2024-01-25 22:48:16 -05:00
Charlie Marsh	904db967af	Use junctions instead of symlinks on Windows (#1087 ) ## Summary When we unzip wheels in the cache, we write the directories out to an `archive-v0` bucket, and then symlink into that bucket from the `wheels-v0` and `built-wheels-v0` buckets. On Windows, symlinks are not well supported. Specifically, they need to be explicitly enabled by the user. So, instead of symlinks, we now use junctions, which are well-supported on Windows, and allow you to (effectively) symlink a directory to another directory. This PR implements said junction support, which gets the core installer working on Windows. In the past, we also used symlinks to implement another primitive: we wanted to be able to replace a directory "atomically" (I put "atomically" in quotes because I don't know if it's actually a guaranteed atomic operation), in case someone was trying to use the directory while we were replacing it (as opposed to deleting the directory, then moving it into place). On Windows, it doesn't appear to be possible to atomically replace a junction. So instead, I'm using a new design, whereby the cache always returns canonicalized paths. We know these canonicalized paths are unique and won't be replaced, so they're safe for writers to rely on. In general, when we write new data to the cache, we now return the canonicalized path. When we read from the cache, and try to identify (e.g.) the set of wheels available to us, we canonicalize the links immediately and consider them non-existent if that operation fails. Closes #1085. --------- Co-authored-by: konstin <konstin@mailbox.org>	2024-01-25 10:06:38 +01:00
Charlie Marsh	738e8341e2	Use a consistent `Timestamp` struct (#1081 ) ## Summary This PR uses `ctime` consistently on Unix as a more conservative approach to change detection. It also ensures that our timestamp abstraction is entirely internal, so we can change the representation and logic easily across the codebase in the future.	2024-01-24 14:21:31 -05:00
Charlie Marsh	63f3434b21	Use nanoid instead of uuid (#1074 ) ## Summary Gives us equivalent randomness with ~half as many characters.	2024-01-24 05:05:14 +00:00
Charlie Marsh	1b3a3f4e80	Add `--refresh` behavior to the cache (#1057 ) ## Summary This PR is an alternative approach to #949 which should be much safer. As in #949, we add a `Refresh` policy to the cache. However, instead of deleting entries from the cache the first time we read them, we now check if the entry is sufficiently new (created after the start of the command) if the refresh policy applies. If the entry is stale, then we avoid reading it and continue onward, relying on the cache to appropriately overwrite based on "new" data. (This relies on the preceding PRs, which ensure the cache is append-only, and ensure that we can atomically overwrite.) Unfortunately, there are just a lot of paths through the cache, and didn't data is handled with different policies, so I really had to go through and consider the "right" behavior for each case. For example, the HTTP requests can use `max-age=0, must-revalidate`. But for the routes that are based on filesystem modification, we need to do something slightly different. Closes #945.	2024-01-23 18:30:26 -05:00
Charlie Marsh	5621c414cf	Use symlinks for directories entries in cache (#1037 ) ## Summary One problem we have in the cache today is that we can't overwrite entries atomically, because we store unzipped _directories_ in the cache (which makes installation _much_ faster than storing zipped directories). So, if you ignore the existing contents of the cache when writing, you might run into an error, because you might attempt to write a directory where a directory already exists. This is especially annoying for cache refresh, because in order to refresh the cache, we have to purge it (i.e., delete a bunch of stuff), which is also highly unsafe if Puffin is running across multiple threads or multiple processes. The solution I'm proposing here is that whenever we persist a _directory_ to the cache, we persist it to a special "archive" bucket. Then, within the other buckets, directory entries are actually symlinks into that "archive" bucket. With symlinks, we can atomically replace, which means we can easily overwrite cache entries without having to delete from the cache. The main downside is that we'll now accumulate dangling entries in the "archive" bucket, and so we'll need to implement some form of garbage collection to ensure that we remove entries with no symlinks. Another downside is that cache reads and writes will be a bit slower, since we need to deal with creating and resolving these symlinks. As an example... after this change, the cache entry for this unzipped wheel is actually a symlink: ![Screenshot 2024-01-22 at 11 56 18 AM](`99ff6940`-5096-4246-8d16-2a7bdcdd8d4b) Then, within the archive directory, we actually have two unique entries (since I intentionally ran the command twice to ensure overwrites were safe): ![Screenshot 2024-01-22 at 11 56 22 AM](`717d04e2`-25d9-4225-b190-bad1441868c6)	2024-01-23 19:52:37 +00:00
Charlie Marsh	556080225d	Use ctime for interpreter timestamps (#1067 ) Per https://apenwarr.ca/log/20181113, `ctime` should be a lot more conservative, and should detect things like the issue we see with the python-build-standalone builds, where the `mtime` is identical across builds. On Windows, I'm just using `last_write_time`. But we should probably add `volume_serial_number` and other attributes via [`winapi_util`](https://docs.rs/winapi-util/latest/winapi_util/index.html).	2024-01-23 19:52:20 +00:00
Charlie Marsh	6561617c56	Store source distribution builds under a unique manifest ID (#1051 ) ## Summary This is a refactor of the source distribution cache that again aims to make the cache purely additive. Instead of deleting all built wheels when the cache gets invalidated (e.g., because the source distribution changed on PyPI or something), we now treat each invalidation as its own cache directory. The manifest inside of the source distribution directory now becomes a pointer to the "latest" version of the source distribution cache. Here's a visual example: ![Screenshot 2024-01-22 at 5 35 41 PM](`ca103c83`-e116-4956-b91c-8434fe62cffe) With this change, we avoid deleting built distributions that might be relied on elsewhere and maintain our invariant that the cache is purely additive. The cost is that we now preserve stale wheels, but we should add a garbage collection mechanism to deal with that.	2024-01-23 19:49:11 +00:00
Charlie Marsh	e32027e384	Avoid persisting manifest data in standalone file (#1044 ) ## Summary This PR gets rid of the manifest that we store for source distributions. Historically, that manifest included the source distribution metadata, plus a list of built wheels. The problem with the manifest is that it duplicates state, since we now have to look at both the manifest and the filesystem to understand the cache state. Instead, I think we should treat the cache as the source of truth, and get rid of the duplicated state in the manifest. Now, we store the manifest (which is merely used to check for cache freshness -- in future PRs, I will repurpose it though, so I left it around), then the distribution metadata as its own file, then any distributions in the same directory. When we want to see if there are any valid distributions, we `readdir` on the directory. This is also much more consistent with how the install plan works.	2024-01-23 19:46:48 +00:00
Charlie Marsh	f17bad0a75	Mark path-based cache entries as stale during install plan (#957 ) ## Summary This is a small correctness improvement that ensures that we avoid using stale cache entries for local dependencies in the install plan. We already have some logic like this in the source distribution builder, but it didn't apply in the install plan, and so we'd end up using stale wheels. Specifically, now, if you create a new local wheel, and run `pip sync`, we'll mark the cache entries as stale and make sure we unzip it and install it. (If the wheel is _already_ installed, we won't reinstall it though, which will be a separate change. This is just about reading from the cache, not the environment.)	2024-01-18 19:13:29 +00:00
Charlie Marsh	249ca10765	Move Puffin subcommands to a pip namespace (#921 ) ## Summary This makes the separation clearer between the legacy `pip` API and the API we'll add in the future for the package manager itself. It also enables seamless `puffin pip` aliasing for those that want it. Closes #918.	2024-01-15 16:36:45 +00:00
konsti	f63776b894	Support HTML indexes in `--find-links` (#913 ) The simple html format parser luckily seems to work for find links too, at least it can parse https://storage.googleapis.com/jax-releases/jax_cuda_releases.html.	2024-01-15 02:54:34 +00:00
konsti	8c2b7d55af	Cleanup deps and docs (#882 ) Fix warnings from `cargo +nightly udeps` and `cargo doc`. Removes all mentions of regex from pep440_rs.	2024-01-11 10:43:40 +00:00
Charlie Marsh	bbe0246205	Change internal representation of `CacheEntry` to avoid allocations (#730 ) Removes a TODO.	2023-12-26 02:10:30 +00:00
Charlie Marsh	188ab75769	Split `File` into internal and external type (#729 ) ## Summary This PR makes the `pypi_types::File` a response-only type (i.e., a type that's only used when deserializing over the wire), and adds a separate internal `File` type. Right now, the representations are similar, but already, we can avoid the "lenient" deserialization on our internal `File` type, and avoid the special-casing of the property names that's required in the JSON. Over time, we can evolve this representation entirely separately from the representation we receive from PyPI and other indexes.	2023-12-25 15:42:28 -05:00
Charlie Marsh	6ff21374dc	Split `puffin-cache` into Puffin-specific and generic utilities (#728 ) This crate started off as generic caching utilities, but we started adding a lot of Puffin-specific stuff (like the cache buckets abstraction that knows about Git vs. direct URL vs. indexes and so on). This PR moves the generic stuff into a new `cache-key` crate.	2023-12-25 14:38:56 +00:00
konsti	b7ad97a823	Show resource and lockfile when waiting (#715 ) We lock git checkout directories and the virtualenv to avoid two puffin instances running in parallel changing files at the same time and leading to a broken state. When one instance is blocking another, we need to inform the user (why is the program hanging?) and also add some information for them to debug the situation. The new messages will print ``` Waiting to acquire lock for /home/konsti/projects/puffin/.venv (lockfile: /home/konsti/projects/puffin/.venv/.lock) ``` or ``` Waiting to acquire lock for git+https://github.com/pydantic/pydantic-extra-types@0ce9f207a1e09a862287ab77512f0060c1625223 (lockfile: /home/konsti/projects/puffin/cache-all-kinds/git-v0/locks/f157fd329a506a34) ``` The messages aren't perfect but clear enough to see what the contention is and in the worst case to delete the lockfile. Fixes #714	2023-12-21 00:05:49 +01:00
konsti	71964ec7a8	Switch to msgpack in the cached client (#662 ) This gives a 1.23 speedup on transformers-extras. We could change to msgpack for the entire cache if we want. I only tried this format and postcard so far, where postcard was much slower (like 1.6s). I don't actually want to merge it like this, i wanted to figure out the ballpark of improvement for switching away from json. ``` hyperfine --warmup 3 --runs 10 "target/profiling/puffin pip-compile --cache-dir cache-msgpack scripts/requirements/transformers-extras.in" "target/profiling/branch pip-compile scripts/requirements/transformers-extras.in" Benchmark 1: target/profiling/puffin pip-compile --cache-dir cache-msgpack scripts/requirements/transformers-extras.in Time (mean ± σ): 179.1 ms ± 4.8 ms [User: 157.5 ms, System: 48.1 ms] Range (min … max): 174.9 ms … 188.1 ms 10 runs Benchmark 2: target/profiling/branch pip-compile scripts/requirements/transformers-extras.in Time (mean ± σ): 221.1 ms ± 6.7 ms [User: 208.1 ms, System: 46.5 ms] Range (min … max): 213.5 ms … 235.5 ms 10 runs Summary target/profiling/puffin pip-compile --cache-dir cache-msgpack scripts/requirements/transformers-extras.in ran 1.23 ± 0.05 times faster than target/profiling/branch pip-compile scripts/requirements/transformers-extras.in ``` Disadvantage: We can't manually look into the cache anymore to debug things - [ ] Check more formats, i currently only tested json, msgpack and postcard, there should be other formats, too - [x] Switch over `CachedByTimestamp` serialization (for the interpreter caching) - [x] Switch over error handling and make sure puffin is still resilient to cache failure	2023-12-16 21:01:35 +00:00
Charlie Marsh	84093773ef	Store source distribution sources in the cache (#653 ) ## Summary This PR modifies `source_dist.rs` to store source distributions (from remote URLs) in the cache. The cache structure for registries now looks like: <img width="1053" alt="Screen Shot 2023-12-14 at 10 43 43 PM" src="`3c2dbf6b`-5926-41f2-b69b-74031741aba8"> (I will update the docs prior to merging, if approved.) The benefit here is that we can reuse the source distribution (avoid download + unzipping it) if we need to build multiple wheels. In the future, it will be even more relevant, since we'll need to reuse the source distribution to support https://github.com/astral-sh/puffin/issues/599. I also included some misc. refactors to DRY up repeated operations and add some more abstraction to `source_dist.rs`.	2023-12-15 17:19:33 +00:00
Charlie Marsh	1181288078	Download, build, and install in a single pipeline phase (#605 ) ## Summary At present, we have two separate phases within the installation pipeline related to populating wheels into the cache. The first phase downloads the distribution, and then builds any source distributions into wheels; the second phase unzips all the built wheels into the cache. This PR merges those two phases into one, such that we seamlessly download, build, and unzip wheels in one pass. This is more efficient, since we can start unzipping while we build. It also ensures that if the install _fails_ partway through, we don't end up with a bunch of downloaded wheels that we never had a chance to unzip. The code is also much simpler. The main downside is that the user-facing feedback isn't as granular, since we only have one phase and one progress bar for what was originally three distinct phases. Closes https://github.com/astral-sh/puffin/issues/571. ## Test Plan I ran the benchmark script on two separate requirements files, and saw a 7% and 31% speedup respectively: ```text + TARGET=./scripts/benchmarks/requirements.txt + hyperfine --runs 100 --warmup 10 --prepare 'virtualenv --clear .venv' './target/release/main pip-sync ./scripts/benchmarks/requirements.txt --no-cache' --prepare 'virtualenv --clear .venv' './target/release/puffin pip-sync ./scripts/benchmarks/requirements.txt --no-cache' Benchmark 1: ./target/release/main pip-sync ./scripts/benchmarks/requirements.txt --no-cache Time (mean ± σ): 269.4 ms ± 33.0 ms [User: 42.4 ms, System: 117.5 ms] Range (min … max): 221.7 ms … 446.7 ms 100 runs Benchmark 2: ./target/release/puffin pip-sync ./scripts/benchmarks/requirements.txt --no-cache Time (mean ± σ): 250.6 ms ± 28.3 ms [User: 41.5 ms, System: 127.4 ms] Range (min … max): 207.6 ms … 336.4 ms 100 runs Summary './target/release/puffin pip-sync ./scripts/benchmarks/requirements.txt --no-cache' ran 1.07 ± 0.18 times faster than './target/release/main pip-sync ./scripts/benchmarks/requirements.txt --no-cache' ``` ```text + TARGET=./scripts/benchmarks/requirements-large.txt + hyperfine --runs 100 --warmup 10 --prepare 'virtualenv --clear .venv' './target/release/main pip-sync ./scripts/benchmarks/requirements-large.txt --no-cache' --prepare 'virtualenv --clear .venv' './target/release/puffin pip-sync ./scripts/benchmarks/requirements-large.txt --no-cache' Benchmark 1: ./target/release/main pip-sync ./scripts/benchmarks/requirements-large.txt --no-cache Time (mean ± σ): 5.053 s ± 0.354 s [User: 1.413 s, System: 6.710 s] Range (min … max): 4.584 s … 6.333 s 100 runs Benchmark 2: ./target/release/puffin pip-sync ./scripts/benchmarks/requirements-large.txt --no-cache Time (mean ± σ): 3.845 s ± 0.225 s [User: 1.364 s, System: 6.970 s] Range (min … max): 3.482 s … 4.715 s 100 runs Summary './target/release/puffin pip-sync ./scripts/benchmarks/requirements-large.txt --no-cache' ran ```	2023-12-11 15:42:29 +00:00
Charlie Marsh	4b8642c6f7	Enable selective cache purging in `puffin clean` (#589 ) ## Summary This PR enables `puffin clean` to accept package names as command line arguments, and selectively purge entries from the cache tied to the given package. Relate to #572. ## Test Plan Modified all the caching tests to run an additional step to (1) purge the cache, and (2) re-install the package.	2023-12-08 19:51:32 +00:00
Charlie Marsh	5ae3a8b1cb	Restructure Git cache to include package name (#588 ) ## Summary This PR modifies the Git wheel cache to: (1) use a shorter version of the SHA, to save space; and (2) include the package name, for consistency with all other buckets. I considered removing the URL hash entirely, and _just_ using the SHA, which would be even _more_ consistent with other buckets. But if we remove the URL, then we won't have separate directories for subdirectories (which are part of the URL). Before: <img width="1035" alt="Screen Shot 2023-12-07 at 7 23 42 PM" src="`86afce67`-682f-464f-9ba1-0b60d5b7f19f"> After: <img width="1232" alt="Screen Shot 2023-12-07 at 8 09 23 PM" src="`eda42a19`-974f-47fe-8c83-54a602ddfd2d">	2023-12-07 20:17:41 -05:00
Zanie Blue	ef7be9103c	Parse `SimpleJson` into categorized data in the client (#522 ) Extends #517 with a suggestion from @konstin to parse the `SimpleJson` into an intermediate type `SimpleMetadata(BTreeMap<Version, VersionFiles>)` before converting to a `VersionMap`. This reduces the number of times we need to parse the response. Additionally, we cache the parsed response now instead of `SimpleJson`. `VersionFiles` stores two vectors with `WheelFilename`/`SourceDistFilename` and `File` tuples. These can be iterated over together or separately. A new enum `DistFilename` was added to capture the `SourceDistFilename` and `WheelFilename` variants allowing iteration over both vectors.	2023-12-07 11:04:47 -06:00
Charlie Marsh	a825b2db06	Shard the registry cache by package (#583 ) ## Summary This PR modifies the cache structure in a few ways. Most notably, we now shard the set of registry wheels by package, and index them lazily when computing the install plan. This applies both to built wheels: <img width="989" alt="Screen Shot 2023-12-06 at 4 42 19 PM" src="`0e8a306f`-befd-4be9-a63e-2303389837bb"> And remote wheels: <img width="836" alt="Screen Shot 2023-12-06 at 4 42 30 PM" src="`7fd908cd`-dd86-475e-9779-07ed067b4a1a"> For other distributions, we now consistently cache using the package name, which is really just for clarity and debuggability (we could consider omitting these): <img width="955" alt="Screen Shot 2023-12-06 at 4 58 30 PM" src="`3e8d0f99`-df45-429a-9175-d57b54a72e56"> Obliquely closes https://github.com/astral-sh/puffin/issues/575.	2023-12-07 05:02:46 +00:00
konsti	3f4d7b7826	Improve path source dist caching (#578 ) Path distribution cache reading errors are no longer fatal. We now invalidate the path file source dists if its modification timestamp changed, and invalidate path dir source dists if `pyproject.toml` or alternatively `setup.py` changed, which seems good choices since changing pyproject.toml should trigger a rebuild and the user can `touch` the file as part of their workflow. `CachedByTimestamp` is now a shared util. It doesn't have methods as i don't think it's worth it yet for two users. Closes #478 TODO(konstin): Write a test. This is probably twice as much work as that fix itself, so i made that PR without one for now.	2023-12-06 11:47:01 -05:00
Charlie Marsh	5fec63bff5	Add caching for path source distributions (#576 ) Follows the strategy that we use for other source distributions. Closes https://github.com/astral-sh/puffin/issues/557.	2023-12-06 01:33:52 +00:00
Charlie Marsh	5370484307	Remove `.whl` extension for cached, unzipped wheels (#574 ) ## Summary This PR uses the wheel stem (e.g., `foo-1.2.3-py3-none-any`) instead of the wheel name (e.g., `foo-1.2.3-py3-none-any.whl`) when storing unzipped wheels in the cache, which removes a class of confusing issues around overwrites and directory-vs.-file collisions. For now, we retain _both_ the zipped and unzipped wheels in the cache, though we can easily change this by storing the zipped wheels in a temporary directory. Closes https://github.com/astral-sh/puffin/issues/573. ## Test Plan Some examples from my local cache: <img width="835" alt="Screen Shot 2023-12-05 at 4 09 55 PM" src="`784146aa`-b080-416e-9767-40c843fe5d6a"> <img width="847" alt="Screen Shot 2023-12-05 at 4 12 14 PM" src="`4bc7f30f`-bef3-47f1-b4e8-da9cabf87f28"> <img width="637" alt="Screen Shot 2023-12-05 at 4 09 50 PM" src="`25ca4944`-4a06-4a08-ac85-c6f7d8b5c8ea">	2023-12-05 22:41:22 +00:00
Charlie Marsh	6f055ecf3b	Remove existing built wheels when building source distributions (#559 ) This PR modifies the source distribution building to replace any existing targets after building the new wheel. In some cases, the existence of an existing target may be indicative of a bug, so we warn. It's partially a workaround for some (but not all) of the errors in https://github.com/astral-sh/puffin/issues/554.	2023-12-05 12:45:24 -05:00
konsti	d5abd33813	Use atomic writes for the cache consistently (#546 ) Ensure we're using atomic writes everywhere in our cache to avoid broken cache records and error with parallel puffin actions (https://github.com/astral-sh/puffin/pull/544#issuecomment-1838841581). All json files that are written to the cache are written atomically and the build wheels are written to temp dir and then moved atomically. I didn't touch venv creation though, i don't think that's worth it since python does not support atomic package installation through its design.	2023-12-04 12:02:01 -05:00
konsti	811c088603	Improve wheel cache docs: Unzipping is lazy (#539 ) Also sneaking `fs_err::rename(staging.into_path(), &normalized_path)?` in here, for a better resolution of https://github.com/astral-sh/puffin/pull/524#discussion_r1412459016	2023-12-04 10:01:35 +00:00
Charlie Marsh	ee2fca3a48	Add CACHEDIR and .gitignore tags to cache directories (#526 ) ## Summary Even if this will typically be in the user's application folder (rather than a local directory), it's still a good practice. Closes https://github.com/astral-sh/puffin/issues/280.	2023-12-02 00:37:51 +00:00
konsti	9806901a16	Consolidate wheel caches (#524 ) After this change, two wheel caches remain: `built-wheels-v0` and `wheels-v0`, docs screenshots below. Each contains both the wheel metadata, cache policy and zip or unzipped wheels under the same name. The zipped/unzipped strategy is as follows: In `pip-compile`, when we build a wheel, we store it zipped. When `pip-sync` or a source dist build in `pip-compile` need to install the wheel, we unzip it, remove the file and replace it with the unzipped wheel. This removes `WheelCache` and `UrlIndex` in favor of `Cache` plus `WheelCache`. The non-built wheel cache now considers index urls and the url for url wheels. I'm unsure if we need the `Unzipper` type, this could just be a function. I move `no_index` into `IndexUrls` and started using `IndexUrl` up to the clap level. I left a number of TODOs in the code, namely performing the actual invalidation of unzipped wheels and making the `InstallPlan` understand cache invalidation (i.e. uninstall wheels when their remote changed). ![image](`c4d45979`-485b-4954-848d-fd3347ee2510)	2023-12-01 20:16:33 +00:00
konsti	4551994b7d	Clear built wheels when remote changed (#519 ) Remove built wheels alongside their metadata when their index source dist or url source dist changed. For git source dists, we currently don't clear the previous build but use a new directory (not sure what's right here - are there any generic cache GC approaches out there? I've seen that e.g. spotify keeps its cache at 10GB max, but i also haven't seen any reusable, well tested approaches for this). Path distributions are unchanged (#478). I like the structure of metadata alongside the wheel for cache invalidation, i'll try to do that for `wheels-v0`/`wheel-metadata-v0` too. (The unzipped wheels afaik currently lack cache invalidation when the remote changed.) This should give is roughly the same structure for wheel and built wheels and a very similar pattern of invalidation.	2023-12-01 14:56:47 -05:00
konsti	d89fbeb642	Migrate interpreter query to custom caching (#508 ) This removes the last usage of cacache by replacing it with a custom, flat json caching keyed by the digest of the executable path. ![image](`8f777c4c`-1f1b-4656-ba7b-002175270556) A step towards #478. I've made `CachedByTimestamp<T>` generic over `T` but intentionally not moved it to `puffin-cache` yet.	2023-11-28 17:14:59 +00:00
konsti	5435d44756	Introduce `Cache`, `CacheBucket` and `CacheEntry` (#507 ) This is mostly a mechanical refactor that moves 80% of our code to the same cache abstraction. It introduces cache `Cache`, which abstracts away the path of the cache and the temp dir drop and is passed throughout the codebase. To get a specific cache bucket, you need to requests your `CacheBucket` from `Cache`. `CacheBucket` is the centralizes the names of all cache buckets, moving them away from the string constants spread throughout the crates. Specifically for working with the `CachedClient`, there is a `CacheEntry`. I'm not sure yet if that is a strict improvement over `cache_dir: PathBuf, cache_file: String`, i may have to rotate that later. The interpreter cache moved into `interpreter-v0`. We can use the `CacheBucket` page to document the cache structure in each bucket: ![image](`b023fdfb`-e34d-4c2d-8663-b5f73937a539)	2023-11-28 17:11:14 +00:00
Charlie Marsh	afda835544	Avoid clone for `WheelMetadataCache` (#500 ) This doesn't need to own the underlying data which allows us to remove a number of clones.	2023-11-25 23:33:59 +00:00
konsti	d54e780843	Source dist metadata refactor (#468 ) ## Summary and motivation For a given source dist, we store the metadata of each wheel built through it in `built-wheel-metadata-v0/pypi/<source dist filename>/metadata.json`. During resolution, we check the cache status of the source dist. If it is fresh, we check `metadata.json` for a matching wheel. If there is one we use that metadata, if there isn't, we build one. If the source is stale, we build a wheel and override `metadata.json` with that single wheel. This PR thereby ties the local built wheel metadata cache to the freshness of the remote source dist. This functionality is available through `SourceDistCachedBuilder`. `puffin_installer::Builder`, `puffin_installer::Downloader` and `Fetcher` are removed, instead there are now `FetchAndBuild` which calls into the also new `SourceDistCachedBuilder`. `FetchAndBuild` is the new main high-level abstraction: It spawns parallel fetching/building, for wheel metadata it calls into the registry client, for wheel files it fetches them, for source dists it calls `SourceDistCachedBuilder`. It handles locks around builds, and newly added also inter-process file locking for git operations. Fetching and building source distributions now happens in parallel in `pip-sync`, i.e. we don't have to wait for the largest wheel to be downloaded to start building source distributions. In a follow-up PR, I'll also clear built wheels when they've become stale. Another effect is that in a fully cached resolution, we need neither zip reading nor email parsing. Closes #473 ## Source dist cache structure Entries by supported sources: * `<build wheel metadata cache>/pypi/foo-1.0.0.zip/metadata.json` * `<build wheel metadata cache>/<sha256(index-url)>/foo-1.0.0.zip/metadata.json` * `<build wheel metadata cache>/url/<sha256(url)>/foo-1.0.0.zip/metadata.json` But the url filename does not need to be a valid source dist filename (<https://github.com/search?q=path%3A*%2Frequirements.txt+master.zip&type=code>), so it could also be the following and we have to take any string as filename: `<build wheel metadata cache>/url/<sha256(url)>/master.zip/metadata.json` Example: ```text # git source dist pydantic-extra-types @ git+https://github.com/pydantic/pydantic-extra-types.git # pypi source dist django_allauth==0.51.0 # url source dist werkzeug @ `ff1904eb5e/werkzeug-3.0.1.tar.gz` ``` will be stored as ```text built-wheel-metadata-v0 ├── git │ └── 5c56bc1c58c34c11 │ └── 843b753e9e8cb74e83cac55598719b39a4d5ef1f │ └── metadata.json ├── pypi │ └── django-allauth-0.51.0.tar.gz │ └── metadata.json └── url └── 6781bd6440ae72c2 └── werkzeug-3.0.1.tar.gz └── metadata.json ``` The inside of a `metadata.json`: ```json { "data": { "django_allauth-0.51.0-py3-none-any.whl": { "metadata-version": "2.1", "name": "django-allauth", "version": "0.51.0", ... } } } ```	2023-11-24 17:47:58 +00:00
Charlie Marsh	9d35128840	Use Clippy lint table over Cargo config (#490 ) Closes https://github.com/astral-sh/puffin/issues/482.	2023-11-22 15:10:27 +00:00
Charlie Marsh	17228ba04e	Add support for path dependencies (#471 ) ## Summary This PR adds support for local path dependencies. The approach mostly just falls out of our existing approach and infrastructure for Git and URL dependencies. Closes https://github.com/astral-sh/puffin/issues/436. (We'll open a separate issue for editable installs.) ## Test Plan Added `pip-compile` tests that pre-download a wheel or source distribution, then install it via local path.	2023-11-21 11:49:42 +00:00
konsti	f0841cdb6e	Wheel metadata refactor (#462 ) A consistent cache structure for remote wheel metadata: * `<wheel metadata cache>/pypi/foo-1.0.0-py3-none-any.json` * `<wheel metadata cache>/<digest(index-url)>/foo-1.0.0-py3-none-any.json` * `<wheel metadata cache>/url/<digest(url)>/foo-1.0.0-py3-none-any.json` The source dist caching will use a similar structure (#468).	2023-11-20 17:26:36 +01:00
konsti	24f00f5a33	Create cache dir before canonicalize (#454 ) `fs::canonicalize` fails when the directory does not exist, which i missed in #453	2023-11-19 13:49:13 +00:00
konsti	ab60233131	Use absolute cache paths (#453 ) Previously, git requirements would fail when setting `--cache-dir`: ```console $ cargo run --bin puffin -- pip-compile --cache-dir cache-all-kinds scripts/benchmarks/requirements/all-kinds.in error: Failed to build distribution from URL: git+https://github.com/pydantic/pydantic-extra-types.git Caused by: Invalid path URL: cache-all-kinds/git-v0/db/b49ffcfeb6c2e9d8 ``` The cause is using a relative and not an absolute path, which `Url` needs, the solution is to turn the cache dir into an absolute path. This never showed up in the tests since the tests use absolute temp dirs for everything.	2023-11-19 13:32:32 +00:00
Charlie Marsh	b1c29447df	Use `temp_dir` casing everywhere (#440 )	2023-11-16 21:04:10 +00:00
konsti	1883dbdc21	Always¹ clear temporary directories (#437 ) Always¹ clear the temporary directories we create. * Clear source dist downloads: Previously, the temporary directories would remain in the cache dir, now they are cleared properly * Clear wheel file downloads: Delete the `.whl` file, we only need to cache the unpacked wheel * Consistent handling of cache arguments: Abstract the handling for CLI cache args away, again making sure we remove the `--no-cache` temp dir. There are no more `into_path()` calls that persist `TempDir`s that i could find. ¹Assuming drop is run, and deleting the directory doesn't silently error.	2023-11-16 20:49:48 +00:00
konsti	5cef40d87a	Add proper caching for pypi metadata fetching kinds (#368 ) I intend this to become the main form of caching for puffin: You can make http requests, you tranform the data to what you really need, you have control over the cache key, and the cache is always json (or anything else much faster we want to replace it with as long as it's serde!)	2023-11-10 11:03:40 +00:00
Charlie Marsh	4b83d8e949	Require URL dependencies to be declared upfront (#319 ) In the resolver, our current model for solving URL dependencies requires that we visit the URL dependency _before_ the registry-based dependency. This PR encodes a strict requirement that all URL dependencies be declared upfront, either as requirements or constraints. I wrote more about how it works and why it's necessary in documentation [here](https://github.com/astral-sh/puffin/pull/319/files#diff-2b1c4f36af0c62a2b7bebeae9473ae083588f2a6b18a3ec52393a24266adecbbR20). I think we could relax this constraint over time, but it requires a more sophisticated model -- and for now, I just want something that's (1) correct, (2) easy for us to reason about, and (3) easy for users to reason about. As additional motivation... allowing arbitrary URL dependencies anywhere in the tree creates some really confusing situations in which I'm not even sure what the right answers are. For example, assume you declare a direct dependency on `Werkzeug==2.0.0`. You then depend on a version of Flask that depends on a version of `Werkzeug` from some arbitrary URL. You build the source distribution at that arbitrary URL, and it turns out it _does_ build to a declared version of 2.0.0. What should happen? (And if it resolves to a version that _isn't_ 2.0.0, what should happen _then_?) I suspect different tools handle this differently, but it must lead to a lot of "silent" failures. In my testing of Poetry, it seems like Poetry just ignores the URL dependency, which seems wrong, but is also a behavior we could implement in the future. Closes https://github.com/astral-sh/puffin/issues/303. Closes https://github.com/astral-sh/puffin/issues/284.	2023-11-05 17:09:58 +00:00

1 2

54 commits