## Summary
There's some inconsistent behaviour in handling symlinks when
`cache-key` is a glob or a file path. This PR attempts to address that.
- When cache-key is a path,
[`Path::metadata()`](https://doc.rust-lang.org/std/path/struct.Path.html#method.metadata)
is used to check if it's a file or not. According to the docs:
> This function will traverse symbolic links to query information about
the destination file.
So, if the target file is a symlink, it will be resolved and the
metadata will be queried for the underlying file.
- When cache-key is a glob, `globwalk` is used, specifically allowing
for symlinks:
```rust
.file_type(globwalk::FileType::FILE | globwalk::FileType::SYMLINK)
```
- However, without enabling link following, `DirEntry::metadata()` will
return an equivalent of `Path::symlink_metadata()` (and not
`Path::metadata()`), which will have a file type that looks like
```rust
FileType {
is_file: false,
is_dir: false,
is_symlink: true,
..
}
```
- Then, there's a check for `metadata.is_file()` which fails and
complains that the target entry "is a directory when file was expected".
- TLDR: glob cache-keys don't work with symlinks.
## Solutions
Option 1 (current PR): follow symlinks.
Option 2 (also doable): don't follow symlinks, but resolve the resulting
target entry manually in case its file type is a symlink. However, this
would be a little weird and unobvious in that we resolve files but not
directories for some reason. Also, symlinking directories is pretty
useful if you want to symlink directories of local dependencies that are
not under the project's path.
## Test Plan
This has been tested manually:
```rust
fn main() {
for follow_links in [false, true] {
let walker = globwalk::GlobWalkerBuilder::from_patterns(".", &["a/*"])
.file_type(globwalk::FileType::FILE | globwalk::FileType::SYMLINK)
.follow_links(follow_links)
.build()
.unwrap();
let entry = walker.into_iter().next().unwrap().unwrap();
dbg!(&entry);
dbg!(entry.file_type());
dbg!(entry.path_is_symlink());
dbg!(entry.path());
let meta = entry.metadata().unwrap();
dbg!(meta.is_file());
}
let path = std::path::PathBuf::from("./a/b");
dbg!(path.metadata().unwrap().file_type());
dbg!(path.symlink_metadata().unwrap().file_type());
}
```
Current behaviour (glob cache-key, don't follow links):
```
[src/main.rs:9:9] &entry = DirEntry("./a/b")
[src/main.rs:10:9] entry.file_type() = FileType {
is_file: false,
is_dir: false,
is_symlink: true,
..
}
[src/main.rs:11:9] entry.path_is_symlink() = true
[src/main.rs:12:9] entry.path() = "./a/b"
[src/main.rs:14:9] meta.is_file() = false
```
Glob cache-key, follow links:
```
[src/main.rs:9:9] &entry = DirEntry("./a/b")
[src/main.rs:10:9] entry.file_type() = FileType {
is_file: true,
is_dir: false,
is_symlink: false,
..
}
[src/main.rs:11:9] entry.path_is_symlink() = true
[src/main.rs:12:9] entry.path() = "./a/b"
[src/main.rs:14:9] meta.is_file() = true
```
Using `path.metadata()` for a non-glob cache key:
```
[src/main.rs:18:5] path.metadata().unwrap().file_type() = FileType {
is_file: true,
is_dir: false,
is_symlink: false,
..
}
[src/main.rs:19:5] path.symlink_metadata().unwrap().file_type() = FileType {
is_file: false,
is_dir: false,
is_symlink: true,
..
}
```
## Summary
(Related PR: #13438 - would be nice to have it merged as well since it
touches on the same globwalker code)
There's a few issues with `cache-key` globs, which this PR attempts to
address:
- As of the current state, parent or absolute paths are not allowed,
which is not obvious and is not documented. E.g., cache-key paths of the
form `{file = "../dep/**"}` will be essentially ignored.
- Absolute glob patterns also don't work (funnily enough, there's logic
in `globwalk` itself that attempts to address it in
[`globwalk::glob_builder()`](8973fa2bc5/src/lib.rs (L415)),
which serves as inspiration to some parts of this PR).
- The reason for parent paths being ignored is the way globwalker is
currently being triggered in `uv-cache-info`: the base directory is
being walked over completely and each entry is then being matched to one
of the provided match patterns.
- This may also end up being very inefficient if you have a huge root
folder with thousands of files: if your match patterns are `a/b/*.rs`
and `a/c/*.py` then instead of walking over the root directory, you can
just walk over `a/b` and `a/c` and match the relevant patterns there.
- Why supporting parent paths may be important to the point of being a
blocker: in large codebases with python projects depending on other
local non-python projects (e.g. rust crates), cache-keys can be very
useful to track dependency on the source code of the latter (e.g.
`cache-keys = [{ file = "../../crates/some-dep/**" }]`.
- TLDR: parent/absolute cache-key globs don't work, glob walk can be
slow.
## Solution
- In this PR, user-provided glob patterns are first clustered
(LCP-style) into pattern groups with longest common path prefix; each of
these groups can then be walked over separately.
- Pattern groups do not overlap, so we would never walk over the same
directory twice (unless there's symlinks pointing to same folders).
- Paths are not canonicalized nor virtually normalized (which is
impossible on Unix without FS access), so the method is symlink-safe
(i.e. we don't treat `a/b/..` as `a`) and should work fine with #13438.
- Because of LCP logic, the minimal amount of directory space will be
traversed to cover all patterns.
- Absolute glob patterns will now work.
- Parent-relative glob patterns will now work.
- Glob walking will be more efficient in some cases.
## Possible improvements
- Efficiency can be further greatly improved if we limit max depth for
globwalk. Currently, a simple ".toml" will deep-traverse the whole
folder. Essentially, max depth can be always set to either N or
infinity. If a pattern at a pivot node contains `**`, we collect all
children nodes from the subtree into the same group and don't limit max
depth; otherwise, we set max depth to the length of the glob pattern.
This wouldn't change correctness though and can we done separately if
needed.
- If this is considered important enough, docs can be updated to
indicate that parent and absolute globs are supported (and symlinks are
resolved, if the relevant PR is algo merged in).
## Test Plan
- Glob splitting and clustering tests are included in the PR.
- Relative and absolute glob cache-keys were tested in an actual
codebase.
## Summary
This has come up a few times, so it seems worth addressing. If you
migrate from a flat layout to a `src` layout or vice versa, we now
invalidate the package metadata.
Closes https://github.com/astral-sh/uv/issues/12047
## Summary
The assumption that all tags are listed under a flat `.git/ref/tags`
structure was wrong. Git creates a hierarchy of directories for tags
containing slashes. To fix the cache key calculation, we need to
recursively traverse all files under that folder instead.
## Test Plan
1. Create an `uv` project with git-tag cache-keys;
2. Add any tag with slash;
3. Run `uv sync` and see uv_cache_info error in verbose log;
4. `uv sync` doesn't trigger reinstall on next tag addition or removal;
5. With fix applied, reinstall triggers on every tag update and there
are no errors in the log.
Fixes#10467
---------
Co-authored-by: Sergei Nizovtsev <sergei.nizovtsev@eqvilent.com>
Co-authored-by: Charlie Marsh <charlie.r.marsh@gmail.com>
## Summary
We no longer need this struct; we bumped the cache bucket version
anyway, so the `Timestamp` variant is never encountered. This means we
get real Serde error messages.
Closes https://github.com/astral-sh/uv/issues/8488.
As per
https://matklad.github.io/2021/02/27/delete-cargo-integration-tests.html
Before that, there were 91 separate integration tests binary.
(As discussed on Discord — I've done the `uv` crate, there's still a few
more commits coming before this is mergeable, and I want to see how it
performs in CI and locally).
## Summary
This has been asked for a few times. There are risks that these checks
could be slow, but they're buyer-beware.
Closes https://github.com/astral-sh/uv/issues/7246.
## Summary
This PR adds a more flexible cache invalidation abstraction for uv, and
uses that new abstraction to improve support for dynamic metadata.
Specifically, instead of relying solely on a timestamp, we now pass
around a `CacheInfo` struct which (as of now) contains
`Option<Timestamp>` and `Option<Commit>`. The `CacheInfo` is saved in
`dist-info` as `uv_cache.json`, so we can test already-installed
distributions for cache validity (along with testing _cached_
distributions for cache validity).
Beyond the defaults (`pyproject.toml`, `setup.py`, and `setup.cfg`
changes), users can also specify additional cache keys, and it's easy
for us to extend support in the future. Right now, cache keys can either
be instructions to include the current commit (for `setuptools_scm` and
similar) or file paths (for `hatch-requirements-txt` and similar):
```toml
[tool.uv]
cache-keys = [{ file = "requirements.txt" }, { git = true }]
```
This change should be fully backwards compatible.
Closes https://github.com/astral-sh/uv/issues/6964.
Closes https://github.com/astral-sh/uv/issues/6255.
Closes https://github.com/astral-sh/uv/issues/6860.