## Summary
I think `GlobDirFilter::match_directory` should return `true` if it
failed to construct a DFA
to force a full directory traversal. Returning `false` means that the
build backend excludes all files which
is unlikely what we want in that situation.
We may run on case-sensitive file systems (Linux, generally) or on
case-insensitive file systems (Windows, generally), while modules in
Python may be lower or upper case. For robustness over filesystem
casing, we require an explicit module name for modules with upper cases.
Fixes#13419
PEP 639 does not allow any characters that aren't in either their
limited glob syntax or the alphanumeric Unicode characters. This means
there's no way to express a glob such as `**/@test` for the excludes.
We extend the glob syntax from PEP 639 by introducing backslash escapes,
which can escape all characters but path separators (forward and
backwards slashes) to be parsed verbatim.
This means we have two glob parsers: The strict PEP 639 parser for
`project.license-files`, and our extended parser for `tool.uv`, with a
slight difference if you need to use special characters, to both adhere
to PEP 639 and to support cases such as #13280.
Fixes#13280
The goal of this PR is to support reproducible builds and best-effort
platform-independent builds. Previously, while the build backend would
build the same source dist and wheel on the same machine, they would
look different across different operating systems. This PR fixes the
platform-dependent walk dir order by sorting and removes
platform-specific permissions from the source dist that had caused those
differences.
The reproducibility goal does not extend to platform-dependent
filesystem features, such as permissions and links, especially in
interaction with Git. Since most users share code across platforms
through Git, we're focusing on cross-platform behavior under Git. One of
those caveats is intentional: If a file, such as a bash script, has an
executable bit, we preserve it. This means that E.g. builds of Git
checkout of a repository with an executable shell script in the sources
will have different archives on Unix and Windows. Another relevant case
are symlinks: By default, Git on Windows replaces symlinks with a file
that contains the path to the target file
(https://stackoverflow.com/q/5917249/3549270). (This example comes from
Cargo, where it means that the package archive is different on Windows
when symlinking license from the repository root to a workspace package)
Best reviewed commit-by-commit
When building the source distribution, we always need to include
`pyproject.toml` and the module, when building the wheel, we always
include the module but nothing else at top level. Since we only allow a
single module per wheel, that means that there are no specific wheel
includes. This means we have source includes, source excludes, wheel
excludes, but no wheel includes: This is defined by the module root,
plus the metadata files and data directories separately.
Extra source dist includes are currently unused (they can't end up in
the wheel currently), but it makes sense to model them here, they will
be needed for any sort of procedural build step.
This results in the following fields being relevant for inclusions and
exclusion:
* `pyproject.toml` (always included in the source dist)
* project.readme: PEP 621
* project.license-files: PEP 639
* module_root: `Path`
* source_include: `Vec<Glob>`
* source_exclude: `Vec<Glob>`
* wheel_exclude: `Vec<Glob>`
* data: `Map<KnownDataName, Path>`
An opinionated choice is that that wheel excludes always contain the
source excludes: Otherwise you could have a path A in the source tree
that gets included when building the wheel directly from the source
tree, but not when going through the source dist as intermediary,
because A is in source excludes, but not in the wheel excludes. This has
been a source of errors previously.
In the process, I fixed a bug where we would skip directories and only
include the files and were missing license due to absolute globs.
<!--
Thank you for contributing to uv! To help us out with reviewing, please
consider the following:
- Does this pull request include a summary of the change? (See below.)
- Does this pull request include a descriptive title?
- Does this pull request include references to any relevant issues?
-->
## Summary
While working on potential bug fixes with temporary files on Windows (I
think I am currently ecountering the same issue as #2810)
I noticed that sub-workspaces were not all having the same `tempfile`
version. And they were not relying on the cargo root project dependency.
I don't know at all if it was done on purpose or not.
(I also wanted to override the root dependency with a local source but
it was not possible due to sub-workspaces not relying on the same).
The root lockfile already pinned to the `3.14.0`. Some sub-workspaces
were depending on the `3.12.0`, some others on the `3.14.0`. So I
updated the root `Cargo.toml` to the `3.14.0`.
Feel free to decline if it was done on purpose! No worries at all
🙂
Thanks!
<!-- What's the purpose of the change? What does it do, and why? -->
## Test Plan
All units tests are still passing on my side. Let's see with the
pull-request CI 😄
This PR contains three smaller improvements:
* Improve the include/exclude logging. We're still showing the current
directory as empty backticks, not sure what to do about that
* Add early stopping to license file globbing, so we don't traverse the
whole directory recursively when license files can only be in few places
* Support explicit wheel excludes. These are still not entirely right,
but at least we're correctly excluding compiled python files now. The
next step is to make sure that the wheel excludes contain all pattern
from source dist excludes, to make sure source tree -> wheel can't have
more files than source tree -> source dist -> wheel.
<!--
Thank you for contributing to uv! To help us out with reviewing, please
consider the following:
- Does this pull request include a summary of the change? (See below.)
- Does this pull request include a descriptive title?
- Does this pull request include references to any relevant issues?
-->
## Summary
In uv-globfilter, use the workspace `fs-err` in `dev-dependencies`.
This fixes an unnecessary dev-dependency on `fs-err` 2.x even after the
workspace fs-err was updated to 3.x in
https://github.com/astral-sh/uv/pull/8625.
The `Cargo.lock` file still has `fs-err v2.11.0` after this PR, but it
is via `tracing-durations-export v0.3.0` rather than directly required
by any `uv` crate.
## Test Plan
<!-- How was it tested? -->
```
$ cd crates/uv-globfilter/
$ cargo test
```
A first milestone: source tree -> source dist -> wheel -> install works.
This PR adds a test for this.
There's obviously a lot still missing, including basics such as the
Readme inclusion.
When doing a directory traversal for source dist inclusion, we want to
offer the user include and exclude options, and we want to avoid
traversing irrelevant directories. The latter is important for
performance, especially on network file systems, but also with large
data directories, or (not-included) directories with other permissions.
To support this, we introduce `GlobDirFilter`, which uses a DFA from
regex_automata to determine whether any children of a directory can be
included and skips the directory if not.
The globs are based on PEP 639. The syntax is more restricted than glob
or globset, but it's standardized. I chose it over glob or globset
because we're already using this syntax for `project.license-files` a
required by PEP 639, so it makes sense to use the same globs for all
includes (see e.g.
4f52a3bb62/pyproject.toml (L36-L48)
for example with same semantics for include and exclude)
### Semantics
Glob semantics are complex due to mixing directories and files,
expectations around simplicity and our need to exclude most of the tree
in the project from traversal. The current draft uses a syntax that
optimizes for simple default use cases for the start.
#### includes
Glob expressions which files and directories to include in the source
distribution.
Includes are anchored, which means that `pyproject.toml` includes only
`<project root>/pyproject.toml`. Use for example `assets/**/sample.csv`
to include for all
`sample.csv` files in `<project root>/assets` or any child directory. To
recursively include
all files under a directory, use a `/**` suffix, e.g. `src/**`. For
performance and
reproducibility, avoid unanchored matches such as `**/sample.csv`.
The glob syntax is the reduced portable glob from
[PEP 639](https://peps.python.org/pep-0639/#add-license-FILES-key).
#### excludes
Glob expressions which files and directories to exclude from the
previous source
distribution includes.
Excludes are not, which means that `__pycache__` excludes all
directories named
`__pycache__` and it's children anywhere. To anchor a directory, use a
`/` prefix, e.g.,
`/dist` will exclude only `<project root>/dist`.
The glob syntax is the reduced portable glob from
[PEP 639](https://peps.python.org/pep-0639/#add-license-FILES-key).