When doing a directory traversal for source dist inclusion, we want to
offer the user include and exclude options, and we want to avoid
traversing irrelevant directories. The latter is important for
performance, especially on network file systems, but also with large
data directories, or (not-included) directories with other permissions.
To support this, we introduce `GlobDirFilter`, which uses a DFA from
regex_automata to determine whether any children of a directory can be
included and skips the directory if not.
The globs are based on PEP 639. The syntax is more restricted than glob
or globset, but it's standardized. I chose it over glob or globset
because we're already using this syntax for `project.license-files` a
required by PEP 639, so it makes sense to use the same globs for all
includes (see e.g.
4f52a3bb62/pyproject.toml (L36-L48)
for example with same semantics for include and exclude)
### Semantics
Glob semantics are complex due to mixing directories and files,
expectations around simplicity and our need to exclude most of the tree
in the project from traversal. The current draft uses a syntax that
optimizes for simple default use cases for the start.
#### includes
Glob expressions which files and directories to include in the source
distribution.
Includes are anchored, which means that `pyproject.toml` includes only
`<project root>/pyproject.toml`. Use for example `assets/**/sample.csv`
to include for all
`sample.csv` files in `<project root>/assets` or any child directory. To
recursively include
all files under a directory, use a `/**` suffix, e.g. `src/**`. For
performance and
reproducibility, avoid unanchored matches such as `**/sample.csv`.
The glob syntax is the reduced portable glob from
[PEP 639](https://peps.python.org/pep-0639/#add-license-FILES-key).
#### excludes
Glob expressions which files and directories to exclude from the
previous source
distribution includes.
Excludes are not, which means that `__pycache__` excludes all
directories named
`__pycache__` and it's children anywhere. To anchor a directory, use a
`/` prefix, e.g.,
`/dist` will exclude only `<project root>/dist`.
The glob syntax is the reduced portable glob from
[PEP 639](https://peps.python.org/pep-0639/#add-license-FILES-key).
Very basic source distribution support. What's included:
- Include and exclude patterns (hard-coded): Currently, we have
globset+walkdir in one part and glob in the other. I'll migrate
everything to globset+walkset and some custom perf optimizations to
avoid traversing irrelevant directories on top. I'll also pick a glob
syntax (or subset), PEP 639 seems like a good candidate since it's
consistent with what we already have to support.
- Add the `PKG-INFO` file with metadata: Thanks to Code Metadata 2.2,
this metadata is reliable and can be read statically by external tools.
Example output:
```
$ tar -ztvf dist/dummy-0.1.0.tar.gz
-rw-r--r-- 0/0 154 1970-01-01 01:00 dummy-0.1.0/PKG-INFO
-rw-rw-r-- 0/0 509 1970-01-01 01:00 dummy-0.1.0/pyproject.toml
drwxrwxr-x 0/0 0 1970-01-01 01:00 dummy-0.1.0/src/dummy
drwxrwxr-x 0/0 0 1970-01-01 01:00 dummy-0.1.0/src/dummy/submodule
-rw-rw-r-- 0/0 30 1970-01-01 01:00 dummy-0.1.0/src/dummy/submodule/impl.py
-rw-rw-r-- 0/0 14 1970-01-01 01:00 dummy-0.1.0/src/dummy/submodule/__init__.py
-rw-rw-r-- 0/0 12 1970-01-01 01:00 dummy-0.1.0/src/dummy/__init__.py
```
No tests since the source distributions don't build valid wheels yet.
As per
https://matklad.github.io/2021/02/27/delete-cargo-integration-tests.html
Before that, there were 91 separate integration tests binary.
(As discussed on Discord — I've done the `uv` crate, there's still a few
more commits coming before this is mergeable, and I want to see how it
performs in CI and locally).