This adds a version of the `bisect()` revset that simply takes the
midpoint of the input set when iterated over. That's correct in linear
history and probably usually good enough in non-linear history too. We
can improve it later. I think it's valuable to have this building
block even in an imperfect state.
Sometimes we have already found that a part of the graph is reachable,
so it's not necessary to continue evaluating the predicate on any
connected commits.
Benchmark results on Git repo compared to previous commit:
```
reachable(@, all()) 1.8% slower
reachable(@, v2.49.0..) 1.2% slower
reachable(author(peff), all()) 93.4% faster
reachable(author(peff), v2.49.0..) 52.0% faster
```
This seems good for both lucky and unlucky cases. If the ancestor is reachable,
DFS tends to finish within a fewer steps. If the ancestor is unreachable, we'll
have to visit all candidates, so the visited set can be large.
jj bench is-ancestor --ignore-working-copy -R ~/mirrors/linux
```
group old new
----- --- ---
is-ancestor-v6.0-v6.10 2.45 93.0±0.46µs 1.00 37.9±0.19µs
is-ancestor-v6.0.1-v6.10 2.90 12.7±0.09ms 1.00 4.4±0.13ms
```
This updates `rebase_with_empty_behavior()` to read old and new parent
commits concurrently, and to read their trees (possibly including
merging parents) concurrently. It seems like it might make a
difference if these objects are not already cached somewhere.
We don't have an intuitive way to search for non-UTF-8 strings with
diff_contains(), but this allows the user to search for UTF-8 patterns inside
files of arbitrary encoding (such as text logs.)
We could instead make .is_match() accept AsRef<[u8]>, but I think an explicit
bytes method is better. haystack is usually a string.
We don't have a strong reason to disable unicode-incompatible patterns like
"(?-u)." This change will help fix bytes handling in diff_contains() revset.
According to the doc, the performance is on par with the unicode Regex.
https://docs.rs/regex/latest/regex/bytes/index.html#performance
Since we already have globset in transitive dependencies, this change helps
reduce the amount of dependencies. Another reason is that globset provides a
function to convert glob to regex. This is nice because we use globs to match
against strings or internal repository paths instead of platform-native paths.
The GlobPattern wrapper is boxed because globset::Glob type is relatively big.
With experimental changed-path index, I noticed "jj log PATH" spends a fair
amount of time for testing known/unwanted ancestor edges. Allocating BitSet
without a known lower bounds can be wasteful, but it's still faster than using
HashSet or BTreeSet. I think we can split BitSet data into e.g. 4kB chunks to
mitigate the initial allocation cost.
```
% hyperfine --sort command --warmup 3 --runs 5 -L bin jj-0,jj-1 \
'target/release-with-debug/{bin} -R ~/mirrors/git --ignore-working-copy log -r"tags()"'
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/git --ignore-working-copy log -r"tags()"
Time (mean ± σ): 2.709 s ± 0.035 s [User: 2.494 s, System: 0.215 s]
Range (min … max): 2.670 s … 2.747 s 5 runs
Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/git --ignore-working-copy log -r"tags()"
Time (mean ± σ): 1.322 s ± 0.023 s [User: 1.121 s, System: 0.199 s]
Range (min … max): 1.308 s … 1.363 s 5 runs
Relative speed comparison
2.05 ± 0.05 target/release-with-debug/jj-0 -R ~/mirrors/git --ignore-working-copy log -r"tags()"
1.00 target/release-with-debug/jj-1 -R ~/mirrors/git --ignore-working-copy log -r"tags()"
```
"original" is a term often associated with the commit that introduced a
certain line (hence its origin). To avoid any confusion, any variable
that relates to the starting point of the annotation process is changed
to consistently use "starting" instead.
As a consequence, in case of error, rather than pointing at the last
commit where this line was seen, it points at the first commit where
the annotation process should continue if the domain were expanded.
This is better aligned with what `git blame` returns.
I'll add changed-paths index segments, and it would be tedious to pass (commits,
changed_paths) segments pairs around. Since this function receives
DefaultMutableIndex, it makes sense that the return value is an Index, not an
IndexSegment.
I'm going to change the operation link/association file to store structured data
so that we can add a separate changed-paths index file. I think it makes sense
to use raw bytes there. It's also nice that the segment file id is typed.
- IndexPosition is renamed to GlobalCommitPosition because we have
LocalCommitPosition, and GlobalCommitPosition is no longer a public type.
- Private types such as ParentIndexPosition aren't renamed. I'll probably split
commit index modules instead.
- ReadonlyIndexLoadError isn't renamed because I'm not sure if we'll add a new
error type dedicated for new changed-paths index.
- IndexLevelStats is renamed, but IndexStats isn't because I'll probably add
stats of changed-paths index.
Since revset engine will use changed-paths index to evaluate files() predicate,
we need to pass &CompositeIndex wrapper around instead of &CompositeCommitIndex.
Index is now implemented for non-reference CompositeIndex type. This works
because the CompositeIndex type is now Sized, so the reference type can be
converted to a trait object.
This is ugly, but avoids CompositeIndex<'a> lifetime. If CompositeIndex had
a reference, trait bounds in the revset engine would become quite messy.
FWIW, I think a composite index type could be a pair of (Vec<Readonly>,
Option<Mutable>). I'm not going to reorganize the commit index at the moment,
but I assume the changed-paths index will be structured in that way.
Changed-paths index won't have data dependency between segments, and it will
have a commit offset field which only applies to the first segment. So it will
probably make sense to manage segments as an array, not as a linked list.
commit offset
segment #0:
for each commit
changed paths (interned)
sstable of paths
segment #1:
for each commit
changed paths (interned)
sstable of paths
...
This will help wrap Arc<ReadonlyIndexSegment> and MutableIndexSegment in an
enum. We won't clone MutableIndexSegment, but the Clone impl should be harmless.
I'll insert a wrapper type that holds the "commit" index and a new changed-paths
index. The Index trait will be implemented on that wrapper.
It's unclear whether the non-trait methods should be pub or pub(super), but that
shouldn't matter since the CompositeIndex type isn't public. I just copied
visibility of similar methods.
This replaces `|| async` by `async ||` since the latter is presumably
more idiomtic.
There may be other places where we can now use async closures but I
don't remember where they are and I don't know of a good way to
identify them.
`Store::tree_builder()` returns a `TreeBuilder`. Almost all callers
should be using the `MergedTreeBuilder` these days. This patch
therefore removes `tree_builder()` to reduce the risk of accidentally
using it.
In 1b1edc7a90, I missed the importance of this comment:
```
// Whenever we add an entry to `self.pending_trees`, we also add an Ok() entry
// to `self.items`.
```
The `self.items` entry was there to make sure that we wait for the
pending tree to be polled to completion, thus resulting in its entries
getting added to `self.items`. After my commit, we no longer always
add an entry to `items`, which meant that we can end up emitting
entries from a parent tree before entries in a child tree, such as
`foo/baz` before `foo/bar/qux` even though `baz` comes after `bar`.
This patch fixes the bug by instead checking in `self.pending_trees`
that there are no directories that we need to emit first. Thanks to
@yuja for the suggestion to do it this way instead.
The next patch will update the tests to catch regressions.
The `TestBackend` methods currently return their data immediately (on
the first poll), which means that if multiple futures are created and
then they're polled "concurrently", they will always return their data
in the order they're being polled. That leads to poor testing of
algortihms that poll futures concurrently, such as `TreeDiffStream`.
This patch makes `TestBackend` spawn async work to run in a tokio
runtime instead. That's enough to show a bug I introduced with my
recent refactoring of `TreeDiffStream`, except that it's also covered
up by the caching we do in `Store`. I'll fix the bug and update tests
to work around the caching next.
This slows down the jj-lib tests from 2.8 s to 3.1 s. I don't think
that matter much, given that the jj-cli tests takes > 30 s.
I tried to add a small `tokio::time::sleep()` (random up to 5 ms) but
that slowed down the property-based tests of the diff editor very
significantly (took over a minute). Maybe we could have two different
kinds of test backend or maybe make the sleep configurable in some
way. We can improve that later. The async-ness added in this patch is
sufficient for catching the diff-stream bug.
It should genenerally be better to prioritize polling trees in the
order we're going to emit their entries. For example, if we have
pending trees `zzz/` and `dir/aaa/`, it's better to poll the latter
even though we inserted the former first.
This also prepares for fixing a bug related to the order we emit. We
will then want to look up in `pending_trees` by key found in `items`.