It means the fuzzer gets to explore less, actually, but we still have
the source-based fuzzer that will find the case where the colon is
missing, and which could hunt for non-idempotencies in the formatter and
such.
As expected, the golden tests fail to run under Nix because the test
directory is not writable. And it's better to not write in my opinion,
let's not hack that and have a dry run output mode.
For now the output format is not structured, this is good enough for the
thests. It could be nice to do structured output in RCL format, but we
can do that later if needed.
The fuzzer was unable to find the overflow case in the new List.sum
method even after several minutes. To help it a bit to find interesting
cases, let's add large integers that are close to overflowing to the
input through a shorcut. Previously they could get there, but only when
reading an 8-byte integer from the input, so first you need a mutation
to bump the number to 8, and then a mutation to have an integer in those
8 bytes that is close to overflowing, and those together are very
unlikely.
When using RCL as a jq replacement, often I have some pipeline and I
want to edit the last part of the query on the command line. I don't
want to have to move my cursor all the way back to wrap the entire
expression in braces. So even though comprehensions can already do map
and filter and flatmap, I still want to add those.
This is step one, adding map.
So I can highlight stuff on my blog as long as there is no highlighting
in Pandoc/Skylighting. With the change to MarkupString, this was really
easy to do!
This fixes a longstanding issue where reporting errors that we have to
blame on just the document's result in general got blamed on its full
span, which is often a comment and not the offending value. Now we blame
it on the inner body expression, which is more natural.
The autoformatter in "rcl format" pretty-prints the CST. But we can
also evaluate the document, and then it gets pretty-printed by the
value pretty printer. These two should match, but currently they do
not. Add a fuzzer to discover all the cases where they don't.
I am making the formatter more suitable for real use, and as a result
I want to add a flake/CI check that ensures that all the examples are
formatted correctly. But then I need this --check mode.
I'm not sure if this is positive because it also enables more room for
syntax error when we expect something other than an identifier. But
maybe it simplifies Smith more than it harms and it speeds up the
fuzzer? I have one running for a few minutes now that is catching up to
one that is using the classic one, and it looks like it's catching up
very fast, but of course that is always the case at the start, so I'm
not sure. Maybe I should let it run over night and see which one gets
the furthest, coverage-wise?
Update: I let both of them run over night. The fuzzer without the
identifier stack started later (I don't remember how much, but maybe
half an hour, maybe an hour). Yet, it managed to achieve more coverage
(7805 vs. 7727, yesterday one with identifier stack was around 7500 and
the non-identifier at ~7200). So it looks like getting rid of the stack
is more helpful in speeding up the fuzzer, than harmful in generating
nonsense. Let's keep it.
On the one hand, this makes fewer inputs valid Smith programs, which
will make it harder for the fuzzer to find them. But on the other hand,
if these cases where the default helped were unlikely to lead to
interesting programs anyway, so maybe it's good to only unlock them when
the stack is properly filled, so that interesting interactions are
easier to discover in mutations?
It's all very hard to say, and very hard to measure without wasting a
huge amount of compute. But at least it makes the code simpler, so let's
go for that.
After this change, the fuzzer took a few minutes (~6 or so?) to return
to the previous level of coverage.
Previously I handled the stack underflows by hand in most places, but we
can use the ? operator for that. It does change the behavior slightly,
so now my corpus has to converge again, but I think it's worth the
simplification.
One thing that I'm less sure about: when there is an underflow or other
invalid condition, early out, don't use the program. Then the fuzzer
will have a harder time to discover valid Smith programs, and mutations
will be less effective. But it will also waste less time on different
Smith programs that evaluate to the same RCL program.
The exec/s counter is now definitely a lot higher than before, almost a
factor 2 from quick eyeballing. Which does make sense if most iterations
are getting rejected before we even run RCL, so that by itself doesn't
say much.
I'm not sure if this is a good idea. On the one hand it keeps the corpus
small and full of interesting examples. But on the other hand, sometimes
we need multiple mutations to discover a new program. So I'm a bit
worried that this will get stuck in a local minimum. But maybe it will
also waste less CPU time on exploring pointless variations of a program?
I implemented coverage, and one thing that stood out is that indexing
operations were not covered, so add them! It's discovering new code
paths quickly now.
I saw many corpus entries that are smith programs where the output is
empty, but nonetheless the programs differ. Exploring that space is
wasteful. I wonder if an early exit, before we even run RCL, helps to
focus compute power more on exploring the interesting cases.
I don't have any hard measurements, but from me eyeballing the logs, it
seems to discover NEW and REDUCE cases at a slightly higher rate than
without this, even though the exec/s is not that much higher. Or is it?
It seems to go up over the runtime, both in the past and the current
fuzzer. Hmmmm.
Instead of swap, probably it makes more sense to rotate the instructions
by 1. Or by a random amount. Maybe this is too hectic, and I should
limit it to just +/- 1? But let's see. My corpus is already pretty much
converged so it's hard to tell if this is helpful right now, but I like
the idea so I'm adding it.
This is where it would be useful to be able to see the mutation name,
not just "Custom" in libFuzzer's output. It _has_ to be possible. But
how?
The stack language is already built to consume the things it can
consume, and for identifiers, we don't want to consume them, but we can
index into the stack when we use one, so we don't always have to have
the one to use at the top.
So these pops only pollute the space of interesting programs with ones
that pointlessly add things to the stack and remove them again.
Remove the pop instructions entirely then. It doesn't even impact fuzzer
coverage much, which suggests they were not used so much in the corpus.
Which also makes sense, libFuzzer's REDUCE would have found smaller
programs that achieve the same thing without the pop.
I noticed the fuzzer was always using "std" and other builtins as
variable names, never using its own even though there is an instruction
for it. I suspect it's because of the REDUCE step in libFuzzer, using a
builtin saves one byte on the fuzz input size.
So add an equally short encoding for short identifiers, that is also
lexicographically smaller than using a builtin, so this should now be
preferred.
This was one part that was still missing, which I suspect is why it was
failing to discover one uncovered line in the TypeDiff reporting that I
was hunting for. After a few minutes of running this it sitll hasn't
found it, but it is making steady progress and discovering lots of new
branches.
With keyword unops, we need a space after them. I saw some inputs
generating "nottrue". That's not very helpful of course. Not all unops
need the space, but it's never harmful.
I replaced it with the `random` module, see also the previous commit.
This has very little impact on build times, if any.
Benchmark 1: cargo build --all (current commit)
Time (mean ± σ): 22.219 s ± 0.249 s [User: 212.705 s, System: 19.562 s]
Range (min … max): 21.915 s … 22.512 s 5 runs
Benchmark 2: cargo build --all (2 commits ago)
Time (mean ± σ): 22.365 s ± 0.527 s [User: 215.451 s, System: 19.730 s]
Range (min … max): 21.448 s … 22.777 s 5 runs
Benchmark was run in powersave mode to avoid thermal throttling.
It's not that there is something wrong with tinyrand, but it turns out,
WyRand is such a tiny function, I can implement it myself and avoid a
dependency. Do keep tinyrand for one more commit so I can write a test
to use it as a reference implementation. It passes.
So now I have the two primary ones, "fuzz_source" and "fuzz_smith". Now
I just need to make something to feed the outputs of the smith into the
fuzz_source corpus.
I want to reuse the same checks from the smith fuzzer. Maybe I should
rename them "source" and "smith" to clarify.
The code remains mostly unchanged, only moved, but I did also remove a
few of the modes. It's not so clear to me any more that it is an
advantage to have a separate typechecking mode when we already have the
evaluation mode. So we can delete some of the code.
It seems to help slightly, but it's hard to tell objectively because by
now my corpus is getting saturated anyway. It does now print the name of
the mutator in the mutation string, I wonder how it does that, there is
no API for it. Does it internally append something to a global? I want
to mutate that global too, would be useful to see what strategies are
effective so I can bias towards them.
Nanorand contains an off-by-one in its WyRand implementation which made
the mutator crash on hitting a code path that should be unreachable. [1]
There is an unmerged patch for it on GitHub, but this is the second time
I encounter an issue with this library, it also suffered from a biased
shuffle when I was looking for a way to shuffle playlists in Musium [2].
To be on the safe side, instead of trying to make Cargo use the patched
fork, just swap out the library for one that works.
[1]: https://github.com slash Absolucy slash nanorand-rs slash issues slash 45
[2]: https://github.com slash Absolucy slash nanorand-rs slash issues slash 37
(URLs defused to prevent GitHub from auto-linking them, it's not my goal
to post passive-aggressive messages in somebody else's issue tracker.)
From a quick glance at the input, this seems to reduce the number of
silly programs a bit. Especially not reading so many arguments for a for
loop that is invalid anyway.