With the dup instruction, the smith fuzzer can produce programs that are
exponential in the length of the fuzz input. Armed with that, the fuzzer
has discovered a weak spot of RCL: the formatter is inefficient for
operator chains, because it tries to break at every operator, and it
backtracks a lot.
I was aware of this, it's still somewhere on my list of things to fix.
(Like the recent changes for the CST, an operator chain should parse as
a single node with a list of terms, an n-ary operator if you like. Then
we can format the entire chain wide or tall, which would make more sense
to me anyway.) But if we add this feature to the fuzzer now, then for an
input length of 64 bytes, the fuzzer will regularly find timeout cases
that take longer than 2s to run, so it's a blocker.
Remove the dup again then, for now, until I fix the formatter.
I also (temporarily) seeded the corpus with some examples of patch mode
inputs. It started exploring, but it still has not discovered a matching
path at a depth greater than 1 after about 20 minutes of fuzzing (on a
single core). I suspect this is going to take a while, but I suppose it
would get there eventually. It is at least steadily discovering new fuzz
inputs.
If the same expression is needed in two parts of the program, before
this change, they had to be pushed separately, with the same sequence of
instructions. Now we can use a 'Dup' instruction instead to duplicate an
expression from within the stack. I hope that this will enable the
fuzzer to explore a bit faster.
Some context, I modified the patch implementation to be not idempotent
(or so I thought), and the fuzzer did not find a counterexample yet
after several minutes of fuzzing. But as I'm writing this, I realize
that I made a mistake there, and even my sabotaged patch command is
still idempotent. Still, this is a small modification of the smith that
can hopefully make it more efficient at exploring the search space, so
let's keep it.
It does find most cases very quickly, but apparently it is hard for it
to create sensible patch paths. It needs to be quite lucky to generate
them with ExprField, and then duplicate the right field names that are
also used in the document. Perhaps there is a better way to explore
faster, but at least we have one entry point.
It hasn't discovered any idempotency violations yet.
When we do surgery on the CST for the patch feature, and make it combine
inputs from multiple sources, the formatter needs to relove spans that
may not come from the same document.
I expected this to require horrible changes, and at first I was planning
to build a hash map of DocId to &str, but I already had this Inputs type
that does exactly what I need, and the changes are pretty minimal!
This is unrelated to the recent change in decimal formatting; the fuzzer
crash happened even before that. Apparently I didn't run this fuzzer for
a longer time, I need to figure out some kind of continuous fuzzing.
There are many things we can load from a string, we can give them better
names than just "input". This is in preparation for the 'rcl patch'
command, where I'll be loading more different documents at the same
time, and it will be more important to tell them apart.
This change can be made independently, so let's do it now to keep the
diff for the 'rcl patch' feature cleaner.
I added a std.format_json function, and that required making the width
in the pretty-printer optional. With that in place, it is now easier to
support the JSON Lines format, something I wanted to do for some time
already.
This is a first step to accept the new format in the CLI. It does not
implement the formatting itself yet.
For displaying things to humans, it makes sense to have a maximum width,
but if we serialize for machines, we can just put everything on a single
line and not bother with backtracking.
This moves one more place of duplication of builtins into
generate_keywords.py as a single source of truth, resolving
a to do in the smith fuzzer.
This does once more shuffle all of these around in the fuzzer, which
makes the existing fuzz corpus mostly meaningless. Fortunately, this
should be the last time that this happens: with the new approach we
can modify the builtins with minimal changes to the meaning of the
fuzz corpus, which is something that I wanted for a long time.
I regularly add new methods, and it's becoming tedious to have to
remember to update all the places that reference these, so let's
generate them and automate the process. For now, I'm choosing the
Pygments grammar as the source of truth, and the first target to
generate is the fuzz dictionary.
I'm leaving the Zed extension pointing to the older commit of the
Tree-sitter grammar, I'll update that after this version bump. It's
a bit awkward to do it this way around, but there are circular
dependencies that can't be avoided. Maybe with an attack on SHA1 it
can be done in theory, but let's not go there.
At first I also wanted to support rounding to a negative number of
decimals (so rounding to a positive power of 10), but scope creep,
complications ... I don't need it, and we can always add that later.
It found one issue right away, related to using an i16::MIN exponent,
which overflows the way we parsed. But then I realized there are a few
other bugs in the number parser ... I added a marker for one and fixed
handling of the implicit exponent offsets.
This enables us treat numbers with an exponent losslessly. We don't
conflate the decimal point with the exponent in case they get in the
way of each other.
It also greatly simplifies the formatting. We can mechanically format
the representation now, without having to use heuristics for when to
switch to scientific notation. The catch is of course that the
heuristics will need to move elsewhere. We'll have to normalize the
numbers after arithmetic operations.
RCL aims to be obvious to understand. Num might be cryptic for new users,
and although we also have "Int" rather than "Integer", that one is very
established, "Num" may be a bit too obscure. (We also have "String"
rather than "Str", consistency ...). It's a type that I expect has
little use for end-users, but it shows up in the negation error message,
so let's make it unambiguous and call it "Number".
This is only the start, but let's verify Decimal::cmp against f64::cmp.
It instantly finds an input where they disagree:
Compare {
a: NormalF64(
-0.16406250000007813,
),
b: NormalF64(
0.0,
),
}
This adds back the exception that was removed by allowing float parsing
imprecision, though in a more limited form initially because it only
affected exponents.
But after running the fuzzer for a bit longer, it also affects large
integers, so we are back to the start, overflow is just an intentional
incompatibility.
This removes one case of incompatibility with Serde. If you write a
float literal that is too precise to be represented exactly, then we now
silently round it rather than treating it as an overflow error. I think
this is acceptable because if you are in the case where you care about
numbers to 19 significant digits then probably RCL is not the best tool
for what you are doing, but the case where we encounter some arbitrary
json that we want to query with "rcl jq" and it happens to have some
humongous float in it, that is probably more likely. Python handles
float literals in this way too so I think it's okay.
The choice I went with is to have a 16-bit exponent, which gives RCL's
float/decimal type more range than a regular f64. Now the fuzzer can
generate an input with a large exponent, and RCL will happily echo it,
and it's technically syntactically valid json, but Serde rejects it with
"number out of range" (in the same way that RCL rejects some numbers as
overflow). So add an exception for this mismatch.
The one thing that prevents that right now is floats, and the fuzzer
discovered it within a few seconds:
╭──────╴ Opcode (hex)
│ ╭───╴ Argument (hex)
│ │ ╭╴ Operation, argument (decimal)
26 03 ExprPushInput, 3
take_str, 3 → "4e2"
e6 01 ModeJsonSuperset, 1
EvalJsonSuperset -->
4e2
I realized today that I want this. In particular, the API of my music
player Musium returns albums with a numeric playcount and discovery
score, and I want to sort on that. Finally that is possible now that I
am adding support for floats. But I need a way to sort on one field of
a dict! Arguably this is more important than the bare sort itself.
While I do this for lists, we can do the same for sets.
It started to get annoying to have to define it myself every time, so
let's just add it properly now. This also resolves the longstanding
issue in the RCL pretty-printer that we have no good way to print the
empty set -- now we do!
It means the fuzzer gets to explore less, actually, but we still have
the source-based fuzzer that will find the case where the colon is
missing, and which could hunt for non-idempotencies in the formatter and
such.