One commenter on Lobste.rs pointed out that the "make X Y again" can
have political connotations. That was not the intent, and I don't want
to scare people away with that.
I brainstormed several options, in the end this one captures the same
spirit ("boring", it should not create problems or cause surprises, it
should not feel like solving puzzles, building things is exciting,
configuring them should be boring), and even makes the goal a bit
clearer by hinting at the remove-duplication part.
The map function in particular is a footgun that hit me twice in two
weeks, in a case similar to this one:
{"apple", "orange", "banana"}.map(x => x.len()).sum()
This looks like it sums the lengths of the strings. Except it doesn't,
it sums the *unique* lengths of the strings, so we get 11 instead of 17.
That's surprising, and I don't want that kind of surprise in RCL.
At first I thought to remove Set.map and Set.flat_map entirely. You
would just have to convert to a list first and map there, and if you
wanted the set, convert back with List.to_set_dedup, which makes it
explicit. That's why I added these conversion functions. Then I
implemented deletion ... and as I was writing the PR description and
changelog entry for how to handle the break, it dawned on me, if we
have the _dedup anyway in to_set_dedup, we might as well use it here.
We don't have to delete these methods, we can just call them _dedup.
{"apple", "orange", "banana"}.map_dedup(x => x.len()).sum()
Now the snippet is no longer so sneaky! It's clear that there is
deduplication going on! So I'm fine keeping them in this form. At least
in the Nix transitive closure test, the map_dedup behavior is what I
want, which validates demand for at least map_dedup. But even then,
a changelog entry for a breaking change that says "renamed to X" is a
much better experience than "have been deleted here are some much longer
alternatives".
Also, this change is relatively safe to make: it will break existing
callers, but they will get an error message and there is a clear path
to fixing this. It doesn't silently change the output of something: it's
being called on sets, so there can't even be fields named 'map' or
'flat_map' because sets don't have fields. So I feel comfortable making
a small breaking change here.
I may have forgotten about Set.sort when I added Set.to_list ... It
feels bit weird to have two identical methods. On the other hand, they
convey very different intent, and using .sort() for places where you
e.g. only want to convert to be able to use a non-deduplicating map, but
where you don't care about order, .to_list() feels like it matches that
intent better than .sort(). So I'll keep both for now, but we can at
least acknowledge it, in case people wonder.
I want to delete Set.map and Set.flat_map, because they are footguns.
To provide at least some alternative useful in method chaining (which is
useful in ad-hoc queries on the command line, where it is inconvenient
to move the cursor back to add an opening paren for a comperhension),
and because they are useful in general, add some conversion methods.
Naming things for the conversion from List to Set went through many
iterations (see 0fca8c5056 for details), and I'm still not sure
this is the right way to go about it. Maybe we should have `List.dedup:
(self: List[T]) -> List[T]` instead, and call the one that errors on
duplicates simply `List.to_set: (self: List[T]) -> Set[T]`, so you
can do `.dedup().to_set()` to explicitly drop duplicates, if that's
the intent? But `to_set` does not look at the call site like it would
validate anything, so it's no good for readability in *that* sense …
Anyway, nothing is set in stone, we can just rename things later! Just
like I can delete Set.map if it turns out to be a mistake!
It is sorted now, it will probably remain like that for a while, but I'm
still playing with the idea of making dicts preserve insertion order,
and if I do that, then I would probably also make sets preserve order,
and the sorting property will go away.
If a curious property like this is not documented, people start
wondering, is it *always* sorted? And the answer is yes -- at least for
now. So probably better to document that. I have no illusions that a
note like this will prevent people from depending on it, but that's
fine, if you update you do that consciously and deliberately, I hope.
Naming these was a long bikeshed. Some considerations:
* Symmetry with Set.to_list and String.to_uppercase, keep the to_.
* Naming after an action is clearer than naming after a property,
because for a property it's not clear whether that is a precondition
rather than a postcondition. xs.dedup() makes it clear it removes
duplicates, xs.unique(), does it assert it's unique? Does it test it?
* In light of the above observation, List.sort is a good name, better
than List.sorted.
* The obvious one people will try is probably to_set. If it silently
drops duplicates, that's an unsafe default. So let's make this the
one that validates, and we can have to_set_dedup that drops
duplicates.
* Erroring on duplicates is always safe. We can even point the users to
the version that drops duplicates.
* If you read code and see xs.to_set() without any contrasting
.to_set_dedup(), it's not so obvious that .to_set() does any
validation, and if that is your intent, then you have to clarify
that in a comment.
* xs.to_set_assert_unique in that sense is clearer, but it's too
verbose in practice. xs.to_set_nodup is a bit obscure, especially if
you don't see the contrasting .to_set_dedup().
* A method List.dedup that returns a list, and having to_set being the
one that errors, also works, then you can do .dedup().to_set(), which
is very explicit. It still does not clarify at the call site that
to_set is validating though. It is also inefficient, but that's
probably not a problem for RCL, and you can always use a
comprehension if it matters.
So conclusion for now, to_set is the right prefix for symmetry and
discoverability. There is not going to be a "default" to set, you have
to explicitly choose whether you want to discard duplicates or not. The
to_set_dedup is clear for that. When reading code, to_set_unique is
maybe less obvious, but it gives more hint than to_set, and if you think
for a moment, set values are already unique, so why the _unique, and
then you find the contrast with _dedup. So far this seems like one of
the better directions.
I plan to remove Set.map and Set.flat_map, and to make that slightly
more palatable, let's add Set.to_list, so you can at least replace all
the former calls with .to_list().map(...).
This is another papercut I ran into while working on Advent of Code
exercises. If I recall correctly, previously comments were not allowed
there because there was no good place in the AST to put them, but that
has since been fixed, with parse_expr handling prefixes, so simply
removing the part that skips over noncode does the trick now!
The formatter used to add a space at the end. That was an oversight when
I added the unpack feature. Fortunately it's an easy fix.
I added the golden test first to confirm it would have caught the bug.
Also, it turns out there was a different fmt golden tests that already
encoded the bad space. I missed it in review.
I did a few of the Advent of Code exercises in RCL, and one papercut was
having to use parentheses in `if x >= (-1)`. On top of that, the error
message was not intuitive ("Expected expression"), so if you omitted
them, it wouldn't be immediate clear what is wrong. The grammar was
like this with the intent to avoid confusion about operator precedence.
For example:
// Is this 'not (a and b)' or '(not a) and b'?
not a and b
There is only confusion though, when a binary operator follows a unary
one. We already had a great error message for this, but only for unary
operators at the top level of an expression, not when they occurred
after a binary operator, e.g. in this case:
a and not b or c
It turns out we can allow both trailing unops inside binops (the `x >=
-1` case), while also improving the error message for an unop in the
middle, which is what this change implements.
It turns out, we can:
* Allow more cases, including the common case of e.g. 'x > -1', without
creating ambiguities.
* Still make ambiguities such as 'a and not b or c' an error.
* Improve the error message for that case, over my previous one!
And this is a tiny change to the grammar and parser, yay! It feels like
one of those times where I discover the *right* way to do it. Discover,
in the platonic sense.
This situation can be confusing without parens, that's why it's not
allowed, even though it makes cases like
if x < -10: ...
invalid and we have to write (-10). But at least we can tell users to
use parens.
The fuzzer has discovered the concept of "exponential growth" and now
calls Set.transitive_closure with a function that at each iteration
creates a dict with *four* closures that each refer to the previous'
iteration's value. This feels like whack-a-mole, I can't make the depth
too small either. But let's see if maybe there is a sweet spot that is
acceptable.
This is a hang discovered by the fuzzer. In the initial form, it
discovered it from a closure (where the environment capture made it
grow), but the same applies to lists. Putting those deeply nested values
in BTreeSets triggers some pathological worst-case comparison where
every comparison has to descend all the way.
This applies to all functions that can recursively build values. The
fuzzer discovered it for transitive_closure but in fact we can trigger
similar behavior with fold (which has never been found by the fuzzer!).
So guard that one as well.
You can still manually do recursion, but there is little we can do about
that, RCL is not really meant for it so serves you right if you trigger
it, though of course at some point the fuzzer is going to learn about
this and then we will have to mitigate it ...
This adds Set.transitive_closure, which is useful for e.g. finding all
store paths in 'nix flake archive --json' output, but it also comes in
very handy in Advent of Code problems!
This is useful in at least one case I ran into in practice when
inspecting Nix flake json output, where flakes contain other flakes and
I want their transitive closure.
It is also usefel for code golf for Advent of Code :)
I already had a line in place to discourage LLM-generated *code*, but
recently I see things popping up where accounts are posting what looks
like LLM-generated messages in pull requests and issue trackers.
I saw a comment pop up now, I seriously wonder, was this a human who
copy-pasted ChatGPT output, or do people give Claude Code access to 'gh'
with a valid token, and do they let agents autonomously post comments
upstream on their behalf?
Whatever it was, it's a waste of everybody's time, and I want nothing to
do with this in my repositories.
I previously only ran 'cargo clippy', without '--workspace'. Now that
I included that, some more advice popped up. Most of them are false
positives, but I accepted one.
It *is* pre-1.0 and it *does* occasionally have backwards-incompatible
changes, but they are well documented with clear upgrade paths, and not
happening at a breakneck pace. For example, with the introduction of
unpack, the |-operator is deprecated, but it still works. Or take the
assertion syntax: it changed, but the old syntax is still accepted, and
the autoformatter automatically upgrades documents. So I think that the
previous phrase was a bit too conservative. Other "mature" projects
cause me breakage headaches far more often than RCL. The readme should
reflect that.
I want to make the readme more aimed at people interested in the tool
and project in general, not necessarily at developers hacking on RCL
itself. Let's move that to a separate manual section, since there is
already a development section in there.
Problems
========
* Combining collections through comprehensions is too verbose.
I don’t want to write {for x in xs: x, for y in ys: y}.
* Pipe (|) for set and dict union is very ad-hoc in the typechecker,
and binary operators are awkward to format.
Solution
========
We can solve both problems by adding unpack: the ability to splice
a collection into another one. Like * and ** in Python, and ... in
Javascript.
* It’s a short way to splice in collections (of which concat is a
special case).
* Dict and set union are now completely regular and easily formattable.
* Arguably, the meaning of .. and ... is more self-evident than the
meaning of |, so it makes code more readable to newcomers.
Design space and alternatives not taken
=======================================
I would prefer to have just .. for unpack, and it would unpack both
single elements (scalars) from lists and sets, as well as key-value
pairs from dicts. However, that means that you can no longer
syntactically tell whether {..xs} is a dict or a set, which previously
was possible (except for the empty {}, which we defined to be a dict,
like Python). There are several ways to deal with this:
Distinct unpack syntax
Distinguish single-element and key-value unpack syntactically, e.g.
.. for scalar and ... for key-value. This is what Python does too,
with * vs. **.
Distinct set literal syntax
Use a distinct syntax for sets. { ... } would be reserved for dicts,
sets would use e.g. {| ... |}, set{ ... }, {{ }}, etc. It would
simplify many things, right now it’s kind of messy that we don't
know what a {}-literal is until we peek inside.
Delete sets entirely
As with using distinct syntax, this removes many complications,
including much of the near duplication between lists and sets. Who
needs sets anyway? We can just use lists instead, and add a few
convenience methods to make things unique. We can even distinguish
between “drop duplicates” and “assert that it’s already unique”,
which a set in RCL currently can’t do, constructing one can only
drop duplicates. Though nothing prevents adding a method on list
that asserts elements are unique and returns a set. Sets are a bit
like unsigned integers, they add some value by making invalid states
unrepresentable, they allow slightly more precise types, but we can
survive without them, and in practice not much is lost.
I did implement this option. It did simplify things a lot. It felt
like a shame to delete so much code, including some rather neat
parts, but that’s just sunk cost fallacy. But in the end, I think
that there is value in having sets. Many things are naturally sets
rather than lists. No other configuration language offers sets,
which gives RCL a unique advantage in this regard. Or maybe there is
a good reason none of the others have sets. Though Python does, and
it also reuses the {}-syntax for dicts and sets.
It also simplifies the type system tremendously, because now we
no longer have two collection types that have an inner element type
(list and set), and it sidesteps the mess around kind of needing a
Collection[T] supertype, but then how to deal with Collection[T]
vs. Union[List[T], Set[T]], how to document it and explain when it
occurs in type errors, etc.
So I tried it, but it felt too harsh, I want to try and make unpack
work without deleting sets. This dead end is not part of the history
included with this merge commit.
Give up on static typechecking
Maybe it’s fine if we don’t know whether {..xs} is a dict or a
set? That minimal example is pathological, but union is a common
operation, and losing the type for {..xs, ..ys} would be a shame.
In the end, the option implemented is to have the distinct unpack
syntax.
This was a big feature to write, primarily due to complications for the
typechecker, and the many cases to handle. Most of the dead ends are not
part of the history merged in this merge commit. There is an entire
other branch about adding a Collection[T] type that is now obsolete as
well.
It is still kind of true, here is a reproducer:
let xs = [1, 2]; { a = 1, ..xs }
stdin:1:27
╷
1 │ let xs = [1, 2]; { a = 1, ..xs }
╵ ^~~~
Error: Invalid unpack in dict.
Help: '..' unpacks lists and sets, use '...' to unpack dicts.
I don't think the help is that bad though, the error does say "invalid
unpack" and the help explains that there are ".." and "...", the fact
that "xs" is a list and ".." is the right thing to unpack a list, well
yeah but what do you expect unpacking a list into a dict to do? So I
think it's fine, let's not complicate the already complex error
reporting logic any further.
Previously, writing [...xs] would give a type error when xs was not a
dict, but really we should complain about dict unpack in the list. The
new match order fixes that.
It can still be refined in the future, but I don't expect it will really
need to be done. Let's not pollute the "git grep TODO" output with
nice-to-haves.
In cases where the collection in '...collection' can't be statically
verified to be a dict, we need to do a runtime check. We can change the
expected type for the collection though.
Unfortunately this makes the error messages in case of a mismatch on
the type a bit more obscure. I think that at this point, that's an
acceptable cost to keep the implementation simpler.
This commit only handles dicts, but also lays the groundwork for
checking scalar unpack. For those, since it can be a list or set,
we need to move the typecheck to runtime, because there is no
single expected type to feed into the check. Well, we could feed
in Union[List[T], Set[T]], but the union is then conjured up by the
typechecker, which could make errors more difficult to understand, so
let's just do a runtime check.