Merge pull request #5130 from roc-lang/big-compiler-doc

Compiler design overview doc
This commit is contained in:
Ayaz 2023-05-30 10:40:15 -05:00 committed by GitHub
commit be077ed046
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
2 changed files with 197 additions and 142 deletions

194
crates/compiler/DESIGN.md Normal file
View file

@ -0,0 +1,194 @@
# Compiler Design
The current Roc compiler is designed as a pipelining compiler parallelizable
across Roc modules.
Roc's compilation pipeline consists of a few major components, which form the
table of contents for this document.
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
- [Parsing](#parsing)
- [Canonicalization](#canonicalization)
- [Symbol Resolution](#symbol-resolution)
- [Type-alias normalization](#type-alias-normalization)
- [Closure naming](#closure-naming)
- [Constraint Generation](#constraint-generation)
- [(Mutually-)recursive definitions](#mutually-recursive-definitions)
- [Type Solving](#type-solving)
- [Unification](#unification)
- [Type Inference](#type-inference)
- [Recursive Types](#recursive-types)
- [Lambda Sets](#lambda-sets)
- [Ability Collection](#ability-collection)
- [Ability Specialization](#ability-specialization)
- [Ability Derivation](#ability-derivation)
- [Exhaustiveness Checking](#exhaustiveness-checking)
- [Debugging](#debugging)
- [IR Generation](#ir-generation)
- [Memory Layouts](#memory-layouts)
- [Compiling Calls](#compiling-calls)
- [Decision Trees](#decision-trees)
- [Tail-call Optimization](#tail-call-optimization)
- [Reference-count insertion](#reference-count-insertion)
- [Reusing Memory Allocations](#reusing-memory-allocations)
- [Debugging](#debugging-1)
- [LLVM Code Generator](#llvm-code-generator)
- [Morphic Analysis](#morphic-analysis)
- [C ABI](#c-abi)
- [Test Harness](#test-harness)
- [Debugging](#debugging-2)
- [WASM Code Generator](#wasm-code-generator)
- [WASM Interpreter](#wasm-interpreter)
- [Debugging](#debugging-3)
- [Dev Code Generator](#dev-code-generator)
- [Debugging](#debugging-4)
- [Builtins](#builtins)
- [Compiler Driver](#compiler-driver)
- [Caching types](#caching-types)
- [Repl](#repl)
- [`test` and `dbg`](#test-and-dbg)
- [Formatter](#formatter)
- [Glue](#glue)
- [Active areas of research / help wanted](#active-areas-of-research--help-wanted)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
## Parsing
Roc's parsers are designed as [combinators](https://en.wikipedia.org/wiki/Parser_combinator).
A list of Roc's parse AST and combinators can be found in [the root parse
file](./parse/src/parser.rs).
Combinators enable parsing to compose as functions would - for example, the
`one_of` combinator supports attempting multiple parsing strategies, and
succeeding on the first one; the `and_then` combinator chains two parsers
together, failing if either parser in the sequence fails.
Since Roc is an indentation-sensitive language, parsing must be cognizant and
deligent about handling indentation and de-indentation levels. Most parsing
functions take a `min_indent` parameter that specifies the minimum indentation
of the scope an expression should be parsed in. Generally, failing to reach
`min_indent` indicates that an expression has ended (but perhaps too early).
## Canonicalization
After parsing a Roc program into an AST, the AST is transformed into a [canonical
form](./can/src/expr.rs) AST. This may seem a bit redundant - why build another
tree, when we already have the AST? Canonicalization performs a few analyses
to catch user errors, and sets up the state necessary to solve the types in a
program. Among other things, canonicalization
- Uniquely identifies names (think variable and function names). Along the way,
canonicalization builds a graph of all variables' references, and catches
unused definitions, undefined definitions, and shadowed definitions.
- Resolves type signatures, including aliases, into a form suitable for type
solving.
- Determines the order definitions are used in, if they are defined
out-of-order.
- Eliminates syntax sugar (for example, renaming `+` to the function call `add`
and converting backpassing to function calls).
- Collects declared abilities, and ability implementations defined for opaque
types. Derived abilities for opaque types are elaborated during
canonicalization.
### Symbol Resolution
Identifiers, like variable names, are resolved to [Symbol](./module/src/symbol.rs)s.
Currently, a symbol is a 64-bit value with
- the bottom 32 bits defining the [ModuleId](./module/src/ident.rs) the symbol
is defined in
- the top 32 bits defining the [IdentId](./module/src/ident.rs) of the symbol
in the module
A symbol is unique per identifier name and the scope
that the identifier has been declared in. Symbols are how the rest of the
compiler refers to value definitions - since the unique scope and identifier
name is disambiguated when symbols are created, referencing symbols requires no
further name resolution.
As symbols are constructed, canonicalization also keeps track of all references
to a given symbol. This simplifies catching unused definitions, undefined
definitions, and shadowing, to an index into an array.
### Type-alias normalization
### Closure naming
## Constraint Generation
### (Mutually-)recursive definitions
## Type Solving
### Unification
### Type Inference
### Recursive Types
### Lambda Sets
### Ability Collection
### Ability Specialization
### Ability Derivation
### Exhaustiveness Checking
### Debugging
## IR Generation
### Memory Layouts
### Compiling Calls
### Decision Trees
### Tail-call Optimization
### Reference-count insertion
### Reusing Memory Allocations
### Debugging
## LLVM Code Generator
### Morphic Analysis
### C ABI
### Test Harness
### Debugging
## WASM Code Generator
### WASM Interpreter
### Debugging
## Dev Code Generator
### Debugging
## Builtins
## Compiler Driver
### Caching types
## Repl
## `test` and `dbg`
## Formatter
## Glue
## Active areas of research / help wanted

View file

@ -1,147 +1,8 @@
# The Roc Compiler
Here's how the compiler is laid out.
## Parsing
The main goal of parsing is to take a plain old String (such as the contents a .roc source file read from the filesystem) and translate that String into an `Expr` value.
`Expr` is an `enum` defined in the `expr` module. An `Expr` represents a Roc expression.
For example, parsing would translate this string...
"1 + 2"
...into this `Expr` value:
BinOp(Int(1), Plus, Int(2))
> Technically it would be `Box::new(Int(1))` and `Box::new(Int(2))`, but that's beside the point for now.
This `Expr` representation of the expression is useful for things like:
- Checking that all variables are declared before they're used
- Type checking
> As of this writing, the compiler doesn't do any of those things yet. They'll be added later!
Since the parser is only concerned with translating String values into Expr values, it will happily translate syntactically valid strings into expressions that won't work at runtime.
For example, parsing will translate this string:
not "foo", "bar"
...into this `Expr`:
CallByName("not", vec!["foo", "bar"])
Now we may know that `not` takes a `Bool` and returns another `Bool`, but the parser doesn't know that.
The parser only knows how to translate a `String` into an `Expr`; it's the job of other parts of the compiler to figure out if `Expr` values have problems like type mismatches and non-exhaustive patterns.
That said, the parser can still run into syntax errors. This won't parse:
if then 5 then else then
This is gibberish to the parser, so it will produce an error rather than an `Expr`.
Roc's parser is implemented using the [`marwes/combine`](http://github.com/marwes/combine-language/) crate.
## Evaluating
One of the useful things we can do with an `Expr` is to evaluate it.
The process of evaluation is basically to transform an `Expr` into the simplest `Expr` we can that's still equivalent to the original.
For example, let's say we had this code:
"1 + 8 - 3"
The parser will translate this into the following `Expr`:
BinOp(
Int(1),
Plus,
BinOp(Int(8), Minus, Int(3))
)
The `eval` function will take this `Expr` and translate it into this much simpler `Expr`:
Int(6)
At this point it's become so simple that we can display it to the end user as the number `6`. So running `parse` and then `eval` on the original Roc string of `1 + 8 - 3` will result in displaying `6` as the final output.
> The `expr` module includes an `impl fmt::Display for Expr` that takes care of translating `Int(6)` into `6`, `Char('x')` as `'x'`, and so on.
`eval` accomplishes this by doing a `match` on an `Expr` and resolving every operation it encounters. For example, when it first sees this:
BinOp(
Int(1),
Plus,
BinOp(Int(8), Minus, Int(3))
)
The first thing it does is to call `eval` on the right `Expr` values on either side of the `Plus`. That results in:
1. Calling `eval` on `Int(1)`, which returns `Int(1)` since it can't be reduced any further.
2. Calling `eval` on `BinOp(Int(8), Minus, Int(3))`, which in fact can be reduced further.
Since the second call to `eval` will match on another `BinOp`, it's once again going to recursively call `eval` on both of its `Expr` values. Since those are both `Int` values, though, their `eval` calls will return them right away without doing anything else.
Now that it's evaluated the expressions on either side of the `Minus`, `eval` will look at the particular operator being applied to those expressions (in this case, a minus operator) and check to see if the expressions it was given work with that operation.
> Remember, this `Expr` value potentially came directly from the parser. `eval` can't be sure any type checking has been done on it!
If `eval` detects a non-numeric `Expr` value (that is, the `Expr` is not `Int` or `Frac`) on either side of the `Minus`, then it will immediately give an error and halt the evaluation. This sort of runtime type error is common to dynamic languages, and you can think of `eval` as being a dynamic evaluation of Roc code that hasn't necessarily been type-checked.
Assuming there's no type problem, `eval` can go ahead and run the Rust code of `8 - 3` and store the result in an `Int` expr.
That concludes our original recursive call to `eval`, after which point we'll be evaluating this expression:
BinOp(
Int(1),
Plus,
Int(5)
)
This will work the same way as `Minus` did, and will reduce down to `Int(6)`.
## Optimization philosophy
Focus on optimizations which are only safe in the absence of side effects, and leave the rest to LLVM.
This focus may lead to some optimizations becoming transitively in scope. For example, some deforestation
examples in the MSR paper benefit from multiple rounds of interleaved deforestation, beta-reduction, and inlining.
To get those benefits, we'd have to do some inlining and beta-reduction that we could otherwise leave to LLVM's
inlining and constant propagation/folding.
Even if we're doing those things, it may still make sense to have LLVM do a pass for them as well, since
early LLVM optimization passes may unlock later opportunities for inlining and constant propagation/folding.
## Inlining
If a function is called exactly once (it's a helper function), presumably we always want to inline those.
If a function is "small enough" it's probably worth inlining too.
## Fusion
<https://www.microsoft.com/en-us/research/wp-content/uploads/2016/07/deforestation-short-cut.pdf>
Basic approach:
Do list stuff using `build` passing Cons Nil (like a cons list) and then do foldr/build substitution/reduction.
Afterwards, we can do a separate pass to flatten nested Cons structures into properly initialized RRBTs.
This way we get both deforestation and efficient RRBT construction. Should work for the other collection types too.
It looks like we need to do some amount of inlining and beta reductions on the Roc side, rather than
leaving all of those to LLVM.
Advanced approach:
Express operations like map and filter in terms of toStream and fromStream, to unlock more deforestation.
More info on here:
<https://wiki.haskell.org/GHC_optimisations#Fusion>
For an overview of the design and architecture of the compiler, see
[DESIGN.md](./DESIGN.md). If you want to dive into the
implementation or get some tips on debugging the compiler, see below
## Getting started with the code