Merge pull request #5130 from roc-lang/big-compiler-doc

Compiler design overview doc
2025-07-24 06:55:15 +00:00 · 2023-05-30 10:40:15 -05:00 · 2023-05-30 10:40:15 -05:00 · be077ed046
commit be077ed046
parent 8c23053c1a 2c385bf601
2 changed files with 197 additions and 142 deletions
--- a/crates/compiler/DESIGN.md
+++ b/crates/compiler/DESIGN.md
@ -0,0 +1,194 @@
+# Compiler Design
+
+The current Roc compiler is designed as a pipelining compiler parallelizable
+across Roc modules.
+
+Roc's compilation pipeline consists of a few major components, which form the
+table of contents for this document.
+
+<!-- START doctoc generated TOC please keep comment here to allow auto update -->
+<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
+
+- [Parsing](#parsing)
+- [Canonicalization](#canonicalization)
+  - [Symbol Resolution](#symbol-resolution)
+  - [Type-alias normalization](#type-alias-normalization)
+  - [Closure naming](#closure-naming)
+- [Constraint Generation](#constraint-generation)
+  - [(Mutually-)recursive definitions](#mutually-recursive-definitions)
+- [Type Solving](#type-solving)
+  - [Unification](#unification)
+  - [Type Inference](#type-inference)
+  - [Recursive Types](#recursive-types)
+  - [Lambda Sets](#lambda-sets)
+  - [Ability Collection](#ability-collection)
+  - [Ability Specialization](#ability-specialization)
+  - [Ability Derivation](#ability-derivation)
+  - [Exhaustiveness Checking](#exhaustiveness-checking)
+  - [Debugging](#debugging)
+- [IR Generation](#ir-generation)
+  - [Memory Layouts](#memory-layouts)
+  - [Compiling Calls](#compiling-calls)
+  - [Decision Trees](#decision-trees)
+  - [Tail-call Optimization](#tail-call-optimization)
+  - [Reference-count insertion](#reference-count-insertion)
+  - [Reusing Memory Allocations](#reusing-memory-allocations)
+  - [Debugging](#debugging-1)
+- [LLVM Code Generator](#llvm-code-generator)
+  - [Morphic Analysis](#morphic-analysis)
+  - [C ABI](#c-abi)
+  - [Test Harness](#test-harness)
+  - [Debugging](#debugging-2)
+- [WASM Code Generator](#wasm-code-generator)
+  - [WASM Interpreter](#wasm-interpreter)
+  - [Debugging](#debugging-3)
+- [Dev Code Generator](#dev-code-generator)
+  - [Debugging](#debugging-4)
+- [Builtins](#builtins)
+- [Compiler Driver](#compiler-driver)
+  - [Caching types](#caching-types)
+- [Repl](#repl)
+- [`test` and `dbg`](#test-and-dbg)
+- [Formatter](#formatter)
+- [Glue](#glue)
+- [Active areas of research / help wanted](#active-areas-of-research--help-wanted)
+
+<!-- END doctoc generated TOC please keep comment here to allow auto update -->
+
+## Parsing
+
+Roc's parsers are designed as [combinators](https://en.wikipedia.org/wiki/Parser_combinator).
+A list of Roc's parse AST and combinators can be found in [the root parse
+file](./parse/src/parser.rs).
+
+Combinators enable parsing to compose as functions would - for example, the
+`one_of` combinator supports attempting multiple parsing strategies, and
+succeeding on the first one; the `and_then` combinator chains two parsers
+together, failing if either parser in the sequence fails.
+
+Since Roc is an indentation-sensitive language, parsing must be cognizant and
+deligent about handling indentation and de-indentation levels. Most parsing
+functions take a `min_indent` parameter that specifies the minimum indentation
+of the scope an expression should be parsed in. Generally, failing to reach
+`min_indent` indicates that an expression has ended (but perhaps too early).
+
+## Canonicalization
+
+After parsing a Roc program into an AST, the AST is transformed into a [canonical
+form](./can/src/expr.rs) AST. This may seem a bit redundant - why build another
+tree, when we already have the AST? Canonicalization performs a few analyses
+to catch user errors, and sets up the state necessary to solve the types in a
+program. Among other things, canonicalization
+
+- Uniquely identifies names (think variable and function names). Along the way,
+    canonicalization builds a graph of all variables' references, and catches
+    unused definitions, undefined definitions, and shadowed definitions.
+- Resolves type signatures, including aliases, into a form suitable for type
+    solving.
+- Determines the order definitions are used in, if they are defined
+    out-of-order.
+- Eliminates syntax sugar (for example, renaming `+` to the function call `add`
+    and converting backpassing to function calls).
+- Collects declared abilities, and ability implementations defined for opaque
+    types. Derived abilities for opaque types are elaborated during
+    canonicalization.
+
+### Symbol Resolution
+
+Identifiers, like variable names, are resolved to [Symbol](./module/src/symbol.rs)s.
+
+Currently, a symbol is a 64-bit value with
+- the bottom 32 bits defining the [ModuleId](./module/src/ident.rs) the symbol
+    is defined in
+- the top 32 bits defining the [IdentId](./module/src/ident.rs) of the symbol
+    in the module
+
+A symbol is unique per identifier name and the scope
+that the identifier has been declared in. Symbols are how the rest of the
+compiler refers to value definitions - since the unique scope and identifier
+name is disambiguated when symbols are created, referencing symbols requires no
+further name resolution.
+
+As symbols are constructed, canonicalization also keeps track of all references
+to a given symbol. This simplifies catching unused definitions, undefined
+definitions, and shadowing, to an index into an array.
+
+### Type-alias normalization
+
+### Closure naming
+
+## Constraint Generation
+
+### (Mutually-)recursive definitions
+
+## Type Solving
+
+### Unification
+
+### Type Inference
+
+### Recursive Types
+
+### Lambda Sets
+
+### Ability Collection
+
+### Ability Specialization
+
+### Ability Derivation
+
+### Exhaustiveness Checking
+
+### Debugging
+
+## IR Generation
+
+### Memory Layouts
+
+### Compiling Calls
+
+### Decision Trees
+
+### Tail-call Optimization
+
+### Reference-count insertion
+
+### Reusing Memory Allocations
+
+### Debugging
+
+## LLVM Code Generator
+
+### Morphic Analysis
+
+### C ABI
+
+### Test Harness
+
+### Debugging
+
+## WASM Code Generator
+
+### WASM Interpreter
+
+### Debugging
+
+## Dev Code Generator
+
+### Debugging
+
+## Builtins
+
+## Compiler Driver
+
+### Caching types
+
+## Repl
+
+## `test` and `dbg`
+
+## Formatter
+
+## Glue
+
+## Active areas of research / help wanted
--- a/crates/compiler/README.md
+++ b/crates/compiler/README.md
@ -1,147 +1,8 @@
 # The Roc Compiler

-Here's how the compiler is laid out.
-
-## Parsing
-
-The main goal of parsing is to take a plain old String (such as the contents a .roc source file read from the filesystem) and translate that String into an `Expr` value.
-
-`Expr` is an `enum` defined in the `expr` module. An `Expr` represents a Roc expression.
-
-For example, parsing would translate this string...
-
-    "1 + 2"
-
-...into this `Expr` value:
-
-    BinOp(Int(1), Plus, Int(2))
-
-> Technically it would be `Box::new(Int(1))` and `Box::new(Int(2))`, but that's beside the point for now.
-
-This `Expr` representation of the expression is useful for things like:
-
- Checking that all variables are declared before they're used
- Type checking
-
-> As of this writing, the compiler doesn't do any of those things yet. They'll be added later!
-
-Since the parser is only concerned with translating String values into Expr values, it will happily translate syntactically valid strings into expressions that won't work at runtime.
-
-For example, parsing will translate this string:
-
-not "foo", "bar"
-
-...into this `Expr`:
-
-    CallByName("not", vec!["foo", "bar"])
-
-Now we may know that `not` takes a `Bool` and returns another `Bool`, but the parser doesn't know that.
-
-The parser only knows how to translate a `String` into an `Expr`; it's the job of other parts of the compiler to figure out if `Expr` values have problems like type mismatches and non-exhaustive patterns.
-
-That said, the parser can still run into syntax errors. This won't parse:
-
-    if then 5 then else then
-
-This is gibberish to the parser, so it will produce an error rather than an `Expr`.
-
-Roc's parser is implemented using the [`marwes/combine`](http://github.com/marwes/combine-language/) crate.
-
-## Evaluating
-
-One of the useful things we can do with an `Expr` is to evaluate it.
-
-The process of evaluation is basically to transform an `Expr` into the simplest `Expr` we can that's still equivalent to the original.
-
-For example, let's say we had this code:
-
-    "1 + 8 - 3"
-
-The parser will translate this into the following `Expr`:
-
-    BinOp(
-        Int(1),
-        Plus,
-        BinOp(Int(8), Minus, Int(3))
-    )
-
-The `eval` function will take this `Expr` and translate it into this much simpler `Expr`:
-
-    Int(6)
-
-At this point it's become so simple that we can display it to the end user as the number `6`. So running `parse` and then `eval` on the original Roc string of `1 + 8 - 3` will result in displaying `6` as the final output.
-
-> The `expr` module includes an `impl fmt::Display for Expr` that takes care of translating `Int(6)` into `6`, `Char('x')` as `'x'`, and so on.
-
-`eval` accomplishes this by doing a `match` on an `Expr` and resolving every operation it encounters. For example, when it first sees this:
-
-    BinOp(
-        Int(1),
-        Plus,
-        BinOp(Int(8), Minus, Int(3))
-    )
-
-The first thing it does is to call `eval` on the right `Expr` values on either side of the `Plus`. That results in:
-
-1. Calling `eval` on `Int(1)`, which returns `Int(1)` since it can't be reduced any further.
-2. Calling `eval` on `BinOp(Int(8), Minus, Int(3))`, which in fact can be reduced further.
-
-Since the second call to `eval` will match on another `BinOp`, it's once again going to recursively call `eval` on both of its `Expr` values. Since those are both `Int` values, though, their `eval` calls will return them right away without doing anything else.
-
-Now that it's evaluated the expressions on either side of the `Minus`, `eval` will look at the particular operator being applied to those expressions (in this case, a minus operator) and check to see if the expressions it was given work with that operation.
-
-> Remember, this `Expr` value potentially came directly from the parser. `eval` can't be sure any type checking has been done on it!
-
-If `eval` detects a non-numeric `Expr` value (that is, the `Expr` is not `Int` or `Frac`) on either side of the `Minus`, then it will immediately give an error and halt the evaluation. This sort of runtime type error is common to dynamic languages, and you can think of `eval` as being a dynamic evaluation of Roc code that hasn't necessarily been type-checked.
-
-Assuming there's no type problem, `eval` can go ahead and run the Rust code of `8 - 3` and store the result in an `Int` expr.
-
-That concludes our original recursive call to `eval`, after which point we'll be evaluating this expression:
-
-    BinOp(
-        Int(1),
-        Plus,
-        Int(5)
-    )
-
-This will work the same way as `Minus` did, and will reduce down to `Int(6)`.
-
-## Optimization philosophy
-
-Focus on optimizations which are only safe in the absence of side effects, and leave the rest to LLVM.
-
-This focus may lead to some optimizations becoming transitively in scope. For example, some deforestation
-examples in the MSR paper benefit from multiple rounds of interleaved deforestation, beta-reduction, and inlining.
-To get those benefits, we'd have to do some inlining and beta-reduction that we could otherwise leave to LLVM's
-inlining and constant propagation/folding.
-
-Even if we're doing those things, it may still make sense to have LLVM do a pass for them as well, since
-early LLVM optimization passes may unlock later opportunities for inlining and constant propagation/folding.
-
-## Inlining
-
-If a function is called exactly once (it's a helper function), presumably we always want to inline those.
-If a function is "small enough" it's probably worth inlining too.
-
-## Fusion
-
-<https://www.microsoft.com/en-us/research/wp-content/uploads/2016/07/deforestation-short-cut.pdf>
-
-Basic approach:
-
-Do list stuff using `build` passing Cons Nil (like a cons list) and then do foldr/build substitution/reduction.
-Afterwards, we can do a separate pass to flatten nested Cons structures into properly initialized RRBTs.
-This way we get both deforestation and efficient RRBT construction. Should work for the other collection types too.
-
-It looks like we need to do some amount of inlining and beta reductions on the Roc side, rather than
-leaving all of those to LLVM.
-
-Advanced approach:
-
-Express operations like map and filter in terms of toStream and fromStream, to unlock more deforestation.
-More info on here:
-
-<https://wiki.haskell.org/GHC_optimisations#Fusion>
+For an overview of the design and architecture of the compiler, see
+[DESIGN.md](./DESIGN.md). If you want to dive into the
+implementation or get some tips on debugging the compiler, see below

 ## Getting started with the code