From 33771ca2cf3d4dbaafb9b56da2ceec7ab2903f74 Mon Sep 17 00:00:00 2001 From: Ayaz Hafiz Date: Sat, 11 Mar 2023 11:40:53 -0500 Subject: [PATCH 1/2] Start working on new compiler design documentation --- crates/compiler/DESIGN.md | 159 ++++++++++++++++++++++++++++++++++++++ crates/compiler/README.md | 145 +--------------------------------- 2 files changed, 162 insertions(+), 142 deletions(-) create mode 100644 crates/compiler/DESIGN.md diff --git a/crates/compiler/DESIGN.md b/crates/compiler/DESIGN.md new file mode 100644 index 0000000000..815ffa2133 --- /dev/null +++ b/crates/compiler/DESIGN.md @@ -0,0 +1,159 @@ +# Compiler Design + +The current Roc compiler is designed as a pipelining compiler parallelizable +across Roc modules. + +Roc's compilation pipeline consists of a few major components, which form the +table of contents for this document. + + + + +- [Parsing](#parsing) +- [Canonicalization](#canonicalization) + - [Symbol Resolution](#symbol-resolution) + - [Type-alias normalization](#type-alias-normalization) + - [Closure naming](#closure-naming) +- [Constraint Generation](#constraint-generation) + - [(Mutually-)recursive definitions](#mutually-recursive-definitions) +- [Type Solving](#type-solving) + - [Unification](#unification) + - [Type Inference](#type-inference) + - [Recursive Types](#recursive-types) + - [Lambda Sets](#lambda-sets) + - [Ability Collection](#ability-collection) + - [Ability Specialization](#ability-specialization) + - [Ability Derivation](#ability-derivation) + - [Exhaustiveness Checking](#exhaustiveness-checking) + - [Debugging](#debugging) +- [IR Generation](#ir-generation) + - [Memory Layouts](#memory-layouts) + - [Compiling Calls](#compiling-calls) + - [Decision Trees](#decision-trees) + - [Tail-call Optimization](#tail-call-optimization) + - [Reference-count insertion](#reference-count-insertion) + - [Reusing Memory Allocations](#reusing-memory-allocations) + - [Debugging](#debugging-1) +- [LLVM Code Generator](#llvm-code-generator) + - [Morphic Analysis](#morphic-analysis) + - [C ABI](#c-abi) + - [Test Harness](#test-harness) + - [Debugging](#debugging-2) +- [WASM Code Generator](#wasm-code-generator) + - [WASM Interpreter](#wasm-interpreter) + - [Debugging](#debugging-3) +- [Dev Code Generator](#dev-code-generator) + - [Debugging](#debugging-4) +- [Builtins](#builtins) +- [Compiler Driver](#compiler-driver) + - [Caching types](#caching-types) +- [Repl](#repl) +- [`test` and `dbg`](#test-and-dbg) +- [Formatter](#formatter) +- [Glue](#glue) +- [Active areas of research / help wanted](#active-areas-of-research--help-wanted) + + + +## Parsing + +Roc's parsers are designed as [combinators](https://en.wikipedia.org/wiki/Parser_combinator). +A list of Roc's parse AST and combinators can be found in [the root parse +file](./parse/src/parser.rs). + +Combinators enable parsing to compose as functions would - for example, the +`one_of` combinator supports attempting multiple parsing strategies, and +succeeding on the first one; the `and_then` combinator chains two parsers +together, failing if either parser in the sequence fails. + +Since Roc is an indentation-sensitive language, parsing must be cognizant and +deligent about handling indentation and de-indentation levels. Most parsing +functions take a `min_indent` parameter that specifies the minimum indentation +of the scope an expression should be parsed in. Generally, failing to reach +`min_indent` indicates that an expression has ended (but perhaps too early). + +## Canonicalization + + + +### Symbol Resolution + +### Type-alias normalization + +### Closure naming + +## Constraint Generation + +### (Mutually-)recursive definitions + +## Type Solving + +### Unification + +### Type Inference + +### Recursive Types + +### Lambda Sets + +### Ability Collection + +### Ability Specialization + +### Ability Derivation + +### Exhaustiveness Checking + +### Debugging + +## IR Generation + +### Memory Layouts + +### Compiling Calls + +### Decision Trees + +### Tail-call Optimization + +### Reference-count insertion + +### Reusing Memory Allocations + +### Debugging + +## LLVM Code Generator + +### Morphic Analysis + +### C ABI + +### Test Harness + +### Debugging + +## WASM Code Generator + +### WASM Interpreter + +### Debugging + +## Dev Code Generator + +### Debugging + +## Builtins + +## Compiler Driver + +### Caching types + +## Repl + +## `test` and `dbg` + +## Formatter + +## Glue + +## Active areas of research / help wanted diff --git a/crates/compiler/README.md b/crates/compiler/README.md index 7b59ffae58..28f983c3f1 100644 --- a/crates/compiler/README.md +++ b/crates/compiler/README.md @@ -1,147 +1,8 @@ # The Roc Compiler -Here's how the compiler is laid out. - -## Parsing - -The main goal of parsing is to take a plain old String (such as the contents a .roc source file read from the filesystem) and translate that String into an `Expr` value. - -`Expr` is an `enum` defined in the `expr` module. An `Expr` represents a Roc expression. - -For example, parsing would translate this string... - - "1 + 2" - -...into this `Expr` value: - - BinOp(Int(1), Plus, Int(2)) - -> Technically it would be `Box::new(Int(1))` and `Box::new(Int(2))`, but that's beside the point for now. - -This `Expr` representation of the expression is useful for things like: - -- Checking that all variables are declared before they're used -- Type checking - -> As of this writing, the compiler doesn't do any of those things yet. They'll be added later! - -Since the parser is only concerned with translating String values into Expr values, it will happily translate syntactically valid strings into expressions that won't work at runtime. - -For example, parsing will translate this string: - -not "foo", "bar" - -...into this `Expr`: - - CallByName("not", vec!["foo", "bar"]) - -Now we may know that `not` takes a `Bool` and returns another `Bool`, but the parser doesn't know that. - -The parser only knows how to translate a `String` into an `Expr`; it's the job of other parts of the compiler to figure out if `Expr` values have problems like type mismatches and non-exhaustive patterns. - -That said, the parser can still run into syntax errors. This won't parse: - - if then 5 then else then - -This is gibberish to the parser, so it will produce an error rather than an `Expr`. - -Roc's parser is implemented using the [`marwes/combine`](http://github.com/marwes/combine-language/) crate. - -## Evaluating - -One of the useful things we can do with an `Expr` is to evaluate it. - -The process of evaluation is basically to transform an `Expr` into the simplest `Expr` we can that's still equivalent to the original. - -For example, let's say we had this code: - - "1 + 8 - 3" - -The parser will translate this into the following `Expr`: - - BinOp( - Int(1), - Plus, - BinOp(Int(8), Minus, Int(3)) - ) - -The `eval` function will take this `Expr` and translate it into this much simpler `Expr`: - - Int(6) - -At this point it's become so simple that we can display it to the end user as the number `6`. So running `parse` and then `eval` on the original Roc string of `1 + 8 - 3` will result in displaying `6` as the final output. - -> The `expr` module includes an `impl fmt::Display for Expr` that takes care of translating `Int(6)` into `6`, `Char('x')` as `'x'`, and so on. - -`eval` accomplishes this by doing a `match` on an `Expr` and resolving every operation it encounters. For example, when it first sees this: - - BinOp( - Int(1), - Plus, - BinOp(Int(8), Minus, Int(3)) - ) - -The first thing it does is to call `eval` on the right `Expr` values on either side of the `Plus`. That results in: - -1. Calling `eval` on `Int(1)`, which returns `Int(1)` since it can't be reduced any further. -2. Calling `eval` on `BinOp(Int(8), Minus, Int(3))`, which in fact can be reduced further. - -Since the second call to `eval` will match on another `BinOp`, it's once again going to recursively call `eval` on both of its `Expr` values. Since those are both `Int` values, though, their `eval` calls will return them right away without doing anything else. - -Now that it's evaluated the expressions on either side of the `Minus`, `eval` will look at the particular operator being applied to those expressions (in this case, a minus operator) and check to see if the expressions it was given work with that operation. - -> Remember, this `Expr` value potentially came directly from the parser. `eval` can't be sure any type checking has been done on it! - -If `eval` detects a non-numeric `Expr` value (that is, the `Expr` is not `Int` or `Frac`) on either side of the `Minus`, then it will immediately give an error and halt the evaluation. This sort of runtime type error is common to dynamic languages, and you can think of `eval` as being a dynamic evaluation of Roc code that hasn't necessarily been type-checked. - -Assuming there's no type problem, `eval` can go ahead and run the Rust code of `8 - 3` and store the result in an `Int` expr. - -That concludes our original recursive call to `eval`, after which point we'll be evaluating this expression: - - BinOp( - Int(1), - Plus, - Int(5) - ) - -This will work the same way as `Minus` did, and will reduce down to `Int(6)`. - -## Optimization philosophy - -Focus on optimizations which are only safe in the absence of side effects, and leave the rest to LLVM. - -This focus may lead to some optimizations becoming transitively in scope. For example, some deforestation -examples in the MSR paper benefit from multiple rounds of interleaved deforestation, beta-reduction, and inlining. -To get those benefits, we'd have to do some inlining and beta-reduction that we could otherwise leave to LLVM's -inlining and constant propagation/folding. - -Even if we're doing those things, it may still make sense to have LLVM do a pass for them as well, since -early LLVM optimization passes may unlock later opportunities for inlining and constant propagation/folding. - -## Inlining - -If a function is called exactly once (it's a helper function), presumably we always want to inline those. -If a function is "small enough" it's probably worth inlining too. - -## Fusion - - - -Basic approach: - -Do list stuff using `build` passing Cons Nil (like a cons list) and then do foldr/build substitution/reduction. -Afterwards, we can do a separate pass to flatten nested Cons structures into properly initialized RRBTs. -This way we get both deforestation and efficient RRBT construction. Should work for the other collection types too. - -It looks like we need to do some amount of inlining and beta reductions on the Roc side, rather than -leaving all of those to LLVM. - -Advanced approach: - -Express operations like map and filter in terms of toStream and fromStream, to unlock more deforestation. -More info on here: - - +For an overview of the design and architecture of the compiler, see +[DESIGN.md](./DESIGN.md). If you want to dive into the +implementation or get some tips on debugging the compiler, see below ## Getting started with the code From 2c385bf60160d0be12dfc4db9b388b93f4b69b47 Mon Sep 17 00:00:00 2001 From: Ayaz Hafiz Date: Mon, 13 Mar 2023 14:07:42 +0000 Subject: [PATCH 2/2] Start a section on canonicalization --- crates/compiler/DESIGN.md | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/crates/compiler/DESIGN.md b/crates/compiler/DESIGN.md index 815ffa2133..be77a57225 100644 --- a/crates/compiler/DESIGN.md +++ b/crates/compiler/DESIGN.md @@ -74,10 +74,45 @@ of the scope an expression should be parsed in. Generally, failing to reach ## Canonicalization +After parsing a Roc program into an AST, the AST is transformed into a [canonical +form](./can/src/expr.rs) AST. This may seem a bit redundant - why build another +tree, when we already have the AST? Canonicalization performs a few analyses +to catch user errors, and sets up the state necessary to solve the types in a +program. Among other things, canonicalization +- Uniquely identifies names (think variable and function names). Along the way, + canonicalization builds a graph of all variables' references, and catches + unused definitions, undefined definitions, and shadowed definitions. +- Resolves type signatures, including aliases, into a form suitable for type + solving. +- Determines the order definitions are used in, if they are defined + out-of-order. +- Eliminates syntax sugar (for example, renaming `+` to the function call `add` + and converting backpassing to function calls). +- Collects declared abilities, and ability implementations defined for opaque + types. Derived abilities for opaque types are elaborated during + canonicalization. ### Symbol Resolution +Identifiers, like variable names, are resolved to [Symbol](./module/src/symbol.rs)s. + +Currently, a symbol is a 64-bit value with +- the bottom 32 bits defining the [ModuleId](./module/src/ident.rs) the symbol + is defined in +- the top 32 bits defining the [IdentId](./module/src/ident.rs) of the symbol + in the module + +A symbol is unique per identifier name and the scope +that the identifier has been declared in. Symbols are how the rest of the +compiler refers to value definitions - since the unique scope and identifier +name is disambiguated when symbols are created, referencing symbols requires no +further name resolution. + +As symbols are constructed, canonicalization also keeps track of all references +to a given symbol. This simplifies catching unused definitions, undefined +definitions, and shadowing, to an index into an array. + ### Type-alias normalization ### Closure naming