5.7 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	Compiler Design
The current Roc compiler is designed as a pipelining compiler parallelizable across Roc modules.
Roc's compilation pipeline consists of a few major components, which form the table of contents for this document.
- Parsing
- Canonicalization
- Constraint Generation
- Type Solving
- IR Generation
- LLVM Code Generator
- WASM Code Generator
- Dev Code Generator
- Builtins
- Compiler Driver
- Repl
- testand- dbg
- Formatter
- Glue
- Active areas of research / help wanted
Parsing
Roc's parsers are designed as combinators. A list of Roc's parse AST and combinators can be found in the root parse file.
Combinators enable parsing to compose as functions would - for example, the
one_of combinator supports attempting multiple parsing strategies, and
succeeding on the first one; the and_then combinator chains two parsers
together, failing if either parser in the sequence fails.
Since Roc is an indentation-sensitive language, parsing must be cognizant and
deligent about handling indentation and de-indentation levels. Most parsing
functions take a min_indent parameter that specifies the minimum indentation
of the scope an expression should be parsed in. Generally, failing to reach
min_indent indicates that an expression has ended (but perhaps too early).
Canonicalization
After parsing a Roc program into an AST, the AST is transformed into a canonical form AST. This may seem a bit redundant - why build another tree, when we already have the AST? Canonicalization performs a few analyses to catch user errors, and sets up the state necessary to solve the types in a program. Among other things, canonicalization
- Uniquely identifies names (think variable and function names). Along the way, canonicalization builds a graph of all variables' references, and catches unused definitions, undefined definitions, and shadowed definitions.
- Resolves type signatures, including aliases, into a form suitable for type solving.
- Determines the order definitions are used in, if they are defined out-of-order.
- Eliminates syntax sugar (for example, renaming +to the function calladd).
- Collects declared abilities, and ability implementations defined for opaque types. Derived abilities for opaque types are elaborated during canonicalization.
Symbol Resolution
Identifiers, like variable names, are resolved to Symbols.
Currently, a symbol is a 64-bit value with
- the bottom 32 bits defining the ModuleId the symbol is defined in
- the top 32 bits defining the IdentId of the symbol in the module
A symbol is unique per identifier name and the scope that the identifier has been declared in. Symbols are how the rest of the compiler refers to value definitions - since the unique scope and identifier name is disambiguated when symbols are created, referencing symbols requires no further name resolution.
As symbols are constructed, canonicalization also keeps track of all references to a given symbol. This simplifies catching unused definitions, undefined definitions, and shadowing, to an index into an array.
