This guide describes the current state of syntax trees and parsing in rust-analyzer as of 2020-01-09 ([link to commit](https://github.com/rust-analyzer/rust-analyzer/tree/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6)).
## Source Code
The things described are implemented in two places
* [rowan](https://github.com/rust-analyzer/rowan/tree/v0.9.0) -- a generic library for rowan syntax trees.
* [ra_syntax](https://github.com/rust-analyzer/rust-analyzer/tree/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_syntax) crate inside rust-analyzer which wraps `rowan` into rust-analyzer specific API.
Nothing in rust-analyzer except this crate knows about `rowan`.
* [ra_parser](https://github.com/rust-analyzer/rust-analyzer/tree/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_parser) crate parses input tokens into an `ra_syntax` tree
## Design Goals
* Syntax trees are lossless, or full fidelity. All comments and whitespace are preserved.
* Syntax trees are semantic-less. They describe *strictly* the structure of a sequence of characters, they don't have hygiene, name resolution or type information attached.
Red-green terminology comes from Roslyn ([link](https://docs.microsoft.com/en-ie/archive/blogs/ericlippert/persistence-facades-and-roslyns-red-green-trees)) and gives the name to the `rowan` library. Green and syntax nodes are defined in rowan, ast is defined in rust-analyzer.
Syntax trees are a semi-transient data structure.
In general, frontend does not keep syntax trees for all files in memory.
Instead, it *lowers* syntax trees to more compact and rigid representation, which is not full-fidelity, but which can be mapped back to a syntax tree if so desired.
* The tree is untyped. Each node has a "type tag", `SyntaxKind`.
* Interior and leaf nodes are distinguished on the type level.
* Trivia and non-trivia tokens are not distinguished on the type level.
* Each token carries its full text.
* The original text can be recovered by concatenating the texts of all tokens in order.
* Accessing a child of particular type (for example, parameter list of a function) generarly involves linerary traversing the children, looking for a specific `kind`.
We don't make special efforts to guarantree that the depth is not liner, but, in practice, syntax trees are branchy and shallow.
* If mandatory (grammar wise) node is missing from the input, it's just missing from the tree.
* If an extra erroneous input is present, it is wrapped into a node with `ERROR` kind, and treated just like any other node.
* Parser errors are not a part of syntax tree.
An input like `fn f() { 90 + 2 }` might be parsed as
```
FN_DEF@[0; 17)
FN_KW@[0; 2) "fn"
WHITESPACE@[2; 3) " "
NAME@[3; 4)
IDENT@[3; 4) "f"
PARAM_LIST@[4; 6)
L_PAREN@[4; 5) "("
R_PAREN@[5; 6) ")"
WHITESPACE@[6; 7) " "
BLOCK_EXPR@[7; 17)
BLOCK@[7; 17)
L_CURLY@[7; 8) "{"
WHITESPACE@[8; 9) " "
BIN_EXPR@[9; 15)
LITERAL@[9; 11)
INT_NUMBER@[9; 11) "90"
WHITESPACE@[11; 12) " "
PLUS@[12; 13) "+"
WHITESPACE@[13; 14) " "
LITERAL@[14; 15)
INT_NUMBER@[14; 15) "2"
WHITESPACE@[15; 16) " "
R_CURLY@[16; 17) "}"
```
#### Optimizations
(significant amount of implementation work here was done by [CAD97](https://github.com/cad97)).
To reduce the amount of allocations, the GreenNode is a DST, which uses a single allocation for header and children. Thus, it is only usable behind a pointer
We currently use `SmolStr`, an small object optimized string to store text.
This was mostly relevant *before* we implmented tree interning, to avoid allocating common keywords and identifiers. We should switch to storing text data alongside the interned tokens.
In the above model, whitespace is not treated specially.
Another alternative (used by swift and roslyn) is to explicitly divide the set of tokens into trivia and non-trivia tokens, and represent non-trivia tokens as
That way, the tree again contains only non-trivia tokens.
Explicit trivia nodes, like in `rowan`, are used by IntelliJ.
##### Accessing Children
As noted before, accesing a specific child in the node requires a linear traversal of the children (though we can skip tokens, beacuse the tag is encoded in the pointer itself).
It is possible to recover O(1) access with another representation.
We explicitly store optional and missing (required by the grammar, but not present) nodes.
both copies of the `x + 2` expression are representing by equal (and, with interning in mind, actualy the same) green nodes.
Green trees just can't differentiate between the two.
`SyntaxNode` adds parent pointers and identify semantics to green nodes.
They can be called cursors or [zippers](https://en.wikipedia.org/wiki/Zipper_(data_structure)) (fun fact: zipper is a derivative (as in ′) of a data structure).
This is OK because trees traversals mostly (always, in case of rust-analyzer) run on a single thread. If you need to send a `SyntaxNode` to another thread, you can send a pair of **root**`GreenNode` (which is thread safe) and a `Range<usize>`.
The other thread can restore the `SyntaxNode` by traversing from the root green node and looking for a node with specified range.
It treats trees as semi-transient and instead of storing a `GreenNode`, it generally stores just the id of the file from which the tree originated: `(FileId, Range<usize>)`.
The `SyntaxNode` is the restored by reparsing the file and traversing it from root.
With this trick, rust-analyzer holds only a small amount of trees in memory at the same time, which reduces memory usage.
Additionally, only the root `SyntaxNode` owns an `Arc` to the (root) `GreenNode`.
To get rid of allocations, `rowan` takes advantage of `SyntaxNode: !Sync` and uses a thread-local free list of `SyntaxNode`s.
In a typical traversal, you only directly hold a few `SyntaxNode`s at a time (and their ancesstors indirectly), so a free list proportional to the depth of the tree removes all allocations in a typical case.
`GreenTree`s are untyped and homogeneous, because it makes accomodating error nodes, arbitrary whitespace and comments natural, and because it makes possible to write generic tree traversals.
In IntelliJ the AST layer (dubbed **P**rogram **S**tructure **I**nterface) can have semantics attached, and is usually backed by either syntax tree, indices, or metadata from compiled libraries.
* The parser and the syntax tree are independent, they live in different crates neither of which depends on the other.
* The parser doesn't know anything about textual contents of the tokens, with an isolated hack for checking contextual keywords.
* For gluing tokens, the `TreeSink::token` might advance further than one atomic token ahead.
### Reporting Syntax Errors
Syntax errors are not stored directly in the tree.
The primary motivation for this is that syntax tree is not necessary produced by the parser, it may also be assembled manually from pieces (which happens all the time in refactorings).
Instead, parser reports errors to an error sink, which stores them in a `Vec`.
If possible, errors are not reported during parsing and are postponed for a separate validation step.