Module cstree::getting_started

Expand description

Getting Started

If you’re looking at cstree, you’re probably looking at or already writing a parser and are considering using concrete syntax trees as its output. We’ll talk more about parsing below – first, let’s have a look at what needs to happen to go from input text to a cstree syntax tree:

Define an enumeration of the types of tokens (like keywords) and nodes (like “an expression”) that you want to have in your syntax and implement Syntax
Create a GreenNodeBuilder and call start_node, token and finish_node from your parser
Call SyntaxNode::new_root or SyntaxNode::new_root_with_resolver with the resulting GreenNode to obtain a syntax tree that you can traverse

Let’s walk through the motions of parsing a (very) simple language into cstree syntax trees. We’ll just support addition and subtraction on integers, from which the user is allowed to construct a single, compound expression. They will, however, be allowed to write nested expressions in parentheses, like 1 - (2 + 5).

Defining the language

First, we need to list the different part of our language’s grammar. We can do that using an enum with a unit variant for any terminal and non-terminal. The enum needs to be convertible to a u32, so we use the repr attribute to ensure it uses the correct representation.

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u32)]
enum SyntaxKind {
    /* Tokens */
    Int,    // 42
    Plus,   // +
    Minus,  // -
    LParen, // (
    RParen, // )
    /* Nodes */
    Expr,
    Root,
}

For convenience when we’re working with generic cstree types like SyntaxNode, we’ll also give a name to our syntax as a whole and add a type alias for it. That way, we can match against SyntaxKinds using the original name, but use the more informative Node<Calculator> to instantiate cstree’s types.

type Calculator = SyntaxKind;

Most of these are tokens to lex the input string into, like numbers (Int) and operators (Plus, Minus). We only really need one type of node; expressions. Our syntax tree’s root node will have the special kind Root, all other nodes will be expressions containing a sequence of arithmetic operations potentially involving further, nested expression nodes.

To use our SyntaxKinds with cstree, we need to tell it how to convert it back to just a number (the #[repr(u32)] that we added) by implementing the Syntax trait. We can also tell cstree about tokens that always have the same text through the static_text method on the trait. This is useful for the operators and parentheses, but not possible for numbers, since an integer token may be produced from the input 3, but also from other numbers like 7 or 12.


impl Syntax for Calculator {
    fn from_raw(raw: RawSyntaxKind) -> Self {
        // This just needs to be the inverse of `into_raw`, but could also
        // be an `impl TryFrom<u32> for SyntaxKind` or any other conversion.
        match raw.0 {
            0 => SyntaxKind::Int,
            1 => SyntaxKind::Plus,
            2 => SyntaxKind::Minus,
            3 => SyntaxKind::LParen,
            4 => SyntaxKind::RParen,
            5 => SyntaxKind::Expr,
            6 => SyntaxKind::Root,
            n => panic!("Unknown raw syntax kind: {n}"),
        }
    }

    fn into_raw(self) -> RawSyntaxKind {
        RawSyntaxKind(self as u32)
    }

    fn static_text(self) -> Option<&'static str> {
        match self {
            SyntaxKind::Plus => Some("+"),
            SyntaxKind::Minus => Some("-"),
            SyntaxKind::LParen => Some("("),
            SyntaxKind::RParen => Some(")"),
            _ => None,
        }
    }
}

Deriving `Syntax`

To save yourself the hassle of defining this conversion (and, perhaps more importantly, continually updating it while your language’s syntax is in flux), cstree includes a derive macro for Syntax when built with the derive feature. With the macro, the Syntax trait implementation above can be replaced by simply adding #[derive(Syntax)] to SyntaxKind.

Parsing into a green tree

With that out of the way, we can start writing the parser for our expressions. For the purposes of this introduction to cstree, I’ll assume that there is a lexer that yields the following tokens:

#[derive(Debug, PartialEq, Eq, Clone, Copy)]
pub enum Token<'input> {
    // Note that number strings are not yet parsed into actual numbers,
    // we just remember the slice of the input that contains their digits
    Int(&'input str),
    Plus,
    Minus,
    LParen,
    RParen,
    // A special token that indicates that we have reached the end of the file
    EoF,
}

A simple lexer that yields such tokens is part of the full readme example, but we’ll be busy enough with the combination of cstree and the actual parser, which we define like this:


pub struct Parser<'input> {
    // `Peekable` is a standard library iterator adapter that allows
    // looking ahead at the next item without removing it from the iterator yet
    lexer:   Peekable<Lexer<'input>>,
    builder: GreenNodeBuilder<'static, 'static, Calculator>,
}

impl<'input> Parser<'input> {
    pub fn new(input: &'input str) -> Self {
        Self {
            // we get `peekable` from implementing `Iterator` on `Lexer`
            lexer:   Lexer::new(input).peekable(),
            builder: GreenNodeBuilder::new(),
        }
    }

    pub fn bump(&mut self) -> Option<Token<'input>> {
        self.lexer.next()
    }
}

In contrast to parsers that return abstract syntax trees, with cstree the syntax tree nodes for all element in the language grammar will have the same type: GreenNode for the inner (“green”) tree and SyntaxNode for the outer (“red”) tree. Different kinds of nodes (and tokens) are differentiated by their SyntaxKind tag, which we defined above.

You can implement many types of parsers with cstree. To get a feel for how it works, consider a typical recursive descent parser. With a more traditional AST, one would define different AST structs for struct or function definitions, statements, expressions and so on. Inside the parser, the components of any element, such as all fields of a struct or all statements inside a function, are parsed first and then the parser wraps them in the matching AST type, which is returned from the corresponding parser function.

Because cstree’s syntax trees are untyped, there is no explicit AST representation that the parser would build. Instead, parsing into a CST using the GreenNodeBuilder follows the source code more closely in that you tell cstree about each new element you enter and all tokens that the parser consumes. So, for example, to parse a struct definition the parser first “enters” the struct definition node, then parses the struct keyword and type name, then parses each field, and finally “finishes” parsing the struct node.

The most trivial example is the root node for our parser, which just creates a root node containing the whole expression (we could do without a specific root node if any expression was a node, in particular if we wrapped integer literal tokens inside Expr nodes).

pub fn parse(&mut self) -> Result<(), String> {
    self.builder.start_node(SyntaxKind::Root);
    self.parse_expr()?;
    self.builder.finish_node();
    Ok(())
}

As there isn’t a static AST type to return, the parser is very flexible as to what is part of a node. In the previous example, if the user is adding a new field to the struct and has not yet typed the field’s type, the CST node for the struct doesn’t care if there is no child node for it. Similarly, if the user is deleting fields and the source code currently contains a leftover field name, this additional identifier can be a part of the struct node without any modifications to the syntax tree definition. This property is the key to why CSTs are such a good fit as a lossless input representation, which necessitates the syntax tree to mirror the user-specific layout of whitespace and comments around the AST items.

In the parser for our simple expression language, we’ll also have to deal with the fact that, when we see a number the parser doesn’t yet know whether there will be additional operations following that number. That is, in the expression 1 + 2, it can only know that it is parsing a binary operation once it sees the +. The event-like model of building trees in cstree, however, implies that when reaching the +, the parser would have to have already entered an expression node in order for the whole input to be part of the expression.

To get around this, GreenNodeBuilder provides the checkpoint method, which we can call to “remember” the current position in the input. For example, we can create a checkpoint before the parser parses the first 1. Later, when it sees the following +, it can create an Expr node for the whole expression using start_node_at:

fn parse_lhs(&mut self) -> Result<(), String> {
    // An expression may start either with a number, or with an opening parenthesis that is
    // the start of a parenthesized expression
    let next_token = *self.lexer.peek().unwrap();
    match next_token {
        Token::Int(n) => {
            self.bump();
            self.builder.token(SyntaxKind::Int, n);
        }
        Token::LParen => {
            // Wrap the grouped expression inside a node containing it and its parentheses
            self.builder.start_node(SyntaxKind::Expr);
            self.bump();
            self.builder.static_token(SyntaxKind::LParen);
            self.parse_expr()?; // Inner expression
            if self.bump() != Some(Token::RParen) {
                return Err("Missing ')'".to_string());
            }
            self.builder.static_token(SyntaxKind::RParen);
            self.builder.finish_node();
        }
        Token::EoF => return Err("Unexpected end of file: expected expression".to_string()),
        t => return Err(format!("Unexpected start of expression: '{t:?}'")),
    }
    Ok(())
}

fn parse_expr(&mut self) -> Result<(), String> {
    // Remember our current position
    let before_expr = self.builder.checkpoint();

    // Parse the start of the expression
    self.parse_lhs()?;

    // Check if the expression continues with `+ <more>` or `- <more>`
    let Some(next_token) = self.lexer.peek() else {
        return Ok(());
    };
    let op = match *next_token {
        Token::Plus => SyntaxKind::Plus,
        Token::Minus => SyntaxKind::Minus,
        Token::RParen | Token::EoF => return Ok(()),
        t => return Err(format!("Expected operator, found '{t:?}'")),
    };

    // If so, retroactively wrap the (already parsed) LHS and the following RHS
    // inside an `Expr` node
    self.builder.start_node_at(before_expr, SyntaxKind::Expr);
    self.bump();
    self.builder.static_token(op);
    self.parse_expr()?; // RHS
    self.builder.finish_node();
    Ok(())
}

Obtaining the parser result

Our parser is now capable of parsing our little arithmetic language, but it’s methods don’t return anything. So how do we get our syntax tree out? The answer lies in GreenNodeBuilder::finish, which finally returns the tree that we have painstakingly constructed.

impl Parser<'_> {
    pub fn finish(mut self) -> (GreenNode, impl Interner) {
        assert!(self.lexer.next().map(|t| t == Token::EoF).unwrap_or(true));
        let (tree, cache) = self.builder.finish();
        (tree, cache.unwrap().into_interner().unwrap())
    }
}

finish also returns the cache it used to deduplicate tree nodes and tokens, so you can re-use it for parsing related inputs (e.g., different source files from the same crate may share a lot of common function and type names that can be deduplicated). See GreenNodeBuilder’s documentation for more information on this, in particular the with_cache and from_cache methods. Most importantly for us, we can extract the Interner that contains the source text of the tree’s tokens from the cache, which we need if we want to look up things like variable names or the value of numbers for our calculator.

To work with the syntax tree, you’ll want to upgrade it to a SyntaxNode using SyntaxNode::new_root. You can also use SyntaxNode::new_root_with_resolver to combine tree and interner, which lets you directly retrieve source text and makes the nodes implement Display and Debug. The same output can be produced from SyntaxNodes by calling the debug or display method with a Resolver. To visualize the whole syntax tree, pass true for the recursive parameter on debug, or simply debug-print a ResolvedNode:

let input = "11 + 2-(5 + 4)";
let mut parser = Parser::new(input);
parser.parse().unwrap();
let (tree, interner) = parser.finish();
let root = SyntaxNode::<Calculator>::new_root_with_resolver(tree, interner);
dbg!(root);

Further examples

The parser we just built is available in full in the runnable readme example, which includes some additional code to read expressions from the terminal and evaluate the parsed expressions - have it do a few calculations if you like. There are several more examples in the examples/ folder in the repository. A good starting point is the s_expressions example, which implements a parser for a small S-Expression language with guiding comments.

AST Layer

While cstree is built for concrete syntax trees, applications are quite easily able to work with either a CST or an AST representation, or freely switch between them. To do so, use cstree to build syntax and underlying green tree and provide AST wrappers for your different kinds of nodes. An example of how this is done can be seen here and here (note that the latter file is automatically generated by a task using ungrammar).