Crate cstree

source ·
Expand description

cstree is a generic library for creating and working with concrete syntax trees (CSTs).

“Traditional” abstract syntax trees (ASTs) usually contain different types of nodes which represent different syntactical elements of the source text of a document and reduce its information to the minimal amount necessary to correctly interpret it. In contrast, CSTs are lossless representations of the entire input where all tree nodes are represented homogeneously (i.e., the nodes are untyped), but are tagged with a RawSyntaxKind to determine the kind of grammatical element they represent.

One big advantage of this representation is that it cannot only recreate the original source exactly, but also lends itself very well to the representation of incomplete or erroneous trees and is thus highly suited for usage in contexts such as IDEs or any other application where a user is editing the source text.

The concept of and the data structures for cstree’s syntax trees are inspired in part by Swift’s libsyntax. Trees consist of two layers: the inner tree (called green tree) contains the actual source text as position independent green nodes. Tokens and nodes that appear identically at multiple places in the source are deduplicated in this representation in order to store the tree efficiently. This means that a green tree may not actually structurally be a tree. To remedy this, the real syntax tree is constructed on top of the green tree as a secondary tree (called the red tree), which models the exact source structure. As a possible third layer, a strongly typed AST can be built on top of the red tree.

The cstree implementation is a fork of the excellent rowan, developed by the authors of rust-analyzer who wrote up a conceptual overview of their implementation in their repository. Notable differences of cstree compared to rowan:

  • Syntax trees (red trees) are created lazily, but are persistent. Once a red node has been created by cstree, it will remain allocated. In contrast, rowan re-creates the red layer on the fly on each traversal of the tree. Apart from the trade-off discussed here, this helps to achieve good tree traversal speed while helping to provide the following:
  • Syntax (red) nodes are Send and Sync, allowing to share realized trees across threads. This is achieved by atomically reference counting syntax trees as a whole, which also gets rid of the need to reference count individual nodes.
  • SyntaxNodes can hold custom data.
  • cstree trees are trees over interned strings. This means cstree will deduplicate the text of tokens with the same source string, such as identifiers with the same name. In this position, rowan stores each token’s text together with its metadata as a custom DST (dynamically-sized type).
  • cstree includes some performance optimizations for tree creation: it only allocates space for new nodes on the heap if they are not in cache and avoids recursively hashing subtrees by pre-hashing them.
  • cstree includes some performance optimizations for tree traversal: persisting red nodes allows tree traversal methods to return references instead of cloning nodes, which involves updating the tree’s reference count. You can still clone the reference to obtain an owned node, but you only pay that cost when you need to.
  • The downside of offering thread safe syntax trees is that cstree cannot offer any mutability API for its CSTs. Trees can still be updated into new trees through replacing nodes, but cannot be mutated in place.

Getting Started

If you’re looking at cstree, you’re probably looking at or already writing a parser and are considering using concrete syntax trees as its output. We’ll talk more about parsing below – first, let’s have a look at what needs to happen to go from input text to a cstree syntax tree:

  1. Define an enumeration of the types of tokens (like keywords) and nodes (like “an expression”) that you want to have in your syntax and implement Syntax

  2. Create a GreenNodeBuilder and call start_node, token and finish_node from your parser

  3. Call SyntaxNode::new_root or SyntaxNode::new_root_with_resolver with the resulting GreenNode to obtain a syntax tree that you can traverse

There’s a full getting started guide that walks through each of the above steps in detail in the documentation for the getting_started module. The walkthrough goes through the necessary steps bit by bit and skips the lexer, but the full code plus a runnable interpreter for the small walkthrough language are available in the readme example. Additional examples can be found in the examples/ folder in the repository. A good starting point is the s_expressions example, which implements a parser for a small S-Expression language with guiding comments.

License

cstree is primarily distributed under the terms of both the MIT license and the Apache License (Version 2.0).

See LICENSE-APACHE and LICENSE-MIT for details.

Modules

  • A tree builder for the construction of syntax trees.
  • Getting Started
  • Implementation of the inner, “green” tree. The GreenNodeBuilder from the build module is the main entry point to constructing GreenNodes and GreenTokens.
  • Types and Traits for efficient String storage and deduplication.
  • A convenient collection of the most used parts of cstree.
  • Synchronization primitives.
  • Implementation of the outer, “red” tree.
  • Typesafe representations of text ranges and sizes.
  • Types for syntax tree traversal / moving through trees.
  • Utility types. It shouldn’t be needed to reference these directly, but they are returned in several places in cstree and may come in handy.

Structs

Traits

  • A type that represents what items in your language can be. Typically, this is an enum with variants such as Identifier, Literal, …

Derive Macros

  • Derive macro available if cstree is built with features = ["derive"].