Expand description
§Auto LSP Codegen
To generate an AST, simply provide a Tree-sitter node-types.json and LanguageFn of any language to the generate function of the auto_lsp_codegen crate.
cargo add auto_lsp_codegenAlthough auto_lsp_codegen is a standalone crate, the generated code depends on the main auto_lsp crate.
§Usage
The auto_lsp_codegen crate exposes a single generate function, which takes:
- A
node-types.json, - A
LanguageFn - A
HashMap<&str, &str>to rename tokens (see Custom Tokens) - And returns a TokenStream.
How you choose to use the TokenStream is up to you.
The most common setup is to call it from a build.rs script and write the generated code to a Rust file.
Note, however, that the output can be quite large—for example, Python’s AST results in ~11,000 lines of code.
use auto_lsp_codegen::generate;
use std::{fs, path::PathBuf};
fn main() {
if std::env::var("AST_GEN").unwrap_or("0".to_string()) == "0" {
return;
}
let output_path = PathBuf::from("./src/generated.rs");
fs::write(
output_path,
generate(
tree_sitter_python::NODE_TYPES,
&tree_sitter_python::LANGUAGE.into(),
None,
)
.to_string(),
)
.unwrap();
}You can also invoke it from your own CLI or tool if needed.
§How Codegen Works
The generated code structure depends on the Tree-sitter grammar.
§Structs for Rules
Each rule in node-types.json becomes a dedicated Rust struct. For example, given the rule:
function_definition: $ => seq(
optional('async'),
'def',
field('name', $.identifier),
field('type_parameters', optional($.type_parameter)),
field('parameters', $.parameters),
optional(
seq(
'->',
field('return_type', $.type),
),
),
':',
field('body', $._suite),
),The generated struct would look like this:
#[derive(Debug, Clone, PartialEq)]
pub struct FunctionDefinition {
pub name: std::sync::Arc<Identifier>,
pub body: std::sync::Arc<Block>,
pub type_parameters: Option<std::sync::Arc<TypeParameter>>,
pub parameters: std::sync::Arc<Parameters>,
pub return_type: Option<std::sync::Arc<Type>>,
/* ... */
}§Field Matching
To match fields, codegen uses the field_id() method from the Tree-sitter cursor.
From the above example, the generated builder might look like this:
builder.builder(db, &node, Some(id), |b| {
b.on_field_id::<Identifier, 19u16>(&mut name)?
.on_field_id::<Block, 6u16>(&mut body)?
.on_field_id::<TypeParameter, 31u16>(&mut type_parameters)?
.on_field_id::<Parameters, 23u16>(&mut parameters)?
.on_field_id::<Type, 24u16>(&mut return_type)
});Each u16 represents the unique field ID assigned by the Tree-sitter language parser.
§Handling Children
If a node has no named fields, a children enum is generated to represent all possible variants.
- If the children are unnamed, a generic “Operator_” enum is generated
- If the children are named, the enum will be a concatenation of all possible child node types with underscores, using sanitized Rust-friendly names.
For example, given the rule:
_statement: $ => choice(
$._simple_statement,
$._compound_statement,
),The generated enum would look like this:
pub enum SimpleStatement_CompoundStatement {
SimpleStatement(SimpleStatement),
CompoundStatement(CompoundStatement),
}If the generated enum name becomes too long, consider using a Tree-sitter supertype to group nodes together.
The kind_id() method is used to determine child kinds during traversal.
The AstNode::contains method relies on this to check whether a node kind belongs to a specific struct or enum variant.
§Vec and Option Fields
repeat and repeat1 in the grammar will generate a Vec field.
optional(...) will generate an Option<T> field.
§Token Naming
Unnamed tokens are mapped to Rust enums using a built-in token map. For instance:
{ "type": "+", "named": false },
{ "type": "+=", "named": false },
{ "type": ",", "named": false },
{ "type": "-", "named": false },
{ "type": "-=", "named": false },Generates:
pub enum Token_Plus {}
pub enum Token_PlusEqual {}
pub enum Token_Comma {}
pub enum Token_Minus {}
pub enum Token_MinusEqual {}Tokens with regular identifiers are converted to PascalCase.
§Custom Tokens
If your grammar defines additional unnamed tokens not covered by the default map, you can provide a custom token mapping to generate appropriate Rust enum names.
use auto_lsp_codegen::generate;
let _result = generate(
&tree_sitter_python::NODE_TYPES,
&tree_sitter_python::LANGUAGE.into(),
Some(HashMap::from([
("+", "Plus"),
("+=", "PlusEqual"),
(",", "Comma"),
("-", "Minus"),
("-=", "MinusEqual"),
])),
);Tokens that are not in the map will be added, and tokens that already exist in the map will be overwritten.
§Super Types
Tree-sitter supports supertypes, which allow grouping related nodes under a common type.
For example, in the Python grammar:
{
"type": "_compound_statement",
"named": true,
"subtypes": [
{
"type": "class_definition",
"named": true
},
{
"type": "decorated_definition",
"named": true
},
/* ... */
{
"type": "with_statement",
"named": true
}
]
},This becomes a Rust enum:
pub enum CompoundStatement {
ClassDefinition(ClassDefinition),
DecoratedDefinition(DecoratedDefinition),
/* ... */
WithStatement(WithStatement),
}Some super types might contain other super types, in which case, the generated enum will flatten the hierarchy.
Functions§
- generate
- Generates the Rust code for a given Tree-sitter grammar