Expand description
The famous Lemon Parser Generator, exposed as a Rust library
that builds an LALR(1) parser transparently during cargo build. Instead of writing a separate
grammar file and running a code generator, you describe the grammar directly in Rust by adding
annotation attributes to ordinary functions, structs and enums. The resulting parser produces
your own Rust types, so parsing yields a ready-to-use syntax tree (or computed value) with no
glue code in between.
This crate uses lemon-mint as the parser-generation backend. You can find more usage examples here.
§How it works
The annotation attributes are procedural macros. As cargo compiles your source file, the macros
accumulate the grammar rules they see, and the final #[derive(LemonTree)] (the start symbol)
triggers generation of a complete LALR(1) parser. The generated parser is emitted into a private
submodule and wired to your types through the LemonTree and LemonTreeNode traits. Because
everything happens at compile time, grammar conflicts and errors are reported as ordinary build
errors.
§Mapping to Lemon
If you are familiar with Lemon’s .y grammar files, this is the correspondence. Let’s say we want
to create a Lemon parser like this:
%token_type {f64}
%left PLUS
Unit ::= Expr(expr).
Expr ::= VALUE(value).
Expr ::= Expr(a) PLUS Expr(b).We want that Unit and Expr will be represented by the following types in rust:
struct Unit
{ expr: Expr,
}
struct Expr
{ value: f64,
}Every symbol except the start symbol we need to annotate with #[derive(LemonTreeNode)], and the start symbol with #[derive(LemonTree)].
Parser rules that return this symbol we put into #[lem()] annotation attributes.
All #[derive(LemonTreeNode)], #[derive(LemonTree)] and #[lem_fn()] attributes that describe single Lemon parser must be contained in single rust file,
and the #[derive(LemonTree)] must come the last.
#[derive(LemonTreeNode)]
#[lem("VALUE(value)")]
pub struct Expr
{ value: f64,
}
#[derive(LemonTree)]
#[lem("Expr(expr)")]
pub struct Unit
{ expr: Expr,
}The #[lem()] attribute can appear multiple times, and each attribute can contain multiple rules, like #[lem("A(value)", "B(value)")].
Each rule will produce code that creates new struct instance. Aliases given in parentheses will be assigned to struct fields.
If a struct has more fields than appear in expression, the remaining fields will be set to Default::default(), so they need to implement std::default::Default trait.
Existing fields will be assigned like this: Type {field: value.into()}. So field type in struct can be the type of value, or compatible with it.
In example above, there’s one Lemon rule, that doesn’t return the final result, but needs to perform some calculation.
We expect that rule Expr ::= Expr(a) PLUS Expr(b) will produce type Expr {value: a.value + b.value}.
We can implement this rule as rust function:
#[lem_fn("Expr(a) PLUS Expr(b)")]
pub fn expr_1(a: Expr, b: Expr) -> Expr
{ Expr {value: a.value + b.value}
}So #[lem_fn()] attribute creates parser rule, whose action is module-global function call.
The return type of such function will be the left-hand side symbol in Lemon rule, like Expr ::= Expr(a) PLUS Expr(b).
To specify Lemon parser directives, like %token_type {f64}, need to use #[lem_opt()] attributes near start symbol, like #[lem_opt(token_type="f64")].
Here is complete example:
use lemon_tree::{lem_fn, LemonTree, LemonTreeNode};
#[derive(LemonTreeNode, Debug)]
#[lem("VALUE(value)")]
pub struct Expr
{ value: f64,
}
#[lem_fn("Expr(a) PLUS Expr(b)")]
pub fn expr_1(a: Expr, b: Expr) -> Expr
{ Expr {value: a.value + b.value}
}
#[derive(LemonTree, Debug)]
#[lem("Expr(expr)")]
#[lem_opt(token_type="f64", left="PLUS")]
pub struct Unit
{ expr: Expr,
}
fn main()
{ let mut parser = Unit::get_parser(());
parser.add_token(<Unit as LemonTree>::Token::VALUE, 10.0).unwrap();
parser.add_token(<Unit as LemonTree>::Token::PLUS, 0.0).unwrap();
parser.add_token(<Unit as LemonTree>::Token::VALUE, 20.0).unwrap();
let result = parser.end().unwrap();
assert_eq!(result.expr.value, 30.0);
println!("Result: {:?}", result);
}Enums can be used as symbol types as well. With enums need to put #[lem()] parser rules near enum variants.
Example:
use lemon_tree::{lem_fn, LemonTree, LemonTreeNode};
#[derive(LemonTreeNode, Debug, PartialEq)]
pub enum Expr
{ #[lem("VALUE(0)")]
Value(f64),
#[lem("Expr(0) PLUS Expr(1)")]
Plus(Box<Expr>, Box<Expr>), // the generated action will look like: Expr::Plus(arg_0.into(), arg_1.into())
}
#[derive(LemonTree, Debug, PartialEq)]
#[lem("Expr(expr)")]
#[lem_opt(token_type="f64", left="PLUS")]
pub struct Unit
{ expr: Expr,
}
fn main()
{ let mut parser = Unit::get_parser(());
parser.add_token(<Unit as LemonTree>::Token::VALUE, 10.0).unwrap();
parser.add_token(<Unit as LemonTree>::Token::PLUS, 0.0).unwrap();
parser.add_token(<Unit as LemonTree>::Token::VALUE, 20.0).unwrap();
let result = parser.end().unwrap();
assert_eq!
( result,
Unit
{ expr: Expr::Plus
( Box::new(Expr::Value(10.0)),
Box::new(Expr::Value(20.0)),
)
}
);
println!("Result: {:?}", result);
}Notice, that in Expr::Plus action, Expr object magically converted to Box<Expr>, because Box<T> implements From<T>, so into() can be used to convert.
What if we want to do more complex conversion? Actually we can convert anything to anything, if we manually implement an Into<T> trait.
Example:
use lemon_tree::{lem_fn, LemonTree, LemonTreeNode};
#[derive(LemonTreeNode, Debug, PartialEq)]
pub enum Expr
{ #[lem("VALUE(0)")]
Value(f64),
#[lem("Expr(0) PLUS Expr(1)")]
Plus(String, String),
}
impl Into<String> for Expr
{ fn into(self) -> String
{ match self
{ Expr::Value(v) => format!("{}", v),
Expr::Plus(a, b) => format!("{} + {}", a, b),
}
}
}
#[derive(LemonTree, Debug, PartialEq)]
#[lem("Expr(expr)")]
#[lem_opt(token_type="f64", left="PLUS")]
pub struct Unit
{ expr: Expr,
}
fn main()
{ let mut parser = Unit::get_parser(());
parser.add_token(<Unit as LemonTree>::Token::VALUE, 10.0).unwrap();
parser.add_token(<Unit as LemonTree>::Token::PLUS, 0.0).unwrap();
parser.add_token(<Unit as LemonTree>::Token::VALUE, 20.0).unwrap();
let result = parser.end().unwrap();
assert_eq!
( result,
Unit
{ expr: Expr::Plus("10".to_string(), "20".to_string())
}
);
println!("Result: {:?}", result);
}§Defining grammar symbols
A grammar is made of terminal symbols (tokens, fed in by your tokenizer) and nonterminal
symbols (produced by reducing rules). By convention, names that contain a lowercase letter are
treated as nonterminals, and all-uppercase names like PLUS or VALUE are treated as tokens.
Each nonterminal is backed by a Rust type, and there are three ways to attach rules to it:
- A struct annotated with
#[derive(LemonTreeNode)](or#[derive(LemonTree)]for the start symbol). Rules are listed in#[lem("...")]attributes on the struct. Each rule constructs the struct; aliases in the rule name the fields to fill. - An enum annotated with
#[derive(LemonTreeNode)](or#[derive(LemonTree)]). Here the#[lem("...")]attributes are placed on the individual variants, and each rule constructs that variant. - A function annotated with
#[lem_fn("...")]. The function’s return type is the left-hand side nonterminal, and the function body is the action executed when the rule reduces. This is the most flexible form, useful when a reduction needs to compute something rather than just build a value.
All rules for one parser must live in a single Rust file, and the #[derive(LemonTree)] start
symbol must come last in that file (see Constraints).
§Rule syntax
The string inside #[lem(...)] / #[lem_fn(...)] is the right-hand side of a Lemon rule — a
sequence of terminal and nonterminal symbols separated by whitespace. The left-hand side is implied
by where the attribute is placed (the struct/enum/variant type, or the function return type).
A single attribute may contain several alternative rules, and the attribute may be repeated:
#[lem("A(value)", "B(value)")] // two rules, both producing this symbol
#[lem("C(value)")] // another rule§Aliases
A symbol in the rule can be followed by an alias in parentheses, which binds the symbol’s value so the action can use it:
- For structs, the alias is a field name:
"VALUE(value)"assigns the matched value to thevaluefield. - For enum variants and
#[lem_fn]functions, the alias may be a function argument / tuple-field name, or a zero-based positional index:"Expr(0) PLUS Expr(1)"binds the first and secondExprto tuple fields0and1.
§Value conversion
Bound values are moved into the target field via .into(), so the field type only needs to
implement From for the matched value’s type (for example Box<T> implements From<T>, which is
why Expr can flow into a Box<Expr> field automatically). For arbitrary conversions you can
implement Into<TargetType> yourself. Fields and arguments that are not bound by an alias are
filled with Default::default(), so those types must implement Default.
§Optional symbols [...]
Square brackets mark an optional part of a rule and expand into multiple alternatives. For example
"Exprs(exprs) [SEMICOLON]" is shorthand for the two rules "Exprs(exprs)" and
"Exprs(exprs) SEMICOLON". Brackets may be nested.
§Parser options — #[lem_opt(...)]
Lemon directives are set with #[lem_opt(...)] attributes placed next to the start symbol
(#[derive(LemonTree)]). Each option takes a string value:
| Option | Lemon directive | Meaning |
|---|---|---|
token_type | %token_type | The Rust type carried by every token’s value (the minor value). |
extra_argument | %extra_argument | Type of a user value made available to all actions (see below). |
left | %left | Declare tokens left-associative (precedence increases with each line). |
right | %right | Declare tokens right-associative. |
nonassoc | %nonassoc | Declare tokens non-associative. |
fallback | %fallback | "FALLBACK_TOK TOK_A TOK_B ..." — fall back the listed tokens. |
trace | %trace | Print a parser trace to stderr, prefixed with the given prompt string. |
Associativity / precedence options may be repeated; rules declared earlier have lower precedence than those declared later, exactly as in Lemon. Example:
#[lem_opt(token_type="f64", left="PLUS MINUS", left="DIVIDE TIMES", trace=">>")]§The generated parser API
Deriving LemonTree on the start symbol S generates:
S::get_parser(extra)— create a parser.extrais the%extra_argumentvalue (use()when noextra_argumentis set). Its type is<S as LemonTree>::Parser.<S as LemonTree>::Token— an enum with one variant per terminal symbol used anywhere in the grammar.parser.add_token(token, value)— feed one token.valuehas thetoken_typetype. ReturnsResult<(), ()>; anErrmeans a syntax error at that token.parser.try_add_token(token, value)— likeadd_token, but returnsResult<bool, ()>whereOk(false)indicates the token was not accepted in the current state instead of erroring.parser.end()— signal end of input. ReturnsResult<S, ()>: the constructed start symbol on success, orErr(())on a syntax error.parser.extra— public field holding theextra_argumentvalue, readable and writable between tokens.
You drive the parser from your own tokenizer: repeatedly call add_token, then call end to get
the result. See the README for a complete
calculator with a hand-written tokenizer.
§The extra_argument
Setting #[lem_opt(extra_argument="MyType")] gives every action access to a shared value. In
#[lem_fn] functions, add a final argument literally named extra to receive it:
#[lem_fn("Expr(a) PLUS Expr(b)")]
pub fn expr_plus(a: Expr, b: Expr, extra: &mut Context) -> Expr
{ extra.count += 1;
Expr {value: a.value + b.value}
}The value is supplied when constructing the parser via get_parser(extra) and is also accessible
as parser.extra.
§Constraints
- All attributes describing a single parser must live in one Rust file.
#[derive(LemonTree)](the start symbol) must be the last parser attribute in that file; it is what triggers parser generation.- A file may define only one parser (one
#[derive(LemonTree)]). - Unions are not supported as symbol types.
Different parsers can coexist in the same crate as long as each lives in its own file.
§Cargo features
These features (forwarded to lemon-tree-derive) help with debugging the grammar at build time:
dump-grammar— print the generated grammar, with Rust actions, to stderr during the build.dump-lemon-grammar— print the grammar in classic Lemon.ysyntax to stderr.debug-parser-to-file— write the generated parser source to a file next to your source instead of inlining it, which makes the generated code easy to inspect.
Traits§
- Lemon
Tree - Implemented for the parser’s start symbol — the type that a successful parse produces.
- Lemon
Tree Node - Marker trait implemented by every nonterminal symbol other than the start symbol.
Attribute Macros§
- lem_fn
- Makes module-global public function an action for specified Lemon parser expression.
Derive Macros§
- Lemon
Tree - Makes enum/struct a start symbol of current parser. Must appear the last parser attribute in file.
- Lemon
Tree Node - Makes enum/struct a regular nonterminal symbol. Must appear before
#[derive(LemonTree)]and#[lem_fn()]in the same rust file.