lemon-tree 0.2.2

Famous Lemon Parser Generator, designed as library that builds your parser transparently during cargo build. Use attributes to describe parser rules.
Documentation

lemon-tree

crates.io docs.rs

The famous Lemon Parser Generator, exposed as a Rust library that builds an LALR(1) parser transparently during cargo build.

Instead of maintaining a separate grammar file and running a code-generation step, you describe the grammar directly in Rust by adding annotation attributes to ordinary functions, structs and enums. The generated parser produces your own Rust types, so a successful parse yields a ready-to-use syntax tree (or a computed value) with no glue code in between.

This crate uses lemon-mint as the parser-generation backend.

How it works

The annotation attributes are procedural macros. As cargo compiles your source file, the macros accumulate the grammar rules they encounter, and the final #[derive(LemonTree)] (the start symbol) triggers generation of a complete LALR(1) parser. The parser is emitted into a private submodule and connected to your types through the LemonTree and LemonTreeNode traits. Because everything happens at compile time, grammar conflicts and errors surface as ordinary build errors.

Installation

Add this to your project's Cargo.toml:

[dependencies]
lemon-tree = "0.2"

The crate exports three items: lem_fn, LemonTree and LemonTreeNode.

Core concepts

A grammar is made of terminal symbols (tokens, supplied by your tokenizer) and nonterminal symbols (produced by reducing rules). By convention, names containing a lowercase letter are nonterminals, and all-uppercase names like PLUS or NUM are tokens.

Each nonterminal is backed by a Rust type. There are three ways to attach rules to it:

  • A struct annotated with #[derive(LemonTreeNode)] (or #[derive(LemonTree)] for the start symbol). Rules are listed in #[lem("...")] attributes on the struct, and each rule constructs the struct.
  • An enum annotated with #[derive(LemonTreeNode)] (or #[derive(LemonTree)]). The #[lem("...")] attributes are placed on the individual variants.
  • A function annotated with #[lem_fn("...")]. The function's return type is the left-hand-side nonterminal, and the body is the action run when the rule reduces.

The start symbol is marked with #[derive(LemonTree)] and represents the final result of a parse. Parser-wide options are set with #[lem_opt(...)] attributes placed next to it.

Rule syntax

The string inside #[lem(...)] / #[lem_fn(...)] is the right-hand side of a Lemon rule — a whitespace-separated sequence of symbols. The left-hand side is implied by where the attribute sits (the struct/enum/variant type, or the function return type).

  • Multiple rules. An attribute may list several alternatives, and may be repeated: #[lem("A(value)", "B(value)")].
  • Aliases. A symbol can be followed by an alias in parentheses to bind its value. For structs, the alias is a field name ("VALUE(value)"). For enum variants and #[lem_fn] functions it is an argument/tuple-field name or a zero-based index ("Expr(0) PLUS Expr(1)").
  • Value conversion. Bound values are moved into their target via .into(), so the target type only needs From for the matched type (this is why an Expr flows into a Box<Expr> field automatically). Implement Into<TargetType> yourself for arbitrary conversions. Unbound fields/arguments are filled with Default::default().
  • Optional symbols. Square brackets mark an optional part and expand into alternatives: "Exprs(exprs) [SEMICOLON]" means both "Exprs(exprs)" and "Exprs(exprs) SEMICOLON". Brackets may be nested.

Parser options — #[lem_opt(...)]

Option Lemon directive Meaning
token_type %token_type The Rust type carried by every token's value.
extra_argument %extra_argument Type of a user value made available to every action.
left %left Declare tokens left-associative (precedence grows with each line).
right %right Declare tokens right-associative.
nonassoc %nonassoc Declare tokens non-associative.
fallback %fallback "FALLBACK_TOK TOK_A TOK_B ..." — fall back the listed tokens.
trace %trace Print a parser trace to stderr, prefixed with the given prompt.

Associativity options may be repeated; rules declared earlier have lower precedence than later ones, exactly as in Lemon.

The generated parser API

Deriving LemonTree on a start symbol S generates:

  • S::get_parser(extra) — create a parser (extra is the %extra_argument; use () when none is set).
  • <S as LemonTree>::Token — an enum with one variant per terminal symbol.
  • parser.add_token(token, value) — feed one token; returns Result<(), ()> (Err on syntax error).
  • parser.try_add_token(token, value) — like add_token, but returns Ok(false) when the token is not accepted instead of erroring.
  • parser.end() — signal end of input; returns Result<S, ()>.
  • parser.extra — public field holding the extra_argument value.

Example: a calculator

This first example builds a calculator that evaluates expressions to f64. Create an empty cargo project and put the following in its main.rs:

use lemon_tree::{lem_fn, LemonTree};

type Expr = f64; // in lemon this corresponds to: %type Expr {f64}
type Exprs = Vec<f64>; // %type Exprs {Vec<f64>}

#[lem_fn("NUM(value)")] pub fn expr_1(value: f64) -> Expr {value} // Expr ::= NUM(value). value
#[lem_fn("MINUS Expr(a)")] pub fn expr_2(a: Expr) -> Expr {-a} // Expr ::= MINUS Expr(a). -a
#[lem_fn("PLUS Expr(a)")] pub fn expr_3(a: Expr) -> Expr {a} // Expr ::= PLUS Expr(a). a
#[lem_fn("Expr(a) PLUS Expr(b)")] pub fn expr_4(a: Expr, b: Expr) -> Expr {a + b} // Expr ::= Expr(a) PLUS Expr(b). a + b
#[lem_fn("Expr(a) MINUS Expr(b)")] pub fn expr_5(a: Expr, b: Expr) -> Expr {a - b} // Expr ::= Expr(a) MINUS Expr(b). a - b
#[lem_fn("Expr(a) TIMES Expr(b)")] pub fn expr_6(a: Expr, b: Expr) -> Expr {a * b} // Expr ::= Expr(a) TIMES Expr(b). a * b
#[lem_fn("Expr(a) DIVIDE Expr(b)")] pub fn expr_7(a: Expr, b: Expr) -> Expr {a / b} // Expr ::= Expr(a) DIVIDE Expr(b). a / b
#[lem_fn("PAR_OPEN Expr(a) PAR_CLOSE")] pub fn expr_8(a: Expr) -> Expr {a} // Expr ::= PAR_OPEN Expr(a) PAR_CLOSE. a
#[lem_fn("Expr(item)")] pub fn exprs_1(item: Expr) -> Exprs {vec![item]} // Exprs ::= Expr(item). vec![item]
#[lem_fn("Exprs(items) SEMICOLON Expr(item)")] pub fn exprs_2(mut items: Exprs, item: Expr) -> Exprs {items.push(item); items} // Exprs ::= Exprs(items) SEMICOLON Expr(item). items.push(item); items

// The start symbol is marked with #[derive(LemonTree)]
#[derive(LemonTree)]
#[lem_opt(token_type="f64", left="PLUS MINUS", left="DIVIDE TIMES", trace=">>")]
#[lem("Exprs(exprs) [SEMICOLON]")]
pub struct Program
{	exprs: Exprs,
}
// The above code generates the following lemon directives:
// %start_symbol {Program}
// %token_type {f64}
// %left PLUS MINUS.
// %left DIVIDE TIMES.
// Program ::= Exprs(exprs). Program {exprs}
// Program ::= Exprs(exprs) SEMICOLON. Program {exprs}

// This is the tokenizer function. It takes "input", and feeds tokens to the "parser".
fn parse(parser: &mut <Program as LemonTree>::Parser, mut input: &str) -> Exprs
{	loop
	{	input = input.trim_start();
		match input.bytes().next()
		{	Some(c) => match c
			{	b'+' => parser.add_token(<Program as LemonTree>::Token::PLUS, 0.0).unwrap(),
				b'-' => parser.add_token(<Program as LemonTree>::Token::MINUS, 0.0).unwrap(),
				b'*' => parser.add_token(<Program as LemonTree>::Token::TIMES, 0.0).unwrap(),
				b'/' => parser.add_token(<Program as LemonTree>::Token::DIVIDE, 0.0).unwrap(),
				b'(' => parser.add_token(<Program as LemonTree>::Token::PAR_OPEN, 0.0).unwrap(),
				b')' => parser.add_token(<Program as LemonTree>::Token::PAR_CLOSE, 0.0).unwrap(),
				b';' => parser.add_token(<Program as LemonTree>::Token::SEMICOLON, 0.0).unwrap(),
				b'0' ..= b'9' | b'.' =>
				{	let pos = input.bytes().position(|c| !c.is_ascii_digit() && c!=b'.').unwrap_or(input.len());
					let value = input[.. pos].parse().unwrap();
					parser.add_token(<Program as LemonTree>::Token::NUM, value).unwrap();
					input = &input[pos-1 ..];
				}
				_ => panic!("Invalid token")
			}
			None =>
			{	// End of input
				// parser.end() returns Result<Program>
				return parser.end().unwrap().exprs;
			}
		}
		input = &input[1 ..];
	}
}

fn main()
{	let mut parser = Program::get_parser(());

	assert_eq!(parse(&mut parser, "2 + 2 * 2; (2+2) * 2"), vec![6.0, 8.0]);
	assert_eq!(parse(&mut parser, "2 * 2 + 2; (2*2) + 2"), vec![6.0, 6.0]);
	assert_eq!(parse(&mut parser, "-1*30"), vec![-30.0]);
	assert_eq!(parse(&mut parser, "0--1;"), vec![1.0]);
	assert_eq!(parse(&mut parser, "(((0)))"), vec![0.0]);
	assert_eq!(parse(&mut parser, "0.123 + 10"), vec![10.123]);
	assert_eq!(parse(&mut parser, "0.123 / (1.0-1.0)"), vec![f64::INFINITY]);
}

Example: building a syntax tree

Rather than computing a value, you can produce an AST. Here Expr is an enum whose variants carry the rules, so the parser hands back a tree of Expr nodes. Note how each matched Expr is automatically wrapped in a Box<Expr> via .into().

use lemon_tree::{lem_fn, LemonTree, LemonTreeNode};

#[derive(LemonTreeNode, Debug, PartialEq)]
pub enum Expr
{	#[lem("NUM(0)")] Num(f64),
	#[lem("MINUS Expr(0)")] UnaryMinus(Box<Expr>),
	#[lem("PLUS Expr(0)")] UnaryPlus(Box<Expr>),
	#[lem("Expr(0) PLUS Expr(1)")] Plus(Box<Expr>, Box<Expr>),
	#[lem("Expr(0) MINUS Expr(1)")] Minus(Box<Expr>, Box<Expr>),
	#[lem("Expr(0) TIMES Expr(1)")] Times(Box<Expr>, Box<Expr>),
	#[lem("Expr(0) DIVIDE Expr(1)")] Divide(Box<Expr>, Box<Expr>),
}
#[lem_fn("PAR_OPEN Expr(0) PAR_CLOSE")] pub fn expr_from_par(a: Expr) -> Expr {a}

pub type Exprs = Vec<Expr>;
#[lem_fn("Expr(item)")] fn exprs_item(item: Expr) -> Exprs {vec![item]}
#[lem_fn("Exprs(items) SEMICOLON Expr(item)")] fn exprs_items(mut items: Exprs, item: Expr) -> Exprs {items.push(item); items}

#[derive(LemonTree, Debug)]
#[lem_opt(token_type="f64", left="PLUS MINUS", left="DIVIDE TIMES")]
#[lem("Exprs(exprs) [SEMICOLON]")]
pub struct Program
{	exprs: Vec<Expr>,
	flag: bool, // not bound by any rule, so filled with Default::default()
}

fn parse(parser: &mut <Program as LemonTree>::Parser, mut input: &str) -> Vec<Expr>
{	loop
	{	input = input.trim_start();
		match input.bytes().next()
		{	Some(c) => match c
			{	b'+' => parser.add_token(<Program as LemonTree>::Token::PLUS, 0.0).unwrap(),
				b'-' => parser.add_token(<Program as LemonTree>::Token::MINUS, 0.0).unwrap(),
				b'*' => parser.add_token(<Program as LemonTree>::Token::TIMES, 0.0).unwrap(),
				b'/' => parser.add_token(<Program as LemonTree>::Token::DIVIDE, 0.0).unwrap(),
				b'(' => parser.add_token(<Program as LemonTree>::Token::PAR_OPEN, 0.0).unwrap(),
				b')' => parser.add_token(<Program as LemonTree>::Token::PAR_CLOSE, 0.0).unwrap(),
				b';' => parser.add_token(<Program as LemonTree>::Token::SEMICOLON, 0.0).unwrap(),
				b'0' ..= b'9' | b'.' =>
				{	let pos = input.bytes().position(|c| !c.is_ascii_digit() && c!=b'.').unwrap_or(input.len());
					let value = input[.. pos].parse().unwrap();
					parser.add_token(<Program as LemonTree>::Token::NUM, value).unwrap();
					input = &input[pos-1 ..];
				}
				_ => panic!("Invalid token")
			}
			None =>
			{	return parser.end().unwrap().exprs;
			}
		}
		input = &input[1 ..];
	}
}

fn main()
{	use Expr::*;
	let mut parser = Program::get_parser(());

	assert_eq!
	(	parse(&mut parser, "2 + 2 * 2; (3+3) * 3"),
		vec!
		[	Plus
			(	Box::new(Num(2.0)),
				Box::new(Times(Box::new(Num(2.0)), Box::new(Num(2.0))))
			),
			Times
			(	Box::new(Plus(Box::new(Num(3.0)), Box::new(Num(3.0)))),
				Box::new(Num(3.0))
			)
		]
	);
}

Constraints

  • All attributes describing a single parser must live in one Rust file.
  • #[derive(LemonTree)] (the start symbol) must be the last parser attribute in that file — it is what triggers parser generation.
  • A file may define only one parser (one #[derive(LemonTree)]).
  • Unions are not supported as symbol types.

You can have several parsers in one crate, as long as each lives in its own file.

Cargo features

These features (forwarded to lemon-tree-derive) help debug the grammar at build time:

  • dump-grammar — print the generated grammar, with Rust actions, to stderr during the build.
  • dump-lemon-grammar — print the grammar in classic Lemon .y syntax to stderr.
  • debug-parser-to-file — write the generated parser source to a file next to your source instead of inlining it, making the generated code easy to inspect.

License

Licensed under the MIT license.