RustyLR
yacc-like LR(1) and LALR(1) Deterministic Finite Automata (DFA) generator from Context Free Grammar (CFGs).
RustyLR provides procedural macros and buildscript tools to generate LR(1) and LALR(1) parser. The generated parser will be a pure Rust code, and the calculation of building DFA will be done at compile time. Reduce action can be written in Rust code, and the error messages are readable and detailed. For huge and complex grammars, it is recommended to use the buildscipt.
features in Cargo.toml
build: Enable buildscript tools.fxhash: In parser table, replacestd::collections::HashMapwithFxHashMapfromrustc-hash.
Example
// this define `EParser` struct
// where `E` is the start symbol
lr1!
let parser = new; // generate `EParser`
let mut context = parser.begin; // create context
let mut userdata: i32 = 0; // define userdata
let input_sequence = "1 + 2 * ( 3 + 4 )";
// start feeding tokens
for token in input_sequence.chars
parser.feed.unwrap; // feed `eof` token
let res = context.accept; // get the value of start symbol
println!;
println!;
Readable error messages (with codespan)

- This error message is generated by the buildscript tool, not the procedural macros.
Features
- pure Rust implementation
- readable error messages, both for grammar building and parsing
- compile-time DFA construction from CFGs
- customizable reduce action
- resolving conflicts of ambiguous grammar
- regex patterns partially supported
- tools for integrating with
build.rs
Contents
proc-macro
Below procedural macros are provided:
lr1!: generate LR(1) parserlalr1!: generate LALR(1) parser
These macros will generate structs:
Parser: contains DFA tables and production rulesParseError: type alias forErrorreturned fromfeed()Context: contains current state and data stackenum NonTerminals: a list of non-terminal symbolsRule: type alias for production rulesState: type alias for DFA states
All structs above are prefixed by <StartSymbol>.
In most cases, what you want is the Parser and ParseError structs, and the others are used internally.
Integrating with build.rs
This buildscripting tool will provide much more detailed, pretty-printed error messages than the procedural macros.
If you are writing a huge, complex grammar, it is recommended to use buildscript than the procedural macros.
Generated code will contain the same structs and functions as the procedural macros. In your actual source code, you can include! the generated file.
Unlike the procedural macros, the program searches for %% in the input file, not the lr1!, lalr1! macro.
The contents before %% will be copied into the output file as it is.
And the context-free grammar must be followed by %%.
// parser.rs
use SomeStruct;
%% // <-- input file splitted here
%tokentype u8;
%start E;
%eof b'\0';
%token a b'a';
%token lparen b'(';
%token rparen b')';
E: lparen E rparen
| P
;
P: a;
You must enable the feature build to use in the build script.
[]
= { = "...", = ["build"] }
// build.rs
use build;
In your source code, include the generated file.
include!;
Start Parsing
The Parser struct has the following functions:
new(): create new parserbegin(&self): create new contextfeed(&self, &mut Context, TerminalType, &mut UserData) -> Result<(), ParseError>: feed token to the parser
Note that the parameter &mut UserData is omitted if %userdata is not defined.
All you need to do is to call new() to generate the parser, and begin() to create a context.
Then, you can feed the input sequence one by one with feed() function.
Once the input sequence is feeded (including eof token), without errors,
you can get the value of start symbol by calling context.accept().
let parser = new;
let context = parser.begin;
for token in input_sequence
let start_symbol_value = context.accept;
Error Handling
There are two error variants returned from feed() function:
InvalidTerminal(InvalidTerminalError): when invalid terminal symbol is fedReduceAction(ReduceActionError): when the reduce action returnsErr(Error)
For ReduceActionError, the error type can be defined by %err directive. If not defined, DefaultReduceActionError will be used.
When printing the error message, there are two ways to get the error message:
e.long_message( &parser, &context ): get the error message asString, in a detailed formate as Display: briefly print the short message throughDisplaytrait.
The long_message function requires the reference to the parser and the context.
It will make a detailed error message of what current state was trying to parse, and what the expected terminal symbols were.
Example of long_message
Invalid Terminal: *. Expected one of: , (, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
>>> In:
M -> M * • M
>>> Backtrace:
M -> M • * M
>>> Backtrace:
A -> A + • A
>>> Backtrace:
A -> A • + A
Syntax
To start writing down a context-free grammar, you need to define necessary directives first. This is the syntax of the procedural macros.
lr1!
lr1! macro will generate a parser struct with LR(1) DFA tables.
If you want to generate LALR(1) parser, use lalr1! macro.
Every line in the macro must follow the syntax below.
Bootstrap, Expanded Bootstrap would be a good example to understand the syntax and generated code. It is RustyLR syntax parser written in RustyLR itself.
Quick Reference
- Production rules
- Regex pattern
- RuleType
- ReduceAction
- Accessing token data in ReduceAction
- Exclamation mark
! %tokentype%token%start%eof%userdata%left,%right%err,%error%derive%derive
Production rules
Every production rules have the base form:
NonTerminalName
: Pattern1 Pattern2 ... PatternN { ReduceAction }
| Pattern1 Pattern2 ... PatternN { ReduceAction }
...
;
Each Pattern follows the syntax:
name: Non-terminal or terminal symbolnamedefined in the grammar.[term1 term_start-term_last],[^term1 term_start-term_last]: Set of terminal symbols.eofwill be automatically removed from the terminal set.P*: Zero or more repetition ofP.P+: One or more repetition ofP.P?: Zero or one repetition ofP.P / term,P / [term1 term_start-term_last],P / [^term1 term_start-term_last]: Lookaheads;Pfollowed by one of given terminal set. Lookaheads are not consumed.
Notes
When using range pattern [first-last],
the range is constructed by the order of the %token directives,
not by the actual value of the token.
If you define tokens in the following order:
%token one '1';
%token two '2';
...
%token zero '0';
%token nine '9';
The range [zero-nine] will be ['0', '9'], not ['0'-'9'].
RuleType (optional)
You can assign a value for each non-terminal symbol.
In reduce action,
you can access the value of each pattern holds,
and can assign new value to current non-terminal symbol.
Please refer to the ReduceAction and Accessing token data in ReduceAction section below.
At the end of parsing, the value of the start symbol will be the result of the parsing.
By default, terminal symbols hold the value of %tokentype passed by feed() function.
E(MyType<i32>) : ... Patterns ... { <This will be new value of E> } ;
ReduceAction (optional)
Reduce action can be written in Rust code. It is executed when the rule is matched and reduced.
-
If
RuleTypeis defined for current non-terminal symbol,ReduceActionitself must be the value ofRuleType(i.e. no semicolon at the end of the statement). -
ReduceActioncan be omitted if:RuleTypeis not defined.- Only one token is holding value in the production rule.
-
Result<(),Error>can be returned fromReduceAction.- Returned
Errorwill be delivered to the caller offeed()function. ErrorTypecan be defined by%error%errordirective. See Error type section.
- Returned
NoRuleType: ... ;
RuleTypeI32: ... ;
// RuleTypeI32 will be chosen
E: NoRuleType NoRuleType RuleTypeI32 NoRuleType;
// set Err variant type to String
%err String;
%token div '/';
E: A div a2=A ;
A: ... ;
Accessing token data in ReduceAction
predefined variables can be used in ReduceAction:
data: userdata passed tofeed()function.
To access the data of each token, you can directly use the name of the token as a variable.
- For non-terminal symbols, the type of variable is
RuleType. - For terminal symbols, the type of variable is
%tokentype. - If multiple variables are defined with the same name, the variable on the front-most will be used.
- You can remap the variable name by using
=operator.
E : A plus a2=A ;
For some regex pattern, the type of variable will be modified as follows:
P*:Vec<P>P+:Vec<P>P?:Option<P>
You can still access the Vec or Option by using the base name of the pattern.
E : A* ;
For terminal set [term1 term_start-term_end], [^term1 term_start-term_end], there is no predefined variable name. You must explicitly define the variable name.
E: digit= ;
Exclamation mark !
An exclamation mark ! can be used right after the token to ignore the value of the token.
The token will be treated as if it is not holding any value.
A : ... ;
// A in the middle will be chosen, since other A's are ignored
E : A! A A!;
Token type (must defined)
%tokentype <RustType> ;
Define the type of terminal symbols.
<RustType> must be accessible at the point where the macro is called.
lr!
Token definition (must defined)
%token name <RustExpr> ;
Map terminal symbol name to the actual value <RustExpr>.
<RustExpr> must be accessible at the point where the macro is called.
%tokentype u8;
%token zero b'0';
%token one b'1';
...
// 'zero' and 'one' will be replaced by b'0' and b'1' respectively
E: zero one;
Start symbol (must defined)
%start NonTerminalName ;
Set the start symbol of the grammar as NonTerminalName.
%start E;
// this internally generate augmented rule <Augmented> -> E eof
E: ... ;
Eof symbol (must defined)
%eof <RustExpr> ;
Define the eof terminal symbol.
<RustExpr> must be accessible at the point where the macro is called.
'eof' terminal symbol will be automatically added to the grammar.
%eof b'\0';
// you can access eof terminal symbol by 'eof' in the grammar
// without %token eof ...;
Userdata type (optional)
%userdata <RustType> ;
Define the type of userdata passed to feed() function.
...
%userdata MyUserData;
...
Reduce type (optional)
// reduce first
%left term1 ;
%left [term1 term_start-term_last] ;
// shift first
%right term1 ;
%right [term1 term_start-term_last] ;
Set the shift/reduce precedence for terminal symbols.
%left can be abbreviated as %reduce or %l, and %right can be abbreviated as %shift or %r.
// define tokens
%token plus '+';
%token hat '^';
// reduce first for token 'plus'
%left plus;
// shift first for token 'hat'
%right hat;
Error type (optional)
%err <RustType> ;
%error <RustType> ;
Define the type of Err variant in Result<(), Err> returned from ReduceAction. If not defined, DefaultReduceActionError will be used.
...
%err ;
...
match parser.feed
Derive (optional)
Specify the derive attributes for the generated Context struct.
By default, the generated Context does not implement any traits.
But in some cases, you may want to derive traits like Clone, Debug, or Serialize, Deserialize of serde.
In this case, user must ensure that every member of the Context must implement the trait.
Currently, Context is holding the stack data,
which is Vec<usize> for state stack and Vec<T> for every RuleType in the grammar.
%derive Clone, Debug, serde::Serialize ;
// here, #[derive(Clone,Debug)] will be added to the generated `Context` struct
%derive Clone, Debug;
...
let mut context = parser.begin;
// do something with context...
println!; // debug-print context
let cloned_context = context.clone; // clone context, you can re-feed the input sequence using cloned context