RustyLR
yacc-like LR(1) and LALR(1) Deterministic Finite Automata (DFA) generator from Context Free Grammar (CFGs).
[dependencies]
rusty_lr = "1.1.0"
Features
- pure Rust implementation
- readable error messages, both for grammar building and parsing
- compile-time DFA construction from CFGs ( with proc-macro )
- customizable reduce action
- resolving conflicts of ambiguous grammar
- tracing parser action with callback
- executable for generating parser tables from CFGs
Usage
- Calculator: calculator with enum
Token
- Calculator u8: calculator with
u8
- Bootstrap, Expanded Bootstrap: bootstrapped line parser of
lr1!
andlalr1!
macro, written in RustyLR itself.
Sample calculator example
In example/calculator_u8/parser.rs
,
use lr1;
use lalr1;
// this define struct `EParser`
// where 'E' is the start symbol
lr1!
In example/calculator_u8/src/main.rs
,
$ cargo run
3.0 '+' 4.0
1.0 '+' 140.0
result: 141
userdata: 2
proc-macro syntax
Four procedural macros are provided:
lr1!
,lalr1!
lr1_runtime!
,lalr1_runtime!
These macros will define three structs: Parser
, Context
, and enum NonTerminals
, prefixed by the <StartSymbol>
.
In most cases, what you want is the Parser
struct, which contains the DFA states and feed()
functions.
Please refer to the Start Parsing section below for actual usage of the Parser
struct.
Former two macros (those without '_runtime' suffix) will generate Parser
struct at compile-time.
The calculation of building DFA will be done at compile-time, and the generated code will be TONS of insert
of tokens one by one.
Latter two (those with '_runtime' suffix) will generate Parser
struct at runtime.
The calculation of building DFA will be done at runtime, and the generated code will be much more readable, and smaller.
Bootstrap, Expanded Bootstrap would be a good example to understand the syntax and generated code. It is RustyLR syntax parser written in RustyLR itself.
Every line in the macro must follow the syntax below.
Token type (must defined)
'%tokentype' <RustType> ';'
Define the type of terminal symbols.
<RustType>
must be accessible at the point where the macro is called.
lr!
Token definition (must defined)
'%token' <Ident> <RustExpr> ';'
Map terminal symbol's name <Ident>
to the actual value <RustExpr>
.
<RustExpr>
must be accessible at the point where the macro is called.
lr1!
Start symbol (must defined)
'%start' <Ident> ';'
Define the start symbol of the grammar.
lr1!
Eof symbol (must defined)
'%eof' <RustExpr> ';'
Define the eof
terminal symbol.
<RustExpr>
must be accessible at the point where the macro is called.
'eof' terminal symbol will be automatically added to the grammar.
lr1!
Userdata type (optional)
'%userdata' <RustType> ';'
Define the type of userdata passed to feed()
function.
lr1!
...
Reduce type (optional)
// reduce first
'%left' <Ident> ';'
'%l' <Ident> ';'
'%reduce' <Ident> ';'
// shift first
'%right' <Ident> ';'
'%r' <Ident> ';'
'%shift' <Ident> ';'
Set the shift/reduce precedence for terminal symbols. <Ident>
must be defined in %token
.
lr1!
Production rules
<Ident><RuleType>
':' <TokenMapped>* <ReduceAction>
'|' <TokenMapped>* <ReduceAction>
...
';'
Define the production rules.
<TokenMapped> : <Ident as var_name> '=' <TokenPattern>
| <TokenPattern>
;
<TokenPattern> : <Ident as terminal or non-terminal>
| <Ident as terminal or non-terminal> '*' (zero or more)
| <Ident as terminal or non-terminal> '+' (one or more)
| <Ident as terminal or non-terminal> '?' (zero or one)
;
This production rule defines non-terminal E
to be A
, then zero or more plus
, then D
mapped to variable d
.
For more information, please refer to the Accessing token data in ReduceAction section below.
lr1!
RuleType (optional)
<RuleType> : '(' <RustType> ')'
|
;
Define the type of value that this production rule holds.
lr1!
ReduceAction (optional)
<ReduceAction> : '{' <RustExpr> '}'
|
;
Define the action to be executed when the rule is matched and reduced.
-
If
<RuleType>
is defined,<ReduceAction>
itself must be the value of<RuleType>
(i.e. no semicolon at the end of the statement). -
<ReduceAction>
can be omitted if:<RuleType>
is not defined- Only one token is holding value in the production rule
-
Result<(),Error>
can be returned from<ReduceAction>
.- Returned
Error
will be delivered to the caller offeed()
function. ErrorType
can be defined by%err
or%error
directive. See Error type section.
- Returned
Omitting ReduceAction
:
lr1!
Returning Result<(),String>
from ReduceAction:
lr1!
Accessing token data in ReduceAction
predefined variables can be used in <ReduceAction>
:
data
: userdata passed tofeed()
function.
To access the data of each token, you can directly use the name of the token as a variable.
For non-terminal symbols, the type of variable is <RuleType>
.
For terminal symbols, the type of variable is %tokentype
.
If multiple variables are defined with the same name, the variable on the front-most will be used.
For regex-like pattern, type of variable will be modified by following:
Pattern | Non-Terminal<RuleType>=T |
Non-Terminal<RuleType>=(not defined) |
Terminal |
---|---|---|---|
'*' | Vec<T> |
(not defined) | Vec<TermType> |
'+' | Vec<T> |
(not defined) | Vec<TermType> |
'?' | Option<T> |
(not defined) | Option<TermType> |
lr1!
Error type (optional)
'%err' <RustType> ';'
'%error' <RustType> ';'
Define the type of Err
variant in Result<(), Err>
returned from <ReduceAction>
. If not defined, String
will be used.
lr1!
...
match parser.feed
Start Parsing
<StartSymbol>Parser
will be generated by the procedural macros.
The parser struct has the following functions:
new()
: create new parserbegin(&self)
: create new contextfeed(&self, &mut Context, TermType, &mut UserData) -> Result<(), ParseError>
: feed token to the parserfeed_callback(&self, &mut Context, &mut C: Callback, TermType, &mut UserData) -> Result<(), ParseError>
: feed token with callback
Note that the parameter &mut UserData
is omitted if %userdata
is not defined.
Once the input sequence (including eof
token) is feeded, without errors, you can get the value of start symbol by calling context.accept()
.
let parser = new;
// create context
let mut context = parser.begin;
// define userdata
let mut userdata: i32 = 0;
// start feeding tokens
for token in input_sequence
// res = value of start symbol
let res = context.accept;
println!;
println!;
Parse with callback
For tracing parser action, you can implement Callback
trait and pass it to parser.feed_callback()
.
Note that generic type Term
and NonTerm
must be replaced with actual types. Term
must be same as %tokentype
, and NonTerm
must be <StartSymbol>NonTerminals
generated by the procedural macro.
// Num + Num * ( Num + Num )
let terms = vec!;
// start parsing
let mut context = parser.begin;
let mut callback = ParserCallback ;
// feed input sequence
for term in terms
The result will be:
Reduce by P -> Num
Reduce by M -> P
Reduce by A -> M
Reduce by P -> Num
Reduce by M -> P
Reduce by P -> Num
Reduce by M -> P
Reduce by A -> M
Reduce by P -> Num
Reduce by M -> P
Reduce by A -> M
Reduce by A -> A + A
Reduce by E -> A
Reduce by P -> ( E )
Reduce by M -> P
Reduce by M -> M * M
Reduce by A -> M
Reduce by A -> A + A
Reduce by E -> A
Macro expand executable rustylr
An executable version of lr1!
and lalr1!
macro.
Here for more information.
Build Deterministic Finite Automata (DFA) from Context Free Grammar (CFG)
This section will describe how to build DFA from CFGs, on runtime.
1. Define terminal and non-terminal symbols
// must implement these traits
// must implement these traits
/// impl Display for TermType, NonTermType will make related ProductionRule, error message Display-able
Or simply, you can use char
or u8
as terminal, and &'static str
or String
as non-terminal.
Any type that implements traits above can be used as terminal and non-terminal symbols.
2. Define production rules
Consider the following context free grammar.
A -> A + A (reduce left)
A -> M
This grammar can be written as:
/// type alias
type Token = Token;
/// create grammar
let mut grammar = new;
grammar.add_rule;
grammar.add_rule;
/// set reduce type
grammar.set_reduce_type;
Note that the production rule A -> A + A
has a shift/reduce conflict. To resolve this conflict, the precedence of shift/reduce is given to terminal symbol Plus
. Left
means that for Plus
token, the parser will reduce the rule instead of shifting the token.
reduce/reduce conflict (e.g. duplicated rules) will be always an error.
3. Build DFA
Calling grammar.build()
for LR(1) or grammar.build_lalr()
for LALR(1) will build the DFA from the CFGs.
let parser: Parser = match grammar.build ;
You must explicitly specify the Augmented non-terminal symbol, and the Augmented production rule must be defined in the grammar.
Augmented -> StartSymbol $
The returned Parser
struct contains the DFA states and the production rules(cloned).
It is completely independent from the Grammar
, so you can drop the Grammar
struct, or export the Parser
struct to another module.
4. Error messages
The Error
type returned from Grammar::build()
will contain the error information.
You can manually match
the error type for custom error message,
but for most cases, using println!("{}", err)
will be enough to see the detailed errors.
Error is Display
if both Term
and NonTerm
is Display
, and It is Debug
if both Term
and NonTerm
is Debug
.
Sample error messages
For Shift/Reduce conflicts,
Build failed: Shift/Reduce Conflict
NextTerm: '0'
Reduce Rule:
"Num" -> "Digit"
Shift Rules:
"Digit" -> '0' • /Lookaheads: '\0', '0'
Try rearanging the rules or set ReduceType to Terminal '0' to resolve the conflict.
For Reduce/Reduce conflicts,
Build failed: Reduce/Reduce Conflict with lookahead: '\0'
Production Rule1:
"Num" -> "Digit"
Production Rule2:
"Num" -> "Digit"
Parse input sequence with generated DFA
For given input sequence, you can start parsing with Parser::begin()
method.
Once you get the Context
from begin()
,
you can feed the input sequence one by one with Parser::feed()
method.
let terms = vec!;
// start parsing
let mut context = parser.begin;
// feed input sequence
for term in terms
EOF
token is feeded at the end of sequence, and the augmented rule Augmented -> StartSymbol $
will not be reduced since there are no lookahead symbols.