RustyLR
for proc-macro
for executable
yacc-like LR(1) and LALR(1) Deterministic Finite Automata (DFA) generator from Context Free Grammar (CFGs).
RustyLR provides both executable and procedural macros to generate LR(1) and LALR(1) parser. The generated parser will be a pure Rust code, and the calculation of building DFA will be done at compile time. Reduce action can be written in Rust code, and the error messages are readable and detailed with executable. For huge and complex grammars, it is recommended to use the executable version.
By default, RustyLR uses std::collections::HashMap
for the parser tables.
If you want to use FxHashMap
from rustc-hash
, add features=["fxhash"]
to your Cargo.toml
.
[]
= { = "...", = ["fxhash"] }
Example
// this define `EParser` struct
// where `E` is the start symbol
lr1!
// generate `EParser`
let parser = new;
// create context
let mut context = parser.begin;
// define userdata
let mut userdata: i32 = 0;
let input_sequence = "1 + 2 * ( 3 + 4 )";
// start feeding tokens
for token in input_sequence.chars
// feed `eof` token
parser.feed.unwrap;
// res = value of start symbol
let res = context.accept;
println!;
println!;
Readable error messages (with codespan)
Contents
Features
- pure Rust implementation
- readable error messages, both for grammar building and parsing
- compile-time DFA construction from CFGs
- customizable reduce action
- resolving conflicts of ambiguous grammar
- regex patterns partially supported
- executable for generating parser tables
proc-macro
Below procedural macros are provided:
lr1!
: LR(1) parserlalr1!
: LALR(1) parser
These macros will generate structs:
Parser
: contains DFA tables and production rulesParseError
: type alias forError
returned fromfeed()
Context
: contains current state and data stackenum NonTerminals
: a list of non-terminal symbolsRule
: type alias for production rulesState
: type alias for DFA states
All structs above are prefixed by <StartSymbol>
.
In most cases, what you want is the Parser
and ParseError
structs, and the others are used internally.
Start Parsing
The Parser
struct has the following functions:
new()
: create new parserbegin(&self)
: create new contextfeed(&self, &mut Context, TerminalType, &mut UserData) -> Result<(), ParseError>
: feed token to the parser
Note that the parameter &mut UserData
is omitted if %userdata
is not defined.
All you need to do is to call new()
to generate the parser, and begin()
to create a context.
Then, you can feed the input sequence one by one with feed()
function.
Once the input sequence is feeded (including eof
token), without errors,
you can get the value of start symbol by calling context.accept()
.
let parser = new;
let context = parser.begin;
for token in input_sequence
let start_symbol_value = context.accept;
Error Handling
There are two error variants returned from feed()
function:
InvalidTerminal(InvalidTerminalError)
: when invalid terminal symbol is fedReduceAction(ReduceActionError)
: when the reduce action returnsErr(Error)
For ReduceActionError
, the error type can be defined by %err
directive. If not defined, String
will be used.
When printing the error message, there are two ways to get the error message:
e.long_message( &parser, &context )
: get the error message asString
, in a detailed formate as Display
: briefly print the short message throughDisplay
trait.
The long_message
function requires the reference to the parser and the context.
It will make a detailed error message of what current state was trying to parse, and what the expected terminal symbols were.
Example of long_message
Invalid Terminal: *
Expected one of: , (, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
-------------------------------Backtracing state--------------------------------
WS0 -> • _RustyLRGenerated0
_RustyLRGenerated1 -> •
_RustyLRGenerated1 -> • _RustyLRGenerated1
_RustyLRGenerated0 -> • _RustyLRGenerated1
_RustyLRGenerated0 -> •
Number -> • WS0 _RustyLRGenerated3 WS0
M -> • M * M
M -> M * • M
M -> • P
P -> • Number
P -> • WS0 ( E ) WS0
-----------------------------------Prev state-----------------------------------
M -> M • * M
-----------------------------------Prev state-----------------------------------
A -> • A + A
A -> A + • A
A -> • M
M -> • M * M
-----------------------------------Prev state-----------------------------------
A -> A • + A
-----------------------------------Prev state-----------------------------------
A -> • A + A
E -> • A
Augmented -> • E
Syntax
To start writing down a context-free grammar, you need to define necessary directives first. This is the syntax of the procedural macros.
lr1!
lr1!
macro will generate a parser struct with LR(1) DFA tables.
If you want to generate LALR(1) parser, use lalr1!
macro.
Every line in the macro must follow the syntax below.
Bootstrap, Expanded Bootstrap would be a good example to understand the syntax and generated code. It is RustyLR syntax parser written in RustyLR itself.
Token type (must defined)
'%tokentype' <RustType> ';'
Define the type of terminal symbols.
<RustType>
must be accessible at the point where the macro is called.
lr!
Token definition (must defined)
'%token' <Ident> <RustExpr> ';'
Map terminal symbol's name <Ident>
to the actual value <RustExpr>
.
<RustExpr>
must be accessible at the point where the macro is called.
lr1!
Start symbol (must defined)
'%start' <Ident> ';'
Define the start symbol of the grammar.
lr1!
Eof symbol (must defined)
'%eof' <RustExpr> ';'
Define the eof
terminal symbol.
<RustExpr>
must be accessible at the point where the macro is called.
'eof' terminal symbol will be automatically added to the grammar.
lr1!
Userdata type (optional)
'%userdata' <RustType> ';'
Define the type of userdata passed to feed()
function.
lr1!
...
Reduce type (optional)
// reduce first
'%left' <Ident> ';'
'%left' <TerminalSet> ';'
// shift first
'%right' <Ident> ';'
'%right' <TerminalSet> ';'
Set the shift/reduce precedence for terminal symbols. <Ident>
must be defined in %token
.
With <TerminalSet>
, you can define reduce type to multiple terminals at once. Please refer to the Regex Pattern section below.
%left
can be abbreviated as %reduce
or %l
, and %right
can be abbreviated as %shift
or %r
.
lr1!
lr1!
Production rules
<Ident><RuleType>
':' <TokenMapped>* <ReduceAction>
'|' <TokenMapped>* <ReduceAction>
...
';'
Define the production rules.
<TokenMapped> : <Ident as var_name> '=' <TokenPattern>
| <TokenPattern>
;
<TokenPattern> : <Ident as terminal or non-terminal>
| <TerminalSet>
| <TokenPattern> '*' (zero or more)
| <TokenPattern> '+' (one or more)
| <TokenPattern> '?' (zero or one)
;
This production rule defines non-terminal E
to be A
, then zero or more plus
, then D
mapped to variable d
.
For more information, please refer to the Accessing token data in ReduceAction section below.
lr1!
Regex pattern
Regex patterns are partially supported. You can use *
, +
, ?
to define the number of repetitions, and []
to define the set of terminal symbols.
%token lparen '(';
%token rparen ')';
%token zero '0';
...
%token nine '9';
A: [zero-nine]+; // zero to nine
B: [^lparen rparen]; // any token except lparen and rparen
C: [lparen rparen one-nine]*; // lparen and rparen, and one to nine
Note that when using range pattern [first-last]
,
the range is constructed by the order of the %token
directives,
not by the actual value of the token.
If you define tokens in the following order:
%token one '1';
%token two '2';
...
%token zero '0';
%token nine '9';
The range [zero-nine]
will be ['0', '9']
, not ['0'-'9']
.
RuleType (optional)
<RuleType> : '(' <RustType> ')'
|
;
Define the type of value that this production rule holds.
lr1!
ReduceAction (optional)
<ReduceAction> : '{' <RustExpr> '}'
|
;
Define the action to be executed when the rule is matched and reduced.
-
If
<RuleType>
is defined,<ReduceAction>
itself must be the value of<RuleType>
(i.e. no semicolon at the end of the statement). -
<ReduceAction>
can be omitted if:<RuleType>
is not defined- Only one token is holding value in the production rule ( Non-terminal symbol with
<RuleType>
defined, or terminal symbols are considered as holding value )
-
Result<(),Error>
can be returned from<ReduceAction>
.- Returned
Error
will be delivered to the caller offeed()
function. ErrorType
can be defined by%err
or%error
directive. See Error type section.
- Returned
Omitting ReduceAction
:
lr1!
Returning Result<(),String>
from ReduceAction:
lr1!
Accessing token data in ReduceAction
predefined variables can be used in <ReduceAction>
:
data
: userdata passed tofeed()
function.
To access the data of each token, you can directly use the name of the token as a variable.
For non-terminal symbols, the type of variable is <RuleType>
.
For terminal symbols, the type of variable is %tokentype
.
If multiple variables are defined with the same name, the variable on the front-most will be used.
For regex pattern, type of variable will be modified by following:
Pattern | Non-Terminal<RuleType>=T |
Non-Terminal<RuleType>=(not defined) |
TerminalTerminalSet |
---|---|---|---|
'*' | Vec<T> |
(not defined) | Vec<TermType> |
'+' | Vec<T> |
(not defined) | Vec<TermType> |
'?' | Option<T> |
(not defined) | Option<TermType> |
lr1!
Error type (optional)
'%err' <RustType> ';'
'%error' <RustType> ';'
Define the type of Err
variant in Result<(), Err>
returned from <ReduceAction>
. If not defined, String
will be used.
lr1!
...
match parser.feed
Exclamation mark !
An exclamation mark !
can be used right after the token to ignore the value of the token.
The token will be treated as if it is not holding any value.
Tip
When combining with repeatance pattern *
, +
, ?
, use !
first.
It can prevent Vec<T>
built from the value of the token internally.
lr1!
executable rustylr
An executable version of lr1!
and lalr1!
macro.
Converts a context-free grammar into a deterministic finite automaton (DFA) tables,
and generates a Rust code that can be used as a parser for that grammar.
cargo install rustylr
This executable will provide much more detailed, pretty-printed error messages than the procedural macros.
If you are writing a huge, complex grammar, it is recommended to use this executable than the procedural macros.
--verbose
option is useful for debugging the grammar. It will print where the auto-generated rules are originated from and the resolving process of shift/reduce conflicts. like this
Although it is convenient to use the proc-macros for small grammars, since modern IDEs feature (rust-analyzer's auto completion, inline error messages) could be enabled.
This program searches for %%
in the input file. ( Not the lr1!
, lalr1!
macro )
The contents before %%
will be copied into the output file as it is.
Context-free grammar must be followed by %%
.
Each line must follow the syntax of rusty_lr#syntax
// my_grammar.rs
use SomeStruct;
%% // <-- input file splitted here
%tokentype u8;
%start E;
%eof b'\0';
%token a b'a';
%token lparen b'(';
%token rparen b')';
E: lparen E rparen
| P
;
P: a;
Calling the command will generate a Rust code my_parser.rs
.
$ rustylr my_grammar.rs my_parser.rs --verbose
Possible options can be found by --help
.
$ rustylr --help
Usage: rustylr [OPTIONS] <INPUT_FILE> [OUTPUT_FILE]
Arguments:
<INPUT_FILE>
input_file to read
[OUTPUT_FILE]
output_file to write
[default: out.tab.rs]
Options:
--no-format
do not rustfmt the output
-l, --lalr
build LALR(1) parser
-v, --verbose
print debug information.
print the auto-generated rules, and where they are originated from.
print the shift/reduce conflicts, and the resolving process.