Expand description
§RustyLR
GLR, LR(1) and LALR(1) parser generator for Rust.
RustyLR provides procedural macros and buildscript tools to generate GLR, LR(1) and LALR(1) parser. The generated parser will be a pure Rust code, and the calculation of building DFA will be done at compile time. Reduce action can be written in Rust, and the error messages are readable and detailed. For huge and complex grammars, it is recommended to use the buildscipt.
§features in Cargo.toml
build: Enable buildscript tools.fxhash: In parser table, replacestd::collections::HashMapwithFxHashMapfromrustc-hash.tree: Enable automatic Tree construction. This feature should be used on debug purpose only, since it will consume much more memory and time.
§Features
- pure Rust implementation
- readable error messages, both for grammar building and parsing
- compile-time DFA construction from CFGs
- customizable reduce action
- resolving conflicts of ambiguous grammar
- regex patterns partially supported
- tools for integrating with
build.rs
§proc-macro
Below procedural macros are provided:
These macros will generate structs:
Parser: contains DFA tables and production rulesParseError: type alias forErrorreturned fromfeed()Context: contains current state and data stackenum NonTerminals: a list of non-terminal symbolsRule: type alias for production rulesState: type alias for DFA states
All structs above are prefixed by <StartSymbol>.
In most cases, what you want is the Parser and ParseError structs, and the others are used internally.
§Integrating with build.rs
This buildscripting tool will provide much more detailed, pretty-printed error messages than the procedural macros.
If you are writing a huge, complex grammar, it is recommended to use buildscript than the procedural macros.
Generated code will contain the same structs and functions as the procedural macros. In your actual source code, you can include! the generated file.
The program searches for %% in the input file, not the lr1!, lalr1! macro.
The contents before %% will be copied into the output file as it is.
And the context-free grammar must be followed by %%.
// parser.rs
use some_crate::some_module::SomeStruct;
enum SomeTypeDef {
A,
B,
C,
}
%% // <-- input file splitted here
%tokentype u8;
%start E;
%eof b'\0';
%token a b'a';
%token lparen b'(';
%token rparen b')';
E: lparen E rparen
| P
;
P: a;You must enable the feature build to use in the build script.
[build-dependencies]
rusty_lr = { version = "...", features = ["build"] }
// build.rs
use rusty_lr::build;
fn main() {
println!("cargo::rerun-if-changed=src/parser.rs");
let output = format!("{}/parser.rs", std::env::var("OUT_DIR").unwrap());
build::Builder::new()
.file("src/parser.rs") // path to the input file
// .lalr() // to generate LALR(1) parser
.build(&output); // path to the output file
}In your source code, include the generated file.
include!(concat!(env!("OUT_DIR"), "/parser.rs"));§Start Parsing
The Parser struct has the following functions:
new(): create new parserbegin(&self): create new contextfeed(&self, &mut Context, TerminalType, &mut UserData) -> Result<(), ParseError>: feed token to the parser
Note that the parameter &mut UserData is omitted if %userdata is not defined.
All you need to do is to call new() to generate the parser, and begin() to create a context.
Then, you can feed the input sequence one by one with feed() function.
Once the input sequence is feeded (including eof token), without errors,
you can get the value of start symbol by calling context.accept().
let parser = Parser::new();
let context = parser.begin();
for token in input_sequence {
match parser.feed(&context, token) {
Ok(_) => {}
Err(e) => { // e: ParseError
println!("{}", e);
return;
}
}
}
let start_symbol_value = context.accept();§Syntax Tree
With the tree feature, feed() function will automatically construct the parse tree.
By calling context.to_tree_list(),
you can get current syntax tree. Simply print the tree list with Display or Debug will give you the pretty-printed tree.
let parser = Parser::new();
let mut context = parser.begin();
/// feed tokens...
println!( "{:?}", context.to_tree_list() ); // print tree list with `Debug` trait
println!( "{}", context.to_tree_list() ); // print tree list with `Display` traitTreeList
├─A
│ └─M
│ └─P
│ └─Number
│ ├─WS0
│ │ └─_space_Star1
│ │ └─_space_Plus0
│ │ ├─_space_Plus0
│ │ │ └─' '
│ │ └─' '
│ ├─_Digit_Plus3
│ │ └─Digit
│ │ └─_TerminalSet2
│ │ └─'1'
│ └─WS0
│ └─_space_Star1
│ └─_space_Plus0
│ └─' '
├─'+'
├─M
│ └─P
│ └─Number
│ ├─WS0
│ │ └─_space_Star1
│ │ └─_space_Plus0
│ │ ├─_space_Plus0
... continue
Note that default Display and Debug trait will print the whole tree recursively.
If you want to limit the depth of the printed tree, you can use [Tree::pretty_print()] or [TreeList::pretty_print()] function with max_level parameter.
§GLR Parser
The GLR (Generalized LR parser) can be generated by %glr; directive in the grammar.
// generate GLR parser;
// from now on, shift/reduce, reduce/reduce conflicts will not be treated as errors
%glr;
...GLR parser can handle ambiguous grammars that LR(1) or LALR(1) parser cannot.
When it encounters any kind of conflict during parsing,
the parser will diverge into multiple states, and will try every paths until it fails.
Of course, there must be single unique path left at the end of parsing (the point where you feed eof token).
§Resolving Ambiguities
You can resolve the ambiguties through the reduce action.
Simply, returning Result::Err(Error) from the reduce action will revoke current path.
The Error variant type can be defined by %err directive.
§Note on GLR Parser
- Still in development, not have been tested enough (patches are welcome!).
- Since there are multiple paths, the reduce action can be called multiple times, even if the result will be thrown away in the future.
- Every
RuleTypeandTermmust implementClonetrait. clone()will be called carefully, only when there are multiple paths.
- Every
- User must be aware of the point where shift/reduce or reduce/reduce conflicts occur. Every time the parser diverges, the calculation cost will increase.
§Syntax
To start writing down a context-free grammar, you need to define necessary directives first. This is the syntax of the procedural macros.
lr1! {
// %directives
// %directives
// ...
// %directives
// NonTerminalSymbol(RuleType): ProductionRules
// NonTerminalSymbol(RuleType): ProductionRules
// ...
}lr1! macro will generate a parser struct with LR(1) DFA tables.
If you want to generate LALR(1) parser, use lalr1! macro.
Every line in the macro must follow the syntax below.
Syntax can be found in repository.
Modules§
- module for build DFA tables from CFG
- module for GLR parser
- module for LR(1), LALR(1) parser
Macros§
- Build a lalr1 Deterministic Finite Automaton (DFA) parser.
- Build a lr1 Deterministic Finite Automaton (DFA) parser.
Structs§
- Default error type for reduce action
- shifted rule with lookahead tokens
- set of lookahead rules
- Production rule.
- A struct for single shifted named production rule.
Enums§
- for resolving shift/reduce conflict
- Token represents a terminal or non-terminal symbol in the grammar.
Type Aliases§
- Type alias for a hash map that uses the Fx hashing algorithm.
- Type alias for a hash set that uses the Fx hashing algorithm.