rusty_lr 1.6.0

yacc-like, LR(1) and LALR(1) parser generator and code generation
Documentation

RustyLR

yacc-like LR(1) and LALR(1) Deterministic Finite Automata (DFA) generator from Context Free Grammar (CFGs).

RustyLR provides both executable and procedural macros to generate LR(1) and LALR(1) parser.

[dependencies]
rusty_lr = "1.6.0"

features=["fxhash"] to replace std::collections::HashMap with FxHashMap

Simple definition of CFG

lr1! {
    // userdata type
    %userdata i32;
    // token type
    %tokentype char;
    // start symbol
    %start E;
    // eof symbol
    %eof '\0';

    // token definition
    %token zero '0';
    %token one '1';
    %token two '2';
    %token three '3';
    %token four '4';
    %token five '5';
    %token six '6';
    %token seven '7';
    %token eight '8';
    %token nine '9';
    %token plus '+';
    %token star '*';
    %token lparen '(';
    %token rparen ')';
    %token space ' ';

    // conflict resolving
    %left plus;
    %left star;

    // context-free grammars
    WS0: space*;

    Digit(char): [zero-nine];

    Number(i32): WS0 Digit+ WS0 { Digit.into_iter().collect::<String>().parse().unwrap() };

    A(f32): A plus a2=A {
        *data += 1; // access userdata by `data`
        println!( "{:?} {:?} {:?}", A, plus, a2 );
        A + a2 // this will be the new value of A
    }
        | M
        ;

    M(f32): M star m2=M { M * m2 }
        | P
        ;

    P(f32): Number { Number as f32 }
        | WS0 lparen E rparen WS0 { E }
        ;

    E(f32) : A ;
}

Readable error messages (with codespan)

images/error1.png images/error3.png

Contents

Features

  • pure Rust implementation
  • readable error messages, both for grammar building and parsing
  • compile-time DFA construction from CFGs
  • customizable reduce action
  • resolving conflicts of ambiguous grammar
  • tracing parser action with callback
  • regex patterns partially supported
  • executable for generating parser tables from CFGs

Usage

proc-macro syntax

Below procedural macros are provided:

  • lr1!, lalr1!

These macros will define three structs: Parser, Context, and enum NonTerminals, prefixed by <StartSymbol>. In most cases, what you want is the Parser struct, which contains the DFA states and feed() functions. Please refer to the Start Parsing section below for actual usage of the Parser struct.

Those macros (those without '_runtime' suffix) will generate Parser struct at compile-time. The calculation of building DFA will be done at compile-time, and the generated code will be TONS of insert of tokens one by one.

Bootstrap, Expanded Bootstrap would be a good example to understand the syntax and generated code. It is RustyLR syntax parser written in RustyLR itself.

Every line in the macro must follow the syntax below.

Token type (must defined)

'%tokentype' <RustType> ';'

Define the type of terminal symbols. <RustType> must be accessible at the point where the macro is called.

enum MyTokenType<Generic> {
    Digit,
    Ident,
    ...
    VariantWithGeneric<Generic>
}

lr! {
...
%tokentype MyTokenType<i32>;
}

Token definition (must defined)

'%token' <Ident> <RustExpr> ';'

Map terminal symbol's name <Ident> to the actual value <RustExpr>. <RustExpr> must be accessible at the point where the macro is called.

lr1! {
%tokentype u8;

%token zero b'0';
%token one b'1';

...

// 'zero' and 'one' will be replaced by b'0' and b'1' respectively
E: zero one;
}

Start symbol (must defined)

'%start' <Ident> ';'

Define the start symbol of the grammar.

lr1! {
%start E;
// this internally generate augmented rule <Augmented> -> E eof

E: ... ;
}

Eof symbol (must defined)

'%eof' <RustExpr> ';'

Define the eof terminal symbol. <RustExpr> must be accessible at the point where the macro is called. 'eof' terminal symbol will be automatically added to the grammar.

lr1! {
%eof b'\0';
// you can access eof terminal symbol by 'eof' in the grammar
// without %token eof ...;
}

Userdata type (optional)

'%userdata' <RustType> ';'

Define the type of userdata passed to feed() function.

struct MyUserData { ... }

lr1! {
...
%userdata MyUserData;
}

...

fn main() {
    ...
    let mut userdata = MyUserData { ... };
    parser.feed( ..., token, &mut userdata); // <-- userdata feed here
}

Reduce type (optional)

// reduce first
'%left' <Ident> ';'
'%left' <TerminalSet> ';'

// shift first
'%right' <Ident> ';'
'%right' <TerminalSet> ';'

Set the shift/reduce precedence for terminal symbols. <Ident> must be defined in %token. With <TerminalSet>, you can define reduce type to multiple terminals at once. Please refer to the Regex Pattern section below. %left can be abbreviated as %reduce or %l, and %right can be abbreviated as %shift or %r.

lr1! {
// define tokens
%token plus '+';
%token hat '^';


// reduce first for token 'plus'
%left plus;

// shift first for token 'hat'
%right hat;
}
lr1! {
// define tokens
%token zero b'0';
%token one b'1';
...
%token nine b'9';

// shift first for tokens in range 'zero' to 'nine'
%shift [zero-nine];
}

Production rules

<Ident><RuleType>
  ':' <TokenMapped>* <ReduceAction>
  '|' <TokenMapped>* <ReduceAction>
  ...
  ';'

Define the production rules.

<TokenMapped> : <Ident as var_name> '=' <TokenPattern>
              | <TokenPattern>
              ;
<TokenPattern> : <Ident as terminal or non-terminal>
               | <TerminalSet>
               | <TokenPattern> '*'    (zero or more)
               | <TokenPattern> '+'    (one or more)
               | <TokenPattern> '?'    (zero or one)
               ;

This production rule defines non-terminal E to be A, then zero or more plus, then D mapped to variable d. For more information, please refer to the Accessing token data in ReduceAction section below.

lr1! {
E: A plus* d=D;
}

Regex pattern

Regex patterns are partially supported. You can use *, +, ? to define the number of repetitions, and [] to define the set of terminal symbols.

%token lparen '(';
%token rparen ')';
%token zero '0';
...
%token nine '9';

A: [zero-nine]+; // zero to nine

B: [^lparen rparen]; // any token except lparen and rparen

C: [lparen rparen one-nine]*; // lparen and rparen, and one to nine

Note that when using range pattern [first-last], the range is constructed by the order of the %token directives, not by the actual value of the token. If you define tokens in the following order:

%token one '1';
%token two '2';
...
%token zero '0';
%token nine '9';

The range [zero-nine] will be ['0', '9'], not ['0'-'9'].

RuleType (optional)

<RuleType> : '(' <RustType> ')'
           |
           ;

Define the type of value that this production rule holds.

lr1! {
E(MyType<...>): ... Tokens ... ;
}

ReduceAction (optional)

<ReduceAction> : '{' <RustExpr> '}'
               |
               ;

Define the action to be executed when the rule is matched and reduced.

  • If <RuleType> is defined, <ReduceAction> itself must be the value of <RuleType> (i.e. no semicolon at the end of the statement).

  • <ReduceAction> can be omitted if:

    • <RuleType> is not defined
    • Only one token is holding value in the production rule ( Non-terminal symbol with <RuleType> defined, or terminal symbols are considered as holding value )
  • Result<(),Error> can be returned from <ReduceAction>.

    • Returned Error will be delivered to the caller of feed() function.
    • ErrorType can be defined by %err or %error directive. See Error type section.

Omitting ReduceAction:

lr1! {
NoRuleType: ... ;

RuleTypeI32(i32): ... { 0 } ;

// RuleTypeI32 will be chosen
E(i32): NoRuleType NoRuleType RuleTypeI32 NoRuleType;
}

Returning Result<(),String> from ReduceAction:

lr1! {
// set Err variant type to String
%err String;

%token div '/';

E(i32): A div a2=A {
    if a2 == 0 {
        return Err("Division by zero".to_string());
    }

    A / a2
};

A(i32): ... ;
}

Accessing token data in ReduceAction

predefined variables can be used in <ReduceAction>:

  • data : userdata passed to feed() function.

To access the data of each token, you can directly use the name of the token as a variable. For non-terminal symbols, the type of variable is <RuleType>. For terminal symbols, the type of variable is %tokentype.

If multiple variables are defined with the same name, the variable on the front-most will be used.

For regex pattern, type of variable will be modified by following:

Pattern Non-Terminal<RuleType>=T Non-Terminal<RuleType>=(not defined) TerminalTerminalSet
'*' Vec<T> (not defined) Vec<TermType>
'+' Vec<T> (not defined) Vec<TermType>
'?' Option<T> (not defined) Option<TermType>
lr1! {
%token plus ...;

// one or more 'A', then optional 'plus', then zero or more 'B'
E(f32) : A+ plus? b=B* minus_or_star=[minus star]
  {
    println!("Value of A: {:?}", A);         // Vec<i32>
    println!("Value of plus: {:?}", plus); // Option<TermType>
    println!("Value of b: {:?}", b);       // Vec<f32>
    println!("Value of minus_or_star: {:?}", minus_or_star); // must explicitly define the variable name

    let first_A = A[0];
    let first_B = b.first(); // Option<&f32>


    // this will be the new value of E
    if let Some(first_B) = first_B {
        let value = first_A as f32 + *first_B;
        value
    } else {
        first_a as f32
    }
  }
  ;

A(i32): ... ;
B(f32): ... ;
}

Error type (optional)

'%err' <RustType> ';'
'%error' <RustType> ';'

Define the type of Err variant in Result<(), Err> returned from <ReduceAction>. If not defined, String will be used.

enum MyErrorType<T> {
    ErrVar1,
    ErrVar2,
    ErrVar3(T),
    ...
}

lr1! {

%err MyErrorType<GenericType> ;

}

...

match parser.feed( ... ) {
    Ok(_) => {}
    Err(err) => {
        match err {
            ParseError::ReduceAction( err ) => {
                // do something with err
            }
            _ => {}
        }
    }
}

Start Parsing

<StartSymbol>Parser will be generated by the procedural macros.

The parser struct has the following functions:

  • new() : create new parser
  • begin(&self) : create new context
  • feed(&self, &mut Context, TermType, &mut UserData) -> Result<(), ParseError> : feed token to the parser
  • feed_callback(&self, &mut Context, &mut C: Callback, TermType, &mut UserData) -> Result<(), ParseError> : feed token with callback

Note that the parameter &mut UserData is omitted if %userdata is not defined. Once the input sequence is feeded (including eof token), without errors, you can get the value of start symbol by calling context.accept().

let parser = parser::EParser::new();
// create context
let mut context = parser.begin();
// define userdata
let mut userdata: i32 = 0;

// start feeding tokens
for token in input_sequence {
    match parser.feed(&mut context, token, &mut userdata) {
        //                          ^^^^^   ^^^^^^^^^^^^ userdata passed here as `&mut i32`
        //                           |- feed token
        Ok(_) => {}
        Err(e) => {
            println!("{}", e);
            // println!( "{}", e.long_message() ); // for more detailed error message
            return;
        }
    }
}
// res = value of start symbol
let res = context.accept();
println!("{}", res);
println!("userdata: {}", userdata);

Parse with callback

For tracing parser action, you can implement Callback trait and pass it to parser.feed_callback().

struct ParserCallback {}

impl rusty_lr::Callback<Term, NonTerm> for ParserCallback {
    /// Error type for callback
    type Error = String;

    fn reduce(
        &mut self,
        rules: &[rusty_lr::ProductionRule<Term, NonTerm>],
        states: &[rusty_lr::State<Term, NonTerm>],
        state_stack: &[usize],
        rule: usize,
    ) -> Result<(), Self::Error> {
        // `Rule` is Display if Term, NonTerm is Display
        println!("Reduce by {}", rules[rule]);
        Ok(())
    }
    fn shift_and_goto(
        &mut self,
        rules: &[rusty_lr::ProductionRule<Term, NonTerm>],
        states: &[rusty_lr::State<Term, NonTerm>],
        state_stack: &[usize],
        term: &Term,
    ) -> Result<(), Self::Error> {
        Ok(())
    }
    fn shift_and_goto_nonterm(
        &mut self,
        rules: &[rusty_lr::ProductionRule<Term, NonTerm>],
        states: &[rusty_lr::State<Term, NonTerm>],
        state_stack: &[usize],
        nonterm: &NonTerm,
    ) -> Result<(), Self::Error> {
        Ok(())
    }
}

Note that generic type Term and NonTerm must be replaced with actual types. Term must be same as %tokentype, and NonTerm must be <StartSymbol>NonTerminals generated by the procedural macro.

// Num + Num * ( Num + Num )
let terms = vec![ Term::Num, Term::Plus, Term::Num, Term::Mul, Term::LeftParen, Term::Num, Term::Plus, Term::Num, Term::RightParen, Term::Eof];

// start parsing
let mut context = parser.begin();
let mut callback = ParserCallback {};

// feed input sequence
for term in terms {
    match parser.feed_callback(&mut context, &mut callback, term) {
        Ok(_) => (),
        Err(err) => {
            match err {
                rusty_lr::ParseError::Callback(err) => {
                    eprintln!("{}", err);
                }
                _ => {}
            }
            return;
        }
    }
}

The result will be:

Reduce by P -> Num
Reduce by M -> P
Reduce by A -> M
Reduce by P -> Num
Reduce by M -> P
Reduce by P -> Num
Reduce by M -> P
Reduce by A -> M
Reduce by P -> Num
Reduce by M -> P
Reduce by A -> M
Reduce by A -> A + A
Reduce by E -> A
Reduce by P -> ( E )
Reduce by M -> P
Reduce by M -> M * M
Reduce by A -> M
Reduce by A -> A + A
Reduce by E -> A

executable rustylr

An executable version of lr1! and lalr1! macro. Converts a context-free grammar into a deterministic finite automaton (DFA) tables, and generates a Rust code that can be used as a parser for that grammar.

This executable will provide much more detailed, pretty-printed error messages than the procedural macros. If you are writing a huge, complex grammar, it is recommended to use this executable than the procedural macros. Although it is convenient to use the proc-macros for small grammars, since modern IDEs feature (rust-analyzer's auto completion, inline error messages) will be enabled.

This program searches for %% in the input file.

The contents before %% will be copied into the output file as it is. Context-free grammar must be followed by %%. Each line must follow the syntax of rusty_lr#syntax.

Here for more information.

// my_grammar.rs
use some_crate::some_module::SomeStruct;

enum SomeTypeDef {
    A,
    B,
    C,
}

%% // <-- input file splitted here

%tokentype u8;
%start E;
%eof b'\0';

%token a b'a';
%token lparen b'(';
%token rparen b')';

E: lparen E rparen
 | P
 ;

P: a;

Calling the command will generate a Rust code my_parser.rs.

$ rustylr my_grammar.rs my_parser.rs

Build Deterministic Finite Automata (DFA) from Context Free Grammar (CFG)

This section will describe about the core library, how to build DFA from CFGs, on runtime.

1. Define terminal and non-terminal symbols

#[derive(Clone, Hash, PartialEq, Eq, PartialOrd, Ord)] // must implement these traits
pub enum Term {
    Num,
    Plus,
    Mul,
    LeftParen,
    RightParen,
    Eof,
}

#[derive(Clone, Hash, PartialEq, Eq, PartialOrd, Ord)] // must implement these traits
pub enum NonTerm {
    E,
    A,
    M,
    P,
    Augmented,
}

/// impl Display for TermType, NonTermType will make related ProductionRule, error message Display-able
impl Display for TermType { ... }
impl Display for NonTermType { ... }

Or simply, you can use char or u8 as terminal, and &'static str or String as non-terminal. Any type that implements traits above can be used as terminal and non-terminal symbols.

2. Define production rules

Consider the following context free grammar.

A -> A + A  (reduce left)
A -> M

This grammar can be written as:

/// type alias
type Token = rusty_lr::Token<Term, NonTerm>;

/// create grammar
let mut grammar = rusty_lr::Grammar::<Term,NonTerm>::new();

grammar.add_rule(
    NonTerm::A,
    vec![Token::NonTerm(NonTerm::A), Token::Term(Term::Plus), Token::NonTerm(NonTerm::A)],
);
grammar.add_rule(
    NonTerm::A,
    vec![Token::NonTerm(NonTerm::M)],
);

/// set reduce type
grammar.set_reduce_type( Term::Plus, ReduceType::Left );

Note that the production rule A -> A + A has a shift/reduce conflict. To resolve this conflict, the precedence of shift/reduce is given to terminal symbol Plus. Left means that for Plus token, the parser will reduce the rule instead of shifting the token.

reduce/reduce conflict (e.g. duplicated rules) will be always an error.

3. Build DFA

Calling grammar.build() for LR(1) or grammar.build_lalr() for LALR(1) will build the DFA from the CFGs.

let parser:rusty_lr::Parser<Term,NonTerm> = match grammar.build(NonTerm::Augmented) {
    Ok(parser) => parser,
    Err(err) => {
        // error is Display if Term, NonTerm is Display
        eprintln!("{}", err);
        return;
    }
};

You must explicitly specify the Augmented non-terminal symbol, and the Augmented production rule must be defined in the grammar.

Augmented -> StartSymbol $

The returned Parser struct contains the DFA states and the production rules(cloned). It is completely independent from the Grammar, so you can drop the Grammar struct, or export the Parser struct to another module.

4. Error messages

The Error type returned from Grammar::build() will contain the error information. You can manually match the error type for custom error message, but for most cases, using println!("{}", err) will be enough to see the detailed errors. Error is Display if both Term and NonTerm is Display, and It is Debug if both Term and NonTerm is Debug.

Sample error messages

For Shift/Reduce conflicts,

Build failed: Shift/Reduce Conflict
NextTerm: '0'
Reduce Rule:
"Num" -> "Digit"
Shift Rules:
"Digit" -> '0' • /Lookaheads: '\0', '0'
Try rearanging the rules or set ReduceType to Terminal '0' to resolve the conflict.

For Reduce/Reduce conflicts,

Build failed: Reduce/Reduce Conflict with lookahead: '\0'
Production Rule1:
"Num" -> "Digit"
Production Rule2:
"Num" -> "Digit"

Parse input sequence with generated DFA

For given input sequence, you can start parsing with Parser::begin() method. Once you get the Context from begin(), you can feed the input sequence one by one with Parser::feed() method.

let terms = vec![ Term::Num, Term::Plus, Term::Num, Term::Mul, Term::LeftParen, Term::Num, Term::Plus, Term::Num, Term::RightParen, Term::Eof];

// start parsing
let mut context = parser.begin();

// feed input sequence
for term in terms {
    match parser.feed(&mut context, term) {
        Ok(_) => (),
        Err(err) => {
            eprintln!("{:?}", err);
            return;
        }
    }
}

EOF token is feeded at the end of sequence, and the augmented rule Augmented -> StartSymbol $ will not be reduced since there are no lookahead symbols.