rusty_lr_buildscript 0.2.0

# RustyLR
[![crates.io](https://img.shields.io/crates/v/rusty_lr.svg)](https://crates.io/crates/rusty_lr)
[![docs.rs](https://docs.rs/rusty_lr/badge.svg)](https://docs.rs/rusty_lr)

yacc-like LR(1) and LALR(1) Deterministic Finite Automata (DFA) generator from Context Free Grammar (CFGs).

RustyLR provides [procedural macros](#proc-macro) and [buildscript tools](#integrating-with-buildrs) to generate LR(1) and LALR(1) parser.
The generated parser will be a pure Rust code, and the calculation of building DFA will be done at compile time.
Reduce action can be written in Rust code,
and the error messages are [readable and detailed](#readable-error-messages-with-codespan).
For huge and complex grammars, it is recommended to use the [buildscipt](#integrating-with-buildrs).

#### `features` in `Cargo.toml`
 - `build` : Enable buildscript tools.
 - `fxhash` : In parser table, replace `std::collections::HashMap` with `FxHashMap` from [`rustc-hash`](https://github.com/rust-lang/rustc-hash).

### Example
```rust
// this define `EParser` struct
// where `E` is the start symbol
lr1! {
    %userdata i32;           // userdata type
    %tokentype char;         // token type
    %start E;                // start symbol
    %eof '\0';               // eof token

    // token definition
    %token zero '0';
    %token one '1';
    %token two '2';
    %token three '3';
    %token four '4';
    %token five '5';
    %token six '6';
    %token seven '7';
    %token eight '8';
    %token nine '9';
    %token plus '+';
    %token star '*';
    %token lparen '(';
    %token rparen ')';
    %token space ' ';

    // conflict resolving
    %left [plus star];                  // reduce first for token 'plus', 'star'

    // context-free grammars
    Digit(char): [zero-nine];           // character set '0' to '9'

    Number(i32)                         // type assigned to production rule `Number`
        : space* Digit+ space*          // regex pattern
    { Digit.into_iter().collect::<String>().parse().unwrap() }; 
    //    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ this will be the value of `Number`
                                        // reduce action written in Rust code

    A(f32): A plus a2=A {
        *data += 1;                     // access userdata by `data`
        println!( "{:?} {:?} {:?}", A, plus, a2 );
        A + a2
    }
        | M
        ;

    M(f32): M star m2=M { M * m2 }
        | P
        ;

    P(f32): Number { Number as f32 }
        | WS0 lparen E rparen WS0 { E }
        ;

    E(f32) : A ;
}
```
```rust
let parser = EParser::new();         // generate `EParser`
let mut context = parser.begin();    // create context
let mut userdata: i32 = 0;           // define userdata

let input_sequence = "1 + 2 * ( 3 + 4 )";

// start feeding tokens
for token in input_sequence.chars() {
    match parser.feed(&mut context, token, &mut userdata) {
        //                          ^^^^^   ^^^^^^^^^^^^ userdata passed here as `&mut i32`
        //                          feed token
        Ok(_) => {}
        Err(e) => {
            match e {
                EParseError::InvalidTerminal(invalid_terminal) => {
                    ...
                }
                EParseError::ReduceAction(error_from_reduce_action) => {
                    ...
                }
            }
            println!("{}", e);
            // println!( "{}", e.long_message( &parser, &context ) );
            return;
        }
    }
}
parser.feed(&mut context, '\0', &mut userdata).unwrap();    // feed `eof` token

let res = context.accept();   // get the value of start symbol
println!("{}", res);
println!("userdata: {}", userdata);
```

### Readable error messages (with [codespan](https://github.com/brendanzab/codespan))
![images/error1.png](images/error1.png)
![images/error2.png](images/error2.png)
 - This error message is generated by the buildscript tool, not the procedural macros.

## Features
 - pure Rust implementation
 - readable error messages, both for grammar building and parsing
 - compile-time DFA construction from CFGs
 - customizable reduce action
 - resolving conflicts of ambiguous grammar
 - regex patterns partially supported
 - tools for integrating with `build.rs`

## Contents
 - [Proc-macro](#proc-macro)
 - [Integrating with `build.rs`](#integrating-with-buildrs)
 - [Start Parsing](#start-parsing)
 - [Error Handling](#error-handling)
 - [Syntax](#syntax)


## proc-macro
Below procedural macros are provided:
 - `lr1!` : generate LR(1) parser
 - `lalr1!` : generate LALR(1) parser

These macros will generate structs:
 - `Parser` : contains DFA tables and production rules
 - `ParseError` : type alias for `Error` returned from `feed()`
 - `Context` : contains current state and data stack
 - `enum NonTerminals` : a list of non-terminal symbols
 - `Rule` : type alias for production rules
 - `State` : type alias for DFA states

All structs above are prefixed by `<StartSymbol>`.
In most cases, what you want is the `Parser` and `ParseError` structs, and the others are used internally.

## Integrating with `build.rs`
This buildscripting tool will provide much more detailed, pretty-printed error messages than the procedural macros.
If you are writing a huge, complex grammar, it is recommended to use buildscript than the procedural macros.
Generated code will contain the same structs and functions as the procedural macros. In your actual source code, you can `include!` the generated file.

The program searches for `%%` in the input file, not the `lr1!`, `lalr1!` macro.
The contents before `%%` will be copied into the output file as it is.
And the context-free grammar must be followed by `%%`.

```rust
// parser.rs
use some_crate::some_module::SomeStruct;

enum SomeTypeDef {
    A,
    B,
    C,
}

%% // <-- input file splitted here

%tokentype u8;
%start E;
%eof b'\0';

%token a b'a';
%token lparen b'(';
%token rparen b')';

E: lparen E rparen
 | P
 ;

P: a;
```

You must enable the feature `build` to use in the build script.
```toml
[build-dependencies]
rusty_lr = { version = "...", features = ["build"] }
```

```rust
// build.rs
use rusty_lr::build;

fn main() {
    println!("cargo::rerun-if-changed=src/parser.rs");

    let output = format!("{}/parser.rs", std::env::var("OUT_DIR").unwrap());
    build::Builder::new()
        .file("src/parser.rs") // path to the input file
    //  .lalr()                // to generate LALR(1) parser
        .build(&output);       // path to the output file
}
```

In your source code, include the generated file.
```rust
include!(concat!(env!("OUT_DIR"), "/parser.rs"));
```

## Start Parsing
The `Parser` struct has the following functions:
 - `new()` : create new parser
 - `begin(&self)` : create new context
 - `feed(&self, &mut Context, TerminalType, &mut UserData) -> Result<(), ParseError>` : feed token to the parser

Note that the parameter `&mut UserData` is omitted if [`%userdata`](#userdata-type-optional) is not defined.
All you need to do is to call `new()` to generate the parser, and `begin()` to create a context.
Then, you can feed the input sequence one by one with `feed()` function.
Once the input sequence is feeded (including `eof` token), without errors,
you can get the value of start symbol by calling `context.accept()`.

```rust
let parser = Parser::new();
let context = parser.begin();
for token in input_sequence {
    match parser.feed(&context, token) {
        Ok(_) => {}
        Err(e) => { // e: ParseError
            println!("{}", e);
            return;
        }
    }
}
let start_symbol_value = context.accept();
```

## Error Handling
There are two error variants returned from `feed()` function:
 - `InvalidTerminal(InvalidTerminalError)` : when invalid terminal symbol is fed
 - `ReduceAction(ReduceActionError)` : when the reduce action returns `Err(Error)`

For `ReduceActionError`, the error type can be defined by [`%err`](#error-type-optional) directive. If not defined, `String` will be used.

When printing the error message, there are two ways to get the error message:
 - `e.long_message( &parser, &context )` : get the error message as `String`, in a detailed format
 - `e as Display` : briefly print the short message through `Display` trait.

The `long_message` function requires the reference to the parser and the context.
It will make a detailed error message of what current state was trying to parse, and what the expected terminal symbols were.
### Example of long_message
```
Invalid Terminal: *. Expected one of:  , (, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
>>> In:
	M -> M * • M
>>> Backtrace:
	M -> M • * M
>>> Backtrace:
	A -> A + • A
>>> Backtrace:
	A -> A • + A
```


## Syntax
To start writing down a context-free grammar, you need to define necessary directives first.
This is the syntax of the procedural macros.

```rust
lr1! {
// %directives
// %directives
// ...
// %directives

// NonTerminalSymbol(RuleType): ProductionRules
// NonTerminalSymbol(RuleType): ProductionRules
// ...
}
```

`lr1!` macro will generate a parser struct with LR(1) DFA tables.
If you want to generate LALR(1) parser, use `lalr1!` macro.
Every line in the macro must follow the syntax below.

[Bootstrap](rusty_lr_parser/src/parser/parser.rs), [Expanded Bootstrap](rusty_lr_parser/src/parser/parser_expanded.rs) would be a good example to understand the syntax and generated code. It is RustyLR syntax parser written in RustyLR itself.

<details>
<summary>
<big> Click to expand the syntax </big>
</summary>

### Token type <sub><sup>(must defined)</sup></sub>
```
'%tokentype' <RustType> ';'
```
Define the type of terminal symbols.
`<RustType>` must be accessible at the point where the macro is called.

<details>
<summary>
Example
</summary>

```rust
enum MyTokenType<Generic> {
    Digit,
    Ident,
    ...
    VariantWithGeneric<Generic>
}

lr! {
...
%tokentype MyTokenType<i32>;
}
```

</details>


### Token definition <sub><sup>(must defined)</sup></sub>
```
'%token' <Ident> <RustExpr> ';'
```
Map terminal symbol's name `<Ident>` to the actual value `<RustExpr>`.
`<RustExpr>` must be accessible at the point where the macro is called.

<details>

<summary>
Example
</summary>


```rust
%tokentype u8;

%token zero b'0';
%token one b'1';

...

// 'zero' and 'one' will be replaced by b'0' and b'1' respectively
E: zero one;
```

</details>

### Start symbol <sub><sup>(must defined)</sup></sub>
```
'%start' <Ident> ';'
```
Define the start symbol of the grammar.

<details>
<summary>
Example
</summary>

```rust
%start E;
// this internally generate augmented rule <Augmented> -> E eof

E: ... ;
```

</details>

### Eof symbol <sub><sup>(must defined)</sup></sub>
```
'%eof' <RustExpr> ';'
```
Define the `eof` terminal symbol.
`<RustExpr>` must be accessible at the point where the macro is called.
'eof' terminal symbol will be automatically added to the grammar.

<details>
<summary>
Example
</summary>

```rust
%eof b'\0';
// you can access eof terminal symbol by 'eof' in the grammar
// without %token eof ...;
```

</details>

### Userdata type <sub><sup>(optional)</sup></sub>
```
'%userdata' <RustType> ';'
```
Define the type of userdata passed to `feed()` function.

<details>
<summary>
Example
</summary>

```rust
struct MyUserData { ... }

...

%userdata MyUserData;

...

fn main() {
    ...
    let mut userdata = MyUserData { ... };
    parser.feed( ..., token, &mut userdata); // <-- userdata feed here
}
```

</details>


### Reduce type <sub><sup>(optional)</sup></sub>
```
// reduce first
'%left' <Ident> ';'
'%left' <TerminalSet> ';'

// shift first
'%right' <Ident> ';'
'%right' <TerminalSet> ';'
```
Set the shift/reduce precedence for terminal symbols. `<Ident>` must be defined in `%token`.
With `<TerminalSet>`, you can define reduce type to multiple terminals at once. Please refer to the [Regex Pattern](#regex-pattern) section below.
`%left` can be abbreviated as `%reduce` or `%l`, and `%right` can be abbreviated as `%shift` or `%r`.

<details>
<summary>
Example
</summary>

```rust
// define tokens
%token plus '+';
%token hat '^';


// reduce first for token 'plus'
%left plus;

// shift first for token 'hat'
%right hat;
```

```rust
// define tokens
%token zero b'0';
%token one b'1';
...
%token nine b'9';

// shift first for tokens in range 'zero' to 'nine'
%shift [zero-nine];
```

</details>

### Production rules
```
<Ident><RuleType>
  ':' <TokenMapped>* <ReduceAction>
  '|' <TokenMapped>* <ReduceAction>
  ...
  ';'
```
Define the production rules.

```
<TokenMapped> : <Ident as var_name> '=' <TokenPattern>
              | <TokenPattern>
              ;
```
```
<TokenPattern> : <Ident as terminal or non-terminal>
               | <TerminalSet>
               | <TokenPattern> '*'    (zero or more)
               | <TokenPattern> '+'    (one or more)
               | <TokenPattern> '?'    (zero or one)
               ;
```

<details>
<summary>
Example
</summary>

This production rule defines non-terminal `E` to be `A`, then zero or more `plus`, then `D` mapped to variable `d`.
For more information, please refer to the [Accessing token data in ReduceAction](#accessing-token-data-in-reduceaction) section below.
```rust
E: A plus* d=D;
```

</details>

### Regex pattern
Regex patterns are partially supported. You can use `*`, `+`, `?` to define the number of repetitions, and `[]` to define the set of terminal symbols.

```
%token lparen '(';
%token rparen ')';
%token zero '0';
...
%token nine '9';

A: [zero-nine]+; // zero to nine

B: [^lparen rparen]; // any token except lparen and rparen

C: [lparen rparen one-nine]*; // lparen and rparen, and one to nine
```

Note that when using range pattern `[first-last]`,
the range is constructed by the order of the `%token` directives,
not by the actual value of the token.
If you define tokens in the following order:
```
%token one '1';
%token two '2';
...
%token zero '0';
%token nine '9';
```
The range `[zero-nine]` will be `['0', '9']`, not `['0'-'9']`.


### RuleType <sub><sup>(optional)</sup></sub>
```
<RuleType> : '(' <RustType> ')'
           |
           ;
```
Define the type of value that this production rule holds.

<details>
<summary>
Example
</summary>

```rust
E(MyType<...>): ... Tokens ... ;
```

</details>

### ReduceAction <sub><sup>(optional)</sup></sub>
```
<ReduceAction> : '{' <RustExpr> '}'
               |
               ;
```

Define the action to be executed when the rule is matched and reduced.

- If `<RuleType>` is defined, `<ReduceAction>` itself must be the value of `<RuleType>` (i.e. no semicolon at the end of the statement).

- `<ReduceAction>` can be omitted if:
  - `<RuleType>` is not defined
  - Only one token is holding value in the production rule ( Non-terminal symbol with `<RuleType>` defined, or terminal symbols are considered as holding value )

- `Result<(),Error>` can be returned from `<ReduceAction>`.
  - Returned `Error` will be delivered to the caller of `feed()` function.
  - `ErrorType` can be defined by `%err` or `%error` directive. See [Error type](#error-type-optional) section.

<details>
<summary>
Example
</summary>

Omitting `ReduceAction`:
```rust
NoRuleType: ... ;

RuleTypeI32(i32): ... { 0 } ;

// RuleTypeI32 will be chosen
E(i32): NoRuleType NoRuleType RuleTypeI32 NoRuleType;
```

Returning `Result<(),String>` from ReduceAction:
```rust
// set Err variant type to String
%err String;

%token div '/';

E(i32): A div a2=A {
    if a2 == 0 {
        return Err("Division by zero".to_string());
    }

    A / a2
};

A(i32): ... ;
```

</details>

### Accessing token data in ReduceAction

**predefined variables** can be used in `<ReduceAction>`:
 - `data` : userdata passed to `feed()` function.

To access the data of each token, you can directly use the name of the token as a variable.
For non-terminal symbols, the type of variable is `<RuleType>`.
For terminal symbols, the type of variable is `%tokentype`.

If multiple variables are defined with the same name, the variable on the front-most will be used.

For regex pattern, type of variable will be modified by following:
 | Pattern | Non-Terminal<br/>`<RuleType>=T` | Non-Terminal<br/>`<RuleType>=(not defined)` | Terminal<br/>TerminalSet |
 |:-------:|:--------------:|:--------------------------:|:--------:|
 | '*'     | `Vec<T>`       | (not defined)              | `Vec<TermType>` |
 | '+'     | `Vec<T>`       | (not defined)              | `Vec<TermType>` |
 | '?'     | `Option<T>`    | (not defined)              | `Option<TermType>` |

<details>
<summary>
Example
</summary>

```rust
%token plus ...;

// one or more 'A', then optional 'plus', then zero or more 'B'
E(f32) : A+ plus? b=B* minus_or_star=[minus star]
  {
    println!("Value of A: {:?}", A);         // Vec<i32>
    println!("Value of plus: {:?}", plus); // Option<TermType>
    println!("Value of b: {:?}", b);       // Vec<f32>
    println!("Value of minus_or_star: {:?}", minus_or_star); // must explicitly define the variable name

    let first_A = A[0];
    let first_B = b.first(); // Option<&f32>


    // this will be the new value of E
    if let Some(first_B) = first_B {
        let value = first_A as f32 + *first_B;
        value
    } else {
        first_a as f32
    }
  }
  ;

A(i32): ... ;
B(f32): ... ;
```

</details>



### Error type <sub><sup>(optional)</sup></sub>
```
'%err' <RustType> ';'
'%error' <RustType> ';'
```
Define the type of `Err` variant in `Result<(), Err>` returned from `<ReduceAction>`. If not defined, `String` will be used.

<details>
<summary>
Example
</summary>

```rust
enum MyErrorType<T> {
    ErrVar1,
    ErrVar2,
    ErrVar3(T),
}

...


%err MyErrorType<GenericType> ;

...

match parser.feed( ... ) {
    Ok(_) => {}
    Err(err) => {
        match err {
            ParseError::ReduceAction( err ) => {
                // do something with err
            }
            _ => {}
        }
    }
}
```

</details>

### Exclamation mark `!`
An exclamation mark `!` can be used right after the token to ignore the value of the token.
The token will be treated as if it is not holding any value.

**Tip**
When combining with repeatance pattern `*`, `+`, `?`, use `!` first.
It can prevent `Vec<T>` built from the value of the token internally.

<details>
<summary>
Example
</summary>

```rust
%token plus ...;

A(i32) : ... ;

// A in the middle will be chosen, since other A's are ignored
E(i32) : A! A A!;

B: A*!; // Vec<i32> will be built from the value of A, and then ignored

C: A!*; // A will be ignored first, and then repeatance pattern will be applied
```

</details>


### Derive <sub><sup>(optional)</sup></sub>
Specify the derive attributes for the generated `Context` struct.
By default, the generated `Context` does not implement any traits.
But in some cases, you may want to derive traits like `Clone`, `Debug`, or `Serialize`, `Deserialize` of `serde`.

You can specify the derive attributes by `%derive` directive.
In this case, user must ensure that every member of the `Context` must implement the trait.
Currently, `Context` is holding the stack data,
which is `Vec<usize>` for state stack and `Vec<T>` for every `<RuleType>` in the grammar.

```
'%derive' <DeriveAttributes> ';'
```

<details>
<summary>
Example
</summary>

```rust
// here, #[derive(Clone,Debug)] will be added to the generated `Context` struct
%derive Clone, Debug;

...

let mut context = parser.begin();
// do something with context...

println!( "{:?}", context );          // debug-print context
let cloned_context = context.clone(); // clone context, you can re-feed the input sequence using cloned context
```

</details>



</details>