gazelle-parser 0.3.0

LR parser generator with runtime operator precedence and natural lexer feedback
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
# Gazelle Reference

Complete reference for the Gazelle parser generator.

## Project Status

**Working and tested:**
- Grammar definition (`.gzl` files and `gazelle!` macro)
- Minimal LR table generation
- Type-safe parser generation with `Types`/`Reducer` traits
- Precedence terminals (`prec`) for runtime operator precedence
- Modifiers: `?` (optional), `*` (zero+), `+` (one+), `%` (separated list)
- Expected conflict declarations (`expect N rr/sr`)
- Token range tracking for source spans
- Push-based parsing (you control the loop)
- Detailed error messages with parser state context
- Automatic error recovery (Dijkstra-based minimum-cost repair)

**Not yet implemented:**
- Debug dump via `GAZELLE_DEBUG` env var
- Grammar visualization (FIRST/FOLLOW sets)

**Tested on:**
- Calculator with user-defined operators
- C11 parser (complete grammar, 18 tests)

## Table of Contents

- [How It Works]#how-it-works
- [Grammar Syntax]#grammar-syntax
- [The gazelle! Macro]#the-grammar-macro
- [Generated Types]#generated-types
- [Using the Parser]#using-the-parser
- [Advanced Features]#advanced-features

---

## How It Works

Most parser generators either embed semantic actions in the grammar (yacc-style `$$` / `$1`) or produce a generic, type-erased concrete syntax tree you have to walk later. Gazelle does neither. It produces an abstract syntax tree — abstract because it's not presented as a data structure, but as a **pattern of calls**.

The AST nodes are enum variants that arise naturally from the grammar — one enum per non-terminal, one variant per alternative. During parsing, every time a rule is reduced, Gazelle calls your `Reducer` with a node containing the children's already-reduced values. This sequence of calls *is* a post-order traversal of the parse tree — the tree is abstractly present without ever being materialized.

You provide the (possibly stateful) mapping from each node to its output via the `Reducer` trait. A `Types` trait declares the output type per symbol. What you return is up to you: a computed value, a tree node, or nothing at all.

### Direct evaluation

Since reductions see already-reduced children, you can fold the tree into values as it's parsed — no intermediate tree:

```rust
impl calc::Types for Evaluator {
    type Error = ParseError;
    type Num = f64;
    type Op = char;
    type Expr = f64;  // expressions reduce to numbers
}

impl Reducer<calc::Expr<Self>> for Evaluator {
    fn reduce(&mut self, node: calc::Expr<Self>) -> Result<f64, ParseError> {
        // node.0 is f64, node.1 is char, node.2 is f64
        Ok(match node {
            calc::Expr::Binop(l, op, r) => match op {
                '+' => l + r, '*' => l * r, _ => 0.0,
            },
            calc::Expr::Literal(n) => n,
        })
    }
}
```

### Materializing a tree

If you *do* want a tree, set associated types to the node enums themselves. Each reduction stores its node, and the tree materializes through the normal reduction flow — no custom `Reducer` needed:

```rust
impl calc::Types for CstBuilder {
    type Error = ParseError;
    type Num = f64;
    type Op = char;
    type Expr = Box<calc::Expr<Self>>;  // recursive, needs Box
}
// No Reducer impl — blanket handles it
```

Gazelle supports a few blanket reductions that allow you to generate a CST or just validate syntax without writing `Reducer` impls:
- **Identity**: `type Expr = calc::Expr<Self>` — node passes through unchanged (CST)
- **Box**: `type Expr = Box<calc::Expr<Self>>` — node is auto-boxed (CST with recursive types)
- **Ignore**: `type Expr = Ignore` — node is discarded (validation only)

You only write a custom `Reducer` when you need custom logic.

### Why Box is needed for recursive types

The generated enum for a recursive rule like `expr = expr OP expr => binop | NUM => literal` contains itself:

```rust
pub enum Expr<A: Types> {
    Binop(A::Expr, A::Op, A::Expr),  // contains A::Expr
    Literal(A::Num),
}
```

If `type Expr = calc::Expr<Self>`, the type is infinitely sized — `Expr` contains `Expr` contains `Expr`... Rust won't allow this. Wrapping in `Box` breaks the cycle: `type Expr = Box<calc::Expr<Self>>` gives the compiler a known size (one pointer). The auto-box blanket handles the wrapping automatically.

Non-recursive types (like a `Statement` that contains an `Expr` but not another `Statement`) don't need boxing and can use identity directly.

### Mix and match

You can mix strategies in one implementation — evaluate some nodes, build trees for others, ignore the rest:

```rust
impl my_grammar::Types for MyActions {
    type Expr = i64;                           // evaluate
    type Statement = Box<my_grammar::Statement<Self>>; // build tree (boxed)
    type Comment = Ignore;                     // discard
    // ...
}
// Only need Reducer for Expr (custom logic)
// Statement and Comment are handled by blankets
```

---

## Grammar Syntax

Gazelle grammars can be written in `.gzl` files or inline with the `gazelle!` macro.

### Basic Structure

`.gzl` files contain the grammar directly:

```
start rule_name;

terminals {
    // terminal declarations
}

// rule definitions
```

In the `gazelle!` macro, the grammar is wrapped with `grammar Name { ... }`:

```rust
gazelle! {
    grammar Name {
        start rule_name;
        terminals { ... }
        // rule definitions
    }
}
```

### Terminal Declarations

Terminals are tokens from your lexer. Declare them in the `terminals` block:

```
terminals {
    // Simple terminal (no payload, becomes () in generated code)
    LPAREN,
    RPAREN,

    // Terminal with payload — `: _` generates an associated type
    // named after the terminal (NUM → type Num, IDENT → type Ident)
    NUM: _,
    IDENT: _,

    // Precedence terminal (for operator precedence parsing)
    prec OP: _,         // Carries precedence at runtime
    prec BINOP,         // Prec terminal without payload
}
```

**Terminal naming convention:** Terminals should be UPPERCASE to distinguish from non-terminals.

### Rule Definitions

Rules define the grammar's structure. Every alternative requires `=> name`:

```
// Single alternative
expr = expr PLUS term => add;

// Multiple alternatives (separated by |)
expr = expr PLUS term => add
     | expr MINUS term => sub
     | term => term;
```

### Actions and Enum Generation

Each `=> name` generates an enum variant. Untyped symbols (no `: Type`) are omitted from variant fields:

```
expr = expr PLUS term => add;
// Generates enum variant: Expr::Add(A::Expr, A::Term)
// PLUS is untyped, so it's omitted — only typed symbols become fields
```

**Untyped rules** - if the non-terminal has no type annotation, output is `()`. The `Reducer` is still called for side effects:

```
// statement has no type annotation — output is ()
statement = expr SEMI => on_statement;
// Generates: Statement::OnStatement(A::Expr)
// Reducer<Statement<Self>> returns Result<(), Error>
```

### Modifiers

**Optional** (`?`) - zero or one:
```
trailing_comma = COMMA?;
// Generates Option<T> where T is the symbol's type
```

**Repetition** (`*`) - zero or more:
```
args = arg*;
// Generates Vec<T>
```

**One or more** (`+`) - at least one:
```
statements = statement+;
// Generates Vec<T>
```

**Separated list** (`%`) - one or more separated by a delimiter:
```
args = expr % COMMA;
// Generates Vec<T> where T is expr's type
// Parses: expr, expr COMMA expr, expr COMMA expr COMMA expr, ...
```

### Expect Declarations

Declare expected conflicts to suppress errors:

```
start translation_unit;
expect 3 rr;  // Expect 3 reduce/reduce conflicts
expect 1 sr;  // Expect 1 shift/reduce conflict

// ... rest of grammar
```

Use this for grammars with known ambiguities (like C's typedef or dangling else).

---

## The gazelle! Macro

The `gazelle!` macro generates a type-safe parser at compile time.

### Basic Usage

```rust
use gazelle_macros::gazelle;
use gazelle::Precedence;

gazelle! {
    grammar Calc {
        start expr;

        terminals {
            NUM: _,
            LPAREN, RPAREN,
            prec OP: _,
        }

        expr = expr OP expr => binop
             | NUM => literal
             | LPAREN expr RPAREN => paren;
    }
}
```

### Visibility

Add `pub` before `grammar` for public visibility:

```rust
gazelle! {
    pub grammar MyParser {
        // ...
    }
}
```

Or with restricted visibility:

```rust
gazelle! {
    pub(crate) grammar MyParser {
        // ...
    }
}
```

---

## Generated Types

The macro generates a module (snake_case of grammar name, e.g., `calc`) containing:

### Types Trait

`Types` declares associated types and the error type:

```rust
pub trait Types: Sized {
    type Error: From<ParseError>;
    type Num: Debug;
    type Op: Debug;
    type Expr: Debug;
    fn set_token_range(&mut self, start: usize, end: usize) {}
}
```

### Per-Node Enums

Each non-terminal with named alternatives generates an enum generic over `A: Types`:

```rust
pub enum Expr<A: Types> {
    Binop(A::Expr, A::Op, A::Expr),
    Literal(A::Num),
}

impl<A: Types> AstNode for Expr<A> {
    type Output = A::Expr;
    type Error = A::Error;
}
```

A blanket `Reducer` impl handles identity (CST), `Box<N>` (auto-boxing), and `Ignore` (discard) automatically. You only write a custom `Reducer` impl when you need custom logic.

### Terminal Enum

Represents input tokens:

```rust
pub enum Terminal<A: Types> {
    Num(A::Num),
    Lparen,
    Rparen,
    Op(A::Op, Precedence),  // prec terminals include Precedence
}
```

### Parser Struct

```rust
pub struct Parser<A: Types> { /* ... */ }

impl<A: Types + Reducer<Expr<A>>> Parser<A> {
    pub fn new() -> Self;
    pub fn push(&mut self, terminal: Terminal<A>, actions: &mut A) -> Result<(), A::Error>;
    pub fn finish(self, actions: &mut A) -> Result<A::Expr, (Self, A::Error)>;
    pub fn state(&self) -> usize;
    pub fn format_error(&self, err: &ParseError) -> String;
}
```

Note: `finish` returns `(Self, A::Error)` on error, giving back the parser so you can still call `format_error`.

---

## Using the Parser

### Step 1: Implement Types and Reducers

```rust
use gazelle::{ParseError, Reducer};

struct Evaluator;

impl calc::Types for Evaluator {
    type Error = ParseError;
    type Num = f64;
    type Op = char;
    type Expr = f64;
}

impl Reducer<calc::Expr<Self>> for Evaluator {
    fn reduce(&mut self, node: calc::Expr<Self>) -> Result<f64, ParseError> {
        Ok(match node {
            calc::Expr::Binop(left, op, right) => match op {
                '+' => left + right,
                '-' => left - right,
                '*' => left * right,
                '/' => left / right,
                _ => panic!("unknown operator"),
            },
            calc::Expr::Literal(n) => n,
        })
    }
}
```

### Step 2: Create Parser and Push Tokens

```rust
use gazelle::Precedence;

fn parse(input: &str) -> Result<f64, String> {
    let mut parser = calc::Parser::<Evaluator>::new();
    let mut actions = Evaluator;

    // Your lexer loop
    for token in lex(input) {
        let terminal = match token {
            Token::Num(n) => calc::Terminal::Num(n),
            Token::Plus => calc::Terminal::Op('+', Precedence::Left(1)),
            Token::Star => calc::Terminal::Op('*', Precedence::Left(2)),
            Token::LParen => calc::Terminal::Lparen,
            Token::RParen => calc::Terminal::Rparen,
        };

        parser.push(terminal, &mut actions)
            .map_err(|e| parser.format_error(&e))?;
    }

    parser.finish(&mut actions)
        .map_err(|(p, e)| p.format_error(&e))
}
```

### Step 3: Handle Errors

```rust
// Push errors - parser is still available for format_error
match parser.push(terminal, &mut actions) {
    Ok(()) => { /* continue */ }
    Err(e) => {
        let msg = parser.format_error(&e);
        // msg contains: "unexpected 'X', expected: A, B, C"
        //               "  after: tokens parsed so far"
        //               "  in rule: context"
        return Err(msg);
    }
}

// Finish errors - parser returned in the error tuple
match parser.finish(&mut actions) {
    Ok(result) => { /* use result */ }
    Err((parser, e)) => {
        let msg = parser.format_error(&e);
        return Err(msg);
    }
}
```

---

## Advanced Features

### Precedence Terminals

For expression grammars, `prec` terminals carry precedence at runtime instead of encoding it in grammar structure:

```rust
terminals {
    prec OP: _,
}

expr = expr OP expr => binop | atom => atom;
```

When lexing, attach precedence to each operator:

```rust
'+' => calc::Terminal::Op('+', Precedence::Left(1)),   // Lower precedence
'*' => calc::Terminal::Op('*', Precedence::Left(2)),   // Higher precedence
'^' => calc::Terminal::Op('^', Precedence::Right(3)),  // Right-associative
```

**Precedence values:**
- `Precedence::Left(n)` - left-associative with level n
- `Precedence::Right(n)` - right-associative with level n
- Higher n = tighter binding

This enables user-defined operators at runtime!

### Token Range Tracking (Spans)

Implement `set_token_range` to track source positions:

```rust
impl calc::Types for SpanTracker {
    type Error = ParseError;
    type Op = char;
    type Num = f64;
    type Expr = Expr;

    fn set_token_range(&mut self, start: usize, end: usize) {
        // Called before each reduction with [start, end) token indices
        self.current_span = self.token_spans[start].start..self.token_spans[end-1].end;
    }
}

impl Reducer<calc::Expr<Self>> for SpanTracker {
    fn reduce(&mut self, node: calc::Expr<Self>) -> Result<Expr, ParseError> {
        Ok(match node {
            calc::Expr::Binop(left, op, right) => Expr {
                kind: ExprKind::BinOp(Box::new(left), op, Box::new(right)),
                span: self.current_span.clone(),
            },
            // ...
        })
    }
}
```

### Lexer Feedback

For languages like C where lexing depends on parse state (typedef disambiguation):

```rust
struct CActions {
    typedefs: HashSet<String>,
}

impl Reducer<c11::DeclTypedef<Self>> for CActions {
    fn reduce(&mut self, node: c11::DeclTypedef<Self>) -> Result<...> {
        // Extract name, register it
        self.typedefs.insert(name.clone());
        // ...
    }
}

// In your parse loop:
loop {
    // Lexer sees current typedef set
    let token = lexer.next(&actions.typedefs)?;
    parser.push(token, &mut actions)?;
}
```

The push-based architecture makes this natural - you control the loop.

### Multiple Implementations

The same grammar can have multiple action implementations:

```rust
// Evaluator
impl calc::Types for Evaluator { type Expr = f64; /* ... */ }
impl Reducer<calc::Expr<Self>> for Evaluator {
    fn reduce(&mut self, node: calc::Expr<Self>) -> Result<f64, ParseError> { /* evaluate */ }
}

// CST Builder — Box needed because Expr is recursive
// The blanket Reducer handles it automatically — no impl needed!
impl calc::Types for CstBuilder { type Expr = Box<calc::Expr<Self>>; /* ... */ }

// Pretty Printer
impl calc::Types for Printer { type Expr = String; /* ... */ }
impl Reducer<calc::Expr<Self>> for Printer {
    fn reduce(&mut self, node: calc::Expr<Self>) -> Result<String, ParseError> { /* format */ }
}
```

### Runtime Grammar API

For dynamic grammars, use the library API directly:

```rust
use gazelle::{parse_grammar, GrammarBuilder};
use gazelle::table::CompiledTable;
use gazelle::runtime::{CstParser, Token};

// Parse from string
let grammar = parse_grammar(r#"
    start expr;
    terminals { NUM, PLUS }
    expr = expr PLUS expr => add | NUM => num;
"#)?;

// Or build programmatically
let mut gb = GrammarBuilder::new();
let num = gb.t("NUM");
let plus = gb.t("PLUS");
let expr = gb.nt("expr");
gb.rule(expr, vec![expr, plus, expr]);
gb.rule(expr, vec![num]);
let grammar = gb.build();

// Compile and parse
let compiled = CompiledTable::build(&grammar);
let mut parser = CstParser::new(compiled.table());

let num_id = compiled.symbol_id("NUM").unwrap();
let plus_id = compiled.symbol_id("PLUS").unwrap();

parser.push(Token::new(num_id))?;     // NUM
parser.push(Token::new(plus_id))?;    // PLUS
parser.push(Token::new(num_id))?;     // NUM

let tree = parser.finish().map_err(|(p, e)| p.format_error(&e, &compiled))?;
// tree is a Cst: Leaf nodes carry SymbolId + token index,
// Node carry rule index + children
```

`CstParser` mirrors the generated parser's `push`/`finish` pattern. `Cst::Leaf` includes a token index so you can map back to your own token data (values, source positions, etc.).

For building a custom AST instead of a CST, use `Parser` directly with `maybe_reduce`/`shift`:

```rust
use gazelle::runtime::{Parser, Token};

let compiled = CompiledTable::build(&grammar);
let mut parser = Parser::new(compiled.table());
let mut stack: Vec<MyAst> = Vec::new();
let mut iter = tokens.into_iter();

loop {
    let token = iter.next();
    // Reduce while possible (rule 0 = accept)
    while let Some((rule, len, _)) = parser.maybe_reduce(token)? {
        if rule == 0 { return stack.pop().unwrap(); }
        let children: Vec<MyAst> = stack.drain(stack.len() - len..).collect();
        stack.push(build_ast_node(rule, children));
    }
    let Some(token) = token else { unreachable!() };
    stack.push(MyAst::Leaf(token));
    parser.shift(token);
}
```

---

## Errors and Recovery

### Parse errors

When `push` or `finish` returns an error, `format_error` produces a message showing what went wrong, what was expected, and where in the grammar the parser was:

```
unexpected 'STAR', expected: NUM, LPAREN
  after: expr PLUS
  in expr: expr OP • expr
```

The `•` marks the parser's position in the rule — it had seen `expr OP` and expected the right-hand operand.

Errors from `push` leave the parser intact so you can still call `format_error`:

```rust
parser.push(terminal, &mut actions).map_err(|e| parser.format_error(&e))?;
```

Errors from `finish` return the parser in the error tuple for the same reason:

```rust
parser.finish(&mut actions).map_err(|(p, e)| p.format_error(&e))?;
```

For nicer output, `format_error_with` lets you provide display names (mapping internal symbol names to user-facing ones) and the actual token texts for the "after:" context line:

```rust
let display_names = HashMap::from([("PLUS", "+"), ("STAR", "*"), ("LPAREN", "(")]);
let token_texts = vec!["1", "+", "*"];  // the tokens parsed so far
let msg = parser.format_error_with(&e, &display_names, &token_texts);
// unexpected '*', expected: NUM, (
//   after: 1 +
//   in expr: expr OP • expr
```

### Error recovery

When a parse error occurs, you can call `recover` on the low-level `Parser` to find a minimum-cost repair and continue parsing. Recovery uses Dijkstra search over possible insert/delete edits to find the cheapest way to get the parser back on track.

```rust
use gazelle::runtime::{Parser, Token, Repair};

// Parse tokens, recover on error
let mut pos = 0;
while pos < tokens.len() {
    loop {
        match parser.maybe_reduce(Some(tokens[pos])) {
            Ok(None) => break,              // ready to shift
            Ok(Some((0, _, _))) => return,  // accept
            Ok(Some(_)) => continue,        // reduce, keep going
            Err(_) => {
                // Error — recover with remaining tokens
                let errors = parser.recover(&tokens[pos..]);
                for e in &errors {
                    let repairs: Vec<_> = e.repairs.iter().map(|r| match r {
                        Repair::Insert(id) => format!("insert '{}'", ctx.symbol_name(*id)),
                        Repair::Delete(id) => format!("delete '{}'", ctx.symbol_name(*id)),
                        Repair::Shift => "shift".to_string(),
                    }).collect();
                    eprintln!("error at token {}: {}", e.position, repairs.join(", "));
                }
                return;
            }
        }
    }
    parser.shift(tokens[pos]);
    pos += 1;
}
```

Each `RecoveryInfo` contains a position (token index where the error was detected) and a list of `Repair` actions:
- `Repair::Insert(id)` — a missing token was inserted (e.g., a forgotten `;`)
- `Repair::Delete(id)` — an extra token was deleted (e.g., a stray `+`)
- `Repair::Shift` — a token was shifted normally (free cost, not a real edit)

Recovery can find multiple errors in one pass — it repairs and continues until the end of input.

### Conflict errors

During table generation, Gazelle reports shift/reduce and reduce/reduce conflicts with full context:

```
Shift/reduce conflict on 'IDENT':
  - Shift (continue parsing)
  - Reduce by: __ident_dot_star -> (empty)

Parser state when seeing 'IDENT':
  arg -> • IDENT EQ path  [shift]
  arg -> • path
  __ident_dot_star -> •  [reduce on IDENT]
```

Use `expect N rr;` / `expect N sr;` in the grammar to acknowledge known conflicts (like C's dangling else or typedef ambiguity).