gazelle-parser 0.9.1

LR parser generator with runtime operator precedence and natural lexer feedback
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
# Gazelle Design Document

A lightweight parser generator library for Rust.

## Philosophy

**Parsing is just a tool** to transform strings into structured forms. It's never the end goal—you always need validation, type checking, semantic analysis afterward. Gazelle should do syntax well, then get out of the way.

**Parser generators have a bad reputation**, but for the wrong reasons:

| Complaint | Real cause | Gazelle's answer |
|-----------|-----------|------------------|
| "Hand-written is simpler" | Only true for stable grammars. Evolving languages benefit from grammar-driven generation | Focus on fast iteration |
| "Terrible error messages" | Tool problem, not fundamental | Counterexample generation for conflicts |
| "Ambiguities are hidden" | PEG's ordered choice silently picks first match | Use CFG + surface real conflicts |
| "Spurious conflicts" | LALR's aggressive state merging | Full LR(1) with DFA minimization |
| "Atrocious API" | Bison's globals, lex/yacc split, pull architecture | Library-first, push architecture, clean Rust API |

## Core Design Decisions

### Full LR(1) with DFA Minimization

Gazelle builds the full canonical LR(1) automaton, then minimizes it:

1. **NFA construction** from LR(1) items
2. **Subset construction** (NFA → DFA) — canonical LR(1) states
3. **Conflict resolution** — SR/RR resolved on the full LR(1) DFA, before any merging
4. **Spurious reduce filling** — error entries filled with default reductions to enable merging
5. **Hopcroft minimization** — standard DFA minimization produces LALR-sized tables

Resolving conflicts before minimization guarantees the minimized parser accepts exactly the same language as the full LR(1) parser. This avoids the IELR problem where LALR-style merging can change the resolved language. Each step is a standard, independently correct algorithm — no custom merging heuristics.

### Conflict Explanation via Counterexamples

When a real conflict exists, show a concrete ambiguous input:

```
Conflict: shift/reduce on '+'

  Parse 1: expr + expr • + expr  →  expr + (expr + expr)
  Parse 2: expr + expr • + expr  →  (expr + expr) + expr

  Ambiguous input: "1 + 2 + 3"
```

Not "state 47: shift/reduce conflict on '+'" with no explanation.

### Push Architecture

The parser is a state machine driven from the outside, not a function that pulls input.

```rust
let mut parser = Parser::new(&tables);
let mut lexer = Lexer::new(input);

while let Some(token) = lexer.next(&context) {
    for event in parser.push(token) {
        match event {
            Event::Reduce { rule, children } => { /* ... */ }
            Event::Accept => break,
            Event::Error { expected } => { /* ... */ }
        }
    }
}
```

**Benefits:**
- Streaming—no need to buffer entire input
- Pause/resume—parser is just state
- Incremental—feed updated tokens for editor integration
- User controls I/O—works with async, files, network
- Composable—parser doesn't own token source or AST construction
- Debuggable—inspect every step

**The lexer hack becomes elegant:** When parsing C, the `typedef` problem (is `foo` a type or identifier?) requires parser context during lexing. With push architecture:

```rust
for event in parser.push(token) {
    if let Event::Reduce { rule: Rule::Typedef, children } => {
        lexer.register_type(children.name());
    }
}
```

The outer loop sees typedef reductions and updates lexer context before the next token. No globals, no callbacks.

**Even cleaner:** Use a bogus token pattern:
```
type = id typetoken
```
Lexer always emits `id`, inserts `typetoken` after known types. The grammar explicitly shows the disambiguation.

### Library-First Architecture

Expose the interesting parts as reusable libraries, not a monolithic tool:

```
┌─────────────────────────────────┐
│  Proc macro / nice frontend     │  ← optional sugar
├─────────────────────────────────┤
│  Parser runtime                 │  ← table-driven, push-based
├─────────────────────────────────┤
│  Table construction library     │  ← THE INTERESTING PART
│  - Grammar representation       │
│  - LR(1) automaton construction  │
│  - DFA minimization (Hopcroft)  │
│  - Conflict detection           │
│  - Counterexample generation    │
├─────────────────────────────────┤
│  Grammar AST / IR               │
└─────────────────────────────────┘
```

Users can:
- Build different frontends
- Visualize grammars/automata
- Analyze grammars without generating parsers
- Integrate into editor tooling
- Experiment with parsing research

## Novel Contribution: Unified LR + Operator Precedence

### The Problem

Traditional grammars explode for expressions:

```
expr     = add_expr
add_expr = add_expr '+' mul_expr | add_expr '-' mul_expr | mul_expr
mul_expr = mul_expr '*' unary    | mul_expr '/' unary    | unary
unary    = '-' unary | primary
primary  = number | '(' expr ')'
```

One rule per precedence level. All that machinery just encodes precedence.

### The Solution

Write:
```
expr = expr OP expr | atom
```

Where `OP` is a token that **carries precedence from the lexer**:

```rust
enum Token {
    Op { symbol: String, prec: u8, assoc: Assoc },
    Num(i64),
    // ...
}
```

At parse time, `expr OP1 expr • OP2` decisions use token data:
- `prec(OP2) > prec(OP1)` → shift
- `prec(OP2) < prec(OP1)` → reduce
- Equal → use associativity

The precedence table compresses into token metadata. The LR table for expressions collapses to nearly nothing.

### Implementation: Precedence-Carrying States

Standard LR parser state is just a table index:

```
action[state, token] -> shift(new_state) | reduce(rule) | error
```

Gazelle extends state to carry precedence:

```
state = (table_state, precedence)

action[table_state, token]:
  | shift only       -> shift(new_state, token.precedence)
  | reduce only      -> reduce(rule)
  | shift AND reduce ->
      if token.precedence > state.precedence: shift
      if token.precedence < state.precedence: reduce
      if equal: use token.associativity (left=reduce, right=shift)
```

When you shift an OP token, its precedence becomes part of the new state on the stack. When a shift/reduce conflict arises, compare the incoming token's precedence against the state's precedence.

**Example: `1 + 2 * 3`**

```
stack: [expr(1), +, expr(2)]    state.precedence = prec(+) = 1
token: * with precedence = 2

2 > 1 → shift (parse * first)

result: 1 + (2 * 3)
```

**Example: `1 * 2 + 3`**

```
stack: [expr(1), *, expr(2)]    state.precedence = prec(*) = 2
token: + with precedence = 1

1 < 2 → reduce (finish * first)

result: (1 * 2) + 3
```

**Why this matters:**

- The LR table stays small—one `expr = expr OP expr` rule, not N rules
- Conflicts are resolved at parse time with token data, not at table-generation time
- Same table works for any operators, including dynamically defined ones
- No grammar explosion, no static precedence declarations

The table generator marks `expr = expr OP expr` conflicts as "precedence-resolved" rather than erroring. The runtime knows to consult token precedence for these.

### Dynamic Operators (Haskell-style)

With push architecture, new operators can be defined mid-parse:

```rust
for event in parser.push(token) {
    if let Event::Reduce { rule: Rule::InfixDecl, children } => {
        // infixl 6 +++
        lexer.register_op("+++", prec: 6, assoc: Left);
    }
}
```

Grammar doesn't change. Lexer starts recognizing `+++` as `OP` with precedence 6.

### AST Implications

Don't encode each operator as a different node type:

```rust
// Bad: structure explosion
enum Expr {
    Add(Box<Expr>, Box<Expr>),
    Sub(Box<Expr>, Box<Expr>),
    Mul(Box<Expr>, Box<Expr>),
    // ... forever
}

// Good: operator is data
struct BinOp {
    left: Box<Expr>,
    op: Token,  // symbol, precedence, associativity
    right: Box<Expr>,
}

enum Expr {
    BinOp(BinOp),
    Atom(Token),
}
```

One grammar rule → one AST node type → downstream passes match on operator data, not node variants.

## Usage Modes

### Sharp-Knife Mode (Runtime Grammar)

The core library works with grammars provided at runtime. Everything is type-erased:

```rust
let grammar = Grammar::parse(grammar_str)?;
let tables = TableBuilder::new(&grammar).build()?;
let mut parser = Parser::new(&tables);

for event in parser.push(token) {
    match event {
        Event::Reduce { rule, children } => {
            // rule is a RuleId (number)
            // tokens are generic Token structs
            match grammar.rule_name(rule) {
                "expr" => { /* ... */ }
                "atom" => { /* ... */ }
                _ => { /* ... */ }
            }
        }
    }
}
```

**Use cases:**
- Tools that process arbitrary grammars
- User-defined DSLs (grammar comes from config/input)
- Grammar experimentation and visualization
- Building other parser generators on top

### Codegen Mode (Compile-Time Grammar)

A proc macro generates type-safe wrappers from a grammar definition. Rules name their alternatives with `=> action`, but **no code in the grammar** — reducers are separate:

```rust
gazelle! {
    grammar Calc {
        terminals { NUM: _, prec OP: _, LPAREN, RPAREN, SEMI }

        expr = expr OP expr => binop
             | atom => atom;

        atom = NUM => num
             | LPAREN expr RPAREN => paren;

        stmt = expr SEMI => print;
    }
}
```

**The macro generates a `Types` trait, per-node enums, and an `Actions` trait bound:**

```rust
// Generated by macro
mod calc {
    pub trait Types: Sized {
        type Error: From<ParseError>;
        type Num: Debug;
        type Op: Debug;
        type Expr: Debug;
        type Atom: Debug;
        type Stmt: Debug;
        fn set_token_range(&mut self, start: usize, end: usize) {}
    }

    pub enum Expr<A: Types> {
        Binop(A::Expr, A::Op, A::Expr),
        Atom(A::Atom),
    }

    pub enum Atom<A: Types> {
        Num(A::Num),
        // paren just returns the inner expr
    }

    // Actions = Types + all required Action impls
    pub trait Actions: Types + Action<Expr<Self>> + Action<Atom<Self>> { }
    impl<T: Types + Action<Expr<T>> + Action<Atom<T>>> Actions for T { }

    // Also generates Terminal enum, Parser struct, etc.
}
```

**You implement `Types` and `Action` as normal Rust:**

```rust
struct Evaluator;

impl calc::Types for Evaluator {
    type Error = ParseError;
    type Num = f64;
    type Op = char;
    type Expr = f64;
    type Atom = f64;
    type Stmt = ();
}

impl gazelle::Action<calc::Expr<Self>> for Evaluator {
    fn build(&mut self, node: calc::Expr<Self>) -> Result<f64, ParseError> {
        Ok(match node {
            calc::Expr::Binop(left, op, right) => match op {
                '+' => left + right,
                '*' => left * right,
                _ => 0.0,
            },
            calc::Expr::Atom(a) => a,
        })
    }
}

impl gazelle::Action<calc::Atom<Self>> for Evaluator {
    fn build(&mut self, node: calc::Atom<Self>) -> Result<f64, ParseError> {
        Ok(match node {
            calc::Atom::Num(n) => n,
        })
    }
}
```

**Benefits:**
- Grammar is clean — just structure and action names, no code
- Action code is real Rust with full IDE support
- Multiple implementations possible (AST, pretty-printer, interpreter, validator)
- Refactoring works across grammar and reducers
- Compile-time type checking between grammar and implementation
- No string interpolation or `$1` magic

**The proc macro:**
1. Parses grammar at compile time
2. Validates types match between rules (e.g., `atom` returns `Atom`, used in `expr`)
3. Generates terminal enum, per-node enums, and `Types`/`Actions` traits
4. Calls the core library to build tables
5. Embeds tables as static data
6. Generates parser wrapper that calls `Action::build` on each reduction

Same library underneath both modes.

## Summary

Gazelle is:
1. **Full LR(1) with DFA minimization** for correctness without spurious conflicts
2. **Push-based** for composability and control
3. **Library-first** exposing table construction algorithms
4. **Unified LR + precedence parsing** as a novel contribution
5. **Practical** with clean Rust API and good error messages

The goal: make grammar iteration so painless it's obviously better than hand-writing parsers.