rustixml 0.3.1

Native iXML (Invisible XML) parser with left-recursion support - 76.9% spec conformance, works in Rust and WebAssembly
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
# Strategic Options for Improving rustixml Conformance

This document analyzes different approaches to achieve higher iXML conformance, comparing markup-blitz strategies with alternatives suitable for Rust.

## Current Architecture: Native Recursive Descent Parser

**What we have**:
- Direct interpretation of iXML grammar AST
- Recursive descent with backtracking
- No intermediate compilation or table generation
- O(n) for simple grammars, O(n²-n³) for complex/ambiguous ones

**Pros**:
- Simple to implement and debug
- Direct mapping to iXML semantics
- No preprocessing overhead
- Easy to understand errors

**Cons**:
- Performance degrades with ambiguity
- No natural ambiguity detection
- Repeated work on backtracking
- Hard to optimize automatically

## Option 1: Adopt markup-blitz's LALR+GLR Approach

### What It Means

**LALR (Look-Ahead LR) Parsing**:
- **Precomputed tables**: Before parsing any input, the grammar is analyzed and converted into state transition tables
- **State machine**: Each state represents a parsing position, tables tell you what to do next
- **O(1) decision time**: Looking up the next action is constant time (just index into a table)
- **Deterministic where possible**: No backtracking needed for unambiguous grammars

**GLR (Generalized LR) Parsing**:
- Extension of LR parsing that handles ambiguity
- When multiple actions are possible, creates "forks" - parallel parse stacks
- Explores all possibilities simultaneously
- Merges results that reach the same state
- Naturally detects ambiguity (multiple successful parses)

**Precomputed Tables Example**:
```rust
// Instead of this (what we do now):
fn parse_rule(&self, rule: &Rule) -> Result<Node> {
    // Walk AST, make decisions at runtime
    for alt in &rule.alternatives {
        if let Ok(result) = self.parse_alternative(alt) {
            return Ok(result);
        }
    }
    Err("No match")
}

// LALR does this once before parsing:
struct ParseTables {
    // State 0: On 'a', shift to state 5
    // State 0: On Rule<X>, goto state 12
    action_table: Vec<Vec<Action>>,  // [state][token] -> Action
    goto_table: Vec<Vec<usize>>,     // [state][nonterminal] -> next_state
}

// Then parsing becomes:
fn parse_lalr(&self, tables: &ParseTables, input: &str) -> Result<Node> {
    let mut state_stack = vec![0];
    for token in tokenize(input) {
        let action = tables.action_table[state_stack.last()][token];
        match action {
            Action::Shift(next_state) => state_stack.push(next_state),
            Action::Reduce(rule) => { /* apply rule */ },
            Action::Accept => return Ok(result),
            Action::Error => return Err("Parse error"),
        }
    }
}
```

### Pros
- **Best performance**: O(n) for unambiguous grammars, O(n³) worst case for GLR
-**Natural ambiguity detection**: Multiple parse stacks = ambiguous
-**Proven approach**: markup-blitz passes all 5168 tests
-**Extensive theory**: Decades of research on LALR/GLR parsing

### Cons
- **Major rewrite**: Complete replacement of parsing engine
-**Complex implementation**: LALR table generation is non-trivial
-**Abstraction mismatch**: LR parsers expect tokens, iXML works on characters
-**iXML semantic gaps**: Insertions, hidden elements, attribute marks need special handling
-**Rust LR libraries limited**: No mature LALR+GLR library like Java has

### Effort Estimate
- **6-12 months** full-time development
- Need to implement or adapt:
  1. LALR table generator (2-3 months)
  2. GLR parsing engine (2-3 months)
  3. iXML semantic translation (2-3 months)
  4. Integration and testing (2-3 months)

### Existing Rust Libraries
- **`lalrpop`**: LALR parser generator, but macro-based (compile-time grammars)
- **`santiago`**: GLR parser, but incomplete documentation
- **`peg`**: PEG parser generator, different paradigm
- **None directly usable**: Would need significant custom work

### Risk Assessment
- **High risk**: Complex implementation, unclear if LR paradigm maps cleanly to iXML
- **High reward**: Could achieve 95%+ conformance

---

## Option 2: Enhance Native Parser with Targeted Optimizations

### Strategy: Pre-processing + Heuristics + Post-processing

Keep recursive descent architecture but add optimizations at different stages.

### 2.1 Pre-processing Optimizations

**Character Class Partitioning** (Already partially implemented!)

```rust
// BEFORE parsing, transform grammar AST:
// Grammar: [a-z], [a-m], [m-z]
// Partitions: [a-l], [m], [n-z]
// 
// Original rule: letter: [a-z].
// Transformed:   letter: ([a-l] | [m] | [n-z]).

fn preprocess_grammar(ast: &mut GrammarAst) {
    let all_charsets = collect_character_classes(ast);
    let partitions = compute_disjoint_partitions(all_charsets);
    replace_charsets_with_partitions(ast, partitions);
}
```

**Status**: Implemented in `src/charclass.rs` but disabled due to bugs
**Effort**: 1-2 weeks to fix and enable
**Impact**: High for grammars with overlapping character classes (15-20% speedup)

**Left-Recursion Detection and Transformation**

```rust
// Detect patterns like: expr: expr, "+", term | term.
// Transform to right-recursion with careful result reconstruction:
//   expr: term, expr_tail.
//   expr_tail: "+", term, expr_tail | .

fn detect_left_recursion(rule: &Rule) -> bool {
    for alt in &rule.alternatives {
        if let Some(first) = alt.sequence.first() {
            if first.is_nonterminal(&rule.name) {
                return true;
            }
        }
    }
    false
}

fn transform_left_to_right_recursion(rule: &mut Rule) {
    // Create helper rule with right recursion
    // Adjust semantic actions to preserve tree structure
}
```

**Status**: Not implemented
**Effort**: 2-3 weeks
**Impact**: Medium - fixes 2-3 failing tests (`expr`, `xpath`)

**Nonterminal Inlining**

```rust
// Inline simple wrapper rules to reduce call depth
// 
// BEFORE: digit: ["0"-"9"]. number: digit+.
// AFTER:  number: ["0"-"9"]+.  (digit inlined)

fn inline_simple_nonterminals(ast: &mut GrammarAst) {
    let simple_rules = find_inlinable_rules(ast);
    for rule in simple_rules {
        replace_references_with_definition(ast, rule);
    }
}

fn is_inlinable(rule: &Rule) -> bool {
    rule.alternatives.len() == 1 &&
    rule.alternatives[0].sequence.len() == 1 &&
    matches!(rule.alternatives[0].sequence[0], Factor::CharClass(_))
}
```

**Status**: Not implemented
**Effort**: 1-2 weeks
**Impact**: Low-medium - reduces call stack, 5-10% speedup

### 2.2 Runtime Heuristics

**Memoization (Packrat Parsing)**

```rust
// Cache parsing results to avoid re-parsing same position
struct ParseCache {
    // (rule_id, input_position) -> ParseResult
    cache: HashMap<(usize, usize), ParseResult>,
}

impl NativeParser {
    fn parse_rule_memoized(&mut self, rule: &Rule, pos: usize) -> ParseResult {
        let key = (rule.id, pos);
        if let Some(result) = self.cache.get(&key) {
            return result.clone();
        }
        let result = self.parse_rule(rule, pos);
        self.cache.insert(key, result.clone());
        result
    }
}
```

**Status**: Not implemented
**Effort**: 1 week
**Impact**: High for grammars with backtracking (30-50% speedup)
**Trade-off**: Memory usage increases

**Ambiguity Detection via Multiple Parses**

```rust
// Track alternative parse paths during parsing
struct ParseState {
    current_parse: Vec<Node>,
    alternative_parses: Vec<Vec<Node>>,
    ambiguity_detected: bool,
}

fn parse_with_ambiguity_detection(&mut self, input: &str) -> ParseResult {
    // Try each alternative at decision points
    // If multiple succeed with same input consumed, mark ambiguous
    for alt in alternatives {
        if let Ok(result) = self.parse_alternative(alt) {
            successful_parses.push(result);
        }
    }
    
    if successful_parses.len() > 1 {
        return ParseResult::Ambiguous(successful_parses);
    }
}
```

**Status**: Not implemented
**Effort**: 2-3 weeks
**Impact**: High - fixes all 11 ambiguity tests
**Trade-off**: Slower parsing (explores more paths)

**Better Error Recovery**

```rust
// When parse fails, try to identify the furthest point reached
struct ParseError {
    position: usize,
    expected: Vec<String>,
    context: String,
    alternatives_tried: usize,
}

fn enhanced_error_reporting(&self, error: &ParseError) -> String {
    format!(
        "Parse error at line {}, column {}: Expected {} but found {}\n\
         Context: ...{}...\n\
         Tried {} alternatives",
        error.line, error.column, error.expected.join(" or "),
        error.found, error.context, error.alternatives_tried
    )
}
```

**Status**: Partially implemented
**Effort**: 1 week
**Impact**: Low - better UX, doesn't fix tests

### 2.3 Post-processing Optimizations

**Result Tree Normalization**

```rust
// After parsing, normalize the XML tree
fn normalize_parse_tree(tree: &mut XmlNode) {
    // 1. Merge adjacent text nodes
    // 2. Remove empty hidden elements
    // 3. Flatten unnecessary nesting
    // 4. Sort attributes
    
    merge_adjacent_text(tree);
    remove_empty_hidden(tree);
    flatten_sequences(tree);
    sort_attributes(tree);
}
```

**Status**: Partially implemented (flatten_sequences exists)
**Effort**: 1 week
**Impact**: Low - improves output quality

### Pros of Option 2
- **Incremental improvement**: Can implement piece by piece
-**Lower risk**: Each optimization is independent
-**Preserves architecture**: Recursive descent stays intact
-**Rust-friendly**: No need for complex libraries

### Cons
- **May not reach 100%**: Fundamental limitations remain
-**Complexity accumulates**: Many small optimizations vs. one clean architecture
-**Performance ceiling**: Will never match LALR speed

### Effort Estimate
- **2-4 months** for significant improvements
- Realistic target: **90-95% conformance**

---

## Option 3: Hybrid Model

### Strategy: Use Different Parsers for Different Grammar Types

```rust
enum ParserStrategy {
    FastPath,      // Simple grammars: optimized recursive descent
    MemoizedPath,  // Complex grammars: packrat parsing
    AmbiguityPath, // Ambiguous grammars: explore all paths
}

impl NativeParser {
    fn choose_strategy(&self, grammar: &Grammar) -> ParserStrategy {
        let complexity = analyze_grammar(grammar);
        
        match complexity {
            Complexity::Simple => ParserStrategy::FastPath,
            Complexity::Recursive => ParserStrategy::MemoizedPath,
            Complexity::Ambiguous => ParserStrategy::AmbiguityPath,
        }
    }
}

fn analyze_grammar(grammar: &Grammar) -> Complexity {
    let has_left_recursion = detect_left_recursion(grammar);
    let has_ambiguity_markers = detect_ambiguous_patterns(grammar);
    let rule_depth = compute_max_depth(grammar);
    
    // Use heuristics to classify
    if rule_depth < 5 && !has_left_recursion {
        Complexity::Simple
    } else if has_ambiguity_markers {
        Complexity::Ambiguous
    } else {
        Complexity::Recursive
    }
}
```

### Pattern Detection Examples

**Simple Grammars** (use fast path):
```ixml
date: year, "-", month, "-", day.
year: digit, digit, digit, digit.
```
- No recursion
- No ambiguity
- Direct translation

**Recursive Grammars** (use memoization):
```ixml
expr: term | expr, "+", term.
term: factor | term, "*", factor.
```
- Left recursion detected
- Enable memoization cache

**Ambiguous Grammars** (use full exploration):
```ixml
a: "x" | "x", "y".
```
- Multiple parses possible
- Track all alternatives

### Pros
- **Best of both worlds**: Fast for simple cases, correct for complex ones
-**Adaptive**: Automatically chooses right approach
-**User-friendly**: Works well without configuration

### Cons
- **Complex implementation**: Need multiple parsing modes
-**Analysis overhead**: Grammar analysis takes time
-**Testing burden**: Must test all paths

### Effort Estimate
- **3-5 months**
- Realistic target: **92-97% conformance**

---

## Option 4: Profiling-Driven Optimization

### Strategy: Measure, Don't Guess

```bash
# Profile current implementation on failing tests
cargo build --release
perf record target/release/conformance_test
perf report

# Analyze hotspots
cargo flamegraph --bin conformance_test
```

**Common Hotspots in Parsers**:
1. Character classification (UTF-8 decoding)
2. Backtracking overhead
3. Memory allocation (Vec growth)
4. String copying
5. Hash map lookups

**Targeted Optimizations**:

```rust
// If profiling shows UTF-8 decoding is slow:
use bstr::ByteSlice; // Work with bytes directly

// If backtracking is expensive:
struct SavePoint {
    position: usize,
    stack_depth: usize,
}
// Lightweight save/restore instead of cloning

// If allocation is costly:
struct ParseArena {
    nodes: Vec<Node>,
    current: usize,
}
// Arena allocation for parse nodes
```

### Pros
- **Evidence-based**: Optimize what actually matters
-**High ROI**: Focus effort where it helps most
-**Measurable**: Clear before/after metrics

### Cons
- **May not fix correctness**: Only improves speed
-**Platform-specific**: Optimizations may vary by CPU
-**Diminishing returns**: Easy wins first, hard ones remain

### Effort Estimate
- **2-4 weeks** for initial profiling and fixes
- **Ongoing**: 1 day/month for continuous improvement

---

## Recommended Strategy: **Option 2 + 4 (Enhanced Native Parser + Profiling)**

### Rationale

1. **Proven Architecture**: Recursive descent is well-understood and debuggable
2. **Incremental Path**: Can ship improvements in minor versions (v0.3, v0.4)
3. **Rust Strengths**: Plays to Rust's zero-cost abstractions and memory safety
4. **Realistic Goals**: 90-95% conformance is excellent for most users
5. **Lower Risk**: Don't bet the project on a complete rewrite

### Implementation Roadmap

#### Phase 1: Quick Wins (v0.3.0 - 1 month)
1. ✅ Enable character class partitioning (fix bugs)
2. ✅ Implement basic memoization
3. ✅ Add ambiguity detection for simple cases
4. ✅ Profile and optimize hotspots

**Expected**: 87-90% conformance (43-44 tests passing)

#### Phase 2: Structural Improvements (v0.4.0 - 2 months)
1. ✅ Left-recursion transformation
2. ✅ Nonterminal inlining
3. ✅ Full ambiguity detection with multiple parse paths
4. ✅ Better error messages

**Expected**: 92-95% conformance (45-47 tests passing)

#### Phase 3: Polish (v0.5.0 - 1 month)
1. ✅ Edge case handling
2. ✅ Performance tuning
3. ✅ Memory optimization
4. ✅ Documentation

**Expected**: 95%+ conformance (47+ tests passing)

### Why NOT Option 1 (LALR+GLR)?

1. **Overkill for 95%**: Can reach high conformance without it
2. **Time vs. Value**: 6-12 months for 5% more tests
3. **Maintenance burden**: Complex code is harder to maintain
4. **Community contribution**: Easier for others to contribute to simpler code
5. **iXML evolution**: Spec may change, simpler code adapts easier

### When to Reconsider Option 1

- If aiming for **100% conformance** becomes critical
- If **performance** becomes the bottleneck (currently not an issue)
- If a mature **Rust LALR+GLR library** emerges
- If building a **production parser framework** (not just an iXML parser)

---

## Comparison Table

| Approach | Effort | Conformance Target | Risk | Maintainability | Performance |
|----------|--------|-------------------|------|-----------------|-------------|
| **Current** | 0 | 83.7% | - | ✅✅✅ | ✅✅ |
| **LALR+GLR** | 12 months | 98-100% | ⚠️⚠️⚠️ | ⚠️ | ✅✅✅ |
| **Enhanced Native** | 4 months | 92-95% | ✅✅ | ✅✅ | ✅✅ |
| **Hybrid** | 5 months | 95-97% | ⚠️⚠️ | ⚠️ | ✅✅✅ |
| **Profile-Only** | 1 month | 83.7% | ✅✅✅ | ✅✅✅ | ✅✅✅ |

---

## Conclusion

**Ship v0.2.0 now** with current 83.7% conformance and excellent documentation.

**Plan for v0.3-v0.5** using Option 2 (Enhanced Native Parser) with profiling.

**Keep Option 1** (LALR+GLR) as a potential v2.0 if needed.

This balances:
- ✅ Time to market
- ✅ User value (83.7% is great for most use cases)
- ✅ Continuous improvement path
- ✅ Code maintainability
- ✅ Community contribution potential

The perfect is the enemy of the good. Ship now, improve incrementally.