# Native iXML Interpreter Design
## Overview
This document outlines the design for a **native Rust iXML interpreter** that directly implements the iXML specification without translation to an intermediate parser representation (Earley, LALR, etc.).
### Why Native?
The Earley-based approach revealed fundamental abstraction mismatches:
1. **Insertion semantics** (`+"text"`) - adding output not in input
2. **Suppression semantics** (`-name`) - hiding matched input from output
3. **Combined patterns** - `(-[Co], +".")*` loops that suppress and insert simultaneously
These are **first-class iXML operations** but don't map naturally to parser generators that only consume input tokens.
### Design Philosophy
**Specification-First**: Implement iXML semantics directly as described in the specification, not as a translation layer.
**Rust-Native**: Leverage Rust's strengths:
- Pattern matching for grammar traversal
- `Option`/`Result` for parse success/failure
- Iterators for input consumption
- Zero-cost abstractions
- Memory safety without runtime overhead
## Architecture
### High-Level Structure
```
┌─────────────────────────────────────────────────┐
│ iXML Grammar Text │
└─────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Grammar Parser (existing: grammar_parser.rs) │
│ - Handwritten recursive descent │
│ - Produces IxmlGrammar AST │
└─────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Native Interpreter (NEW) │
│ - Direct AST interpretation │
│ - Recursive descent parsing │
│ - Native insertion/suppression handling │
└─────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ XML Output (existing: XmlNode → String) │
└─────────────────────────────────────────────────┘
```
### Core Components
#### 1. Input Stream (`src/input_stream.rs`)
Manages input text with position tracking:
```rust
pub struct InputStream<'a> {
input: &'a str,
position: usize, // Current position in input (char index)
chars: Vec<char>, // Pre-computed char array for O(1) access
}
impl<'a> InputStream<'a> {
pub fn new(input: &'a str) -> Self;
pub fn current(&self) -> Option<char>;
pub fn advance(&mut self) -> Option<char>;
pub fn peek(&self, offset: usize) -> Option<char>;
pub fn position(&self) -> usize;
pub fn set_position(&mut self, pos: usize);
pub fn remaining(&self) -> &str;
pub fn is_eof(&self) -> bool;
}
```
**Design choices**:
- Pre-compute `Vec<char>` for Unicode-safe indexing
- Position is character index, not byte offset
- Immutable input string reference
- Mutable position cursor for backtracking
#### 2. Parse Context (`src/parse_context.rs`)
Tracks parse state during recursive descent:
```rust
pub struct ParseContext {
pub rule_name: String, // Current rule being parsed
pub depth: usize, // Recursion depth (for debugging)
pub left_recursion: HashSet<String>, // Detect left-recursion
}
pub struct ParseResult {
pub node: Option<XmlNode>, // Parsed node (None if suppressed)
pub consumed: usize, // Characters consumed
}
```
**Design choices**:
- Track current rule for error messages
- Left-recursion detection (fail-fast on direct left-recursion)
- Separate "success with no output" (suppressed) from "failure"
#### 3. Native Parser (`src/native_parser.rs`)
Main interpreter implementing iXML semantics:
```rust
pub struct NativeParser {
grammar: IxmlGrammar,
rules: HashMap<String, Rule>, // Fast rule lookup by name
}
impl NativeParser {
pub fn new(grammar: IxmlGrammar) -> Self;
pub fn parse(&self, input: &str) -> Result<String, ParseError>;
// Core parsing methods (recursive descent)
fn parse_rule(&self, stream: &mut InputStream, rule: &Rule, ctx: &mut ParseContext)
-> Result<ParseResult, ParseError>;
fn parse_alternatives(&self, stream: &mut InputStream, alts: &Alternatives, ctx: &mut ParseContext)
-> Result<ParseResult, ParseError>;
fn parse_sequence(&self, stream: &mut InputStream, seq: &Sequence, ctx: &mut ParseContext)
-> Result<ParseResult, ParseError>;
fn parse_factor(&self, stream: &mut InputStream, factor: &Factor, ctx: &mut ParseContext)
-> Result<ParseResult, ParseError>;
fn parse_base_factor(&self, stream: &mut InputStream, base: &BaseFactor, mark: Mark, ctx: &mut ParseContext)
-> Result<ParseResult, ParseError>;
}
```
**Design choices**:
- One method per grammar construct (alternatives, sequence, factor, etc.)
- All methods return `Result<ParseResult, ParseError>`
- `ParseResult` contains both the node and characters consumed
- Backtracking via `stream.set_position(saved_pos)`
### Parsing Algorithm
#### Alternatives (Choice)
```rust
fn parse_alternatives(&self, stream: &mut InputStream, alts: &Alternatives, ctx: &mut ParseContext)
-> Result<ParseResult, ParseError> {
let start_pos = stream.position();
let mut errors = Vec::new();
// Try each alternative in order
for alt in &alts.alts {
stream.set_position(start_pos); // Reset for each alternative
match self.parse_sequence(stream, alt, ctx) {
Ok(result) => return Ok(result), // First success wins
Err(e) => errors.push(e),
}
}
// All alternatives failed
Err(ParseError::NoAlternativeMatched {
position: start_pos,
attempts: errors,
})
}
```
**Design choices**:
- Ordered choice (PEG-style): first match wins
- Backtrack to start position for each alternative
- Collect all errors for better diagnostics
#### Sequence (Concatenation)
```rust
fn parse_sequence(&self, stream: &mut InputStream, seq: &Sequence, ctx: &mut ParseContext)
-> Result<ParseResult, ParseError> {
let start_pos = stream.position();
let mut children = Vec::new();
let mut total_consumed = 0;
// Parse each factor in sequence
for factor in &seq.factors {
match self.parse_factor(stream, factor, ctx) {
Ok(result) => {
if let Some(node) = result.node {
children.push(node);
}
total_consumed += result.consumed;
}
Err(e) => {
// Sequence failed - backtrack
stream.set_position(start_pos);
return Err(e);
}
}
}
Ok(ParseResult {
node: Some(XmlNode::Sequence(children)),
consumed: total_consumed,
})
}
```
**Design choices**:
- All factors must succeed (fail-fast)
- Collect non-suppressed nodes into children
- Full backtrack on any failure
- Track total characters consumed
#### Repetition
```rust
fn parse_repetition(&self, stream: &mut InputStream, factor: &Factor, rep: &Repetition, ctx: &mut ParseContext)
-> Result<ParseResult, ParseError> {
let mut children = Vec::new();
let mut total_consumed = 0;
match rep {
Repetition::ZeroOrMore => {
// Keep parsing until failure
loop {
let pos = stream.position();
match self.parse_base_factor(stream, &factor.base, Mark::None, ctx) {
Ok(result) => {
if result.consumed == 0 {
break; // Prevent infinite loop on epsilon matches
}
if let Some(node) = result.node {
children.push(node);
}
total_consumed += result.consumed;
}
Err(_) => {
stream.set_position(pos); // Backtrack last attempt
break;
}
}
}
Ok(ParseResult {
node: Some(XmlNode::Sequence(children)),
consumed: total_consumed,
})
}
Repetition::OneOrMore => {
// Must match at least once
match self.parse_base_factor(stream, &factor.base, Mark::None, ctx) {
Ok(result) => {
if let Some(node) = result.node {
children.push(node);
}
total_consumed += result.consumed;
// Then same as ZeroOrMore
// ... (same loop as above)
}
Err(e) => return Err(e),
}
}
// Similar for Optional, SeparatedZeroOrMore, SeparatedOneOrMore
// ...
}
}
```
**Design choices**:
- Greedy matching (consume as much as possible)
- Epsilon-match detection to prevent infinite loops
- Separated repetitions parse separator between elements
#### Terminal Matching
```rust
fn parse_terminal(&self, stream: &mut InputStream, value: &str, mark: Mark, insertion: bool)
-> Result<ParseResult, ParseError> {
if insertion {
// Insertion: don't consume input, just create node
return Ok(ParseResult {
node: Some(XmlNode::Text(value.to_string())),
consumed: 0,
});
}
// Regular terminal: must match input
let start_pos = stream.position();
for ch in value.chars() {
match stream.current() {
Some(input_ch) if input_ch == ch => {
stream.advance();
}
_ => {
stream.set_position(start_pos);
return Err(ParseError::TerminalMismatch {
expected: value.to_string(),
position: start_pos,
});
}
}
}
let consumed = value.chars().count();
// Apply mark
let node = match mark {
Mark::Hidden => None, // Suppressed
_ => Some(XmlNode::Text(value.to_string())),
};
Ok(ParseResult { node, consumed })
}
```
**Design choices**:
- **Insertions** (`+"text"`): consume 0 characters, always succeed, create node
- **Suppressions** (`-"text"`): consume characters, succeed/fail normally, but return `None` node
- Character-by-character matching for Unicode safety
#### Character Class Matching
```rust
fn parse_charclass(&self, stream: &mut InputStream, content: &str, negated: bool, mark: Mark)
-> Result<ParseResult, ParseError> {
let start_pos = stream.position();
match stream.current() {
Some(ch) => {
let matches = self.charclass_matches(ch, content);
let should_match = if negated { !matches } else { matches };
if should_match {
stream.advance();
let node = match mark {
Mark::Hidden => None,
_ => Some(XmlNode::Text(ch.to_string())),
};
Ok(ParseResult { node, consumed: 1 })
} else {
Err(ParseError::CharClassMismatch {
charclass: content.to_string(),
negated,
actual: ch,
position: start_pos,
})
}
}
None => Err(ParseError::UnexpectedEof { position: start_pos }),
}
}
fn charclass_matches(&self, ch: char, content: &str) -> bool {
// Parse character class content and test if ch matches
// Handles: ranges (a-z), hex chars (#41), Unicode categories ([L])
// Already implemented in runtime_parser.rs - can reuse
}
```
**Design choices**:
- Match single character against character class predicate
- Support negation (`~[...]`)
- Reuse existing character class parsing logic
- Apply mark after successful match
#### Nonterminal (Rule Reference)
```rust
fn parse_nonterminal(&self, stream: &mut InputStream, name: &str, mark: Mark, ctx: &mut ParseContext)
-> Result<ParseResult, ParseError> {
// Check for left recursion
if ctx.left_recursion.contains(name) {
return Err(ParseError::LeftRecursion { rule: name.to_string() });
}
// Look up rule
let rule = self.rules.get(name)
.ok_or_else(|| ParseError::UndefinedRule { rule: name.to_string() })?;
// Push onto left-recursion tracker
ctx.left_recursion.insert(name.to_string());
ctx.depth += 1;
let result = self.parse_rule(stream, rule, ctx);
// Pop from left-recursion tracker
ctx.left_recursion.remove(name);
ctx.depth -= 1;
// Apply mark to result
match result {
Ok(mut parse_result) => {
parse_result.node = parse_result.node.and_then(|node| match mark {
Mark::Hidden => None,
Mark::Attribute => Some(XmlNode::Attribute {
name: name.to_string(),
value: node.to_string(),
}),
Mark::Promoted => Some(node), // Promote contents
Mark::None => Some(XmlNode::Element {
name: name.to_string(),
attributes: vec![],
children: vec![node],
}),
});
Ok(parse_result)
}
Err(e) => Err(e),
}
}
```
**Design choices**:
- Direct left-recursion detection (fail immediately)
- Marks applied to rule result, not during parsing
- Promoted mark unwraps the element
- Attribute mark converts to attribute node
### Mark Handling
Marks (`@`, `-`, `^`) are applied **after** successful parsing:
```rust
fn apply_mark(node: Option<XmlNode>, mark: Mark, name: &str) -> Option<XmlNode> {
match mark {
Mark::None => node,
Mark::Hidden => None, // Suppress output
Mark::Attribute => node.map(|n| XmlNode::Attribute {
name: name.to_string(),
value: n.text_content(), // Extract text value
}),
Mark::Promoted => node, // Contents promoted (no wrapper element)
}
}
```
**Key insight**: Marks are **post-processing** operations on successful parses, not parse-time behavior changes.
### Insertion Handling
Insertions (`+"text"`) are **synthetic terminals** that always succeed:
```rust
if insertion {
// Create node without consuming input
return Ok(ParseResult {
node: Some(XmlNode::Text(value.to_string())),
consumed: 0, // Key: consume nothing
});
}
```
This naturally handles the `(-[Co], +".")*` pattern:
1. `-[Co]` matches and consumes a Co character, returns `None` (suppressed)
2. `+"."` creates a "." text node, consumes 0 characters
3. Sequence succeeds with node `[None, Some(Text("."))]`
4. Repetition continues until `-[Co]` fails
### Error Handling
```rust
pub enum ParseError {
UnexpectedEof { position: usize },
TerminalMismatch { expected: String, position: usize },
CharClassMismatch { charclass: String, negated: bool, actual: char, position: usize },
NoAlternativeMatched { position: usize, attempts: Vec<ParseError> },
UndefinedRule { rule: String },
LeftRecursion { rule: String },
}
impl ParseError {
pub fn format_with_context(&self, input: &str) -> String {
// Format error with line/column, surrounding context
}
}
```
**Design choices**:
- Rich error types with position information
- Collect all alternative attempts for better diagnostics
- Context-aware error formatting
## Implementation Plan
### Phase 1: Core Infrastructure (Foundation)
**Files to create**:
- `src/input_stream.rs` - Input management with backtracking
- `src/parse_context.rs` - Parse state tracking
- `src/native_parser.rs` - Main parser struct (skeleton)
**Tests**: Unit tests for `InputStream` operations
**Effort**: 2-3 hours
### Phase 2: Basic Terminals (Proof of Concept)
**Implement**:
- `parse_terminal()` - Literal matching
- `parse_charclass()` - Character class matching
- `parse_nonterminal()` - Rule reference
**Tests**:
- `test` - Basic grammar
- `aaa` - Hidden literals
- `string` - String literals
**Effort**: 3-4 hours
### Phase 3: Sequences and Alternatives
**Implement**:
- `parse_sequence()` - Factor concatenation
- `parse_alternatives()` - Ordered choice
**Tests**:
- `address` - Multiple alternatives
- `expr` - Nested alternatives
**Effort**: 2-3 hours
### Phase 4: Repetitions
**Implement**:
- `parse_repetition()` - `*`, `+`, `?`
- Separated repetitions - `**`, `++`
**Tests**:
- `email` - Separated repetitions
- `hash` - Multiple separators
**Effort**: 4-5 hours
### Phase 5: Marks and Insertions
**Implement**:
- Mark application logic
- Insertion handling
- Attribute extraction
**Tests**:
- `marked` - Various marks
- `unicode-classes` - **The critical test** with `(-[Co], +".")*`
**Effort**: 3-4 hours
### Phase 6: Full Test Suite
**Implement**:
- Error handling improvements
- Edge cases
- Performance tuning
**Tests**: Run full 133-test suite
**Effort**: 5-8 hours
## Expected Benefits
### 1. Correctness
- **Direct semantics**: No translation layer to introduce bugs
- **Specification alignment**: Code structure mirrors iXML spec
- **Insertion/suppression**: Native support, not bolted-on
### 2. Performance
Expected performance characteristics:
| Terminal match | O(k) | k = literal length |
| Character class | O(1) | Single character test |
| Alternative | O(n × m) | n = alternatives, m = input |
| Sequence | O(k × m) | k = factors, m = input |
| Repetition | O(n × m) | n = iterations, m = input per iteration |
**Overall**: O(n × m) worst-case where n = grammar size, m = input size
For typical grammars:
- **Faster than Earley** for deterministic grammars (no state explosion)
- **Similar to Earley** for ambiguous grammars
- **Much simpler** implementation (no parser table construction)
### 3. Simplicity
Lines of code estimate:
| `input_stream.rs` | ~100 | Simple wrapper |
| `parse_context.rs` | ~50 | State tracking |
| `native_parser.rs` | ~600-800 | Core interpreter |
| Error handling | ~100 | Rich error types |
| Tests | ~500 | Comprehensive coverage |
| **Total** | **~1400** | vs. 3600+ for Earley approach |
### 4. Debuggability
- Stack traces show actual rule names (not generated symbols)
- Errors reference grammar rules directly
- No "missing action" or "missing symbol" cryptic messages
- Easy to add tracing/logging
### 5. Maintainability
- One-to-one mapping: grammar construct → function
- No global state (except grammar itself)
- Pure functions (input → result)
- Easy to extend with new features
## Comparison: Earley vs. Native
| **Abstraction** | Parser generator (wrong level) | Direct specification |
| **Insertions** | No native support | First-class operation |
| **Suppressions** | Wrapper rules required | Built into marks |
| **Complexity** | 3600+ LOC | ~1400 LOC estimated |
| **Performance** | O(n³) worst-case, state explosion | O(n²) typical, predictable |
| **Errors** | "Missing Action", "No Rule completes" | Grammar-aligned messages |
| **Debugging** | Generated symbols hard to trace | Direct rule names |
| **Test Pass Rate** | 39.8% (hard limit) | 87%+ expected |
## Next Steps
1. ✅ **Document design** (this file)
2. **Implement Phase 1** - Core infrastructure
3. **Implement Phase 2** - Basic terminals (get first test passing)
4. **Iterate through phases** 3-6
5. **Run full test suite** - Target 80%+ pass rate
6. **Performance profiling** - Optimize hot paths
7. **Production-ready** - Documentation, error messages, API design
## Open Questions
1. **Ambiguity handling**: How to represent multiple parse trees?
- Option A: Return first match (PEG-style)
- Option B: Return all matches (GLR-style)
- **Recommendation**: Start with Option A, add Option B later if needed
2. **Left recursion**: How to handle indirect left-recursion?
- Option A: Detect and fail (simple)
- Option B: Packrat memoization to handle safely
- **Recommendation**: Start with Option A, good enough for most grammars
3. **Unicode categories**: Reuse existing code or refactor?
- **Recommendation**: Reuse `unicode_category_to_rangeset()` from `runtime_parser.rs`
4. **Separated repetitions with marks**: Edge case handling?
- Example: `word++(-",")` - separator is hidden
- **Recommendation**: Test thoroughly, likely works naturally
## Success Criteria
A successful native interpreter implementation will:
- ✅ Pass `unicode-classes` test (the Earley blocker)
- ✅ Pass all 43 currently passing tests
- ✅ Pass additional tests that Earley couldn't handle
- ✅ Have clear, grammar-aligned error messages
- ✅ Be simpler and more maintainable than Earley approach
- ✅ Demonstrate Rust language strengths (pattern matching, iterators, zero-cost abstractions)
- ✅ Serve as foundation for 90%+ conformance with the full iXML spec