rustixml 0.3.1 - Docs.rs

# rustixml Conformance Roadmap

This document outlines the known issues in rustixml and provides a clear roadmap for improving iXML conformance.

## Current Conformance (Unreleased): 76.9%

- **Total Tests**: 50 / 65 passing
- **Correctness**: 47 / 49 (95.9%)
- **Ambiguous**: 0 / 13 (0.0%)
- **Error**: 3 / 3 (100.0%)

This is an improvement from:
- v0.2.0 post-fixes: 75.4% (49/65)
- Initial v0.2.0: 69.2% (45/65)

---

## Completed Improvements (Since v0.2.0)

### 1. Grammar Parser Enhancements ✅
- **Status**: `Done`
- **Goal**: Implement missing iXML 1.0 grammar features.
- **Fixed (4 tests)**:
    - `correct/version-decl` (Version declaration)
    - `correct/version-decl.2` (Version declaration with any string)
    - `correct/ws-and-delim` (`=` as rule separator)
    - `error/invalid-char` (`+#hex` insertion syntax)

### 2. Full Left-Recursion Support ✅
- **Status**: `Done`
- **Goal**: Implement seed-growing algorithm for complete left-recursion handling.
- **Fixed (6 tests)**:
    - `correct/expr` - Expression grammar with left-recursive operators
    - `correct/json` - JSON grammar with nested structures
    - `correct/arithmetic` - Arithmetic expression parsing
    - `ambiguous/expr0`, `ambiguous/expr1`, `ambiguous/expr2` - Expression ambiguity tests
- **Implementation**: Full seed-growing algorithm documented in `docs/SEED_GROWING_IMPLEMENTATION.md`
- **Performance**: < 1ms analysis overhead, ~10-20% runtime overhead only for recursive grammars

### 3. Grammar Normalization Framework ✅
- **Status**: `Done`
- **Goal**: Transform grammars for better analysis (Pemberton's approach)
- **Features**:
    - Hidden/Promoted rule inlining (`-` and `^` marks)
    - Foundation for static analysis
    - Enables ambiguity detection and recursion handling
- **Implementation**: `src/normalize.rs`

### 4. Static Ambiguity Detection ✅
- **Status**: `Done`
- **Goal**: Detect potentially ambiguous grammars and mark output appropriately
- **Features**:
    - Fixpoint nullable detection
    - Three ambiguity patterns detected
    - Automatic `ixml:state="ambiguous"` marking
- **Fixed (2 tests)**:
    - `ambiguous/ambig2` - Nullable alternatives pattern
    - `ambiguous/ambig3` - Consecutive nullable nonterminals

### 5. Unicode Category Handling ✅
- **Status**: `Done`
- **Goal**: Properly exclude newlines from Unicode control categories per iXML spec
- **Fixed**: Newline handling in `Cc` and `C` categories (U+000A, U+000D exclusion)

---

## v0.3.0: Remaining Issues

**Target Conformance**: ~81-85%
**Focus**: Fix remaining correctness issues and improve Unicode handling.

### 1. Unicode Line Separator Handling
- **Status**: `To Do`
- **Goal**: Extend Unicode category handling to properly exclude U+2028 (LINE SEPARATOR) and U+2029 (PARAGRAPH SEPARATOR) per iXML spec
- **Fixes**: `correct/unicode-classes`
- **Complexity**: Medium - requires extending current Unicode category logic

### 2. Grammar Execution Issues
- **Status**: `To Do`
- **Goal**: Fix meta-level bugs in grammar interpretation and execution
- **Fixes (2 tests)**:
    - `correct/vcard` - Grammar parser bug (eoln rule alternatives counted wrong)
    - `correct/xpath` - Grammar execution bug (predicate parsing fails)
- **Note**: These are different from parsing bugs - they're issues in how rustixml interprets the grammar specification itself
- **Complexity**: Hard - requires deep investigation of grammar interpretation

---

## v0.4.0: Exhaustive Parsing & Advanced Features

**Target Conformance**: ~92-95%
**Focus**: Implement exhaustive parsing for complete ambiguity handling

### 1. Exhaustive Ambiguity Detection (Deferred)
- **Status**: `Deferred`
- **Goal**: Return ALL valid parse trees for ambiguous inputs and generate diagnostic comments
- **Potential Fixes**: 11 remaining `ambiguous/*` tests
- **Challenges**:
    - Combinatorial explosion with nested ambiguities (2^N possible parses)
    - Performance impact (10-100x slower on complex grammars)
    - Memory exhaustion on pathological inputs
    - Test suite convention vs. strict spec requirement
- **Analysis**: See `docs/AMBIGUITY_TRACKING_ANALYSIS.md` for detailed trade-off analysis
- **Decision**: Static detection (currently implemented) provides actionable warnings without performance cost. Exhaustive parsing deferred until real-world use cases demonstrate need.

### 2. Advanced Lexing for Identifiers
- **Status**: `To Do`
- **Goal**: Update the lexer to correctly tokenize identifiers that contain dots (e.g., `unicode-6.1`)
- **Fixes**: `correct/unicode-version-diagnostic`
- **Complexity**: Expert - ambiguity between `.` in identifiers and `.` as rule terminator

---

## Architectural Considerations

### Current Approach: Native Recursive Descent + Seed-Growing
rustixml uses a native recursive descent parser that directly interprets the iXML grammar, enhanced with the seed-growing algorithm for left-recursion. This approach:
- **Pros**: Simple, maps cleanly to iXML semantics, handles left-recursion fully
- **Cons**: Requires explicit implementation of advanced features (exhaustive parsing)
- **Performance**: Excellent for non-ambiguous grammars, minimal overhead for left-recursion

### Seed-Growing Algorithm
The seed-growing algorithm (Frost et al.) provides complete left-recursion support:
- Handles direct and indirect left-recursion
- Works with hidden/promoted rules
- Minimal performance impact (~10-20% only for recursive grammars)
- Well-documented in `docs/SEED_GROWING_IMPLEMENTATION.md`

### Alternative: LALR+GLR
The `markup-blitz` reference implementation uses an LALR+GLR parser, which handles ambiguity and left-recursion automatically. While powerful, this represents a complete architectural rewrite.

**Decision**: The native parser with seed-growing provides excellent conformance (76.9%) with minimal complexity. The remaining issues (3 correctness bugs, 11 ambiguous tests) can be addressed incrementally. LALR+GLR remains a consideration for v2.0 if 100% conformance becomes critical.

## How to Contribute
The best way to contribute is to pick a task from the roadmap!
1. **Medium**: Extend Unicode category handling for U+2028/U+2029 line separators
2. **Hard**: Investigate the `correct/vcard` and `correct/xpath` grammar execution failures
3. **Expert**: Design and implement exhaustive parsing for complete ambiguity detection
4. **Expert**: Implement advanced identifier lexing with dot support

Please see [CONTRIBUTING.md](CONTRIBUTING.md) for more details on setting up the development environment and submitting changes.

## Recent Achievements (Since v0.2.0)
- ✅ Full left-recursion support via seed-growing algorithm
- ✅ Grammar normalization framework (Pemberton's approach)
- ✅ Static ambiguity detection with automatic marking
- ✅ Fixpoint nullable detection
- ✅ Unicode newline handling in control categories
- ✅ 6 new tests passing (expr, json, arithmetic, 3 ambiguous tests)
- ✅ Comprehensive documentation (SEED_GROWING_IMPLEMENTATION.md, AMBIGUITY_TRACKING_ANALYSIS.md)