mimium-lang 4.0.0-alpha

# Parser for mimium

This directory contains the parser implementation for mimium, based on the Red-Green Syntax Tree pattern.

For the formal syntax specification, see [ebnf.md](./ebnf.md).

## Overview

The parser preserves all information from the source code, including comments, whitespace, and exact positions. This makes it suitable for IDE features like:

- Code formatting
- Refactoring
- Semantic highlighting
- Code completion with context
- Comment preservation during transformations

## Architecture

The parser consists of four main stages:

### 1. Tokenizer (`tokenizer.rs`)

Converts source text into position-aware tokens using chumsky parser combinators.

- **Input**: Source text (`&str`)
- **Output**: Sequence of `Token`s
- Each token stores:
  - `kind`: Type of token (identifier, keyword, literal, etc.)
  - `start`: Byte offset in source
  - `length`: Length in bytes
- Literals store only position information, not the actual value (values can be reconstructed from source + position)
- **Error Recovery**: Uses chumsky's error recovery to continue parsing after encountering invalid characters
  - Invalid characters are marked with `TokenKind::Error`
  - Parser continues collecting valid tokens after errors

### 2. Pre-parser (`preparser.rs`)

Separates trivia (comments, whitespace) from syntax tokens.

- **Input**: Token sequence
- **Output**: `PreParsedTokens` containing:
  - `token_indices`: Indices of non-trivia tokens
  - `leading_trivia_map`: Map from token index to leading trivia
  - `trailing_trivia_map`: Map from token index to trailing trivia
- Comments and whitespace are preserved in trivia maps
- Linebreaks act as separators between leading and trailing trivia

### 3. CST Parser (`cst_parser.rs`, `green.rs`)

Parses token indices into a Green Tree (Concrete Syntax Tree).

- **Input**: Token indices from pre-parser
- **Output**: Green Tree (position-independent, immutable CST)
- Green nodes use **SlotMap-based interning** for efficient memory management
  - Nodes stored in `GreenNodeArena` with O(1) lookups
  - Referenced via lightweight `GreenNodeId` handles
  - Better cache locality compared to Arc-based sharing
  - No reference counting overhead
- Green nodes represent the complete syntactic structure
- Can be shared and cached across multiple uses
- Implements recursive descent parsing
- **Lookahead Disambiguation**:
  - Tuples `(a, b)` vs parenthesized expressions `(a)`
  - Records `{a = 1}` vs blocks `{stmt}`
  - Tuples require a comma (single element tuples need trailing comma)
  - Records require field assignment pattern `ident = expr`

### 4. Red Tree / AST (`red.rs`, `lower.rs`)

Converts Green Tree to Red Tree and AST.

- **Input**: Green Tree
- **Output**: 
  - Red Tree: Position-aware nodes with absolute offsets
  - AST: Traditional abstract syntax tree without trivia
- Red nodes are created on-demand from Green nodes
- **Let-body-then Chain Transformation**: When converting to AST, flat statement lists are transformed into nested Let structures
  - Green/Red Tree: Statements stored as a flat list for full-fidelity representation
  - AST: Let bindings nested as `Let(x, value, Let(y, value2, body))` for semantic analysis
  - Matches the structure expected by the main mimium compiler

## Usage

```rust
use mimium_lang::compiler::parser;

let source = "fn dsp() { 42 }";

// Complete pipeline
let (ast, tokens, preparsed, arena, errors) = parser::parse(source);

// Or step by step:
let tokens = parser::tokenize(source);
let preparsed = parser::preparse(&tokens);
let (green_id, arena, tokens, errors) = parser::parse_cst(tokens, &preparsed);
let red = parser::green_to_red(green_id, 0);
let ast = parser::red_to_ast(&red, source, &tokens, &arena);
```

### Error Recovery

The tokenizer can recover from invalid characters and continue parsing:

```rust
let source = "fn dsp() { let x = 42 § let y = 3.14 © }";
let tokens = parser::tokenize(source);

// The parser continues and marks invalid characters as Error tokens
// Valid tokens: fn, dsp, (, ), {, let, x, =, 42, let, y, =, 3.14, }
// Error tokens: §, ©
```

## Red-Green Syntax Tree Pattern

This implementation follows the Red-Green Syntax Tree pattern used in rust-analyzer and Roslyn:

- **Green Tree**: Immutable, position-independent nodes that represent the syntactic structure
  - Can be shared and cached
  - Efficient for incremental parsing
  - Contains all tokens including trivia

- **Red Tree**: Position-aware wrapper around Green nodes
  - Created on-demand with absolute positions
  - Used for queries and navigation
  - Efficient as Green nodes are reused

## Testing

```bash
cargo test -p mimium-lang parser
```

## References

- [ebnf.md](./ebnf.md) - Formal syntax specification in EBNF
- [rust-analyzer's rowan library](https://github.com/rust-lang/rust-analyzer/tree/master/crates/rowan)
- [Roslyn's Red-Green Trees](https://ericlippert.com/2012/06/08/red-green-trees/)
- [Simple but Powerful Pratt Parsing](https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html)