# laburnum::chumsky Module
Tools for integrating [chumsky](https://docs.rs/chumsky) parser combinators with laburnum's span management and CST infrastructure.
## Architecture Overview
```
┌──────────────────────────────────────────────────────────────────┐
│ Source Code │
└───────────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Lexer (define_tokens!) │
│ • Produces tokens with leading/trailing trivia │
│ • Uses wrap! macro for trivia handling │
│ • Two-pass: EmptyErr (fast) → Rich (if errors) │
└───────────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ TokenStream (stream.rs) │
│ • Arc<[Spanned<Token>]> for efficient sharing │
│ • Bridges lexer output to CST parser input │
└───────────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ CST Parser (define_node_db!) │
│ • Generates Node enum, NodeDb, State, etc. │
│ • Supports backtracking via Inspector trait │
│ • Two modes: Parser (Vec) and Query (IndexMap) │
└───────────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ CST/AST │
│ • Nodes with spans, leading/trailing trivia │
│ • Ready for semantic analysis via symbolique │
└──────────────────────────────────────────────────────────────────┘
```
## Module Structure
| `mod.rs` | Core types: `State`, `LexExtra`, `SpanCreator`, `LaburnumSpanExt` |
| `stream.rs` | `TokenStream` wrapper for lexer output |
| `node_db.rs` | `define_node_db!` macro for CST infrastructure |
| `lexer/mod.rs` | Re-exports for lexer components |
| `lexer/trivia.rs` | `Trivia` type and parsers, `wrap!` macro |
| `lexer/define_tokens.rs` | `define_tokens!` macro for token generation |
## Quick Start
The key components are:
### 1. Define Token Types
```rust
// In lexer/keyword.rs
laburnum::chumsky::define_tokens! {
#[chumsky::text::unicode::keyword] // Use keyword matching
Token::Keyword(Keyword -> [
"fn" => Fn,
"let" => Let,
"async" => Async #[warn "reserved for future use: {}"],
])
}
// In lexer/control.rs
laburnum::chumsky::define_tokens! {
#[just] // Use exact matching
Token::Ctrl(Control -> [
"==" => EqEq, // Multi-char operators first
"=" => Eq,
"+" => Plus,
])
}
```
### 2. Create Lexer State
```rust
pub enum LexerState<'src> {
Detect {
inner: laburnum::chumsky::State<'src>,
did_see_token_error: bool,
},
Collect {
inner: laburnum::chumsky::State<'src>,
token_errors: Vec<(TokenError, laburnum::Span)>,
},
}
impl laburnum::chumsky::SpanCreator for LexerState<'_> {
fn span_from(&mut self, span: SimpleSpan) -> laburnum::Span {
self.inner_mut().span_cache.create_span(span.start, span.end - span.start)
}
}
```
### 3. Implement Two-Pass Lexing
```rust
pub fn lex<'src>(
source_key: SourceKey,
content: &'src str,
span_cache: &'src mut SpanCache,
) -> (TokenList, Vec<Rich<'src, char, SimpleSpan>>) {
// Fast pass: detect errors only
let checkpoint = span_cache.checkpoint();
{
let mut state = LexerState::new_detect(source_key, span_cache);
let result = lexer::<EmptyErr>()
.parse_with_state(content, &mut state)
.into_result();
if let Ok(tokens) = result {
return (tokens, Vec::new()); // No errors - fast path
}
}
// Rollback and collect detailed errors
span_cache.rollback(checkpoint);
let mut state = LexerState::new_collect(source_key, span_cache);
let (tokens, errors) = lexer::<Rich<'src, char>>()
.parse_with_state(content, &mut state)
.into_output_errors();
(tokens.unwrap_or_default(), errors)
}
```
### 4. Define CST Node Database (Optional)
```rust
laburnum::chumsky::define_node_db! { Cst =>
crate::errata::Error,
crate::errata::Todo,
crate::symbol::Ident,
crate::expr::Expr,
// ... more node types
}
```
This generates:
- `Node` enum with all node variants
- `CstNodeId` for strongly-typed references
- `CstNodeDb` with Parser/Query variants
- `CstState` with Detect/Collect modes
- `CstParserMapExtraExt` trait for node insertion
## Key Patterns
### Two-Pass Lexing/Parsing
The two-pass pattern optimizes for the common case of valid input:
1. **Detect Pass**: Use `EmptyErr` for fast parsing without error allocation
2. **Check Result**: If no errors, return immediately
3. **Collect Pass**: Only if errors detected, rollback and re-parse with `Rich` errors
This avoids expensive error formatting in the success path.
```
Input → Detect (EmptyErr) → Success? → Return tokens
↓ (errors)
Rollback SpanCache
↓
Collect (Rich) → Return tokens + errors
```
### LexerState Enum Pattern
The `Detect`/`Collect` enum pattern prevents accessing wrong fields:
- **Detect mode**: Only sets `did_see_token_error: bool` flag
- **Collect mode**: Accumulates `Vec<(TokenError, Span)>`
This provides compile-time safety and clear intent.
### Trivia Handling
Every token carries optional leading and trailing trivia:
```rust
Token::Keyword(
Option<Trivia>, // Leading whitespace
laburnum::Spanned<Keyword>, // The token itself
Option<Trivia>, // Trailing whitespace
)
```
The `wrap!` macro handles this automatically:
```rust
wrap!(
{ your_token_parser } -> |((leading, inner), trailing), e| {
YourToken::Variant(leading, inner, trailing)
}
)
```
### Span Management
- **Lexer spans**: Use `laburnum::Span` via `SpanCreator` trait
- **Parser spans**: Use chumsky's `SimpleSpan` internally
- **Conversion**: `LaburnumSpanExt::create_span()` in `.map_with()` closures
The `SpanCache` enables:
- Efficient span creation during parsing
- Checkpoint/rollback for two-pass parsing
- Text recovery from spans
## Macro Relationships
```
define_tokens! ────uses────► wrap!
│ │
│ ▼
│ trivia parsers
│ (leading, trailing)
│
▼
Token enum
Lexer function
Match macros (just!, spanned!, etc.)
```
```
define_node_db! ─────────► generates all CST infrastructure
│
├── Node enum (with leading/trailing trivia)
├── NodeId (span + key)
├── NodeDb (Parser/Query variants)
├── State (Detect/Collect variants)
├── Checkpoint (for backtracking)
├── ParserMapExtraExt (insert_* methods)
└── Printer (bluegum visualization)
```
## Match Macro Variants
Generated by `define_tokens!`:
| `just!(Variant)` | Enum variant | Ignores trivia |
| `spanned!(Variant)` | `VariantSpan` struct | Ignores trivia |
| `spanned_unboxed!(Variant)` | `VariantSpan` (unboxed) | Ignores trivia |
| `spanned_with_trivia!(Variant)` | `(leading, variant, trailing, span)` | Exposes trivia |
| `spanned_no_trivia!(Variant)` | `VariantSpan` | Fails if trivia present |
| `spanned_no_trailing_trivia!(Variant)` | `(leading, VariantSpan)` | Fails if trailing |
## Required Dependencies
When using these macros in your crate's `Cargo.toml`:
```toml
[dependencies]
laburnum = { path = "..." }
chumsky = "0.10"
bluegum = { path = "..." }
indexmap = "2"
owo-colors = "4"
paste = "1"
```
## Common Pitfalls
### 1. Future-Compat Warning for Generated Macros
The `define_tokens!` macro generates match macros using `#[macro_export]` which triggers a
future-compat warning (`macro_expanded_macro_exports_accessed_by_absolute_paths`).
This is a known Rust limitation - see [rust-lang/rust#52234](https://github.com/rust-lang/rust/issues/52234).
The warning will remain until Rust provides a better pattern for macro re-exports from
macro-generated code. The macros work correctly; this is just a warning about potential
future Rust changes.
### 2. Multi-Character Operators First
In `define_tokens!`, put longer operators before shorter ones:
```rust
Token::Ctrl(Control -> [
"==" => EqEq, // Must come before "="
"!=" => NotEq,
"=" => Eq,
// ...
])
```
### 3. Keyword vs Just
- Use `#[chumsky::text::unicode::keyword]` for language keywords to prevent matching `letx` as `let`
- Use `#[just]` for operators and delimiters
### 4. SpanCache Lifetime
The `SpanCache` must outlive parsing. Pass it by mutable reference to the state:
```rust
let mut span_cache = SpanCache::default();
let mut state = LexerState::new_detect(source_key, &mut span_cache);
```
### 5. Inspector Trait
For backtracking support with `define_node_db!`, your state must implement `chumsky::inspector::Inspector`. The macro generates this automatically for the `*State` type.
## Related ADRs
- **ADR0002**: Rope-based span storage - explains the `Span` design
- **ADR0001**: Content-addressed storage - broader architectural context
- **ADR0003**: Symbolique - how CST nodes feed into symbol analysis
## Feature Adoption Guide
| `define_tokens!` | Yes | All lexers - generates token enums and match macros |
| `State`/`SpanCreator` | Yes | All parsers - manages spans |
| Two-pass lexing | Recommended | Performance optimization for LSP use cases |
| `LexerState` enum | Recommended | Type-safe state for two-pass lexing |
| `wrap!` macro | Optional | Simplifies trivia handling in token parsers |
| `define_node_db!` | Optional | Full CST with backtracking, trivia preservation |