tokit 0.0.0

Blazing fast parser combinators: parse-while-lexing (zero-copy), deterministic LALR-style parsing, no backtracking. Flexible emitters for fail-fast runtime or greedy compiler diagnostics
Documentation
> WIP: This project is still under active development and not ready for use.

<div align="center">
<h1>Tokit</h1>
</div>
<div align="center">

Blazing fast parser combinators with parse-while-lexing architecture (zero-copy), deterministic LALR-style parsing, and no hidden backtracking.

[<img alt="github" src="https://img.shields.io/badge/github-al8n/tokit-8da0cb?style=for-the-badge&logo=Github" height="22">][Github-url]
<img alt="LoC" src="https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2Fal8n%2F327b2a8aef9003246e45c6e47fe63937%2Fraw%2Ftokit" height="22">
[<img alt="Build" src="https://img.shields.io/github/actions/workflow/status/al8n/tokit/ci.yml?logo=Github-Actions&style=for-the-badge" height="22">][CI-url]
[<img alt="codecov" src="https://img.shields.io/codecov/c/gh/al8n/tokit?style=for-the-badge&token=6R3QFWRWHL&logo=codecov" height="22">][codecov-url]

[<img alt="docs.rs" src="https://img.shields.io/badge/docs.rs-tokit-66c2a5?style=for-the-badge&labelColor=555555&logo=" height="20">][doc-url]
[<img alt="crates.io" src="https://img.shields.io/crates/v/tokit?style=for-the-badge&logo=" height="22">][crates-url]
[<img alt="crates.io" src="https://img.shields.io/crates/d/tokit?color=critical&logo=&style=for-the-badge" height="22">][crates-url]
<img alt="license" src="https://img.shields.io/badge/License-Apache%202.0/MIT-blue.svg?style=for-the-badge&fontColor=white&logoColor=f5c076&logo=" height="22">

English | [简体中文][zh-cn-url]

</div>

## Overview

**Tokit** is a blazing fast parser combinator library for Rust that uniquely combines:

- **Parse-While-Lexing Architecture**: Zero-copy streaming - parsers consume tokens directly from the lexer without buffering, eliminating allocation overhead
- **Deterministic LALR-Style Parsing**: Explicit lookahead with compile-time buffer capacity, no hidden backtracking
- **Flexible Error Handling**: Same parser code adapts for fail-fast runtime or greedy compiler diagnostics via the `Emitter` trait

Unlike traditional parser combinators that buffer tokens and rely on implicit backtracking, Tokit streams tokens on-demand with predictable, deterministic decisions. This makes it ideal for building high-performance language tooling, DSL parsers, compilers, and REPLs that need both speed and comprehensive error reporting.

### Key Features

- **Parse-While-Lexing**: Zero-copy streaming architecture - no token buffering, no extra allocations
- **No Hidden Backtracking**: Explicit, predictable parsing with lookahead-based decisions instead of implicit backtracking
- **Deterministic + Composable**: Combines the flexibility of parser combinators with LALR-style deterministic table parsing
- **Flexible Error Handling Architecture**: Designed to support both fail-fast parsing (runtime) and greedy parsing (compiler diagnostics) by swapping the `Emitter` type - same parser, different behavior
- **Token-Based Parsing**: Works directly on token streams from any lexer implementing the `Lexer<'inp>` trait
- **Composable Combinators**: Build complex parsers from simple, reusable building blocks
- **Flexible Error Handling**: Configurable error emission strategies (`Fatal`, `Silent`, `Ignored`)
- **Rich Error Recovery**: Built-in support for error recovery and validation
- **Zero-Cost Abstractions**: All configuration resolved at compile time
- **No-std Support**: Core functionality works without allocator
- **Multiple Source Types**: Support for `str`, `[u8]`, `Bytes`, `BStr`, `HipStr`
- **Logos Integration**: Optional `LogosLexer` adapter for seamless [Logos](https://github.com/maciejhirsz/logos) integration
- **CST Support**: Optional Concrete Syntax Tree support via [rowan](https://github.com/rust-analyzer/rowan)

## Installation

Add this to your `Cargo.toml`:

```toml
[dependencies]
tokit = "0.0.0"
```

### Feature Flags

- `std` (default) - Enable standard library support
- `alloc` - Enable allocator support for no-std environments
- `logos` - Enable `LogosLexer` adapter for Logos integration
- `rowan` - Enable CST (Concrete Syntax Tree) support with rowan integration
- `bytes` - Support for `bytes::Bytes` as token source
- `bstr` - Support for `bstr::BStr` as token source
- `hipstr` - Support for `hipstr::HipStr` as token source
- `among` - Enable `Among<L, M, R>` parseable support
- `smallvec` - Enable small vector optimization utilities

## Core Components

### Lexer Layer

- **`Lexer<'inp>` Trait**

  Core trait for lexers that produce token streams. Implement this to use any lexer with Tokit.

- **`Token<'a>` Trait**

  Defines token types with:
  - `Kind`: Token kind discriminator
  - `Error`: Associated error type

- **`LogosLexer<'inp, T, L>`** (feature: `logos`)

  Ready-to-use adapter for integrating [Logos](https://github.com/maciejhirsz/logos) lexers.

### Error Handling

Tokit's flexible `Emitter` system allows the same parser to adapt to different use cases by simply changing the error handling strategy:

- **Emitter Strategies**
  - `Fatal` - **Fail-fast parsing**: Stop on first error (default) - perfect for runtime parsing and REPLs
  - **Greedy emitter** (planned) - Collect all errors and continue parsing - perfect for compiler diagnostics and IDEs
  - `Silent` - Silently ignore errors
  - `Ignored` - Ignore errors completely

**Key Design**: Change the `Emitter` type to switch between fail-fast runtime parsing and greedy compiler diagnostics - **same parser code, different behavior**. This makes Tokit suitable for both:

- **Runtime/REPL**: Fast feedback with `Fatal` emitter
- **Compiler/IDE**: Comprehensive diagnostics with greedy emitter (coming soon)

- **Rich Error Types** (in `error/` module)
  - Token-level: `UnexpectedToken`, `MissingToken`, `UnexpectedEot`
  - Syntax-level: `Unclosed`, `Unterminated`, `Malformed`, `Invalid`
  - Escape sequences: `HexEscape`, `UnicodeEscape`
  - All errors include span tracking

### Utilities

- **Span Tracking**
  - `Span` - Lightweight span representation
  - `Spanned<T>` - Wrap value with span
  - `Located<T>` - Wrap value with span and source slice
  - `Sliced<T>` - Wrap value with source slice

- **Parser Configuration**
  - `Parser<F, L, O, Error, Context>` - Configurable parser
  - `ParseContext` - Context for emitter and cache
  - `Window` - Type-level peek buffer capacity for deterministic lookahead
  - **Note**: Lookahead windows support 1-32 token capacity via `typenum::{U1..U32}`

## Quick Start

Here's a simple example parsing JSON tokens:

```rust
use logos::Logos;
use tokit::{Any, Parse, Token as TokenT};

#[derive(Debug, Logos, Clone)]
#[logos(skip r"[ \t\r\n\f]+")]
enum Token {
    #[token("true", |_| true)]
    #[token("false", |_| false)]
    Bool(bool),

    #[token("null")]
    Null,

    #[regex(r"-?(?:0|[1-9]\d*)(?:\.\d+)?", |lex| lex.slice().parse::<f64>().unwrap())]
    Number(f64),
}

#[derive(Debug, Display, Clone, Copy)]
enum TokenKind {
    Bool,
    Null,
    Number,
}

impl TokenT<'_> for Token {
    type Kind = TokenKind;
    type Error = ();

    fn kind(&self) -> Self::Kind {
        match self {
            Token::Bool(_) => TokenKind::Bool,
            Token::Null => TokenKind::Null,
            Token::Number(_) => TokenKind::Number,
        }
    }
}

type MyLexer<'a> = tokit::LogosLexer<'a, Token, Token>;

fn main() {
    // Parse any token and extract its value
    let parser = Any::parser::<'_, MyLexer<'_>, ()>()
      .map(|tok: Token| match tok {
        Token::Number(n) => Some(n),
        _ => None,
      });

    let result = parser.parse("42.5");
    println!("{:?}", result); // Ok(Some(42.5))
}
```

### More Examples

Check out the examples directory:

```bash
# JSON token parsing with map combinators
cargo run --example json

# Note: The calculator examples are being updated for v0.3.0 API
```

## Architecture

Tokit's architecture follows a layered design:

1. **Lexer Layer** - Token production and source abstraction
2. **Parser Layer** - Composable parser combinators
3. **Error Layer** - Rich error types and emission strategies
4. **Utility Layer** - Spans, containers, and helpers

This separation enables:

- Use any lexer by implementing `Lexer<'inp>`
- Mix and match parser combinators
- Customize error handling per-parser or globally
- Zero-cost abstractions through compile-time configuration

## Design Philosophy

### Parse-While-Lexing: Zero-Copy Streaming

Tokit uses a **parse-while-lexing** architecture where parsers consume tokens directly from the lexer as needed, without intermediate buffering:

**Traditional Approach (Two-Phase):**

```text
Source → Lexer → [Token Buffer] → Parser
         ↓
    Allocate Vec<Token>  ← Extra allocation!
```

**Tokit Approach (Streaming):**

```text
Source → Lexer ←→ Parser
         ↑________↓
    Zero-copy streaming, no buffer
```

**Benefits:**

- ✅ **Zero Extra Allocations**: No token buffer, tokens consumed on-demand
- ✅ **Lower Memory Footprint**: Only lookahead window buffered on stack, not entire token stream
- ✅ **Better Cache Locality**: Tokens processed immediately after lexing
- ✅ **Predictable Performance**: No large allocations, deterministic memory usage

### No Hidden Backtracking

Unlike traditional parser combinators that rely on implicit backtracking (trying alternatives until one succeeds), Tokit uses **explicit lookahead-based decisions**. This design choice provides:

- **Predictable Performance**: No hidden exponential backtracking scenarios
- **Explicit Control**: Developers decide when and where to peek ahead via `peek_then()` and `peek_then_choice()`
- **Deterministic Parsing**: LALR-style table-driven decisions using fixed-capacity lookahead windows (`Window` trait)
- **Better Error Messages**: Failed alternatives don't hide earlier, more relevant errors

```rust
// Traditional parser combinator (hidden backtracking):
// try_parser1.or(try_parser2).or(try_parser3)  // May backtrack!

// Tokit approach (explicit lookahead, no backtracking):
let parser = any()
    .peek_then::<_, typenum::U2>(|peeked, _| {
        match peeked.get(0) {
            Some(Token::If) => Ok(Action::Continue),  // Deterministic decision
            _ => Ok(Action::Stop),
        }
    });
```

### Parser Combinators + Deterministic Table Parsing

Tokit uniquely combines:

- **Parser Combinator Flexibility**: Compose small parsers into complex grammars
- **LALR-Style Determinism**: Fixed lookahead windows with deterministic decisions
- **Type-Level Capacity**: Lookahead buffer size known at compile time (`Window::CAPACITY`)

This hybrid approach gives you composable abstractions without sacrificing performance or predictability.

### Fail-Fast Runtime ↔ Greedy Compiler Diagnostics

Tokit's architecture decouples parsing logic from error handling strategy through the `Emitter` trait. This means:

**Same Parser, Different Contexts:**

- **Runtime/REPL Mode**: Use `Fatal` emitter → stop on first error for immediate feedback
- **Compiler/IDE Mode**: Use greedy emitter (planned) → collect all errors for comprehensive diagnostics
- **Testing/Fuzzing**: Use `Ignored` emitter → parse through all errors for robustness testing

**Benefits:**

- ✅ Write parsers once, deploy everywhere
- ✅ No separate "error recovery mode" - it's just a different emitter
- ✅ Custom emitters can implement domain-specific error handling
- ✅ Zero-cost abstraction - emitter behavior resolved at compile time

### Inspirations

Tokit takes inspiration from:

- [**winnow**](https://github.com/winnow-rs/winnow) - For ergonomic parser API design
- [**chumsky**](https://github.com/zesterer/chumsky) - For composable parser combinator patterns
- [**logos**](https://github.com/maciejhirsz/logos) - For high-performance lexing
- [**rowan**](https://github.com/rust-analyzer/rowan) - For lossless syntax tree representation

### Core Priorities

1. **Performance** - Parse-while-lexing (zero-copy streaming), zero-cost abstractions, no hidden allocations
2. **Predictability** - No hidden backtracking, explicit control flow, deterministic decisions
3. **Composability** - Small parsers combine into complex ones
4. **Versatility** - Same parser works for runtime (fail-fast) and compiler diagnostics (greedy) via `Emitter`
5. **Flexibility** - Work with any lexer, customize error handling, support both AST and CST
6. **Correctness** - Rich error types, span tracking, validation

## Who Uses Tokit?

- [`smear`](https://github.com/al8n/smear): Blazing fast, fully spec-compliant, reusable parser combinators for standard GraphQL and GraphQL-like DSLs

## License

`tokit` is dual-licensed under:

- MIT License ([LICENSE-MIT](LICENSE-MIT))
- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE))

You may choose either license for your purposes.

Copyright (c) 2025 Al Liu.

[Github-url]: https://github.com/al8n/tokit/
[CI-url]: https://github.com/al8n/tokit/actions/workflows/ci.yml
[doc-url]: https://docs.rs/tokit
[crates-url]: https://crates.io/crates/tokit
[codecov-url]: https://app.codecov.io/gh/al8n/tokit/
[zh-cn-url]: https://github.com/al8n/tokit/tree/main/README-zh_CN.md