Crate tokit

Crate tokit 

Source
Expand description

WIP: This project is still under active development and not ready for use.

Tokit

Blazing fast parser combinators with parse-while-lexing architecture (zero-copy), deterministic LALR-style parsing, and no hidden backtracking.

github LoC Build codecov

docs.rs crates.io crates.io license

English | 简体中文

§Overview

Tokit is a blazing fast parser combinator library for Rust that uniquely combines:

  • Parse-While-Lexing Architecture: Zero-copy streaming - parsers consume tokens directly from the lexer without buffering, eliminating allocation overhead
  • Deterministic LALR-Style Parsing: Explicit lookahead with compile-time buffer capacity, no hidden backtracking
  • Flexible Error Handling: Same parser code adapts for fail-fast runtime or greedy compiler diagnostics via the Emitter trait

Unlike traditional parser combinators that buffer tokens and rely on implicit backtracking, Tokit streams tokens on-demand with predictable, deterministic decisions. This makes it ideal for building high-performance language tooling, DSL parsers, compilers, and REPLs that need both speed and comprehensive error reporting.

§Key Features

  • Parse-While-Lexing: Zero-copy streaming architecture - no token buffering, no extra allocations
  • No Hidden Backtracking: Explicit, predictable parsing with lookahead-based decisions instead of implicit backtracking
  • Deterministic + Composable: Combines the flexibility of parser combinators with LALR-style deterministic table parsing
  • Flexible Error Handling Architecture: Designed to support both fail-fast parsing (runtime) and greedy parsing (compiler diagnostics) by swapping the Emitter type - same parser, different behavior
  • Token-Based Parsing: Works directly on token streams from any lexer implementing the Lexer<'inp> trait
  • Composable Combinators: Build complex parsers from simple, reusable building blocks
  • Flexible Error Handling: Configurable error emission strategies (Fatal, Silent, Ignored)
  • Rich Error Recovery: Built-in support for error recovery and validation
  • Zero-Cost Abstractions: All configuration resolved at compile time
  • No-std Support: Core functionality works without allocator
  • Multiple Source Types: Support for str, [u8], Bytes, BStr, HipStr
  • Logos Integration: Optional LogosLexer adapter for seamless Logos integration
  • CST Support: Optional Concrete Syntax Tree support via rowan

§Installation

Add this to your Cargo.toml:

[dependencies]
tokit = "0.0.0"

§Feature Flags

  • std (default) - Enable standard library support
  • alloc - Enable allocator support for no-std environments
  • logos - Enable LogosLexer adapter for Logos integration
  • rowan - Enable CST (Concrete Syntax Tree) support with rowan integration
  • bytes - Support for bytes::Bytes as token source
  • bstr - Support for bstr::BStr as token source
  • hipstr - Support for hipstr::HipStr as token source
  • among - Enable Among<L, M, R> parseable support
  • smallvec - Enable small vector optimization utilities

§Core Components

§Lexer Layer

  • Lexer<'inp> Trait

    Core trait for lexers that produce token streams. Implement this to use any lexer with Tokit.

  • Token<'a> Trait

    Defines token types with:

    • Kind: Token kind discriminator
    • Error: Associated error type
  • LogosLexer<'inp, T, L> (feature: logos)

    Ready-to-use adapter for integrating Logos lexers.

§Error Handling

Tokit’s flexible Emitter system allows the same parser to adapt to different use cases by simply changing the error handling strategy:

  • Emitter Strategies
    • Fatal - Fail-fast parsing: Stop on first error (default) - perfect for runtime parsing and REPLs
    • Greedy emitter (planned) - Collect all errors and continue parsing - perfect for compiler diagnostics and IDEs
    • Silent - Silently ignore errors
    • Ignored - Ignore errors completely

Key Design: Change the Emitter type to switch between fail-fast runtime parsing and greedy compiler diagnostics - same parser code, different behavior. This makes Tokit suitable for both:

  • Runtime/REPL: Fast feedback with Fatal emitter

  • Compiler/IDE: Comprehensive diagnostics with greedy emitter (coming soon)

  • Rich Error Types (in error/ module)

    • Token-level: UnexpectedToken, MissingToken, UnexpectedEot
    • Syntax-level: Unclosed, Unterminated, Malformed, Invalid
    • Escape sequences: HexEscape, UnicodeEscape
    • All errors include span tracking

§Utilities

  • Span Tracking

    • Span - Lightweight span representation
    • Spanned<T> - Wrap value with span
    • Located<T> - Wrap value with span and source slice
    • Sliced<T> - Wrap value with source slice
  • Parser Configuration

    • Parser<F, L, O, Error, Context> - Configurable parser
    • ParseContext - Context for emitter and cache
    • Window - Type-level peek buffer capacity for deterministic lookahead
    • Note: Lookahead windows support 1-32 token capacity via typenum::{U1..U32}

§Quick Start

Here’s a simple example parsing JSON tokens:

use logos::Logos;
use tokit::{Any, Parse, Token as TokenT};

#[derive(Debug, Logos, Clone)]
#[logos(skip r"[ \t\r\n\f]+")]
enum Token {
    #[token("true", |_| true)]
    #[token("false", |_| false)]
    Bool(bool),

    #[token("null")]
    Null,

    #[regex(r"-?(?:0|[1-9]\d*)(?:\.\d+)?", |lex| lex.slice().parse::<f64>().unwrap())]
    Number(f64),
}

#[derive(Debug, Display, Clone, Copy)]
enum TokenKind {
    Bool,
    Null,
    Number,
}

impl TokenT<'_> for Token {
    type Kind = TokenKind;
    type Error = ();

    fn kind(&self) -> Self::Kind {
        match self {
            Token::Bool(_) => TokenKind::Bool,
            Token::Null => TokenKind::Null,
            Token::Number(_) => TokenKind::Number,
        }
    }
}

type MyLexer<'a> = tokit::LogosLexer<'a, Token, Token>;

fn main() {
    // Parse any token and extract its value
    let parser = Any::parser::<'_, MyLexer<'_>, ()>()
      .map(|tok: Token| match tok {
        Token::Number(n) => Some(n),
        _ => None,
      });

    let result = parser.parse("42.5");
    println!("{:?}", result); // Ok(Some(42.5))
}

§More Examples

Check out the examples directory:

# JSON token parsing with map combinators
cargo run --example json

# Note: The calculator examples are being updated for v0.3.0 API

§Architecture

Tokit’s architecture follows a layered design:

  1. Lexer Layer - Token production and source abstraction
  2. Parser Layer - Composable parser combinators
  3. Error Layer - Rich error types and emission strategies
  4. Utility Layer - Spans, containers, and helpers

This separation enables:

  • Use any lexer by implementing Lexer<'inp>
  • Mix and match parser combinators
  • Customize error handling per-parser or globally
  • Zero-cost abstractions through compile-time configuration

§Design Philosophy

§Parse-While-Lexing: Zero-Copy Streaming

Tokit uses a parse-while-lexing architecture where parsers consume tokens directly from the lexer as needed, without intermediate buffering:

Traditional Approach (Two-Phase):

Source → Lexer → [Token Buffer] → Parser
         ↓
    Allocate Vec<Token>  ← Extra allocation!

Tokit Approach (Streaming):

Source → Lexer ←→ Parser
         ↑________↓
    Zero-copy streaming, no buffer

Benefits:

  • Zero Extra Allocations: No token buffer, tokens consumed on-demand
  • Lower Memory Footprint: Only lookahead window buffered on stack, not entire token stream
  • Better Cache Locality: Tokens processed immediately after lexing
  • Predictable Performance: No large allocations, deterministic memory usage

§No Hidden Backtracking

Unlike traditional parser combinators that rely on implicit backtracking (trying alternatives until one succeeds), Tokit uses explicit lookahead-based decisions. This design choice provides:

  • Predictable Performance: No hidden exponential backtracking scenarios
  • Explicit Control: Developers decide when and where to peek ahead via peek_then() and peek_then_choice()
  • Deterministic Parsing: LALR-style table-driven decisions using fixed-capacity lookahead windows (Window trait)
  • Better Error Messages: Failed alternatives don’t hide earlier, more relevant errors
// Traditional parser combinator (hidden backtracking):
// try_parser1.or(try_parser2).or(try_parser3)  // May backtrack!

// Tokit approach (explicit lookahead, no backtracking):
let parser = any()
    .peek_then::<_, typenum::U2>(|peeked, _| {
        match peeked.get(0) {
            Some(Token::If) => Ok(Action::Continue),  // Deterministic decision
            _ => Ok(Action::Stop),
        }
    });

§Parser Combinators + Deterministic Table Parsing

Tokit uniquely combines:

  • Parser Combinator Flexibility: Compose small parsers into complex grammars
  • LALR-Style Determinism: Fixed lookahead windows with deterministic decisions
  • Type-Level Capacity: Lookahead buffer size known at compile time (Window::CAPACITY)

This hybrid approach gives you composable abstractions without sacrificing performance or predictability.

§Fail-Fast Runtime ↔ Greedy Compiler Diagnostics

Tokit’s architecture decouples parsing logic from error handling strategy through the Emitter trait. This means:

Same Parser, Different Contexts:

  • Runtime/REPL Mode: Use Fatal emitter → stop on first error for immediate feedback
  • Compiler/IDE Mode: Use greedy emitter (planned) → collect all errors for comprehensive diagnostics
  • Testing/Fuzzing: Use Ignored emitter → parse through all errors for robustness testing

Benefits:

  • ✅ Write parsers once, deploy everywhere
  • ✅ No separate “error recovery mode” - it’s just a different emitter
  • ✅ Custom emitters can implement domain-specific error handling
  • ✅ Zero-cost abstraction - emitter behavior resolved at compile time

§Inspirations

Tokit takes inspiration from:

  • winnow - For ergonomic parser API design
  • chumsky - For composable parser combinator patterns
  • logos - For high-performance lexing
  • rowan - For lossless syntax tree representation

§Core Priorities

  1. Performance - Parse-while-lexing (zero-copy streaming), zero-cost abstractions, no hidden allocations
  2. Predictability - No hidden backtracking, explicit control flow, deterministic decisions
  3. Composability - Small parsers combine into complex ones
  4. Versatility - Same parser works for runtime (fail-fast) and compiler diagnostics (greedy) via Emitter
  5. Flexibility - Work with any lexer, customize error handling, support both AST and CST
  6. Correctness - Rich error types, span tracking, validation

§Who Uses Tokit?

  • smear: Blazing fast, fully spec-compliant, reusable parser combinators for standard GraphQL and GraphQL-like DSLs

§License

tokit is dual-licensed under:

You may choose either license for your purposes.

Copyright (c) 2025 Al Liu.

Re-exports§

pub use emitter::Emitter;
pub use lexer::Cache;
pub use lexer::Lexed;
pub use lexer::Lexer;
pub use lexer::Source;
pub use lexer::State;
pub use lexer::Token;
pub use parser::Parse;
pub use parser::ParseChoice;
pub use parser::ParseContext;
pub use parser::ParseInput;
pub use parser::Parser;
pub use parser::Window;
pub use logos;logos

Modules§

container
Trait for container types.
cstrowan
Concrete Syntax Tree (CST) representations and utilities. Concrete Syntax Tree (CST) utilities built on top of rowan.
emitter
The emitter related structures and traits
error
Common error types for lexers and parsers.
lexer
Lexers and token definitions.
parser
Parsers and combinators. Blazing fast parser combinators with deterministic parsing and zero-copy streaming.
punct
Common punctuation tokens.
syntax
Syntax definitions and traits. Syntax definition and incomplete syntax error types.
types
Common types for any programming language. Common types for building language-specific ASTs.
utils
Common utilities for working with tokens and lexers.

Macros§

keyword
Defines a keyword.
punctuator
Defines the punctuators.

Traits§

Check
A trait for checking
Require
A trait for tokens that can be compared for equivalence against a reference. A helper trait for ergonomically requiring specific token shapes.