regexr 0.1.0-beta.1

Regex engine specialized in tokenization and LLM text processing
Documentation
regexr-0.1.0-beta.1 has been yanked.

regexr

A specialized, pure-Rust regex engine designed for LLM tokenization and complex pattern matching.

Crates.io Documentation License


⚠️ Beta Warning

This library is currently in active development and testing. While it passes strict compliance tests for industry-standard tokenizer patterns (like OpenAI's cl100k_base and Meta's Llama 3), it has not yet been battle-tested in large-scale production environments.

Use at your own risk. We encourage using this for research, tokenizer training, and data preprocessing pipelines. Please report any edge cases or bugs you encounter on the Issue Tracker.


🎯 When to use regexr

This is a specialized tool, not a general-purpose replacement.

The Rust ecosystem already has the excellent, battle-tested regex crate. For 99% of use cases, you should use that.

Only use regexr if you specifically need:

  1. Lookarounds on Raw Bytes: You need features like (?=...) or (?!\S) running directly on &[u8] (invalid UTF-8).
    • Why not regex? It intentionally omits lookarounds to guarantee linear time.
    • Why not fancy-regex? It operates primarily on UTF-8 strings, not raw bytes.
  2. Pure Rust Dependency: You need advanced features (Atomic Groups, Lookarounds) but cannot use pcre2 due to unsafe C bindings or build complexity.
  3. Bounded Execution: You want ReDoS protection that memoizes states (guaranteeing completion) rather than just aborting after a timeout (like pcre2).

Documentation

  • Features Guide - Comprehensive syntax reference (lookarounds, properties, flags).
  • Architecture - Deep dive into the VM, JIT, and compiler design.
  • WASM Support - Guide for running regexr in the browser or Edge Runtime.
  • Contributing - How to build, test, and contribute to the project.

The Problem Solved

Developers building LLM tokenizers (like GPT-4 or Llama 3) currently face a dilemma in Rust:

  • regex crate: Fast, safe, and supports bytes, but lacks lookarounds needed for complex token boundary logic.
  • fancy-regex: Supports lookarounds, but is string-focused, making it unsuitable for processing raw binary data (invalid UTF-8).
  • pcre2: Supports everything, but introduces unsafe C bindings and external dependencies.

regexr bridges this gap. It is the only engine that provides advanced features (Lookarounds) on raw bytes while remaining 100% Safe Rust.

Key Features

  • Dual Mode Architecture:
    • Unicode Mode (&str): Full UTF-8 validation and property support.
    • Bytes Mode (&[u8]): Raw byte matching without UTF-8 validation.
  • Advanced Syntax: Native support for Lookarounds ((?=...), (?<=...)), Atomic Groups ((?>...)), and Possessive Quantifiers.
  • LLM Compatibility: specifically optimized to compile OpenAI (cl100k_base) and Llama 3 patterns.
  • Safety First: Built-in protection against ReDoS (Regular Expression Denial of Service) via bounded execution and memoization.
  • High Performance: JIT compilation (via Cranelift) and SIMD acceleration (AVX2/SSSE3) for high-throughput scenarios.

Installation

Add this to your Cargo.toml:

[dependencies]
regexr = "x.x"

Feature Flags

Flag Description Recommended For
simd Enable AVX2/SSSE3 acceleration High-throughput servers
jit Enable JIT compilation (native only) Heavy usage of complex patterns
full Enables all optimizations Production builds

Quick Start

1. Bytes Mode (LLM Tokenization)

Use this for processing raw binary data or training BPE tokenizers. It handles invalid UTF-8 gracefully and supports OpenAI-style compatibility.

use regexr::bytes::Regex;

fn main() {
    // Compile a pattern compatible with GPT-4's tokenizer
    // 'unicode_ascii_compat' maps \p{L} to [a-zA-Z] for raw byte performance
    let re = Regex::builder(r"\p{L}+|\p{N}+")
        .unicode_ascii_compat(true)
        .build()
        .unwrap();

    // Works on raw bytes, even invalid UTF-8 sequences
    let input = b"Hello \xFF World 123";
    let matches: Vec<_> = re.find_iter(input).map(|m| m.as_bytes()).collect();

    assert_eq!(matches[0], b"Hello");
    assert_eq!(matches[1], b"World");
    assert_eq!(matches[2], b"123");
}

2. Standard Text Matching (with Lookarounds)

Use this when you need regex features missing from the standard crate.

use regexr::Regex;

fn main() {
    // Lookarounds (not supported by standard regex crate)
    let re = Regex::new(r"foo(?=bar)").unwrap();
    assert!(re.is_match("foobar"));

    // Named captures
    let re = Regex::new(r"(?P<y>\d{4})-(?P<m>\d{2})").unwrap();
    let caps = re.captures("Date: 2024-01").unwrap();
    assert_eq!(&caps["y"], "2024");
}

Feature Comparison

Feature regex (Standard) pcre2 (C Bindings) fancy-regex regexr (This Crate)
Pure Rust
Safe (No unsafe)
Bytes Mode (&[u8])
Lookarounds
ReDoS Protection ✅ (Linear Time) ⚠️ (Abort Limits) ⚠️ (Backtrack Limit) ✅ (Memoization)

License

This project is licensed under the MIT License - see the LICENSE file for details.