regexr
A specialized, pure-Rust regex engine designed for LLM tokenization and complex pattern matching.
⚠️ Beta Warning
This library is currently in active development and testing. While it passes strict compliance tests for industry-standard tokenizer patterns (like OpenAI's
cl100k_baseand Meta's Llama 3), it has not yet been battle-tested in large-scale production environments.Use at your own risk. We encourage using this for research, tokenizer training, and data preprocessing pipelines. Please report any edge cases or bugs you encounter on the Issue Tracker.
🎯 When to use regexr
This is a specialized tool, not a general-purpose replacement.
The Rust ecosystem already has the excellent, battle-tested regex crate. For 99% of use cases, you should use that.
Only use regexr if you specifically need:
- Lookarounds on Raw Bytes: You need features like
(?=...)or(?!\S)running directly on&[u8](invalid UTF-8).- Why not
regex? It intentionally omits lookarounds to guarantee linear time. - Why not
fancy-regex? It operates primarily on UTF-8 strings, not raw bytes.
- Why not
- Pure Rust Dependency: You need advanced features (Atomic Groups, Lookarounds) but cannot use
pcre2due to unsafe C bindings or build complexity. - Bounded Execution: You want ReDoS protection that memoizes states (guaranteeing completion) rather than just aborting after a timeout (like
pcre2).
Documentation
- Features Guide - Comprehensive syntax reference (lookarounds, properties, flags).
- Architecture - Deep dive into the VM, JIT, and compiler design.
- WASM Support - Guide for running
regexrin the browser or Edge Runtime. - Contributing - How to build, test, and contribute to the project.
The Problem Solved
Developers building LLM tokenizers (like GPT-4 or Llama 3) currently face a dilemma in Rust:
regexcrate: Fast, safe, and supports bytes, but lacks lookarounds needed for complex token boundary logic.fancy-regex: Supports lookarounds, but is string-focused, making it unsuitable for processing raw binary data (invalid UTF-8).pcre2: Supports everything, but introduces unsafe C bindings and external dependencies.
regexr bridges this gap. It is the only engine that provides advanced features (Lookarounds) on raw bytes while remaining 100% Safe Rust.
Key Features
- Dual Mode Architecture:
- Unicode Mode (
&str): Full UTF-8 validation and property support. - Bytes Mode (
&[u8]): Raw byte matching without UTF-8 validation.
- Unicode Mode (
- Advanced Syntax: Native support for Lookarounds (
(?=...),(?<=...)), Atomic Groups ((?>...)), and Possessive Quantifiers. - LLM Compatibility: specifically optimized to compile OpenAI (
cl100k_base) and Llama 3 patterns. - Safety First: Built-in protection against ReDoS (Regular Expression Denial of Service) via bounded execution and memoization.
- High Performance: JIT compilation (via Cranelift) and SIMD acceleration (AVX2/SSSE3) for high-throughput scenarios.
Installation
Add this to your Cargo.toml:
[]
= "x.x"
Feature Flags
| Flag | Description | Recommended For |
|---|---|---|
simd |
Enable AVX2/SSSE3 acceleration | High-throughput servers |
jit |
Enable JIT compilation (native only) | Heavy usage of complex patterns |
full |
Enables all optimizations | Production builds |
Quick Start
1. Bytes Mode (LLM Tokenization)
Use this for processing raw binary data or training BPE tokenizers. It handles invalid UTF-8 gracefully and supports OpenAI-style compatibility.
use Regex;
2. Standard Text Matching (with Lookarounds)
Use this when you need regex features missing from the standard crate.
use Regex;
Feature Comparison
| Feature | regex (Standard) |
pcre2 (C Bindings) |
fancy-regex |
regexr (This Crate) |
|---|---|---|---|---|
| Pure Rust | ✅ | ❌ | ✅ | ✅ |
Safe (No unsafe) |
✅ | ❌ | ✅ | ✅ |
Bytes Mode (&[u8]) |
✅ | ✅ | ❌ | ✅ |
| Lookarounds | ❌ | ✅ | ✅ | ✅ |
| ReDoS Protection | ✅ (Linear Time) | ⚠️ (Abort Limits) | ⚠️ (Backtrack Limit) | ✅ (Memoization) |
License
This project is licensed under the MIT License - see the LICENSE file for details.