
regexr
A specialized, pure-Rust regex engine designed for LLM tokenization and complex pattern matching.
⚠️ Experimental - API May Change
This library was created as the regex backend for splintr, an LLM tokenizer. It is highly experimental and the API may change drastically between versions.
While it passes compliance tests for industry-standard tokenizer patterns (OpenAI's
cl100k_base, Meta's Llama 3), it has not been proven in production environments.Recommended for: Research, experimentation, tokenizer development, data preprocessing.
Not recommended for: Production systems requiring stability guarantees.
Please report issues on the Issue Tracker.
🎯 When to use regexr
This is a specialized tool, not a general-purpose replacement.
The Rust ecosystem already has the excellent, battle-tested regex crate. For 99% of use cases, you should use that.
Only use regexr if you specifically need:
- Lookarounds: You need features like
(?=...),(?<=...), or(?!\S)without C dependencies.- Why not
regex? It intentionally omits lookarounds to guarantee linear time. - Why not
pcre2? Requires C library and FFI.
- Why not
- JIT Compilation in Pure Rust: You want native code generation for hot patterns without C dependencies.
- Why not
regex/fancy-regex? Neither offers JIT compilation. - Why not
pcre2? Requires C library and FFI.
- Why not
- Pure Rust Dependency: You need advanced features (Lookarounds, Backreferences) but cannot use
pcre2due to unsafe C bindings or build complexity. - Bounded Execution: You want ReDoS protection that memoizes states (guaranteeing completion) rather than just aborting after a timeout (like
pcre2).
The Problem Solved
Developers building LLM tokenizers (like GPT-4 or Llama 3) currently face a dilemma in Rust:
regexcrate: Fast, safe, but lacks lookarounds and JIT compilation.fancy-regex: Supports lookarounds, but lacks JIT compilation.pcre2: Supports everything including JIT, but introduces unsafe C bindings and external dependencies.
regexr bridges this gap. It provides Lookarounds + JIT compilation + Backreferences while remaining 100% Pure Rust.
Installation
Add this to your Cargo.toml:
[]
= "0.x"
For JIT compilation support:
[]
= { = "0.x", = ["full"] }
Usage
Basic matching
use Regex;
let re = new.unwrap;
assert!;
// Find first match
if let Some = re.find
// Find all matches
for m in re.find_iter
Capture groups
use Regex;
let re = new.unwrap;
let caps = re.captures.unwrap;
println!; // "user@example.com"
println!; // "user"
println!; // "example"
println!; // "com"
Named captures
use Regex;
let re = new.unwrap;
let caps = re.captures.unwrap;
println!; // "user"
println!; // "example.com"
JIT compilation
Enable JIT for patterns that will be matched many times:
use RegexBuilder;
let re = new
.jit
.build
.unwrap;
assert!;
Prefix optimization for tokenizers
For patterns with many literal alternatives (e.g., keyword matching in tokenizers):
use RegexBuilder;
let re = new
.optimize_prefixes
.build
.unwrap;
assert!;
Text replacement
use Regex;
let re = new.unwrap;
// Replace first match
let result = re.replace;
assert_eq!;
// Replace all matches
let result = re.replace_all;
assert_eq!;
Feature Flags
simd(default): Enables SIMD-accelerated literal searchjit: Enables JIT compilation for x86-64full: Enables both JIT and SIMD
Build without default features for a minimal installation:
Build with all optimizations:
Engine Selection
The library automatically selects the best execution engine based on pattern characteristics:
Non-JIT mode (default):
- ShiftOr: Small patterns (≤64 states) without anchors/word boundaries
- EagerDfa: Patterns with word boundaries or anchors
- LazyDfa: General patterns with on-demand state construction
- BacktrackingVm: Patterns with backreferences
- PikeVm: Patterns with lookaround or non-greedy quantifiers
JIT mode (with jit feature):
- BacktrackingJit: Patterns with backreferences
- TaggedNfa: Patterns with lookaround or non-greedy quantifiers
- JitShiftOr: Small patterns with alternations
- DFA JIT: General patterns, benefits from SIMD prefiltering
See docs/architecture.md for details on the engine selection logic.
Performance
Speedup relative to regex crate (higher is better):

Highlights (speedup vs regex crate):
| Benchmark | regexr |
regexr-jit |
pcre2-jit |
|---|---|---|---|
| log_parsing | 0.80-0.84x | 3.91-4.09x | 3.57-3.71x |
| url_extraction | 0.81-0.83x | 1.95-1.99x | 2.10-2.13x |
| unicode_letters | 1.24x | 1.43-1.44x | 1.65-1.72x |
| html_tags | 0.82-0.87x | 1.33-1.43x | 0.80-0.85x |
| word_boundary | 1.19-1.24x | 1.15-1.19x | 0.72-0.74x |
| email_validation | 0.99-1.00x | 1.00-1.11x | 0.94-1.00x |
| alternation | 0.88-1.01x | 0.88-1.01x | 0.12-0.15x |
regexr-jitexcels at log parsing (4x faster thanregex)regexr(non-JIT) matchesregexperformance on most patterns- Both outperform
fancy-regexandpcre2(non-JIT) consistently
Documentation
- Architecture Overview - Engine architecture and selection logic
- Features - Detailed feature documentation
Citation
If you use regexr in your research, please cite: