
regexr
A specialized, pure-Rust regex engine designed for LLM tokenization and complex pattern matching.
Originally created as the regex backend for splintr, an LLM tokenizer. Passes compliance tests for industry-standard tokenizer patterns (OpenAI's
cl100k_base, Meta's Llama 3).Please report issues on the Issue Tracker.
🎯 When to use regexr
This is a specialized tool, not a general-purpose replacement.
The Rust ecosystem already has the excellent, battle-tested regex crate. For 99% of use cases, you should use that.
Only use regexr if you specifically need:
- Lookarounds: You need features like
(?=...),(?<=...), or(?!\S)without C dependencies.- Why not
regex? It intentionally omits lookarounds to guarantee linear time. - Why not
pcre2? Requires C library and FFI.
- Why not
- JIT Compilation in Pure Rust: You want native code generation for hot patterns without C dependencies.
- Why not
regex/fancy-regex? Neither offers JIT compilation. - Why not
pcre2? Requires C library and FFI.
- Why not
- Pure Rust Dependency: You need advanced features (Lookarounds, Backreferences) but cannot use
pcre2due to unsafe C bindings or build complexity. - Bounded Execution: You want ReDoS protection that memoizes states (guaranteeing completion) rather than just aborting after a timeout (like
pcre2).
The Problem Solved
Developers building LLM tokenizers (like GPT-4 or Llama 3) currently face a dilemma in Rust:
regexcrate: Fast, safe, but lacks lookarounds and JIT compilation.fancy-regex: Supports lookarounds, but lacks JIT compilation.pcre2: Supports everything including JIT, but introduces unsafe C bindings and external dependencies.
regexr bridges this gap. It provides Lookarounds + JIT compilation + Backreferences while remaining 100% Pure Rust.
Installation
Add this to your Cargo.toml:
[]
= "0.x"
For JIT compilation support:
[]
= { = "0.x", = ["full"] }
Usage
Basic matching
use Regex;
let re = new.unwrap;
assert!;
// Find first match
if let Some = re.find
// Find all matches
for m in re.find_iter
Capture groups
use Regex;
let re = new.unwrap;
let caps = re.captures.unwrap;
println!; // "user@example.com"
println!; // "user"
println!; // "example"
println!; // "com"
Named captures
use Regex;
let re = new.unwrap;
let caps = re.captures.unwrap;
println!; // "user"
println!; // "example.com"
JIT compilation
Enable JIT for patterns that will be matched many times:
use RegexBuilder;
let re = new
.jit
.build
.unwrap;
assert!;
Prefix optimization for tokenizers
For patterns with many literal alternatives (e.g., keyword matching in tokenizers):
use RegexBuilder;
let re = new
.optimize_prefixes
.build
.unwrap;
assert!;
Text replacement
use Regex;
let re = new.unwrap;
// Replace first match
let result = re.replace;
assert_eq!;
// Replace all matches
let result = re.replace_all;
assert_eq!;
Feature Flags
simd(default): Enables SIMD-accelerated literal searchjit: Enables JIT compilation (x86-64 and ARM64)full: Enables both JIT and SIMD
Platform Support
| Platform | JIT Support | SIMD Support |
|---|---|---|
| Linux x86-64 | ✓ | ✓ (AVX2) |
| Linux ARM64 | ✓ | ✗ |
| macOS x86-64 | ✓ | ✓ (AVX2) |
| macOS ARM64 (Apple Silicon) | ✓ | ✗ |
| Windows x86-64 | ✓ | ✓ (AVX2) |
| WASM (wasm32) | ✗ | ✗ |
| Other | ✗ | ✗ |
Build without default features for a minimal installation (also works for WASM):
Build with all optimizations:
Engine Selection
The library automatically selects the best execution engine based on pattern characteristics:
Non-JIT mode (default):
- ShiftOr: Small patterns (≤64 states) without anchors/word boundaries
- EagerDfa: Patterns with word boundaries or anchors
- LazyDfa: General patterns with on-demand state construction
- BacktrackingVm: Patterns with backreferences
- PikeVm: Patterns with lookaround or non-greedy quantifiers
JIT mode (with jit feature):
- BacktrackingJit: Patterns with backreferences
- TaggedNfa: Patterns with lookaround or non-greedy quantifiers
- JitShiftOr: Small patterns with alternations
- DFA JIT: General patterns, benefits from SIMD prefiltering
See docs/architecture.md for details on the engine selection logic.
Performance
Speedup relative to regex crate (higher is better):

Highlights (speedup vs regex crate):
| Benchmark | regexr |
regexr-jit |
pcre2-jit |
|---|---|---|---|
| log_parsing | 0.80-0.84x | 3.91-4.09x | 3.57-3.71x |
| url_extraction | 0.81-0.83x | 1.95-1.99x | 2.10-2.13x |
| unicode_letters | 1.24x | 1.43-1.44x | 1.65-1.72x |
| html_tags | 0.82-0.87x | 1.33-1.43x | 0.80-0.85x |
| word_boundary | 1.19-1.24x | 1.15-1.19x | 0.72-0.74x |
| email_validation | 0.99-1.00x | 1.00-1.11x | 0.94-1.00x |
| alternation | 0.88-1.01x | 0.88-1.01x | 0.12-0.15x |
regexr-jitexcels at log parsing (4x faster thanregex)regexr(non-JIT) matchesregexperformance on most patterns- Both outperform
fancy-regexandpcre2(non-JIT) consistently
Documentation
- Architecture Overview - Engine architecture and selection logic
- Features - Detailed feature documentation
Citation
If you use regexr in your research, please cite: