# regexr
**A specialized, pure-Rust regex engine designed for LLM tokenization and complex pattern matching.**
[](https://crates.io/crates/regexr)
[](https://docs.rs/regexr)
[](LICENSE)
---
> ⚠️ **Beta Warning**
>
> This library is currently in **active development and testing**. While it passes strict compliance tests for industry-standard tokenizer patterns (like OpenAI's `cl100k_base` and Meta's Llama 3), it has not yet been battle-tested in large-scale production environments.
>
> **Use at your own risk.** We encourage using this for research, tokenizer training, and data preprocessing pipelines. Please report any edge cases or bugs you encounter on the [Issue Tracker](https://github.com/farhan/regexr/issues).
---
## 🎯 When to use `regexr`
**This is a specialized tool, not a general-purpose replacement.**
The Rust ecosystem already has the excellent, battle-tested [**`regex`**](https://crates.io/crates/regex) crate. For 99% of use cases, you should use that.
**Only use `regexr` if you specifically need:**
1. **Lookarounds on Raw Bytes:** You need features like `(?=...)` or `(?!\S)` running directly on `&[u8]` (invalid UTF-8).
- _Why not `regex`?_ It intentionally omits lookarounds to guarantee linear time.
- _Why not `fancy-regex`?_ It operates primarily on UTF-8 strings, not raw bytes.
2. **Pure Rust Dependency:** You need advanced features (Atomic Groups, Lookarounds) but cannot use `pcre2` due to unsafe C bindings or build complexity.
3. **Bounded Execution:** You want ReDoS protection that **memoizes** states (guaranteeing completion) rather than just **aborting** after a timeout (like `pcre2`).
## Documentation
- [**Features Guide**](docs/FEATURES.md) - Comprehensive syntax reference (lookarounds, properties, flags).
- [**Architecture**](docs/ARCHITECTURE.md) - Deep dive into the VM, JIT, and compiler design.
- [**WASM Support**](docs/WASM.md) - Guide for running `regexr` in the browser or Edge Runtime.
- [**Contributing**](CONTRIBUTING.md) - How to build, test, and contribute to the project.
## The Problem Solved
Developers building LLM tokenizers (like GPT-4 or Llama 3) currently face a dilemma in Rust:
- **`regex` crate:** Fast, safe, and supports bytes, but **lacks lookarounds** needed for complex token boundary logic.
- **`fancy-regex`:** Supports lookarounds, but is **string-focused**, making it unsuitable for processing raw binary data (invalid UTF-8).
- **`pcre2`:** Supports everything, but introduces **unsafe C bindings** and external dependencies.
**`regexr` bridges this gap.** It is the only engine that provides **advanced features (Lookarounds)** on **raw bytes** while remaining **100% Safe Rust**.
## Key Features
- **Dual Mode Architecture**:
- **Unicode Mode (`&str`)**: Full UTF-8 validation and property support.
- **Bytes Mode (`&[u8]`)**: Raw byte matching without UTF-8 validation.
- **Advanced Syntax**: Native support for Lookarounds (`(?=...)`, `(?<=...)`), Atomic Groups (`(?>...)`), and Possessive Quantifiers.
- **LLM Compatibility**: specifically optimized to compile OpenAI (`cl100k_base`) and Llama 3 patterns.
- **Safety First**: Built-in protection against ReDoS (Regular Expression Denial of Service) via bounded execution and memoization.
- **High Performance**: JIT compilation (via Cranelift) and SIMD acceleration (AVX2/SSSE3) for high-throughput scenarios.
## Installation
Add this to your `Cargo.toml`:
```toml
[dependencies]
regexr = "x.x"
```
### Feature Flags
| `simd` | Enable AVX2/SSSE3 acceleration | High-throughput servers |
| `jit` | Enable JIT compilation (native only) | Heavy usage of complex patterns |
| `full` | Enables all optimizations | Production builds |
## Quick Start
### 1\. Bytes Mode (LLM Tokenization)
Use this for processing raw binary data or training BPE tokenizers. It handles invalid UTF-8 gracefully and supports OpenAI-style compatibility.
```rust
use regexr::bytes::Regex;
fn main() {
// Compile a pattern compatible with GPT-4's tokenizer
// 'unicode_ascii_compat' maps \p{L} to [a-zA-Z] for raw byte performance
let re = Regex::builder(r"\p{L}+|\p{N}+")
.unicode_ascii_compat(true)
.build()
.unwrap();
// Works on raw bytes, even invalid UTF-8 sequences
let input = b"Hello \xFF World 123";
let matches: Vec<_> = re.find_iter(input).map(|m| m.as_bytes()).collect();
assert_eq!(matches[0], b"Hello");
assert_eq!(matches[1], b"World");
assert_eq!(matches[2], b"123");
}
```
### 2\. Standard Text Matching (with Lookarounds)
Use this when you need regex features missing from the standard crate.
```rust
use regexr::Regex;
fn main() {
// Lookarounds (not supported by standard regex crate)
let re = Regex::new(r"foo(?=bar)").unwrap();
assert!(re.is_match("foobar"));
// Named captures
let re = Regex::new(r"(?P<y>\d{4})-(?P<m>\d{2})").unwrap();
let caps = re.captures("Date: 2024-01").unwrap();
assert_eq!(&caps["y"], "2024");
}
```
## Feature Comparison
| **Pure Rust** | ✅ | ❌ | ✅ | ✅ |
| **Safe (No `unsafe`)** | ✅ | ❌ | ✅ | ✅ |
| **Bytes Mode (`&[u8]`)** | ✅ | ✅ | ❌ | ✅ |
| **Lookarounds** | ❌ | ✅ | ✅ | ✅ |
| **ReDoS Protection** | ✅ (Linear Time) | ⚠️ (Abort Limits) | ⚠️ (Backtrack Limit) | ✅ (Memoization) |
## License
This project is licensed under the MIT License - see the [LICENSE](https://www.google.com/search?q=LICENSE) file for details.