regexr 0.1.0-beta.1

Regex engine specialized in tokenization and LLM text processing
Documentation
# regexr

**A specialized, pure-Rust regex engine designed for LLM tokenization and complex pattern matching.**

[![Crates.io](https://img.shields.io/crates/v/regexr.svg)](https://crates.io/crates/regexr)
[![Documentation](https://docs.rs/regexr/badge.svg)](https://docs.rs/regexr)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

---

> ⚠️ **Beta Warning**
>
> This library is currently in **active development and testing**. While it passes strict compliance tests for industry-standard tokenizer patterns (like OpenAI's `cl100k_base` and Meta's Llama 3), it has not yet been battle-tested in large-scale production environments.
>
> **Use at your own risk.** We encourage using this for research, tokenizer training, and data preprocessing pipelines. Please report any edge cases or bugs you encounter on the [Issue Tracker]https://github.com/farhan/regexr/issues.

---

## 🎯 When to use `regexr`

**This is a specialized tool, not a general-purpose replacement.**

The Rust ecosystem already has the excellent, battle-tested [**`regex`**](https://crates.io/crates/regex) crate. For 99% of use cases, you should use that.

**Only use `regexr` if you specifically need:**

1.  **Lookarounds on Raw Bytes:** You need features like `(?=...)` or `(?!\S)` running directly on `&[u8]` (invalid UTF-8).
    - _Why not `regex`?_ It intentionally omits lookarounds to guarantee linear time.
    - _Why not `fancy-regex`?_ It operates primarily on UTF-8 strings, not raw bytes.
2.  **Pure Rust Dependency:** You need advanced features (Atomic Groups, Lookarounds) but cannot use `pcre2` due to unsafe C bindings or build complexity.
3.  **Bounded Execution:** You want ReDoS protection that **memoizes** states (guaranteeing completion) rather than just **aborting** after a timeout (like `pcre2`).

## Documentation

- [**Features Guide**]docs/FEATURES.md - Comprehensive syntax reference (lookarounds, properties, flags).
- [**Architecture**]docs/ARCHITECTURE.md - Deep dive into the VM, JIT, and compiler design.
- [**WASM Support**]docs/WASM.md - Guide for running `regexr` in the browser or Edge Runtime.
- [**Contributing**]CONTRIBUTING.md - How to build, test, and contribute to the project.

## The Problem Solved

Developers building LLM tokenizers (like GPT-4 or Llama 3) currently face a dilemma in Rust:

- **`regex` crate:** Fast, safe, and supports bytes, but **lacks lookarounds** needed for complex token boundary logic.
- **`fancy-regex`:** Supports lookarounds, but is **string-focused**, making it unsuitable for processing raw binary data (invalid UTF-8).
- **`pcre2`:** Supports everything, but introduces **unsafe C bindings** and external dependencies.

**`regexr` bridges this gap.** It is the only engine that provides **advanced features (Lookarounds)** on **raw bytes** while remaining **100% Safe Rust**.

## Key Features

- **Dual Mode Architecture**:
  - **Unicode Mode (`&str`)**: Full UTF-8 validation and property support.
  - **Bytes Mode (`&[u8]`)**: Raw byte matching without UTF-8 validation.
- **Advanced Syntax**: Native support for Lookarounds (`(?=...)`, `(?<=...)`), Atomic Groups (`(?>...)`), and Possessive Quantifiers.
- **LLM Compatibility**: specifically optimized to compile OpenAI (`cl100k_base`) and Llama 3 patterns.
- **Safety First**: Built-in protection against ReDoS (Regular Expression Denial of Service) via bounded execution and memoization.
- **High Performance**: JIT compilation (via Cranelift) and SIMD acceleration (AVX2/SSSE3) for high-throughput scenarios.

## Installation

Add this to your `Cargo.toml`:

```toml
[dependencies]
regexr = "x.x"
```

### Feature Flags

| Flag   | Description                          | Recommended For                 |
| ------ | ------------------------------------ | ------------------------------- |
| `simd` | Enable AVX2/SSSE3 acceleration       | High-throughput servers         |
| `jit`  | Enable JIT compilation (native only) | Heavy usage of complex patterns |
| `full` | Enables all optimizations            | Production builds               |

## Quick Start

### 1\. Bytes Mode (LLM Tokenization)

Use this for processing raw binary data or training BPE tokenizers. It handles invalid UTF-8 gracefully and supports OpenAI-style compatibility.

```rust
use regexr::bytes::Regex;

fn main() {
    // Compile a pattern compatible with GPT-4's tokenizer
    // 'unicode_ascii_compat' maps \p{L} to [a-zA-Z] for raw byte performance
    let re = Regex::builder(r"\p{L}+|\p{N}+")
        .unicode_ascii_compat(true)
        .build()
        .unwrap();

    // Works on raw bytes, even invalid UTF-8 sequences
    let input = b"Hello \xFF World 123";
    let matches: Vec<_> = re.find_iter(input).map(|m| m.as_bytes()).collect();

    assert_eq!(matches[0], b"Hello");
    assert_eq!(matches[1], b"World");
    assert_eq!(matches[2], b"123");
}
```

### 2\. Standard Text Matching (with Lookarounds)

Use this when you need regex features missing from the standard crate.

```rust
use regexr::Regex;

fn main() {
    // Lookarounds (not supported by standard regex crate)
    let re = Regex::new(r"foo(?=bar)").unwrap();
    assert!(re.is_match("foobar"));

    // Named captures
    let re = Regex::new(r"(?P<y>\d{4})-(?P<m>\d{2})").unwrap();
    let caps = re.captures("Date: 2024-01").unwrap();
    assert_eq!(&caps["y"], "2024");
}
```

## Feature Comparison

| Feature                  | `regex` (Standard) | `pcre2` (C Bindings) |    `fancy-regex`     | `regexr` (This Crate) |
| :----------------------- | :----------------: | :------------------: | :------------------: | :-------------------: |
| **Pure Rust**            |||||
| **Safe (No `unsafe`)**   |||||
| **Bytes Mode (`&[u8]`)** |||||
| **Lookarounds**          |||||
| **ReDoS Protection**     |  ✅ (Linear Time)  |  ⚠️ (Abort Limits)   | ⚠️ (Backtrack Limit) |   ✅ (Memoization)    |

## License

This project is licensed under the MIT License - see the [LICENSE](https://www.google.com/search?q=LICENSE) file for details.