rexile 0.1.1

A blazing-fast regex engine with JIT-style optimizations - competitive performance (1.03x vs regex) with 15x less memory
Documentation
# ReXile ๐ŸฆŽ

[![Crates.io](https://img.shields.io/crates/v/rexile.svg)](https://crates.io/crates/rexile)
[![Documentation](https://docs.rs/rexile/badge.svg)](https://docs.rs/rexile)
[![License: MIT OR Apache-2.0](https://img.shields.io/badge/license-MIT%20OR%20Apache--2.0-blue.svg)](LICENSE)

**A blazing-fast regex engine with JIT-style optimizations and minimal dependencies**

ReXile is a **zero-dependency regex alternative** (no `regex` crate!) that achieves **competitive performance** through intelligent fast paths:

- โšก **Performance-competitive with regex crate** - Within 3% on real-world workloads
- ๐Ÿง  **15x less memory for pattern compilation** - Minimal metadata overhead
- ๐Ÿš€ **21x faster pattern compilation** - Critical for dynamic patterns
- ๐Ÿ“ฆ **Only 2 dependencies** - `memchr` and `aho-corasick` for SIMD primitives
- ๐ŸŽฏ **10 specialized fast paths** - JIT-style optimizations without JIT complexity
- ๐Ÿ”ง **Full control** - Custom optimizations for parsers, lexers, and rule engines

**Key Features:**
- โœ… Literal searches with SIMD acceleration
- โœ… Multi-pattern matching (alternations)
- โœ… Character classes with negation
- โœ… Quantifiers (`*`, `+`, `?`)
- โœ… Escape sequences (`\d`, `\w`, `\s`, etc.)
- โœ… Sequences and groups
- โœ… Word boundaries (`\b`, `\B`)
- โœ… Anchoring (`^`, `$`)

## ๐ŸŽฏ Purpose

ReXile is a **production-ready regex engine** built from scratch for maximum performance and minimal overhead:

- ๐ŸŽฏ **Competitive performance** - 1.03x aggregate ratio vs `regex` crate on real workloads
- โšก **JIT-style optimizations** - 10 specialized fast paths for common patterns
- ๐Ÿ“ฆ **Minimal dependencies** - Only `memchr` + `aho-corasick` for SIMD primitives
- ๐Ÿš€ **Lightning-fast compilation** - 21x faster than `regex` crate
- ๐Ÿ’พ **Memory efficient** - 15x less compilation memory, 5x less peak memory
- ๐Ÿ”ง **Full control** - Custom optimizations for specific use cases

### Performance Highlights

**Real-World GRL Benchmark** (6 patterns ร— 41 files):
- Pattern `\d+`: **3.57x faster** than regex (41/41 wins)
- Pattern `"[^"]+"`: **2.44x faster** than regex (41/41 wins)
- Pattern `rule\s+`: **1.05x faster** than regex
- **Aggregate: 1.03x** (within 3% of regex - competitive!)

**Memory Comparison**:
- Compilation: **15x less memory** (128 KB vs 1920 KB)
- Compilation time: **21x faster** (370ยตs vs 7.89ms)
- Peak memory: **5x less** in stress tests (0.12 MB vs 0.62 MB)
- Search operations: **Equal memory efficiency**

**When to Use ReXile:**
- โœ… Parsers & lexers (fast token matching)
- โœ… Rule engines (business logic pattern matching)
- โœ… Log processing (keyword search)
- โœ… Dynamic patterns (21x faster compilation)
- โœ… Memory-constrained environments (15x less memory)
- โœ… Low-latency applications (competitive performance)

## ๐Ÿš€ Quick Start

```rust
use rexile::Pattern;

// Literal matching with SIMD acceleration
let pattern = Pattern::new("hello").unwrap();
assert!(pattern.is_match("hello world"));
assert_eq!(pattern.find("say hello"), Some((4, 9)));

// Multi-pattern matching (aho-corasick fast path)
let multi = Pattern::new("foo|bar|baz").unwrap();
assert!(multi.is_match("the bar is open"));

// Digit matching (DigitRun fast path - 3.57x faster than regex!)
let digits = Pattern::new("\\d+").unwrap();
let matches = digits.find_all("Order #12345 costs $67.89");
// Returns: [(7, 12), (20, 22), (23, 25)]

// Identifier matching (IdentifierRun fast path)
let ident = Pattern::new("[a-zA-Z_]\\w*").unwrap();
assert!(ident.is_match("variable_name_123"));

// Quoted strings (QuotedString fast path - 2.44x faster!)
let quoted = Pattern::new("\"[^\"]+\"").unwrap();
assert!(quoted.is_match("say \"hello world\""));

// Word boundaries
let word = Pattern::new("\\btest\\b").unwrap();
assert!(word.is_match("this is a test"));
assert!(!word.is_match("testing"));

// Anchors
let exact = Pattern::new("^hello$").unwrap();
assert!(exact.is_match("hello"));
assert!(!exact.is_match("hello world"));
```

### Cached API (Recommended for Hot Paths)

For patterns used repeatedly in hot loops:

```rust
use rexile;

// Automatically cached - compile once, reuse forever
assert!(rexile::is_match("test", "this is a test").unwrap());
assert_eq!(rexile::find("world", "hello world").unwrap(), Some((6, 11)));

// Perfect for parsers and lexers
for line in log_lines {
    if rexile::is_match("ERROR", line).unwrap() {
        // handle error
    }
}
```

## โœจ Supported Features

### Fast Path Optimizations (10 Types)

ReXile uses **JIT-style specialized implementations** for common patterns:

| Fast Path | Pattern Example | Performance vs regex |
|-----------|----------------|---------------------|
| **Literal** | `"hello"` | Competitive (SIMD) |
| **LiteralPlusWhitespace** | `"rule "` | Competitive |
| **DigitRun** | `\d+` | **3.57x faster** โœจ |
| **IdentifierRun** | `[a-zA-Z_]\w*` | **2520x faster** (vs general) |
| **QuotedString** | `"[^"]+"` | **2.44x faster** โœจ |
| **WordRun** | `\w+` | Competitive |
| **Alternation** | `foo\|bar\|baz` | 2x slower (acceptable) |
| **LiteralWhitespaceQuoted** | Complex | Competitive |
| **LiteralWhitespaceDigits** | Complex | Competitive |

### Regex Features

| Feature | Example | Status |
|---------|---------|--------|
| Literal strings | `hello`, `world` | โœ… Supported |
| Alternation | `foo\|bar\|baz` | โœ… Supported (aho-corasick) |
| Start anchor | `^start` | โœ… Supported |
| End anchor | `end$` | โœ… Supported |
| Exact match | `^exact$` | โœ… Supported |
| Character classes | `[a-z]`, `[0-9]`, `[^abc]` | โœ… Supported |
| Quantifiers | `*`, `+`, `?` | โœ… Supported |
| Escape sequences | `\d`, `\w`, `\s`, `\.`, `\n`, `\t` | โœ… Supported |
| Sequences | `ab+c*`, `\d+\w*` | โœ… Supported |
| Groups | `(abc)`, `(?:...)` | โœ… Supported |
| Word boundaries | `\b`, `\B` | โœ… Supported |
| Bounded quantifiers | `{n}`, `{n,m}` | ๐Ÿšง Planned |
| Capturing groups | Extract `(group)` | ๐Ÿšง Planned |
| Lookahead/lookbehind | `(?=...)`, `(?<=...)` | ๐Ÿšง Planned |
| Backreferences | `\1`, `\2` | ๐Ÿšง Planned |

## ๏ฟฝ Performance Benchmarks

### Real-World GRL Benchmark

Testing 6 realistic patterns across 41 GRL files (total ~139KB):

| Pattern | Description | Performance | Result |
|---------|-------------|-------------|--------|
| `\d+` | Digit sequences | **0.28x** | **3.57x faster** โœจ |
| `"[^"]+"` | Quoted strings | **0.41x** | **2.44x faster** โœจ |
| `rule\s+` | Rule keyword | **0.95x** | 5% faster |
| `salience\s+\d+` | Salience declarations | **1.10x** | Competitive |
| `query\s+` | Query keyword (sparse) | **1.44x** | Expected loss |
| `when\|then` | Alternation | **1.99x** | 2x slower (acceptable) |
| **AGGREGATE** | All patterns | **1.03x** | **Within 3% of regex!** โœ… |

**Perfect Performance (82/82 wins):**
- Digit patterns: **41/41 wins** (3.57x faster)
- Quoted strings: **41/41 wins** (2.44x faster)

### Memory Comparison

**Test 1: Pattern Compilation** (10 patterns):
- regex: 1920 KB in 7.89ms
- ReXile: 128 KB in 370ยตs
- **Result: 15x less memory, 21x faster** โœจ

**Test 2: Search Operations** (5 patterns ร— 139KB corpus):
- Both: 0 bytes memory delta
- **Result: Equal efficiency** โœ…

**Test 3: Stress Test** (50 patterns ร— 500KB corpus):
- regex: 0.62 MB peak in 46ms
- ReXile: 0.12 MB peak in 27ms
- **Result: 5x less peak memory, 1.7x faster** โœจ

### When ReXile Wins

โœ… **Digit sequences** (`\d+`) - 3.57x faster
โœ… **Quoted strings** (`"[^"]+"`) - 2.44x faster  
โœ… **Word runs** (`\w+`) - Competitive
โœ… **Identifiers** (`[a-zA-Z_]\w*`) - 2520x faster than general matcher
โœ… **Pattern compilation** - 21x faster
โœ… **Memory usage** - 15x less for compilation, 5x less peak

### When regex Wins

โš ๏ธ **Alternations** (`when|then`) - ReXile 2x slower (trade-off for simplicity)
โš ๏ธ **Sparse matches** (`query\s+`) - ReXile 1.44x slower (expected)

### Architecture

```
Pattern โ†’ Parser โ†’ AST โ†’ Fast Path Detection โ†’ Specialized Matcher
                                                        โ†“
                                     DigitRun (memchr SIMD scanning)
                                     IdentifierRun (direct byte scanning)
                                     QuotedString (memchr + validation)
                                     Alternation (aho-corasick automaton)
                                     Literal (memchr SIMD)
                                     ... 5 more fast paths
```

**Run benchmarks yourself:**
```bash
cargo run --release --example per_file_grl_benchmark
cargo run --release --example memory_comparison
```

## ๐Ÿ“ฆ Installation

Add to your `Cargo.toml`:

```toml
[dependencies]
rexile = "0.1"
```

## ๐ŸŽ“ Examples

### Literal Search

```rust
let p = Pattern::new("needle").unwrap();
assert!(p.is_match("needle in a haystack"));
assert_eq!(p.find("where is the needle?"), Some((13, 19)));

// Find all occurrences
let matches = p.find_all("needle and needle");
assert_eq!(matches, vec![(0, 6), (11, 17)]);
```

### Multi-Pattern (Alternation)

```rust
// Fast multi-pattern search using aho-corasick
let keywords = Pattern::new("import|export|function|class").unwrap();
assert!(keywords.is_match("export default function"));
```

### Anchored Patterns

```rust
// Must start with pattern
let starts = Pattern::new("^Hello").unwrap();
assert!(starts.is_match("Hello World"));
assert!(!starts.is_match("Say Hello"));

// Must end with pattern
let ends = Pattern::new("World$").unwrap();
assert!(ends.is_match("Hello World"));
assert!(!ends.is_match("World Peace"));

// Exact match
let exact = Pattern::new("^exact$").unwrap();
assert!(exact.is_match("exact"));
assert!(!exact.is_match("not exact"));
```

### Cached API (Best for Repeated Patterns)

```rust
// First call compiles and caches
rexile::is_match("keyword", "find keyword here").unwrap();

// Subsequent calls reuse cached pattern (zero compile cost)
rexile::is_match("keyword", "another keyword").unwrap();
rexile::is_match("keyword", "more keyword text").unwrap();
```

**๐Ÿ“š More examples:** See [examples/](examples/) directory for:
- [`basic_usage.rs`]examples/basic_usage.rs - Core API walkthrough
- [`log_processing.rs`]examples/log_processing.rs - Log analysis patterns
- [`performance.rs`]examples/performance.rs - Performance comparison

Run examples with:
```bash
cargo run --example basic_usage
cargo run --example log_processing
```

## ๐Ÿ”ง Use Cases

ReXile is production-ready for:

### โœ… Ideal Use Cases
- **Parsers and lexers** - 21x faster pattern compilation, competitive matching
- **Rule engines** - Simple pattern matching in business rules (original use case!)
- **Log processing** - Fast keyword and pattern extraction
- **Dynamic patterns** - Applications that compile patterns at runtime
- **Memory-constrained environments** - 15x less compilation memory
- **Low-latency applications** - Predictable performance, no JIT warmup

### ๐ŸŽฏ Perfect Patterns for ReXile
- Digit extraction: `\d+` (3.57x faster!)
- Quoted strings: `"[^"]+"` (2.44x faster!)
- Identifiers: `[a-zA-Z_]\w*` (2520x faster than general matcher!)
- Word runs: `\w+`
- Keyword search: `rule\s+`, `function\s+`

### โš ๏ธ Consider regex crate for
- Complex alternations (ReXile 2x slower)
- Very sparse patterns (ReXile up to 1.44x slower)
- Unicode properties (`\p{L}` - not yet supported)
- Advanced features (lookahead, backreferences - not yet supported)

## ๐Ÿค Contributing

Contributions welcome! ReXile is actively maintained and evolving.

**Current focus:**
- โœ… Core regex features complete
- โœ… 10 fast path optimizations implemented
- โœ… Production-ready performance (1.03x aggregate vs regex)
- ๐Ÿ”„ Advanced features: bounded quantifiers `{n,m}`, capturing groups, lookahead

**How to contribute:**
1. Check [issues]https://github.com/KSD-CO/rexile/issues for open tasks
2. Run tests: `cargo test`
3. Run benchmarks: `cargo run --release --example per_file_grl_benchmark`
4. Submit PR with benchmarks showing performance impact

**Priority areas:**
- ๏ฟฝ Bounded quantifiers (`{n}`, `{n,m}`)
- ๐Ÿ“‹ Capturing group extraction
- ๐Ÿ“‹ More fast path patterns
- ๐Ÿ“‹ Unicode support
- ๐Ÿ“‹ Documentation improvements

## ๐Ÿ“œ License

Licensed under either of:

- MIT License ([LICENSE-MIT]LICENSE-MIT or http://opensource.org/licenses/MIT)
- Apache License, Version 2.0 ([LICENSE-APACHE]LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)

at your option.

## ๐Ÿ™ Credits

Built on top of:
- [`memchr`]https://docs.rs/memchr by Andrew Gallant - SIMD-accelerated substring search
- [`aho-corasick`]https://docs.rs/aho-corasick by Andrew Gallant - Multi-pattern matching automaton

Developed for the [rust-rule-engine](https://github.com/KSD-CO/rust-rule-engine) project, providing fast pattern matching for GRL (Grule Rule Language) parsing and business rule evaluation.

**Performance Philosophy:**
ReXile achieves competitive performance through **intelligent specialization** rather than complex JIT compilation:
- 10 hand-optimized fast paths for common patterns
- SIMD acceleration via memchr
- Pre-built automatons for alternations
- Zero-copy iterator design
- Minimal metadata overhead

---

**Status:** โœ… Production Ready (v0.1.0)

- โœ… **Performance:** 1.03x aggregate vs regex (within 3%)
- โœ… **Memory:** 15x less compilation, 5x less peak
- โœ… **Features:** All core regex features working
- โœ… **Testing:** 77 unit tests passing, comprehensive benchmarks
- โœ… **Real-world validated:** GRL parsing, rule engines, log processing