multimatch 0.2.0

Multi-pattern matching engine — Aho-Corasick + regex
Documentation
# Deep Audit: multimatch v0.1.1

**Auditor:** Kimi Code CLI  
**Date:** 2026-03-26  
**Scope:** Single-crate analysis of a multi-pattern matching engine wrapping Aho-Corasick + regex

---

## Executive Summary

`multimatch` is a thin abstraction over `aho-corasick` and `regex` crates. It provides a unified builder API for matching multiple literal and regex patterns simultaneously. The crate is **functional but over-promises** in documentation—two major advertised features (`Scanner` trait, Hyperscan SIMD backend) **do not exist in the code**.

**Verdict:** Acceptable for internal use with caveats. Not production-ready for untrusted inputs without additional hardening.

---

## 1. What Does This ACTUALLY Offer Over Raw aho-corasick?

### Real Value Adds

| Feature | Value | Assessment |
|---------|-------|------------|
| **Unified builder API** | Reduces boilerplate when mixing literals + regex | ✓ Legitimate convenience |
| **Separate AC automata for CI/exact** | Avoids mixed-case slowdown in Aho-Corasick | ✓ Smart optimization |
| **Pre-allocated result buffers** | Capacity hint based on pattern count | ✓ Minor perf win |
| **Convenience constructors** | `from_literals()`, `from_regexes()` | ✓ DX improvement |
| **Case-insensitive regex via `(?i-u)`** | Prefixes patterns automatically | ✓ Works as advertised |

### The Gap: Advertised vs. Actual

```rust
// README claims this trait exists:
pub trait Scanner {
    fn scan(&self, input: &[u8]) -> Vec<MatchResult>;
    fn is_match(&self, input: &[u8]) -> bool;
    fn pattern_count(&self) -> usize;
}
```

**REALITY:** This trait is documented in README but **does not exist in the codebase**. The `PatternSet` struct has these methods, but there's no trait abstraction allowing backend swapping. The README's claim that you can "swap backends without changing calling code" is **false**—there is only one backend implemented.

### The Missing SIMD Backend

The README and crate description advertise:
> "Optional Hyperscan SIMD backend for 3-5x throughput"

**REALITY:** The `simd` feature flag exists in `Cargo.toml` and pulls in `hyperscan`, but **zero code uses it**. There is no `#[cfg(feature = "simd")]` conditional compilation. The Hyperscan dependency is completely dead code.

```toml
# Cargo.toml - feature defined but NEVER used in src/
[features]
simd = ["dep:hyperscan"]  # Dead feature
```

### Code Size Analysis

| Component | Lines | Purpose |
|-----------|-------|---------|
| `lib.rs` | 164 | Re-exports, error types, convenience fns |
| `engine.rs` | 394 | Core matching logic |
| `pattern.rs` | 346 | Builder API + PatternSet |
| `adversarial_tests.rs` | 855 | Test suite (largest file!) |

**Net new code:** ~900 lines of actual logic (excl. tests). This is a thin wrapper.

---

## 2. Are Regex Patterns Checked for Catastrophic Backtracking?

### Short Answer: No explicit checks

The crate relies entirely on the `regex` crate's built-in protections:

```rust
// engine.rs lines 78-84
let compiled = regex::bytes::RegexBuilder::new(&pattern_str)
    .size_limit(REGEX_SIZE_LIMIT)  // 1MB total
    .build()
    .map_err(|e| MatchError::InvalidRegex { ... })?;
```

### What Protections Exist

1. **Size limits:** `REGEX_SIZE_LIMIT = 1MB` (total NFA/DFA size)
2. **Per-pattern limits:** `MAX_REGEX_PATTERN_BYTES = 4KB`
3. **Pattern count limits:** `MAX_REGEX_PATTERNS = 2_048`
4. **Input size limits:** `MAX_REGEX_SCAN_BYTES = 8MB` (regex scanning skipped for larger inputs!)

### The 8MB Input Cutoff - CRITICAL

```rust
// engine.rs lines 179-197
if input.len() <= MAX_REGEX_SCAN_BYTES {  // 8MB
    if let Some(set) = &self.regex_set {
        // ... regex scanning happens
    }
}
// If input > 8MB, regex patterns SILENTLY don't run!
```

**BUG:** For inputs >8MB, regex patterns are **silently ignored**. This is a correctness issue, not just a performance optimization. The crate returns literal matches only, with no warning that regex patterns were skipped.

### Catastrophic Backtracking Test Results

The adversarial test suite includes:

```rust
#[test]
fn adversarial_regex_catastrophic_backtracking() {
    let ps = PatternSet::builder()
        .add_regex(r"a+b", 0)
        .build()
        .unwrap();
    
    let input = b"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac";  // 30+ 'a's, no match
    let start = std::time::Instant::now();
    let matches = ps.scan(input);
    let elapsed = start.elapsed();
    
    assert!(elapsed.as_secs() < 60, "Should not hang");
    assert!(matches.is_empty());
}
```

This passes because the `regex` crate uses **bounded backtracking** and finite automata. However, the crate doesn't proactively detect or warn about patterns like `(a+)+` that could cause issues in other engines.

### Recommendation

Add regex pattern static analysis or timeout mechanisms for untrusted pattern sources. The current protections are passive, not active.

---

## 3. Does the Scanner Trait Abstraction Help or Hinder?

### It Doesn't Exist

The `Scanner` trait documented in the README is **not implemented**. This is a documentation bug, not an API issue.

### What Actually Exists

```rust
// pattern.rs lines 35-57
pub struct PatternSet {
    engine: MatchEngine,
    pattern_count: usize,
    max_matches: usize,
}

impl PatternSet {
    pub fn scan(&self, input: &[u8]) -> Vec<crate::MatchResult> { ... }
    pub fn is_match(&self, input: &[u8]) -> bool { ... }
    pub fn pattern_count(&self) -> usize { ... }
}
```

This is a **concrete type**, not a trait. You cannot:
- Swap to a Hyperscan backend (doesn't exist)
- Implement your own scanner backend
- Write generic code over scanners

### Would a Trait Help?

If implemented properly, a `Scanner` trait would enable:
- Mock scanners for testing
- Different backend implementations ( Hyperscan, PCRE2, etc.)
- Generic algorithms over scanners

**Current status:** Hinders. The false advertising creates API expectations that aren't met.

---

## 4. Edge Cases Analysis

### 4.1 Empty Patterns

**Empty literal:**
```rust
// pattern.rs test (line 236-240)
let ps = PatternSet::builder().add_literal("", 0).build().unwrap();
let res = ps.scan_str("test");
assert_eq!(res.len(), 5);  // Matches at every position!
```

✓ **Correctly handled:** Empty literal matches at every position (including before/after each character).

**Empty regex:**
```rust
// adversarial_tests.rs lines 16-30
let ps = PatternSet::builder().add_regex("", 0).build().unwrap();
let matches = ps.scan(b"abc");
assert_eq!(matches.len(), 4);  // Positions 0,1,2,3
```

✓ **Correctly handled:** Empty regex matches at every position (standard regex behavior).

### 4.2 Overlapping Matches

```rust
// engine.rs lines 146-159
for mat in ac.find_overlapping_iter(input) {  // ← OVERLAPPING enabled
    if results.len() >= limit { break; }
    // ...
}
```

✓ **Correctly handled:** Uses `find_overlapping_iter` for Aho-Corasick. Regex patterns use `find_iter` which finds non-overlapping matches (correct regex semantics).

**Test coverage:**
```rust
// adversarial_tests.rs lines 482-500
// 100 patterns all matching "a", input "aaa" = 300 matches
assert_eq!(matches.len(), 300);
```

### 4.3 Huge Inputs

| Scenario | Behavior | Assessment |
|----------|----------|------------|
| Aho-Corasick on 10MB+ | Works, finds all matches | ✓ Linear time, bounded memory |
| Regex on 8MB+ | **SILENTLY SKIPPED** | ✗ CRITICAL BUG |
| 10MB input, 1-char literal | Returns 10M matches | ⚠️ Memory bomb risk |

**The 10MB match explosion test:**
```rust
// adversarial_tests.rs lines 102-119
let input = vec![b'a'; 10 * 1024 * 1024];
let ps = PatternSet::builder().add_literal("a", 0).build().unwrap();
let matches = ps.scan(&input);
assert_eq!(matches.len(), input.len());  // 10,485,760 matches!
```

**Memory impact:** Each `MatchResult` is 24 bytes (3 × usize). 10M matches = **240MB allocated**.

### 4.4 Null Byte Handling

```rust
// adversarial_tests.rs lines 274-286
let ps = PatternSet::builder()
    .add_literal("pass\0word", 0)
    .build()
    .unwrap();

let matches = ps.scan(b"my pass\0word is secret");
assert_eq!(matches.len(), 1);  // ✓ Correct
```

✓ **Correctly handled:** Null bytes work in both patterns and input (Rust `&[u8]` handles this naturally).

### 4.5 Unicode/UTF-8

```rust
// adversarial_tests.rs lines 533-548
let ps = PatternSet::builder()
    .add_literal("日本", 0)  // 6 bytes
    .build()
    .unwrap();

let input = "私は日本語を話します".as_bytes();
assert_eq!(matches[0].start, 6);  // Byte offset, not char offset
assert_eq!(matches[0].end, 12);
```

✓ **Correctly handled:** Byte offsets (not character offsets) returned. Consistent with `regex` crate behavior.

**Turkish I problem documented:**
```rust
// adversarial_tests.rs lines 333-358
// 'i' does NOT match 'İ' with simple case folding
// This is acknowledged behavior, not a bug
```

### 4.6 Invalid UTF-8

```rust
// adversarial_tests.rs lines 551-572
let input = b"\x80abc\xff\xfe";  // Invalid UTF-8
let matches = ps.scan(input);
// Should still find "abc"
```

✓ **Correctly handled:** The `regex` crate's bytes API handles invalid UTF-8 gracefully.

---

## 5. Security Assessment

### Denial of Service Vectors

| Vector | Severity | Notes |
|--------|----------|-------|
| Regex backtracking | Low | `regex` crate uses bounded algorithms |
| Match result explosion | Medium | 10M+ matches can exhaust memory |
| Silent regex skip on large inputs | **High** | Inputs >8MB bypass regex patterns |
| Pattern compilation limits | Low | Size/count limits prevent worst cases |

### Memory Safety

- `#![forbid(unsafe_code)]` declared in `lib.rs`- No `unsafe` blocks found in codebase ✓
- Relies on safe `aho-corasick` and `regex` crates ✓

---

## 6. Recommendations

### Critical (Fix Immediately)

1. **Fix the 8MB silent skip bug:** Either:
   - Remove the limit and document performance characteristics
   - Return an error for inputs exceeding the limit
   - Add a `scan_partial()` method that documents the limit

2. **Remove or implement dead features:**
   - Remove Hyperscan from `Cargo.toml` OR implement it
   - Remove `Scanner` trait from README OR implement it

### High Priority

3. **Add match result streaming:** Instead of `Vec<MatchResult>`, provide an iterator API to handle high-match scenarios without unbounded memory growth.

4. **Document the regex size limit:** The 8MB cutoff is undocumented behavior.

### Medium Priority

5. **Add regex pattern analysis:** Warn users about patterns known to cause performance issues.

6. **Implement the `Scanner` trait:** If backend swapping is a goal, actually implement the abstraction.

---

## 7. Conclusion

`multimatch` is a **minimal viable wrapper** over proven crates. The core logic is sound, but:

- **Documentation is misleading** (non-existent features advertised)
- **The 8MB silent skip is a correctness bug**
- **No protection against match explosions**

For internal tools with trusted patterns and known input sizes: **acceptable**.

For security tools processing untrusted input: **needs hardening** before production use.

---

*Audit generated by automated analysis of source code and test suite.*