# Deep Audit: multimatch v0.1.1
**Auditor:** Kimi Code CLI
**Date:** 2026-03-26
**Scope:** Single-crate analysis of a multi-pattern matching engine wrapping Aho-Corasick + regex
---
## Executive Summary
`multimatch` is a thin abstraction over `aho-corasick` and `regex` crates. It provides a unified builder API for matching multiple literal and regex patterns simultaneously. The crate is **functional but over-promises** in documentation—two major advertised features (`Scanner` trait, Hyperscan SIMD backend) **do not exist in the code**.
**Verdict:** Acceptable for internal use with caveats. Not production-ready for untrusted inputs without additional hardening.
---
## 1. What Does This ACTUALLY Offer Over Raw aho-corasick?
### Real Value Adds
| **Unified builder API** | Reduces boilerplate when mixing literals + regex | ✓ Legitimate convenience |
| **Separate AC automata for CI/exact** | Avoids mixed-case slowdown in Aho-Corasick | ✓ Smart optimization |
| **Pre-allocated result buffers** | Capacity hint based on pattern count | ✓ Minor perf win |
| **Convenience constructors** | `from_literals()`, `from_regexes()` | ✓ DX improvement |
| **Case-insensitive regex via `(?i-u)`** | Prefixes patterns automatically | ✓ Works as advertised |
### The Gap: Advertised vs. Actual
```rust
// README claims this trait exists:
pub trait Scanner {
fn scan(&self, input: &[u8]) -> Vec<MatchResult>;
fn is_match(&self, input: &[u8]) -> bool;
fn pattern_count(&self) -> usize;
}
```
**REALITY:** This trait is documented in README but **does not exist in the codebase**. The `PatternSet` struct has these methods, but there's no trait abstraction allowing backend swapping. The README's claim that you can "swap backends without changing calling code" is **false**—there is only one backend implemented.
### The Missing SIMD Backend
The README and crate description advertise:
> "Optional Hyperscan SIMD backend for 3-5x throughput"
**REALITY:** The `simd` feature flag exists in `Cargo.toml` and pulls in `hyperscan`, but **zero code uses it**. There is no `#[cfg(feature = "simd")]` conditional compilation. The Hyperscan dependency is completely dead code.
```toml
# Cargo.toml - feature defined but NEVER used in src/
[features]
simd = ["dep:hyperscan"] # Dead feature
```
### Code Size Analysis
| `lib.rs` | 164 | Re-exports, error types, convenience fns |
| `engine.rs` | 394 | Core matching logic |
| `pattern.rs` | 346 | Builder API + PatternSet |
| `adversarial_tests.rs` | 855 | Test suite (largest file!) |
**Net new code:** ~900 lines of actual logic (excl. tests). This is a thin wrapper.
---
## 2. Are Regex Patterns Checked for Catastrophic Backtracking?
### Short Answer: No explicit checks
The crate relies entirely on the `regex` crate's built-in protections:
```rust
// engine.rs lines 78-84
let compiled = regex::bytes::RegexBuilder::new(&pattern_str)
.size_limit(REGEX_SIZE_LIMIT) // 1MB total
.build()
.map_err(|e| MatchError::InvalidRegex { ... })?;
```
### What Protections Exist
1. **Size limits:** `REGEX_SIZE_LIMIT = 1MB` (total NFA/DFA size)
2. **Per-pattern limits:** `MAX_REGEX_PATTERN_BYTES = 4KB`
3. **Pattern count limits:** `MAX_REGEX_PATTERNS = 2_048`
4. **Input size limits:** `MAX_REGEX_SCAN_BYTES = 8MB` (regex scanning skipped for larger inputs!)
### The 8MB Input Cutoff - CRITICAL
```rust
// engine.rs lines 179-197
if input.len() <= MAX_REGEX_SCAN_BYTES { // 8MB
if let Some(set) = &self.regex_set {
// ... regex scanning happens
}
}
// If input > 8MB, regex patterns SILENTLY don't run!
```
**BUG:** For inputs >8MB, regex patterns are **silently ignored**. This is a correctness issue, not just a performance optimization. The crate returns literal matches only, with no warning that regex patterns were skipped.
### Catastrophic Backtracking Test Results
The adversarial test suite includes:
```rust
#[test]
fn adversarial_regex_catastrophic_backtracking() {
let ps = PatternSet::builder()
.add_regex(r"a+b", 0)
.build()
.unwrap();
let input = b"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac"; // 30+ 'a's, no match
let start = std::time::Instant::now();
let matches = ps.scan(input);
let elapsed = start.elapsed();
assert!(elapsed.as_secs() < 60, "Should not hang");
assert!(matches.is_empty());
}
```
This passes because the `regex` crate uses **bounded backtracking** and finite automata. However, the crate doesn't proactively detect or warn about patterns like `(a+)+` that could cause issues in other engines.
### Recommendation
Add regex pattern static analysis or timeout mechanisms for untrusted pattern sources. The current protections are passive, not active.
---
## 3. Does the Scanner Trait Abstraction Help or Hinder?
### It Doesn't Exist
The `Scanner` trait documented in the README is **not implemented**. This is a documentation bug, not an API issue.
### What Actually Exists
```rust
// pattern.rs lines 35-57
pub struct PatternSet {
engine: MatchEngine,
pattern_count: usize,
max_matches: usize,
}
impl PatternSet {
pub fn scan(&self, input: &[u8]) -> Vec<crate::MatchResult> { ... }
pub fn is_match(&self, input: &[u8]) -> bool { ... }
pub fn pattern_count(&self) -> usize { ... }
}
```
This is a **concrete type**, not a trait. You cannot:
- Swap to a Hyperscan backend (doesn't exist)
- Implement your own scanner backend
- Write generic code over scanners
### Would a Trait Help?
If implemented properly, a `Scanner` trait would enable:
- Mock scanners for testing
- Different backend implementations ( Hyperscan, PCRE2, etc.)
- Generic algorithms over scanners
**Current status:** Hinders. The false advertising creates API expectations that aren't met.
---
## 4. Edge Cases Analysis
### 4.1 Empty Patterns
**Empty literal:**
```rust
// pattern.rs test (line 236-240)
let ps = PatternSet::builder().add_literal("", 0).build().unwrap();
let res = ps.scan_str("test");
assert_eq!(res.len(), 5); // Matches at every position!
```
✓ **Correctly handled:** Empty literal matches at every position (including before/after each character).
**Empty regex:**
```rust
// adversarial_tests.rs lines 16-30
let ps = PatternSet::builder().add_regex("", 0).build().unwrap();
let matches = ps.scan(b"abc");
assert_eq!(matches.len(), 4); // Positions 0,1,2,3
```
✓ **Correctly handled:** Empty regex matches at every position (standard regex behavior).
### 4.2 Overlapping Matches
```rust
// engine.rs lines 146-159
for mat in ac.find_overlapping_iter(input) { // ← OVERLAPPING enabled
if results.len() >= limit { break; }
// ...
}
```
✓ **Correctly handled:** Uses `find_overlapping_iter` for Aho-Corasick. Regex patterns use `find_iter` which finds non-overlapping matches (correct regex semantics).
**Test coverage:**
```rust
// adversarial_tests.rs lines 482-500
// 100 patterns all matching "a", input "aaa" = 300 matches
assert_eq!(matches.len(), 300);
```
### 4.3 Huge Inputs
| Aho-Corasick on 10MB+ | Works, finds all matches | ✓ Linear time, bounded memory |
| Regex on 8MB+ | **SILENTLY SKIPPED** | ✗ CRITICAL BUG |
| 10MB input, 1-char literal | Returns 10M matches | ⚠️ Memory bomb risk |
**The 10MB match explosion test:**
```rust
// adversarial_tests.rs lines 102-119
let input = vec![b'a'; 10 * 1024 * 1024];
let ps = PatternSet::builder().add_literal("a", 0).build().unwrap();
let matches = ps.scan(&input);
assert_eq!(matches.len(), input.len()); // 10,485,760 matches!
```
**Memory impact:** Each `MatchResult` is 24 bytes (3 × usize). 10M matches = **240MB allocated**.
### 4.4 Null Byte Handling
```rust
// adversarial_tests.rs lines 274-286
let ps = PatternSet::builder()
.add_literal("pass\0word", 0)
.build()
.unwrap();
let matches = ps.scan(b"my pass\0word is secret");
assert_eq!(matches.len(), 1); // ✓ Correct
```
✓ **Correctly handled:** Null bytes work in both patterns and input (Rust `&[u8]` handles this naturally).
### 4.5 Unicode/UTF-8
```rust
// adversarial_tests.rs lines 533-548
let ps = PatternSet::builder()
.add_literal("日本", 0) // 6 bytes
.build()
.unwrap();
let input = "私は日本語を話します".as_bytes();
assert_eq!(matches[0].start, 6); // Byte offset, not char offset
assert_eq!(matches[0].end, 12);
```
✓ **Correctly handled:** Byte offsets (not character offsets) returned. Consistent with `regex` crate behavior.
**Turkish I problem documented:**
```rust
// adversarial_tests.rs lines 333-358
// 'i' does NOT match 'İ' with simple case folding
// This is acknowledged behavior, not a bug
```
### 4.6 Invalid UTF-8
```rust
// adversarial_tests.rs lines 551-572
let input = b"\x80abc\xff\xfe"; // Invalid UTF-8
let matches = ps.scan(input);
// Should still find "abc"
```
✓ **Correctly handled:** The `regex` crate's bytes API handles invalid UTF-8 gracefully.
---
## 5. Security Assessment
### Denial of Service Vectors
| Regex backtracking | Low | `regex` crate uses bounded algorithms |
| Match result explosion | Medium | 10M+ matches can exhaust memory |
| Silent regex skip on large inputs | **High** | Inputs >8MB bypass regex patterns |
| Pattern compilation limits | Low | Size/count limits prevent worst cases |
### Memory Safety
- `#![forbid(unsafe_code)]` declared in `lib.rs` ✓
- No `unsafe` blocks found in codebase ✓
- Relies on safe `aho-corasick` and `regex` crates ✓
---
## 6. Recommendations
### Critical (Fix Immediately)
1. **Fix the 8MB silent skip bug:** Either:
- Remove the limit and document performance characteristics
- Return an error for inputs exceeding the limit
- Add a `scan_partial()` method that documents the limit
2. **Remove or implement dead features:**
- Remove Hyperscan from `Cargo.toml` OR implement it
- Remove `Scanner` trait from README OR implement it
### High Priority
3. **Add match result streaming:** Instead of `Vec<MatchResult>`, provide an iterator API to handle high-match scenarios without unbounded memory growth.
4. **Document the regex size limit:** The 8MB cutoff is undocumented behavior.
### Medium Priority
5. **Add regex pattern analysis:** Warn users about patterns known to cause performance issues.
6. **Implement the `Scanner` trait:** If backend swapping is a goal, actually implement the abstraction.
---
## 7. Conclusion
`multimatch` is a **minimal viable wrapper** over proven crates. The core logic is sound, but:
- **Documentation is misleading** (non-existent features advertised)
- **The 8MB silent skip is a correctness bug**
- **No protection against match explosions**
For internal tools with trusted patterns and known input sizes: **acceptable**.
For security tools processing untrusted input: **needs hardening** before production use.
---
*Audit generated by automated analysis of source code and test suite.*