simdsieve
A SIMD-accelerated multi-pattern pre-filtering engine with zero-allocation streaming iteration. Scans byte haystacks for up to 16 patterns simultaneously using the widest SIMD instructions available.
What It Does
simdsieve is a first-pass candidate filter for substring search:
- Prefix Extraction: Takes the first 1-4 bytes of each pattern
- SIMD Filtering: Scans haystack in 32-128 byte blocks, finding positions where prefixes match using vectorized comparison
- Verification: Confirms full pattern matches before yielding
- Zero False Positives: Every yielded offset is a verified match start
Use this when you need to quickly filter large data streams for multiple patterns before applying more expensive analysis.
Architecture
| Backend | Instruction Set | Block Size | Throughput* | Registers |
|---|---|---|---|---|
| AVX-512 | AVX-512F + AVX-512BW | 128 bytes | >50 GB/s | 512-bit |
| AVX2 | AVX2 | 64 bytes | >25 GB/s | 256-bit |
| NEON | AArch64 NEON | 64 bytes | >15 GB/s | 128-bit |
| Scalar | Portable Rust | 64 bytes | >2 GB/s | 64-bit |
* Single-byte pattern on modern x86_64. Multi-byte patterns scale linearly with prefix length.
Backend Selection
The optimal backend is auto-selected at construction time:
- AVX-512: If
avx512fandavx512bwCPU features are detected - AVX2: If
avx2is detected - NEON: On
aarch64targets (guaranteed by the architecture) - Scalar: Portable fallback for all other platforms
Dual-Pump Processing
AVX-512, AVX2, and NEON use "dual-pump" processing where each logical block is split into two halves processed together. This hides load latency by interleaving independent operations across both load ports.
Design Limits
- 16 patterns maximum: Unrolled register layout saturates execution ports
- 4-byte prefix: First 4 bytes used for SIMD filtering; full pattern for verification
- ASCII case-insensitive: Only
a-zfolded; non-ASCII compared verbatim
Usage
Add to Cargo.toml:
[]
= "0.1"
Basic exact matching:
use SimdSieve;
let haystack = b"The quick brown fox jumps over the lazy dog";
let patterns = &;
let sieve = new.unwrap;
let matches: = sieve.collect;
assert_eq!; // Positions of "fox" and "dog"
Case-insensitive matching:
use SimdSieve;
let haystack = b"Hello World HELLO";
let patterns = &;
let sieve = new_case_insensitive.unwrap;
let matches: = sieve.collect;
assert_eq!; // Matches "Hello" and "HELLO"
Density estimation (without full verification):
use SimdSieve;
let haystack = b"aaaaaa";
let patterns = &;
let count = estimate_match_count;
assert_eq!; // One prefix hit at each position
Iterator Behavior
SimdSieve implements Iterator<Item = usize> and FusedIterator:
- Yields match offsets in ascending order
- After returning
None, all subsequent calls returnNone - Zero heap allocation during iteration
- Thread-safe:
SendandSync(immutable iterator)
Error Handling
Construction can fail with SimdSieveError:
EmptyPatternSet: No patterns providedPatternLimitExceeded(usize): More than 16 patterns provided
use ;
let result = new;
assert!;
Platform Support
| Platform | Backend | Notes |
|---|---|---|
| x86_64 | AVX-512, AVX2, Scalar | Runtime feature detection |
| aarch64 | NEON, Scalar | NEON guaranteed on AArch64 |
| wasm32 | Scalar | No SIMD backend yet |
| other | Scalar | Portable fallback |
Safety
simdsieve contains a small amount of unsafe code confined to
platform-specific backend modules. Every unsafe block has been audited for
soundness:
- SIMD loads use unaligned load intrinsics (
loadu), safe for any valid pointer alignment. - Block bounds are verified before each SIMD call; multi-byte prefixes
require
block_len >= block_size + max_prefix_len - 1trailing bytes. - Target features are detected at construction;
#[target_feature]gates prevent executing unsupported instructions. - No uninitialized memory is read; SIMD struct arrays are zero-initialized.
unsafe_op_in_unsafe_fnis forbidden at the crate level.
Performance Tuning
When to Use This Crate
✅ Good for:
- Filtering large streams (>1MB) for multiple patterns
- Finding candidate positions for regex engines
- Log analysis with multiple fixed-string patterns
- Network packet inspection with signature matching
❌ Not ideal for:
- Single small haystacks (<1KB)
- Patterns longer than 4 bytes that differ only after position 4
- When you need capture groups or regex features
Prefetching
The engine automatically issues prefetch hints 512 bytes ahead. This value (8 cache lines) balances:
- Enough lookahead to hide memory latency on modern CPUs
- Not so much that we evict useful data from cache
Pattern Selection
For best performance:
- Use patterns with distinct first bytes when possible
- Place more frequent patterns earlier in the pattern list
- For case-insensitive mode, use lowercase in patterns
Benchmarks
Run benchmarks with:
Example throughput on AMD EPYC 9R14 (AVX-512):
| Haystack | Patterns | Throughput |
|---|---|---|
| 1 MB | 1 | ~55 GB/s |
| 1 MB | 4 | ~45 GB/s |
| 1 MB | 16 | ~30 GB/s |
License
Licensed under the MIT License. See LICENSE for details.
Contributing
Contributions welcome! Areas of interest:
- WebAssembly SIMD backend
- Additional benchmarks and fuzz targets
- New verification strategies
Please ensure:
cargo testpassescargo clippy -- -D warningsis cleancargo fmt --checkpasses- New code has module-level and item-level documentation