# hyperscan-tokio: Modern Async VectorScan Bindings
## 🏗️ Architecture Overview
### Core Design Principles
1. **Async-First**: Built on Tokio with zero-cost abstractions
2. **Zero-Copy**: Minimize allocations with `Bytes`, `&[u8]`, and Arrow integration
3. **Hot-Reloadable**: Pattern databases can be swapped at runtime without downtime
4. **NUMA-Aware**: Per-core worker pools with affinity pinning
5. **Type-Safe**: Leverage Rust's type system for compile-time guarantees
### Component Architecture
```
┌─────────────────────────────────────────────────────────┐
│ User API Layer │
│ (Builder Pattern, Async Interfaces, Error Types) │
├─────────────────────────────────────────────────────────┤
│ Worker Pool Layer │
│ (Core-Pinned Workers, Work Stealing, Backpressure) │
├─────────────────────────────────────────────────────────┤
│ Pattern DB Layer │
│ (Hot Reload, Serialization, Version Management) │
├─────────────────────────────────────────────────────────┤
│ FFI Safety Layer │
│ (Safe Wrappers, Lifetime Management, Panic Guards) │
├─────────────────────────────────────────────────────────┤
│ VectorScan C++ Library │
└─────────────────────────────────────────────────────────┘
```
## 📁 Project Structure
```
hyperscan-tokio/
├── hyperscan-tokio/ # Main crate
│ ├── src/
│ │ ├── lib.rs # Public API exports
│ │ ├── builder.rs # Pattern & DB builders
│ │ ├── scanner.rs # Core scanning interfaces
│ │ ├── stream.rs # Streaming scan support
│ │ ├── worker_pool.rs # Core-affinity worker pools
│ │ ├── database.rs # Pattern DB management
│ │ ├── error.rs # Comprehensive error types
│ │ ├── zero_copy.rs # Zero-copy interfaces
│ │ └── async_scan.rs # Tokio async wrappers
│ ├── benches/
│ │ ├── throughput.rs # EPS benchmarks
│ │ ├── latency.rs # p99 latency tests
│ │ └── comparison.rs # vs RE2, PCRE2, regex
│ └── examples/
│ ├── basic_scan.rs
│ ├── hot_reload.rs
│ ├── worker_pool.rs
│ └── arrow_scan.rs
├── hyperscan-tokio-sys/ # Low-level FFI bindings
│ ├── src/
│ │ ├── lib.rs
│ │ └── bindings.rs # Generated bindings
│ ├── build.rs # VectorScan build script
│ └── vectorscan/ # Vendored VectorScan source
├── hyperscan-tokio-macros/ # Procedural macros (optional)
│ └── src/
│ └── lib.rs # Pattern compile-time validation
└── integration-tests/ # End-to-end tests
├── tests/
└── fixtures/
```
## 🔧 Key Components Design
### 1. FFI Layer (`hyperscan-tokio-sys`)
```rust
// Safe abstraction over VectorScan C API
pub struct CompiledDatabase {
ptr: NonNull<hs_database_t>,
_phantom: PhantomData<hs_database_t>,
}
// RAII pattern for scratch space
pub struct ScratchSpace {
ptr: NonNull<hs_scratch_t>,
}
impl Drop for ScratchSpace {
fn drop(&mut self) {
unsafe { hs_free_scratch(self.ptr.as_ptr()) }
}
}
```
### 2. Builder Pattern API
```rust
let db = DatabaseBuilder::new()
.add_pattern(Pattern::new(r"\d{3}-\d{2}-\d{4}")
.id(1)
.flags(Flags::CASELESS | Flags::MULTILINE))
.add_pattern(Pattern::new(r"[A-Z]{2,4}")
.id(2))
.mode(Mode::BLOCK)
.platform(Platform::native())
.build()?;
```
### 3. Async Scanning Interface
```rust
// Zero-copy async scanning
impl Scanner {
pub async fn scan_bytes(&self, data: Bytes) -> Result<Vec<Match>> {
let scanner = self.clone();
tokio::task::spawn_blocking(move || {
scanner.scan_sync(data.as_ref())
}).await?
}
pub async fn scan_stream<S>(&self, stream: S) -> Result<MatchStream>
where
S: Stream<Item = Result<Bytes>> + Send + 'static,
{
// Streaming implementation
}
}
```
### 4. Hot-Reloadable Pattern Database
```rust
pub struct ReloadableDatabase {
current: Arc<RwLock<Arc<Database>>>,
reload_notify: Arc<Notify>,
}
impl ReloadableDatabase {
pub async fn reload(&self, new_db: Database) -> Result<()> {
let new_arc = Arc::new(new_db);
{
let mut write_guard = self.current.write().await;
*write_guard = new_arc;
}
self.reload_notify.notify_waiters();
Ok(())
}
pub async fn scanner(&self) -> Scanner {
let db = self.current.read().await.clone();
Scanner::new(db)
}
}
```
### 5. Worker Pool with Core Affinity
```rust
pub struct WorkerPool {
workers: Vec<Worker>,
work_queue: Arc<SegQueue<ScanJob>>,
results: Arc<SegQueue<ScanResult>>,
}
struct Worker {
id: usize,
core_id: CoreId,
handle: JoinHandle<()>,
}
impl WorkerPool {
pub fn builder() -> WorkerPoolBuilder {
WorkerPoolBuilder::default()
}
pub async fn scan_batch(&self, jobs: Vec<ScanJob>) -> Vec<ScanResult> {
// Distribute work across cores
}
}
```
### 6. Zero-Copy Interfaces
```rust
// Support multiple input types without allocation
pub trait ScanInput {
fn as_bytes(&self) -> &[u8];
}
impl ScanInput for &[u8] { ... }
impl ScanInput for Bytes { ... }
impl ScanInput for BytesMut { ... }
impl ScanInput for &str { ... }
// Arrow integration
#[cfg(feature = "arrow")]
impl Scanner {
pub async fn scan_record_batch(
&self,
batch: &RecordBatch,
column: &str
) -> Result<Vec<(usize, Vec<Match>)>> {
// Scan Arrow string/binary arrays
}
}
```
## 🚀 Implementation Status
### ✅ Completed
- [x] Set up VectorScan build integration (hyperscan-tokio-sys crate)
- [x] Generate safe FFI bindings (all core functions implemented)
- [x] Basic block-mode scanning implementation
- [x] Core error handling with comprehensive error types
- [x] Pattern compilation (single, multi, extended, literal)
- [x] Scratch space management with RAII
- [x] Database serialization/deserialization
- [x] Builder pattern API for patterns and databases
- [x] Scanner types (Scanner, StreamScanner, VectoredScanner)
- [x] Worker pool with core affinity
- [x] Zero-copy interfaces
- [x] Chimera support with capture groups
- [x] Hot-reloadable database structure
- [x] Metrics collection
- [x] All unsafe code documented
### ✅ Critical Blockers Fixed
- [x] **Streaming implementation** - Fixed lifetime errors in stream.rs
- [x] **Tokio features** - "fs" feature already present in Cargo.toml
- [x] **Compilation errors** - Library now compiles successfully
- [x] **Async boundary issues** - Resolved with proper spawn_blocking usage
- [x] **StreamState Drop conflicts** - Fixed with proper pinning and futures
- [x] **Worker pool errors** - Fixed mutex and type annotation issues
### 📝 TODO - Next Steps
- [ ] Fix compilation warnings (140 warnings to address)
- [ ] Ensure all examples compile and run
- [ ] Add comprehensive tests for streaming mode
- [ ] Set up CI/CD pipeline with GitHub Actions
- [ ] Benchmark implementation against targets (50M EPS)
- [ ] Complete API documentation with rustdoc
- [ ] Create working examples for all major features
- [ ] Test Arrow integration thoroughly
- [ ] Add integration tests for worker pool
- [ ] Implement proper error handling tests
## 📊 Performance Targets
- **Throughput**: 50M+ events/sec on 16-core machine
- **Latency**: <1ms p99 for 1KB payloads
- **Memory**: <100MB overhead for 10K patterns
- **Scaling**: Linear up to core count
## 🔑 Key Dependencies
```toml
[dependencies]
tokio = { version = "1.35", features = ["full"] }
bytes = "1.5"
thiserror = "1.0"
anyhow = "1.0"
arc-swap = "1.6" # For hot-reloading
crossbeam = "0.8"
core_affinity = "0.8"
parking_lot = "0.12"
tracing = "0.1"
[dev-dependencies]
criterion = "0.5"
proptest = "1.4"
tokio-test = "0.4"
[features]
default = ["tokio", "jemalloc"]
arrow = ["arrow-array", "arrow-buffer"]
jemalloc = ["tikv-jemallocator"]
mimalloc = ["mimalloc-rust"]
```
## 🎯 Success Metrics
1. **Performance**: Meet 50M EPS target with <1ms p99
2. **Safety**: Zero unsoundness in safe API
3. **Ergonomics**: Intuitive builder pattern
4. **Compatibility**: Works on x86_64, ARM64, RISC-V
5. **Production Ready**: Used in real 100M EPS pipelines
## 🔄 Migration Path
For users coming from `rust-hyperscan`:
```rust
// Old (rust-hyperscan)
let db = hyperscan::compile(patterns)?;
let scratch = db.alloc_scratch()?;
let db = DatabaseBuilder::from_patterns(patterns).build()?;
let scanner = Scanner::new(db);
let matches = scanner.scan_bytes(data).await?;
```
## 🏁 Next Steps
1. **Validate VectorScan fork choice** - Ensure it meets all requirements
2. **Set up build infrastructure** - CMake, bindgen, CI/CD
3. **Create minimal FFI prototype** - Prove the approach
4. **Design comprehensive error types** - Better than "scan failed"
5. **Start with block mode** - Then add streaming