Expand description
Streaming scanner for detecting and replacing sensitive data.
§Architecture
The streaming scanner processes input data in configurable chunks,
detecting secret patterns (regex or literal) and applying one-way
replacements via the MappingStore.
This design supports files of 20–100 GB+ without requiring the entire
content to fit in memory.
┌──────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ Input (Read) │ ──▶ │ StreamScanner │ ──▶ │ Output (Write) │
│ (chunked) │ │ (pattern match │ │ (sanitized) │
└──────────────┘ │ + replace) │ └──────────────────┘
└────────┬────────┘
│
┌────────▼────────┐
│ MappingStore │
│ (dedup cache) │
└─────────────────┘§Chunk Overlap Strategy
To avoid missing matches that span chunk boundaries, the scanner maintains an overlap window between consecutive chunks:
- Read
chunk_sizebytes of new data. - Prepend the
carrybuffer (tail of previous window). - Scan the combined
windowfor all pattern matches. - Compute
commit_point = window.len() - overlap_size(adjusted upward if a match straddles the boundary). - Emit output for
window[..commit_point]with replacements applied. - Set
carry = window[commit_point..]for the next iteration.
The overlap_size should be ≥ the maximum expected match length to
guarantee no matches are missed at boundaries.
§Thread Safety
StreamScanner is Send + Sync. Multiple files can be scanned
concurrently using a shared Arc<StreamScanner>, all backed by the
same MappingStore for per-run dedup
consistency.
§Performance
- Chunk-based I/O: only
chunk_size + overlap_sizebytes in memory per active scan. - Compiled regex: patterns are compiled once at construction and reused across all chunks and files.
- Lock-free reads: the
DashMapinsideMappingStoreprovides lock-free reads for already-seen values. - File-level parallelism: share
Arc<StreamScanner>across threads to scan multiple files concurrently.
Structs§
- Scan
Config - Configuration for the streaming scanner.
- Scan
Pattern - A pattern rule defining what to scan for and how to categorize matches.
- Scan
Stats - Statistics collected during a scan operation.
- Stream
Scanner - Streaming scanner that detects and replaces sensitive patterns.