Skip to main content

Module scanner

Module scanner 

Source
Expand description

Streaming scanner for detecting and replacing sensitive data.

§Architecture

The streaming scanner processes input data in configurable chunks, detecting secret patterns (regex or literal) and applying one-way replacements via the MappingStore. This design supports files of 20–100 GB+ without requiring the entire content to fit in memory.

┌──────────────┐     ┌─────────────────┐     ┌──────────────────┐
│  Input (Read) │ ──▶ │  StreamScanner  │ ──▶ │  Output (Write)  │
│  (chunked)    │     │  (pattern match │     │  (sanitized)     │
└──────────────┘     │   + replace)    │     └──────────────────┘
                      └────────┬────────┘
                               │
                      ┌────────▼────────┐
                      │  MappingStore   │
                      │  (dedup cache)  │
                      └─────────────────┘

§Chunk Overlap Strategy

To avoid missing matches that span chunk boundaries, the scanner maintains an overlap window between consecutive chunks:

  1. Read chunk_size bytes of new data.
  2. Prepend the carry buffer (tail of previous window).
  3. Scan the combined window for all pattern matches.
  4. Compute commit_point = window.len() - overlap_size (adjusted upward if a match straddles the boundary).
  5. Emit output for window[..commit_point] with replacements applied.
  6. Set carry = window[commit_point..] for the next iteration.

The overlap_size should be ≥ the maximum expected match length to guarantee no matches are missed at boundaries.

§Thread Safety

StreamScanner is Send + Sync. Multiple files can be scanned concurrently using a shared Arc<StreamScanner>, all backed by the same MappingStore for per-run dedup consistency.

§Performance

  • Chunk-based I/O: only chunk_size + overlap_size bytes in memory per active scan.
  • Compiled regex: patterns are compiled once at construction and reused across all chunks and files.
  • Lock-free reads: the DashMap inside MappingStore provides lock-free reads for already-seen values.
  • File-level parallelism: share Arc<StreamScanner> across threads to scan multiple files concurrently.

Structs§

ScanConfig
Configuration for the streaming scanner.
ScanPattern
A pattern rule defining what to scan for and how to categorize matches.
ScanProgress
Progress snapshot emitted during streaming scans.
ScanStats
Statistics collected during a scan operation.
StreamScanner
Streaming scanner that detects and replaces sensitive patterns.