sanitize-engine 0.3.0

Deterministic one-way data sanitization engine
Documentation
# Architecture

> **sanitize-engine** v0.2.0 — Deterministic, one-way data sanitization.

This document describes the internal architecture of the sanitization
engine.  It is aimed at contributors and operators who need to
understand data-flow, concurrency, and security boundaries.

---

## 1. High-Level Data Flow

```
┌─────────────┐                 ┌───────────────────────┐
│  CLI args    │  ──────────▶   │    sanitize  (bin)     │
│  [INPUT]     │                │  ┌─────────────────┐   │
│  -o/--output │                │  │ Signal handler   │   │
│  -s/--secrets-│                │  │ Tracing init     │   │
│    file      │                │  │ Thread cap resolve│   │
└─────────────┘                 │  └────────┬────────┘   │
                                │           │            │
                      ┌─────────▼──────────────────────┐ │
                      │  Is it an archive?             │ │
                      │  (.tar / .tar.gz / .zip)       │ │
                      └──┬─────────────────┬───────────┘ │
                    YES  │                 │  NO         │
          ┌──────────────▼──────┐   ┌──────▼──────────┐  │
          │ ArchiveProcessor    │   │ StreamScanner   │  │
          │ (per-entry routing) │   │ (chunk+overlap) │  │
          └──────────┬──────────┘   └──────┬──────────┘  │
                     │                     │             │
         ┌───────────▼─────────────────────▼──────────┐  │
         │          MappingStore (DashMap)             │  │
         │  ┌──────────────┐  ┌─────────────────────┐ │  │
         │  │ ForwardMap   │  │ ReplacementGenerator │ │  │
         │  │ val → repl   │  │ (HMAC / Random)      │ │  │
         │  └──────────────┘  └─────────────────────┘ │  │
         └────────────────────────────────────────────┘  │
                                │                        │
                      ┌─────────▼──────────┐             │
                      │ AtomicFileWriter   │             │
                      │ (tmp → fsync →     │             │
                      │  rename)           │             │
                      └────────────────────┘             │
                      ┌─────────────────────┐            │
                      │ ReportBuilder       │            │
                      │ (JSON summary)      │            │
                      └─────────────────────┘            │
└────────────────────────────────────────────────────────┘
```

### Plain file path

1. Load encrypted secrets → decrypt with password (PBKDF2 + AES-256-GCM).
2. Build `ScanPattern` list from decrypted plain-text entries.
3. Create `StreamScanner` with chunk+overlap configuration.
4. Open input as a `Read`, open output via `AtomicFileWriter`.
5. Call `scan_reader(&mut reader, &mut writer)`.
6. Scanner reads chunks, matches literals via Aho-Corasick and regex patterns via `RegexSet` pre-filter + per-pattern regex scan, calls `MappingStore::get_or_insert` for each match, and writes sanitized chunks to the writer.
7. On success, `AtomicFileWriter::finish()` fsyncs + renames. On error/signal, temp file is cleaned up by `Drop`.

### Archive path

1. Detect format from extension (`.tar`, `.tar.gz`, `.zip`).
2. `ArchiveProcessor` iterates entries; for each regular file:
   - Try structured processor match (JSON / YAML / XML / CSV / KeyValue) via `ProcessorRegistry::find_processor`.
   - If matched and within `MAX_STRUCTURED_ENTRY_SIZE`, parse + walk + replace field values.
   - Otherwise fall back to `StreamScanner::scan_reader` for byte-level replacement.
3. Rebuild the archive with sanitized content and preserved metadata.

### Encrypt / Decrypt subcommands

The `sanitize encrypt` and `sanitize decrypt` subcommands handle
secrets file management. These are simple linear flows that do not
involve the scanning/replacement pipeline:

- **`sanitize encrypt <IN> <OUT>`** — reads a plaintext secrets file,
  optionally validates it, encrypts with AES-256-GCM (PBKDF2 key
  derivation), and writes the ciphertext atomically.
- **`sanitize decrypt <IN> <OUT>`** — reads an encrypted secrets file,
  decrypts, optionally validates the resulting plaintext, and writes
  atomically.

Both subcommands resolve the password through a unified chain:
`--password` flag (triggers interactive masked prompt; requires TTY) →
`--password-file` (with Unix permission enforcement) →
`SANITIZE_PASSWORD` env var → automatic interactive terminal prompt
(masked input via `rpassword`).

---

## 2. Replacement Strategies: Two Parallel Paths

The crate provides **two distinct APIs** for generating sanitized replacements:

### Path 1: Category-Aware Generators (Used by CLI)

```rust
HmacGenerator / RandomGenerator
    → implements ReplacementGenerator trait
    → calls format_replacement(category, hash, original)
    → category-specific formatting (email, IP, JWT, etc.)
```

- **Used by:** CLI binary, streaming scanner
- **Design:** Opinionated, category-aware formatters with length preservation
- **Determinism:** HMAC-SHA256 for deterministic mode, OS CSPRNG for random mode
- **Code:** `src/generator.rs` (lines 136-600+)

### Path 2: Pluggable Strategy Trait (Public Library API)

```rust
StrategyGenerator (adapter)
    → implements ReplacementGenerator trait
    → delegates to dyn Strategy::replace(original, entropy)
    → user-defined or built-in strategies (RandomString, FakeIp, etc.)
```

- **Used by:** Library consumers, third-party crates via public API
- **Design:** Extensible, strategy pattern for custom replacement logic
- **Determinism:** Entropy source (HMAC or CSPRNG) decoupled from strategy
- **Code:** `src/strategy.rs`
- **Example:** `examples/custom_strategy.rs`
- **Docs:** `docs/strategies.md`

### Why Two Paths?

1. **CLI simplicity:** Direct category formatters avoid indirection overhead for
   the built-in use case (email → email-shaped replacement, IP → IP-shaped, etc.).
2. **Library extensibility:** The `Strategy` trait allows users to plug in
   custom replacement logic without forking the crate.
3. **Historical:** The Strategy trait was part of the initial design. The
   category-aware formatters (`format_replacement`) were later optimized for
   the CLI and became the primary path.

Both paths are maintained and tested. They share the same `MappingStore` and
`ReplacementGenerator` interface, but diverge in how replacements are computed.

---

## 3. Module Map

| Module | Responsibility |
|--------|---------------|
| `scanner` | Streaming hybrid scanner with configurable chunk/overlap: Aho-Corasick for literals + regex engine for regex patterns. Memory-bounded reads. |
| `store` | `DashMap`-backed dedup cache. `get_or_insert` is the single entry-point. Capacity-limited. |
| `generator` | `ReplacementGenerator` trait. Two impls: `HmacGenerator` (deterministic), `RandomGenerator` (CSPRNG). Contains category-aware formatters used by the CLI. |
| `strategy` | **Extensibility layer:** `Strategy` trait + `StrategyGenerator` adapter + 5 built-in strategies (`RandomString`, `FakeIp`, etc.). Public API for library users to implement custom replacement logic. |
| `category` | `Category` enum. Drives domain separation in HMAC and replacement format selection. |
| `secrets` | AES-256-GCM encrypted secrets file format. PBKDF2 key derivation. Zeroizes plaintext on drop. |
| `processor::*` | Format-aware processors: JSON, YAML, XML, CSV, KeyValue. Each implements `Processor` trait. |
| `processor::archive` | Tar / tar.gz / zip processing. Per-entry structured-or-scanner routing. |
| `processor::registry` | `ProcessorRegistry` — maps processor names to `Arc<dyn Processor>`. |
| `processor::profile` | `FileTypeProfile` + `FieldRule` — user-supplied rules for structured processing. |
| `report` | Thread-safe `ReportBuilder` producing a JSON summary of the sanitization run. |
| `error` | `SanitizeError` enum + `Result<T>` alias. |
| `atomic` | `AtomicFileWriter` — crash-safe output via temp + fsync + rename. |

---

## 4. Streaming Model

The scanner never holds the entire file in memory. It reads fixed-size
**chunks** (default 1 MiB) with a configurable **overlap** (default
4 KiB). The overlap ensures that a sensitive value straddling a chunk
boundary is still detected.

```
Chunk N:     [===========================|overlap|]
Chunk N+1:                         [overlap|===========================|overlap|]
```

After scanning a chunk, only the overlap window is retained; the rest
is flushed to the writer. Peak memory per file ≈ `chunk_size + overlap`.

For structured processors (JSON, YAML, …) the entry content must fit
in memory (gated by `MAX_STRUCTURED_ENTRY_SIZE`). Oversized entries
fall through to the streaming scanner.

---

## 5. Concurrency Model

- **`MappingStore`** uses `DashMap` (striped read-write locks). Multiple
  threads can call `get_or_insert` concurrently; per-shard locking keeps
  contention low.
- **All public types are `Send + Sync`.**
- The CLI resolves `--threads` to `min(--threads, available_parallelism)`
  and initializes the global rayon thread pool to that size at startup.
- **File-level parallelism:** when multiple `[INPUT]` paths are given, all
  `InputTarget::File` targets are dispatched via `rayon::par_iter`. Stdin
  targets always run serially.
- **Archive-entry parallelism:** for a single-archive input (or when file-level
  parallelism is not active), archive entries are sanitized in parallel via
  rayon when the file-entry count meets `parallel_threshold` (default: 4).
  The rebuilt archive is always written in original entry order — output is
  deterministic regardless of thread count.
- **Thread-budget policy:** file-level and entry-level parallelism are mutually
  exclusive. When multiple files are processed in parallel, archive-entry
  parallelism inside each one is suppressed (`parallel_threshold = usize::MAX`)
  to prevent rayon oversubscription.
- **Progress:** the shared `Arc<Mutex<ProgressReporter>>` serializes all
  `start_task` / `finish_task` calls, so milestone lines are never interleaved.

---

## 6. Replacement Pipeline

```
Input value  ──▶  MappingStore::get_or_insert
                 ┌─────▼──────────┐
                 │ Already cached? │
                 └──┬──────────┬──┘
                  YES          NO
                   │            │
                   │    ┌───────▼──────────────┐
                   │    │ Strategy::generate()  │
                   │    │ (HMAC or Random seed  │
                   │    │  + category format)   │
                   │    └───────┬──────────────┘
                   │            │
                   │    ┌───────▼──────┐
                   │    │ Insert into  │
                   │    │ forward map  │
                   │    └───────┬──────┘
                   │            │
                   ▼            ▼
              Return cached replacement
```

- **HMAC mode**: `HMAC-SHA256(seed, category_tag || "\x00" || value)`  truncated to category-specific format. Same seed + value always
  produces the same replacement.
- **Random mode**: `OsRng` / `thread_rng()` per invocation. The dedup
  cache still ensures per-run consistency.

---

## 7. Atomic Output Safety

All file outputs go through `AtomicFileWriter`:

1. Write to `<destination>.tmp`.
2. `flush()` + `fsync()` the file descriptor.
3. `rename()` atomically over the destination.
4. If the process exits before `finish()`, `Drop` removes the temp file.

This guarantees that readers never see a partial output file after a
crash or signal interrupt.

---

## 8. Signal Handling

The CLI installs a `SIGINT` / `SIGTERM` handler via the `ctrlc` crate.
A global `AtomicBool` (`INTERRUPTED`) is set on signal. The pipeline
checks `is_interrupted()` before committing output:

- If interrupted **before** `AtomicFileWriter::finish()`, the temp file
  is cleaned up and the process exits with code 130.
- If interrupted **after** commit, the already-written output is valid.

---

## 9. Observability

Logging uses the `tracing` / `tracing-subscriber` stack:

- `--log-format human` (default): human-readable terminal output.
- `--log-format json`: structured JSON lines (machine-parseable).
- Level controlled via `SANITIZE_LOG` env var (e.g.
  `SANITIZE_LOG=debug`).
- **No secret values are ever logged.** Only file names, counts, and
  timing data appear in log output.

---

## 10. Feature Flags

| Feature | Effect |
|---------|--------|
| `bench` | Enables additional `tracing::info!` output for internal metrics (unique mapping count, etc.). Not intended for production. |

---

## 11. Build & Test

```bash
# Run all tests
cargo test

# Run with structured logging
SANITIZE_LOG=debug cargo run -- foo.txt -s secrets.enc --password -o foo.sanitized.txt

# Pipe from stdin (no TTY — use env var instead of --password)
echo "sensitive data" | SANITIZE_LOG=debug cargo run -- -s secrets.enc

# Run benchmarks
cargo bench

# Run fuzz targets (requires cargo-fuzz / nightly)
cargo +nightly fuzz run fuzz_regex
cargo +nightly fuzz run fuzz_json
cargo +nightly fuzz run fuzz_yaml
cargo +nightly fuzz run fuzz_archive
```