sanitize-engine
Deterministic, one-way data sanitization engine and CLI tool.
sanitize-engine scans files and archives for sensitive data — emails, IP addresses, API keys, credentials, and other secrets — and replaces every match with a category-aware, structurally plausible substitute. Replacements are one-way within the system design: no reverse mapping is stored or recoverable from sanitized output alone. There is no restore mode.
Intended Audience
- Security and compliance teams sanitizing production data for safe sharing.
- CI/CD pipelines that must fail when secrets leak into configuration or logs.
- Developers preparing realistic but non-sensitive test datasets.
Core Differentiators
- One-way only. No mapping file, no restore mode. Forward map lives in process memory and is zeroized on drop.
- Deterministic or random. HMAC-SHA256 seeded mode produces identical replacements across runs; CSPRNG mode produces fresh replacements each run (still consistent within a single run via dedup cache).
- Streaming architecture. Processes 20–100 GB+ files in bounded memory via configurable chunk + overlap scanning.
- Format-aware processing. Structured processors for JSON, YAML, XML, CSV, and key-value files replace only matched field values while preserving document structure.
- Archive support. Tar, tar.gz, and zip archives are processed entry-by-entry with automatic format detection and metadata preservation.
- Zero
unsafecode. The entire crate contains nounsafeblocks.
Design Principles
- One-way only. No reverse mappings, no restore mode. Security by elimination.
- Deterministic reproducibility. Same seed + same input = same output, across machines and runs.
- Format-aware. Replace values, not structure. JSON stays valid JSON; YAML stays valid YAML.
- Streaming-first. Constant memory regardless of file size. Process 100 GB files on a 512 MB machine.
- Zero
unsafe. Thread safety throughDashMapandArc, not pointer arithmetic. - Defence in depth. Input size caps, regex automaton limits, depth limits, node-count caps — every parser has a budget.
Quick Start
# 1. Create a plaintext secrets file:
# 2. Encrypt it (recommended for production):
# 3. Remove the plaintext:
# 4. Sanitize a file (prefer env var over --password for passwords):
# 5. Or write to stdout (default) and redirect:
# 6. CI gate — fail the build if secrets are detected:
Quick Start — Stdin Pipes
You can pipe data directly into sanitize:
# Pipe from grep:
|
# Read from stdin, write sanitized output to a file:
|
# Chain with other tools:
| |
Quick Start — Plaintext Secrets (no encryption)
Encryption is recommended but not required. You can use a plaintext secrets file directly:
# Use a plaintext JSON/YAML/TOML secrets file (auto-detected):
# Or explicitly skip encryption with --unencrypted-secrets:
# Deterministic mode works the same way:
No password or SANITIZE_PASSWORD env var is needed when using plaintext secrets. Memory hygiene (zeroization of parsed entries) is preserved.
Installation
From crates.io
From source
Binaries are placed at target/release/sanitize.
As a library
use Category;
use HmacGenerator;
use MappingStore;
use Arc;
// Create a deterministic generator with a fixed seed.
let generator = new;
// Create the replacement store (optional capacity limit).
let store = new;
// Sanitize a value (one-way).
let sanitized = store.get_or_insert.unwrap;
assert!;
assert_eq!;
// Same input → same output (per-run consistency).
let again = store.get_or_insert.unwrap;
assert_eq!;
Requirements
- Rust 1.74 or later (stable toolchain)
Documentation
| Document | Description |
|---|---|
| CLI Reference | Full sanitize command reference (including encrypt and decrypt subcommands), secrets file format, and usage examples. |
| Structured Processing | File-type profiles, field rules, processor-specific options, and structured vs literal comparison. |
| Supported Categories | All 18 built-in replacement categories with strategies and examples, plus custom categories. |
| Pluggable Strategies | The Strategy trait, 5 built-in strategies, and guide to writing custom strategies. |
| Library API Reference | Module-by-module public API tables (scanner, store, generator, strategy, processor, archive, report, atomic, secrets, error, category). |
| Defensive Limits & Streaming | Streaming chunking model, archive processing flow, and all defensive size/depth/count limits. |
| Architecture | Internal architecture, data flow diagrams, module map, concurrency model, and streaming design. |
| Security | Security properties, threat mitigations, encryption details, zeroization strategy, and threat model. |
| Contributing | Build instructions, test suite, fuzz targets, linting, and PR guidelines. |
| Changelog | Release history and version notes. |
Supported Formats
| Format | Processor | Detection |
|---|---|---|
| Plain text | StreamScanner (chunk + overlap) |
Default fallback for all files |
| JSON | JsonProcessor |
Profile match or {/[ heuristic |
| YAML | YamlProcessor |
Profile match or ---/- /: heuristic |
| XML | XmlProcessor |
Profile match or <?xml/< heuristic |
| CSV / TSV | CsvProcessor |
Profile match only |
| Key-value | KeyValueProcessor |
Profile match only |
| Tar | ArchiveProcessor |
.tar extension |
| Tar.gz / .tgz | ArchiveProcessor |
.tar.gz / .tgz extension |
| Zip | ArchiveProcessor |
.zip extension |
Security Model
Replacements are one-way within the system design. No reverse mapping is stored or recoverable from sanitized output alone. The MappingStore forward map lives only in process memory, is never persisted to disk, and is zeroized on drop. There is no restore or decrypt-output mode.
Key security properties:
- Encryption at rest — Secrets files are encrypted with AES-256-GCM (PBKDF2-HMAC-SHA256, 600 000 iterations). Plaintext secrets are also supported.
- Zeroization — HMAC keys, secret entries, mapping store keys, and decrypted blobs are zeroized on drop.
- Regex hardening — Per-pattern automaton and DFA size limits (1 MiB each) prevent ReDoS and unbounded memory.
- Defensive limits — Input size caps, recursion depth limits, node-count caps, and pattern count limits bound every parser.
- Zero
unsafe— Thread safety throughDashMapandArc.Send + Syncbounds verified at compile time.
For the full security model, threat mitigations, and out-of-scope threats, see SECURITY.md.
Examples
Sanitize a single file:
Write output to a file:
Pipe from another command:
|
Deterministic mode (same seed → same replacements every run):
Fail CI if secrets are detected:
See docs/cli-reference.md for the complete set of examples including archive processing, stdin pipes, dry-run, plaintext secrets, and custom chunk sizes.
Security note: Prefer
-P/--password-fileor theSANITIZE_PASSWORDenvironment variable over-p/--passwordto avoid exposing the password in process listings and shell history.
Limitations
- No restore. Replacements are one-way by design. There is no undo, decrypt-output, or reverse-mapping capability.
- Deterministic mode caveats. Deterministic replacements require the same secrets key and the same secret values to produce identical output. Changing the secrets file or key produces entirely different replacements.
- Structured fallback. Files exceeding structured processor size limits silently fall back to the streaming scanner. The streaming scanner performs byte-level regex replacement and does not understand document structure — it may match inside JSON keys, XML tags, or other structural elements.
- YAML formatting.
serde_yamlnormalizes some whitespace during serialization. Minor formatting differences from the original are possible. - Zeroization scope. Zeroization covers secrets, HMAC keys, and mapping store keys. It does not cover incidental copies the Rust compiler may create (e.g. during optimization passes). This is an inherent limitation of safe Rust zeroization.
- Sequential archive processing. Archive entries are processed sequentially (not in parallel) to preserve deterministic ordering.
- Binary detection. Entries detected as binary are skipped by default. Use
--include-binaryto override.
Security Disclosure
If you discover a security vulnerability in this project, please report it responsibly. Do not open a public issue for security-sensitive findings.
Contact the maintainers via the security contact configured in the repository. If no security contact is listed, open a private security advisory through the repository hosting platform or contact the maintainers directly via the email address in Cargo.toml or commit history.
Include:
- Description of the vulnerability.
- Steps to reproduce.
- Potential impact assessment.
Maintainers will acknowledge receipt within 5 business days and aim to provide a fix or mitigation timeline within 30 days.
Stability
This project follows Semantic Versioning. While below 1.0, breaking changes will bump the minor version.
- Stable guarantees: One-way replacement, deterministic mode (same seed → same output), length preservation, encrypted secrets format.
- May evolve: CLI flag names, report JSON schema, processor heuristics, default limit values.
See CHANGELOG.md for release history.
License
Licensed under the Apache License, Version 2.0. See LICENSE for the full text.