rust-sanitize 0.11.0

Deterministic one-way data sanitization engine
Documentation

rust-sanitize

CI crates.io docs.rs License MSRV

Scrub sensitive data from logs and configs before sharing them — with support teams, vendors, or AI tools.

rust-sanitize replaces API keys, emails, IPs, passwords, tokens, and other secrets with structurally plausible substitutes. Replacements are one-way: no mapping file is stored, there's no restore mode, and nothing sensitive persists after the run.

Works as a CLI, a Rust library, and an MCP server — so AI assistants like Claude and Cursor can sanitize files on your behalf before raw content ever reaches the model.


MCP: Clean Before the LLM Sees It

Install the MCP server and your AI assistant can sanitize files directly — secrets stay inside the audited Rust process and never enter the context window.

Step 1 — Install the binaries

Download sanitize and sanitize-mcp for your platform from the Releases page and place them on your $PATH (e.g. /usr/local/bin/). No Deno or Node required — the runtime is embedded.

Step 2 — Register with your AI tool

Claude Code:

claude mcp add rust-sanitize /usr/local/bin/sanitize-mcp \
  -e SANITIZE_BIN=/usr/local/bin/sanitize

Claude Desktop (claude_desktop_config.json):

{
  "mcpServers": {
    "rust-sanitize": {
      "command": "/usr/local/bin/sanitize-mcp",
      "env": { "SANITIZE_BIN": "/usr/local/bin/sanitize" }
    }
  }
}

Once connected, the assistant can sanitize inline text or files, scan for leaks without modifying anything, or produce a pre-structured prompt ready for incident triage or config review. See docs/mcp.md for the full tool reference, all parameter examples, and setup instructions for Cursor, Neovim, and OpenCode.


Install

CLI from crates.io:

cargo install rust-sanitize

From source:

git clone https://github.com/kayelohbyte/rust-sanitize.git
cd rust-sanitize
cargo build --release
# Binary: target/release/sanitize

Windows: Requires the MSVC linker. Install Visual Studio Build Tools and select the Desktop development with C++ workload.

As a Rust library:

cargo add rust-sanitize

MCP binary: Download sanitize-mcp for your platform from the Releases page — no Deno or Node required, the runtime is embedded.


Quick Start

No setup — scan immediately

When you run sanitize with no secrets file or app bundle, the built-in patterns activate automatically. They cover the most common secrets: API keys (AWS, GCP, GitHub, Stripe, Slack, OpenAI, Anthropic, HuggingFace, and more), JWTs, emails, IPv4/IPv6, UUIDs, MAC addresses, PEM headers, credential URLs, and password/secret key=value pairs.

sanitize server.log

Output goes to server-sanitized.log next to the source. Use -o /path/to/output to override, or -o - for stdout.

# Dry-run — see what would be replaced without writing anything:
sanitize server.log -n

# CI gate — fail if secrets are detected:
sanitize config.yaml --fail-on-match

Guided setup — answer a few questions, get a tailored config

sanitize guided

The wizard asks for your workspace type (Generic, Web app, Kubernetes, Database, AWS), replacement strictness, company domains, and which file formats to cover. It produces two files:

  • secrets.guided.yaml — your pattern set, optionally encrypted
  • secrets.guided.profile.yaml — structured field rules for the formats you chose
sanitize server.log config.yaml \
  -s secrets.guided.yaml \
  --profile secrets.guided.profile.yaml

Aggressive strictness also matches broad hostnames, short container IDs, and high-entropy token patterns — recommended when sharing logs with an LLM.

App bundles — zero config for common applications

Built-in bundles for 22 applications pair a secrets pattern set with a structured field profile so field-level sanitization works out of the box, no authoring required.

sanitize /etc/gitlab/gitlab.rb --app gitlab
sanitize nginx.conf docker-compose.yml --app nginx,docker-compose
sanitize values.yaml --app kubernetes

# See all available bundles:
sanitize apps

Built-in bundles: ansible, aws-cli, circleci, django, docker-compose, elasticsearch, fstab, github-actions, gitlab, grafana, heroku, kubernetes, laravel, mongodb, mysql, nginx, postgresql, rails, redis, splunk, spring-boot, terraform.


Common Workflows

Multiple files and archives in one pass

sanitize server.log config.yaml backup.zip -s patterns.yaml
# Produces: server-sanitized.log  config-sanitized.yaml  backup.sanitized.zip

# Send all outputs to a directory:
sanitize server.log config.yaml backup.zip -s patterns.yaml -o /tmp/clean/

Pipe from another command

Stdin is sanitized and sent to stdout automatically.

grep "error" server.log | sanitize -s patterns.yaml
mysqldump mydb | sanitize | gzip > dump.sql.gz

# Mix stdin with file inputs (stdin → stdout; files → per-file siblings):
cat extra.log | sanitize - config.yaml -s patterns.yaml

When reading from stdin and the format can't be inferred from a filename, use -f to specify it: -f yaml, -f json, -f csv, -f log, etc.

CI secrets gate

# Fail the build if secrets are detected:
sanitize config.yaml -s patterns.yaml --fail-on-match

# Same with encrypted patterns file:
SANITIZE_PASSWORD="..." sanitize config.yaml -s patterns.enc --encrypted-secrets --fail-on-match

# Stream per-match findings as NDJSON for jq or SIEM ingest:
sanitize server.log -s patterns.yaml --dry-run --findings | \
  jq 'select(.type=="file") | .pattern_counts'

Archive entry filtering

Filter which entries inside an archive are processed. * matches within a single directory segment, ** crosses directory boundaries, trailing / matches a subtree.

# Keep only the config directory:
sanitize backup.zip --only 'config/' -s patterns.yaml

# Keep all JSON, drop the secrets file:
sanitize backup.zip --only '**/*.json' --exclude config/secrets.json -s patterns.yaml

# Independent filters per archive in one command:
sanitize a.zip --only 'config/' b.tar.gz --only '**/*.log' -s patterns.yaml

Structured field rules (--profile)

Replace specific named fields in YAML, JSON, TOML, CSV, .env, and INI files. Comments, indentation, key ordering, and unmatched values are preserved exactly.

# fields.yaml
- processor: yaml
  extensions: [".yaml", ".yml"]
  fields:
    - pattern: "*.password"
      category: "custom:password"
    - pattern: "*.username"
      category: email

- processor: jsonl
  extensions: [".jsonl", ".ndjson", ".log"]
  options:
    skip_invalid: "true"   # pass non-JSON lines through unchanged
  fields:
    - pattern: "*.email"
      category: email
    - pattern: "*.ip"
      category: ipv4
sanitize config.yaml server.log --profile fields.yaml -s patterns.yaml

When --profile is active, values discovered in structured fields are automatically written back to the patterns file as literals so the streaming scanner can match them in other files too. See Structured Processing for the full field pattern syntax and two-phase pipeline.

Encrypted patterns file

# Encrypt once:
sanitize encrypt patterns.yaml patterns.yaml.enc --password

# Use interactively:
sanitize data.log -s patterns.enc --encrypted-secrets -p

# Non-interactive (CI / pipes):
SANITIZE_PASSWORD="my-password" sanitize data.log -s patterns.enc --encrypted-secrets
# Or read from a file:
sanitize data.log -s patterns.enc --encrypted-secrets -P /run/secrets/pw

Allowlist — pass specific values through unchanged

sanitize data.log -s patterns.yaml --allow localhost --allow "*.internal" --allow "192.168.1.*"

For project-stable allowlists, add kind: allow entries directly to your patterns file:

- pattern: "*.internal"
  kind: allow
- pattern: "192.168.1.*"
  kind: allow

Deterministic mode

Same seed + same input produces identical replacements across runs and machines — useful when correlating sanitized data across multiple files or sharing a reproducible dataset with a team.

sanitize data.csv -s patterns.yaml -d

Shannon entropy detection

Catch high-entropy tokens not covered by any pattern — useful for novel API keys, obfuscated secrets, or anything that wasn't anticipated when the patterns file was written.

sanitize server.log -s patterns.yaml --entropy-threshold 4.5

# Dry-run prints a calibration histogram to help tune the threshold:
sanitize server.log -s patterns.yaml -n --entropy-threshold 4.5

LLM-ready output

Sanitize and produce a structured prompt in one step:

sanitize server.log -s patterns.yaml --llm                  # incident triage
sanitize nginx.conf --app nginx --llm review-config         # configuration review
sanitize nginx.conf --app nginx --llm review-security       # security posture review

The prompt includes a ## Files Analyzed manifest and embeds sanitized content inline (<content> blocks). For large file sets or agentic LLMs that can read files with their own tools, add --output to switch to reference mode — files are written to disk and the prompt lists their absolute paths instead:

sanitize logs/ -s patterns.yaml --llm review-security --output /tmp/sanitized/

Supported Formats

Format Detection
Plain text / log Default fallback for all files
JSON Profile match or {/[ heuristic
NDJSON / JSON Lines Profile match or multi-line { heuristic; streaming — bounded memory for GB-scale log files
YAML Profile match or ---/- /: heuristic
TOML Profile match or [section] heuristic
XML Profile match or <?xml/< heuristic
CSV / TSV Profile match only
.env Profile match only
INI / .conf Profile match only
Key-value Profile match only
Log lines (mixed) Profile match only
Tar .tar extension
Tar.gz / .tgz .tar.gz / .tgz extension
Zip .zip extension

Library

cargo add rust-sanitize
use rust_sanitize::category::Category;
use rust_sanitize::generator::HmacGenerator;
use rust_sanitize::store::MappingStore;
use std::sync::Arc;

// Deterministic generator seeded with a fixed 32-byte key.
let generator = Arc::new(HmacGenerator::new([42u8; 32]));
let store = MappingStore::new(generator, None);

// One-way replacement.
let sanitized = store.get_or_insert(&Category::Email, "alice@corp.com").unwrap();
assert!(sanitized.contains("@corp.com"));
assert_eq!(sanitized.len(), "alice@corp.com".len());

// Same input → same output within a run.
let again = store.get_or_insert(&Category::Email, "alice@corp.com").unwrap();
assert_eq!(sanitized, again);

See Library API Reference for the full module-by-module API.


Security Model

Replacements are one-way by design. No reverse mapping is stored or recoverable from sanitized output alone. The MappingStore forward map lives only in process memory and is zeroized on drop.

Key properties:

  • Encryption at rest — secrets files use AES-256-GCM (PBKDF2-HMAC-SHA256, 600 000 iterations). Plaintext files are also supported.
  • Zeroization — HMAC keys, secret entries, mapping-store keys, and decrypted blobs are zeroized on drop.
  • Regex hardening — per-pattern automaton and DFA size limits (1 MiB each) prevent ReDoS and unbounded memory growth.
  • Defensive limits — input size caps, recursion depth limits, node-count caps, and pattern-count limits bound every parser.
  • Zero unsafe — thread safety through DashMap and Arc; Send + Sync bounds verified at compile time.

See SECURITY.md for the full threat model and mitigations.


Documentation

Document Description
MCP Reference MCP server setup, all tool parameters, JSON examples, IDE configs (Cursor, Neovim, OpenCode), and namespace-based multi-tenant setup.
CLI Reference Full sanitize command reference including all flags, subcommands, secrets file format, and examples.
Structured Processing --profile usage, field patterns, two-phase pipeline, format preservation, and processor options.
Supported Categories All 18 built-in replacement categories with strategies and examples, plus custom categories.
Pluggable Strategies The Strategy trait, 5 built-in strategies, and guide to writing custom strategies.
Library API Reference Module-by-module public API tables.
Defensive Limits & Streaming Streaming chunking model, archive processing flow, and all defensive size/depth/count limits.
Architecture Internal architecture, data flow, module map, concurrency model, and streaming design.
Security Security properties, threat mitigations, encryption details, and zeroization strategy.
Contributing Build instructions, test suite, fuzz targets, linting, and PR guidelines.
Changelog Release history and version notes.

Limitations

  • No restore. Replacements are one-way by design. No undo, decrypt-output, or reverse-mapping capability.
  • Structured-to-scanner handoff. When --profile is active, discovered values are appended to your secrets file as kind: literal entries so the scanner can find them in other files. Use --no-structured-handoff to suppress the write if needed.
  • Structured processor size limit. Files over 256 MiB (or --max-structured-size) fall back to the streaming scanner, which replaces raw bytes without document awareness. In practice this only affects large serialized data dumps, not real config files.
  • Deterministic mode caveats. Identical output requires the same secrets file and the same seed. Changing either produces completely different replacements.
  • Zeroization scope. Covers secrets, HMAC keys, and mapping-store keys. Incidental copies the Rust compiler creates during optimization passes are not covered — an inherent limitation of safe Rust zeroization.
  • Large archive sequential fallback. Zip and tar archives whose total uncompressed content exceeds 256 MiB are processed sequentially rather than in parallel to avoid unbounded memory use.
  • Binary detection. Entries detected as binary are skipped by default. Use --include-binary to override.

Security Disclosure

Do not open a public issue for security-sensitive findings. Report privately via a security advisory on the repository or via the maintainer contact in Cargo.toml. Include a description, reproduction steps, and potential impact. Maintainers will acknowledge within 5 business days and provide a fix or mitigation timeline within 30 days.


Stability

This project follows Semantic Versioning. As of 0.8.0, the public library API and CLI interface are considered stable. Breaking changes will be avoided but may occur in minor releases until 1.0.0. The MSRV is 1.74 (stable toolchain), declared under rust-version in Cargo.toml and enforced in CI.

See CHANGELOG.md for release history.


License

Licensed under the Apache License, Version 2.0. See LICENSE for the full text.