Expand description
§rust-sanitize
Deterministic, one-way data sanitization engine.
This crate provides the core replacement infrastructure for replacing sensitive values with category-aware, deterministic substitutes. Replacements are one-way only — there is no key file, mapping table, or restore mode. It is the foundation layer consumed by higher-level streaming and CLI components.
§Key Components
category::Category— Classification of sensitive values (email, IP, name, etc.) that determines replacement format.generator::ReplacementGenerator— Trait abstracting replacement strategy (HMAC-deterministic or CSPRNG-random).strategy::Strategy— Pluggable replacement strategies that can be called directly without any mapping table.store::MappingStore— Optional thread-safe per-run dedup cache ensuring the same input always maps to the same output within a run.scanner::StreamScanner— Streaming regex scanner with chunk + overlap for bounded-memory processing.
§Concurrency Model
The MappingStore uses DashMap (shard-level locking) for the forward
dedup cache. All types are Send + Sync.
§Stability
As of 0.8.0 the public API is considered stable and follows Semantic Versioning. Breaking changes require a major version bump. The core guarantees — one-way replacement, deterministic mode, and length preservation — are stable across all 1.x releases. Processor heuristics, default limit values, and report schema may change in minor releases (additive only).
§Example: Store-Level Replacement
use sanitize_engine::category::Category;
use sanitize_engine::generator::HmacGenerator;
use sanitize_engine::store::MappingStore;
use std::sync::Arc;
// Create a deterministic generator with a fixed seed.
let generator = Arc::new(HmacGenerator::new([42u8; 32]));
// Create the replacement store (optional capacity limit).
let store = MappingStore::new(generator, None);
// Sanitize a value (one-way).
let sanitized = store.get_or_insert(&Category::Email, "alice@corp.com").unwrap();
assert!(sanitized.contains("@corp.com"));
assert_eq!(sanitized.len(), "alice@corp.com".len());
// Same input → same output (per-run consistency).
let again = store.get_or_insert(&Category::Email, "alice@corp.com").unwrap();
assert_eq!(sanitized, again);§Example: Streaming Scanner
use sanitize_engine::category::Category;
use sanitize_engine::generator::HmacGenerator;
use sanitize_engine::scanner::{ScanConfig, ScanPattern, StreamScanner};
use sanitize_engine::store::MappingStore;
use std::sync::Arc;
// Build patterns.
let patterns = vec![
ScanPattern::from_regex(r"alice@corp\.com", Category::Email, "alice_email").unwrap(),
];
// Store with deterministic generator.
let generator = Arc::new(HmacGenerator::new([42u8; 32]));
let store = Arc::new(MappingStore::new(generator, Some(1_000_000)));
// Scanner with default chunk config.
let config = ScanConfig::new(1_048_576, 4096);
let scanner = StreamScanner::new(patterns, store, config).unwrap();
// Scan bytes in-memory.
let input = b"Contact alice@corp.com for details.";
let (output, stats) = scanner.scan_bytes(input).unwrap();
assert_eq!(stats.replacements_applied, 1);
assert_eq!(output.len(), input.len());§Example: Log Context Extraction
After sanitizing, scan the output for error/warning keywords and capture surrounding lines for LLM-friendly triage:
use sanitize_engine::log_context::{extract_context, LogContextConfig};
let sanitized = "INFO request received\n\
ERROR disk full on /dev/sda1\n\
INFO retrying mount\n\
WARN filesystem degraded\n\
INFO recovery complete";
let config = LogContextConfig::new().with_context_lines(1);
let result = extract_context(sanitized, &config);
// Two keyword hits: "error" and "warn".
assert_eq!(result.match_count, 2);
// First match: ERROR line with one line of context on each side.
assert_eq!(result.matches[0].keyword, "error");
assert_eq!(result.matches[0].before, vec!["INFO request received"]);
assert_eq!(result.matches[0].after, vec!["INFO retrying mount"]);Re-exports§
pub use atomic::atomic_write;pub use atomic::atomic_write_private;pub use atomic::AtomicFileWriter;pub use category::Category;pub use error::Result;pub use error::SanitizeError;pub use generator::HmacGenerator;pub use generator::RandomGenerator;pub use generator::ReplacementGenerator;pub use llm::format_llm_prompt;pub use llm::format_llm_prompt_reference;pub use llm::resolve_llm_template;pub use llm::LlmEntry;pub use llm::LlmPathEntry;pub use llm::PROMPT_PREAMBLE;pub use llm::TEMPLATE_REVIEW_CONFIG;pub use llm::TEMPLATE_REVIEW_SECURITY;pub use llm::TEMPLATE_TROUBLESHOOT;pub use log_context::extract_context;pub use log_context::extract_context_reader;pub use log_context::LogContextConfig;pub use log_context::LogContextMatch;pub use log_context::LogContextResult;pub use log_context::DEFAULT_CONTEXT_LINES;pub use log_context::DEFAULT_KEYWORDS;pub use log_context::DEFAULT_MAX_MATCHES;pub use processor::archive::ArchiveFilter;pub use processor::archive::ArchiveFormat;pub use processor::archive::ArchiveProcessor;pub use processor::archive::ArchiveProgress;pub use processor::archive::ArchiveStats;pub use processor::archive::EntryCallback;pub use processor::FieldNameSignal;pub use processor::FieldRule;pub use processor::FileTypeProfile;pub use processor::Processor;pub use processor::ProcessorRegistry;pub use processor::DEFAULT_FIELD_SIGNAL_THRESHOLD;pub use report::FileReport;pub use report::ReportBuilder;pub use report::ReportMetadata;pub use report::SanitizeReport;pub use scanner::ScanConfig;pub use scanner::ScanPattern;pub use scanner::ScanProgress;pub use scanner::ScanStats;pub use scanner::StreamScanner;pub use secrets::decrypt_secrets;pub use secrets::encrypt_secrets;pub use secrets::load_secrets_auto;pub use secrets::looks_encrypted;pub use secrets::SecretEntry;pub use secrets::SecretsFormat;pub use store::MappingStore;pub use strategy::EntropyMode;pub use strategy::FakeIp;pub use strategy::HmacHash;pub use strategy::PreserveLength;pub use strategy::RandomString;pub use strategy::RandomUuid;pub use strategy::Strategy;pub use strategy::StrategyGenerator;pub use strip_values::strip_values_from_text;
Modules§
- allowlist
- Allowlist for suppressing specific values from sanitization.
- atomic
- Atomic file writes for crash-safe output.
- category
- Data category types for classifying sensitive values.
- error
- Unified error types for the sanitization engine.
- generator
- Replacement generation strategies.
- llm
- LLM prompt formatting — template resolution and prompt assembly.
- log_
context - Log context extraction — finds keyword-matching lines and captures surrounding context windows for LLM-friendly log triage.
- processor
- Structured processors for format-aware sanitization.
- report
- Structured reporting for sanitization runs.
- scanner
- Streaming scanner for detecting and replacing sensitive data.
- secrets
- Encrypted secrets management.
- store
- Thread-safe, concurrent one-way replacement store.
- strategy
- Pluggable replacement strategies.
- strip_
values - Key-only structure extraction for configuration files.
Constants§
- DEFAULT_
ARCHIVE_ DEPTH - Default maximum nesting depth for recursive archive processing.