Crate sanitize_engine

Expand description

§rust-sanitize

Deterministic, one-way data sanitization engine.

This crate provides the core replacement infrastructure for replacing sensitive values with category-aware, deterministic substitutes. Replacements are one-way only — there is no key file, mapping table, or restore mode. It is the foundation layer consumed by higher-level streaming and CLI components.

§Key Components

category::Category — Classification of sensitive values (email, IP, name, etc.) that determines replacement format.
generator::ReplacementGenerator — Trait abstracting replacement strategy (HMAC-deterministic or CSPRNG-random).
strategy::Strategy — Pluggable replacement strategies that can be called directly without any mapping table.
store::MappingStore — Optional thread-safe per-run dedup cache ensuring the same input always maps to the same output within a run.
scanner::StreamScanner — Streaming regex scanner with chunk + overlap for bounded-memory processing.

§Concurrency Model

The MappingStore uses DashMap (shard-level locking) for the forward dedup cache. All types are Send + Sync.

§Stability

As of 0.8.0 the public API is considered stable and follows Semantic Versioning. Breaking changes require a major version bump. The core guarantees — one-way replacement, deterministic mode, and length preservation — are stable across all 1.x releases. Processor heuristics, default limit values, and report schema may change in minor releases (additive only).

§Example: Store-Level Replacement

use sanitize_engine::category::Category;
use sanitize_engine::generator::HmacGenerator;
use sanitize_engine::store::MappingStore;
use std::sync::Arc;

// Create a deterministic generator with a fixed seed.
let generator = Arc::new(HmacGenerator::new([42u8; 32]));

// Create the replacement store (optional capacity limit).
let store = MappingStore::new(generator, None);

// Sanitize a value (one-way).
let sanitized = store.get_or_insert(&Category::Email, "alice@corp.com").unwrap();
assert!(sanitized.contains("@corp.com"));
assert_eq!(sanitized.len(), "alice@corp.com".len());

// Same input → same output (per-run consistency).
let again = store.get_or_insert(&Category::Email, "alice@corp.com").unwrap();
assert_eq!(sanitized, again);

§Example: Streaming Scanner

use sanitize_engine::category::Category;
use sanitize_engine::generator::HmacGenerator;
use sanitize_engine::scanner::{ScanConfig, ScanPattern, StreamScanner};
use sanitize_engine::store::MappingStore;
use std::sync::Arc;

// Build patterns.
let patterns = vec![
    ScanPattern::from_regex(r"alice@corp\.com", Category::Email, "alice_email").unwrap(),
];

// Store with deterministic generator.
let generator = Arc::new(HmacGenerator::new([42u8; 32]));
let store = Arc::new(MappingStore::new(generator, Some(1_000_000)));

// Scanner with default chunk config.
let config = ScanConfig::new(1_048_576, 4096);
let scanner = StreamScanner::new(patterns, store, config).unwrap();

// Scan bytes in-memory.
let input = b"Contact alice@corp.com for details.";
let (output, stats) = scanner.scan_bytes(input).unwrap();

assert_eq!(stats.replacements_applied, 1);
assert_eq!(output.len(), input.len());

§Example: Log Context Extraction

After sanitizing, scan the output for error/warning keywords and capture surrounding lines for LLM-friendly triage:

use sanitize_engine::log_context::{extract_context, LogContextConfig};

let sanitized = "INFO  request received\n\
                 ERROR disk full on /dev/sda1\n\
                 INFO  retrying mount\n\
                 WARN  filesystem degraded\n\
                 INFO  recovery complete";

let config = LogContextConfig::new().with_context_lines(1);
let result = extract_context(sanitized, &config);

// Two keyword hits: "error" and "warn".
assert_eq!(result.match_count, 2);

// First match: ERROR line with one line of context on each side.
assert_eq!(result.matches[0].keyword, "error");
assert_eq!(result.matches[0].before, vec!["INFO  request received"]);
assert_eq!(result.matches[0].after,  vec!["INFO  retrying mount"]);

Re-exports§

pub use atomic::atomic_write;
pub use atomic::atomic_write_private;
pub use atomic::AtomicFileWriter;
pub use category::Category;
pub use error::Result;
pub use error::SanitizeError;
pub use generator::HmacGenerator;
pub use generator::RandomGenerator;
pub use generator::ReplacementGenerator;
pub use llm::format_llm_prompt;
pub use llm::format_llm_prompt_reference;
pub use llm::resolve_llm_template;
pub use llm::LlmEntry;
pub use llm::LlmPathEntry;
pub use llm::PROMPT_PREAMBLE;
pub use llm::TEMPLATE_REVIEW_CONFIG;
pub use llm::TEMPLATE_REVIEW_SECURITY;
pub use llm::TEMPLATE_TROUBLESHOOT;
pub use log_context::extract_context;
pub use log_context::extract_context_reader;
pub use log_context::LogContextConfig;
pub use log_context::LogContextMatch;
pub use log_context::LogContextResult;
pub use log_context::DEFAULT_CONTEXT_LINES;
pub use log_context::DEFAULT_KEYWORDS;
pub use log_context::DEFAULT_MAX_MATCHES;
pub use processor::archive::ArchiveFilter;
pub use processor::archive::ArchiveFormat;
pub use processor::archive::ArchiveProcessor;
pub use processor::archive::ArchiveProgress;
pub use processor::archive::ArchiveStats;
pub use processor::archive::EntryCallback;
pub use processor::FieldNameSignal;
pub use processor::FieldRule;
pub use processor::FileTypeProfile;
pub use processor::Processor;
pub use processor::ProcessorRegistry;
pub use processor::DEFAULT_FIELD_SIGNAL_THRESHOLD;
pub use report::FileReport;
pub use report::ReportBuilder;
pub use report::ReportMetadata;
pub use report::SanitizeReport;
pub use scanner::ScanConfig;
pub use scanner::ScanPattern;
pub use scanner::ScanProgress;
pub use scanner::ScanStats;
pub use scanner::StreamScanner;
pub use secrets::decrypt_secrets;
pub use secrets::encrypt_secrets;
pub use secrets::load_secrets_auto;
pub use secrets::looks_encrypted;
pub use secrets::SecretEntry;
pub use secrets::SecretsFormat;
pub use store::MappingStore;
pub use strategy::EntropyMode;
pub use strategy::FakeIp;
pub use strategy::HmacHash;
pub use strategy::PreserveLength;
pub use strategy::RandomString;
pub use strategy::RandomUuid;
pub use strategy::Strategy;
pub use strategy::StrategyGenerator;
pub use strip_values::strip_values_from_text;

Modules§

allowlist: Allowlist for suppressing specific values from sanitization.
atomic: Atomic file writes for crash-safe output.
category: Data category types for classifying sensitive values.
error: Unified error types for the sanitization engine.
generator: Replacement generation strategies.
llm: LLM prompt formatting — template resolution and prompt assembly.
log_context: Log context extraction — finds keyword-matching lines and captures surrounding context windows for LLM-friendly log triage.
processor: Structured processors for format-aware sanitization.
report: Structured reporting for sanitization runs.
scanner: Streaming scanner for detecting and replacing sensitive data.
secrets: Encrypted secrets management.
store: Thread-safe, concurrent one-way replacement store.
strategy: Pluggable replacement strategies.
strip_values: Key-only structure extraction for configuration files.

Constants§

DEFAULT_ARCHIVE_DEPTH: Default maximum nesting depth for recursive archive processing.