pub struct StreamScanner { /* private fields */ }Expand description
Streaming scanner that detects and replaces sensitive patterns.
Thread-safe: can be shared via Arc<StreamScanner> for concurrent
scanning of multiple files. Each call to scan_reader
is independent and maintains its own chunking state.
§Usage
use sanitize_engine::scanner::{StreamScanner, ScanPattern, ScanConfig};
use sanitize_engine::category::Category;
use sanitize_engine::generator::HmacGenerator;
use sanitize_engine::store::MappingStore;
use std::sync::Arc;
// 1. Build the replacement store.
let gen = Arc::new(HmacGenerator::new([42u8; 32]));
let store = Arc::new(MappingStore::new(gen, None));
// 2. Define patterns.
let patterns = vec![
ScanPattern::from_regex(
r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
Category::Email,
"email",
).unwrap(),
];
// 3. Create the scanner.
let scanner = StreamScanner::new(patterns, store, ScanConfig::default()).unwrap();
// 4. Scan.
let input = b"Contact alice@corp.com for details.";
let (output, stats) = scanner.scan_bytes(input).unwrap();
assert_eq!(stats.matches_found, 1);
assert!(!output.windows(b"alice@corp.com".len())
.any(|w| w == b"alice@corp.com"));Implementations§
Source§impl StreamScanner
impl StreamScanner
Sourcepub fn new(
patterns: Vec<ScanPattern>,
store: Arc<MappingStore>,
config: ScanConfig,
) -> Result<Self>
pub fn new( patterns: Vec<ScanPattern>, store: Arc<MappingStore>, config: ScanConfig, ) -> Result<Self>
Create a new streaming scanner.
§Arguments
patterns— the set of patterns to scan for.store— the mapping store for dedup-consistent replacements.config— chunking / overlap configuration.
§Errors
Returns SanitizeError::InvalidConfig if the configuration is
invalid (e.g. chunk_size == 0 or overlap_size >= chunk_size).
Sourcepub fn new_with_max_patterns(
patterns: Vec<ScanPattern>,
store: Arc<MappingStore>,
config: ScanConfig,
max_patterns: usize,
) -> Result<Self>
pub fn new_with_max_patterns( patterns: Vec<ScanPattern>, store: Arc<MappingStore>, config: ScanConfig, max_patterns: usize, ) -> Result<Self>
Create a new streaming scanner with a custom pattern limit.
This is identical to new but allows overriding the
default pattern cap (10 000). Use this
when you have a legitimate need for more patterns and have
verified that your system has enough memory for the resulting
RegexSet.
§Errors
Returns SanitizeError::InvalidConfig if the configuration is
invalid or the pattern count exceeds max_patterns.
Sourcepub fn with_extra_literals(&self, extra: Vec<ScanPattern>) -> Result<Self>
pub fn with_extra_literals(&self, extra: Vec<ScanPattern>) -> Result<Self>
Create a copy of this scanner extended with additional literal patterns.
Clones the existing pattern set and appends extra, then rebuilds
the internal Aho-Corasick and RegexSet automata. Used by the
format-preserving structured pass to scan original bytes with
discovered field-value literals added to the base pattern set.
§Errors
Returns SanitizeError if automaton construction fails or the
combined pattern count exceeds the default limit.
Sourcepub fn for_structured_pass(&self, extra: Vec<ScanPattern>) -> Result<Self>
pub fn for_structured_pass(&self, extra: Vec<ScanPattern>) -> Result<Self>
Build a scanner suitable for format-preserving structured-file passes.
Patterns whose labels end with "_kv" are excluded from the base set.
Those patterns match both a key name and its value (e.g. password: s3cr3t)
as a single unit; in a structured pass the key must survive untouched so
only the discovered field-value literals are safe to replace.
extra (the profile-discovered literals) are always included.
§Errors
Returns SanitizeError if Aho-Corasick or RegexSet construction fails
or the combined pattern count exceeds the default limit.
Sourcepub fn scan_reader<R: Read, W: Write>(
&self,
reader: R,
writer: W,
) -> Result<ScanStats>
pub fn scan_reader<R: Read, W: Write>( &self, reader: R, writer: W, ) -> Result<ScanStats>
Scan a reader and write sanitized output to a writer.
Processes the input in chunks of config.chunk_size bytes,
maintaining an overlap window of config.overlap_size bytes to
catch matches spanning chunk boundaries. All detected matches
are replaced one-way via the MappingStore.
§Arguments
reader— input source (file, network stream,&[u8], …).writer— output sink (file,Vec<u8>, …).
§Returns
ScanStats with counters for bytes processed, matches found, etc.
§Errors
Returns SanitizeError on I/O failures or if a replacement
cannot be generated (e.g. store capacity exceeded).
Sourcepub fn scan_reader_with_progress<R: Read, W: Write, F>(
&self,
reader: R,
writer: W,
total_bytes: Option<u64>,
on_progress: F,
) -> Result<ScanStats>where
F: FnMut(&ScanProgress),
pub fn scan_reader_with_progress<R: Read, W: Write, F>(
&self,
reader: R,
writer: W,
total_bytes: Option<u64>,
on_progress: F,
) -> Result<ScanStats>where
F: FnMut(&ScanProgress),
Scan a reader and emit progress snapshots after each committed chunk.
total_bytes should be provided when the caller knows the full input
size. When omitted, progress consumers should avoid percentages/ETA.
This is a convenience wrapper around scan_reader_with_callbacks
that discards per-match location information. Use that method directly
when you need line numbers or byte offsets for individual matches.
§Errors
Returns SanitizeError on I/O failures or if a replacement
cannot be generated (e.g. store capacity exceeded).
Sourcepub fn scan_reader_with_callbacks<R: Read, W: Write, F, M>(
&self,
reader: R,
writer: W,
total_bytes: Option<u64>,
on_progress: F,
on_match: M,
) -> Result<ScanStats>
pub fn scan_reader_with_callbacks<R: Read, W: Write, F, M>( &self, reader: R, writer: W, total_bytes: Option<u64>, on_progress: F, on_match: M, ) -> Result<ScanStats>
Scan a reader, emit progress snapshots, and call on_match for every
committed match with its 1-based line number and byte offset.
on_match is called synchronously in the scanning thread, once per
committed match, in document order. The callback receives a
MatchLocation describing the pattern label, 1-based line number,
and 0-based byte offset within the input file. Callers that only need
aggregate counts (no per-match positions) should prefer
scan_reader_with_progress, which
skips the per-byte newline counting entirely.
§Performance note
Enabling on_match adds an O(committed_bytes_between_matches)
newline-counting pass inside each chunk. For files with sparse matches
this overhead is proportional to file size; for dense matches (e.g. one
secret per line) it is negligible. On 10–15 GiB log files with typical
match densities the overhead is roughly 10–20 % of total scan time.
§Errors
Returns SanitizeError on I/O failures or if a replacement
cannot be generated (e.g. store capacity exceeded).
Sourcepub fn scan_bytes(&self, input: &[u8]) -> Result<(Vec<u8>, ScanStats)>
pub fn scan_bytes(&self, input: &[u8]) -> Result<(Vec<u8>, ScanStats)>
Convenience: scan byte slice in-memory and return sanitized output.
Equivalent to scan_reader(input, Vec::new()) but returns the
output buffer directly.
§Errors
Returns SanitizeError if a replacement cannot be generated
(e.g. store capacity exceeded).
Sourcepub fn scan_bytes_with_progress<F>(
&self,
input: &[u8],
on_progress: F,
) -> Result<(Vec<u8>, ScanStats)>where
F: FnMut(&ScanProgress),
pub fn scan_bytes_with_progress<F>(
&self,
input: &[u8],
on_progress: F,
) -> Result<(Vec<u8>, ScanStats)>where
F: FnMut(&ScanProgress),
Scan a byte slice in memory and emit progress snapshots.
§Errors
Returns SanitizeError if a replacement cannot be generated
(e.g. store capacity exceeded).
Sourcepub fn config(&self) -> &ScanConfig
pub fn config(&self) -> &ScanConfig
Access the scanner’s configuration.
Sourcepub fn store(&self) -> &Arc<MappingStore> ⓘ
pub fn store(&self) -> &Arc<MappingStore> ⓘ
Access the underlying mapping store.
Sourcepub fn pattern_count(&self) -> usize
pub fn pattern_count(&self) -> usize
Number of patterns registered in this scanner.
Sourcepub fn from_encrypted_secrets(
encrypted_bytes: &[u8],
password: &str,
format: Option<SecretsFormat>,
store: Arc<MappingStore>,
config: ScanConfig,
extra_patterns: Vec<ScanPattern>,
) -> Result<(StreamScanner, Vec<(usize, SanitizeError)>, Vec<String>)>
pub fn from_encrypted_secrets( encrypted_bytes: &[u8], password: &str, format: Option<SecretsFormat>, store: Arc<MappingStore>, config: ScanConfig, extra_patterns: Vec<ScanPattern>, ) -> Result<(StreamScanner, Vec<(usize, SanitizeError)>, Vec<String>)>
Create a scanner from an encrypted secrets file.
Decrypts the file in memory, parses the entries, compiles patterns, and returns the scanner ready to scan. Decrypted plaintext is scrubbed from memory after parsing.
§Arguments
encrypted_bytes— raw bytes of the.encfile.password— user password.format— optional format override for the plaintext.store— mapping store for dedup-consistent replacements.config— chunking / overlap configuration.extra_patterns— additional patterns to merge in.
§Returns
(scanner, warnings, allow_patterns) where warnings lists entries
that failed to compile (index + error) and allow_patterns are the
raw strings from kind: allow entries — pass these to
AllowlistMatcher::new to
suppress replacements for known-safe values.
§Errors
Returns a secrets-related SanitizeError on decryption failure
or SanitizeError::InvalidConfig on invalid scanner config.
Sourcepub fn from_plaintext_secrets(
plaintext: &[u8],
format: Option<SecretsFormat>,
store: Arc<MappingStore>,
config: ScanConfig,
extra_patterns: Vec<ScanPattern>,
) -> Result<(StreamScanner, Vec<(usize, SanitizeError)>, Vec<String>)>
pub fn from_plaintext_secrets( plaintext: &[u8], format: Option<SecretsFormat>, store: Arc<MappingStore>, config: ScanConfig, extra_patterns: Vec<ScanPattern>, ) -> Result<(StreamScanner, Vec<(usize, SanitizeError)>, Vec<String>)>
Create a scanner from a plaintext secrets file.
Convenience for development / testing without encryption.
§Returns
(scanner, warnings, allow_patterns) where allow_patterns are the
raw strings from kind: allow entries — pass these to
AllowlistMatcher::new to
suppress replacements for known-safe values.
§Errors
Returns a secrets-related SanitizeError on parse failure
or SanitizeError::InvalidConfig on invalid scanner config.
Auto Trait Implementations§
impl Freeze for StreamScanner
impl !RefUnwindSafe for StreamScanner
impl Send for StreamScanner
impl Sync for StreamScanner
impl Unpin for StreamScanner
impl UnsafeUnpin for StreamScanner
impl !UnwindSafe for StreamScanner
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more