pub struct ScrubConfig {
pub normalize_newlines: bool,
pub remove_zero_width: bool,
pub remove_bidi_controls: bool,
pub collapse_whitespace: bool,
pub normalization: ScrubNormalization,
pub case: ScrubCase,
pub strip_diacritics: bool,
}Expand description
Policy/config for constructing normalized keys / comparison forms.
The intent is to make the pipeline explicit: most real bugs here are from implicitly normalizing and accidentally destroying semantics (ZWJ/ZWNJ, bidi marks, punctuation, newlines).
Fields§
§normalize_newlines: boolNormalize newlines (\r\n/\r → \n) before any other whitespace policy.
remove_zero_width: boolRemove common zero-width characters (ZWSP/ZWNJ/ZWJ/WORD JOINER/BOM).
remove_bidi_controls: boolRemove Unicode bidirectional control characters (Trojan Source-style).
collapse_whitespace: boolCollapse all Unicode whitespace to single ASCII spaces (and trim).
normalization: ScrubNormalizationWhich normalization form to apply before case/diacritics.
case: ScrubCaseCase handling strategy.
strip_diacritics: boolStrip combining marks (diacritics) after normalization + case mapping.
Implementations§
Source§impl ScrubConfig
impl ScrubConfig
Sourcepub fn search_key() -> Self
pub fn search_key() -> Self
See search_key() (fallback when casefold is disabled).
Sourcepub fn search_key_strict_invisibles() -> Self
pub fn search_key_strict_invisibles() -> Self
Like search_key(), but also removes common zero-width characters (ZWSP/ZWNJ/ZWJ/WJ/BOM).
This is a deliberate trade-off:
- Pro: avoids “ghost mismatches” in mostly-Latin corpora where ZWJ/ZWNJ are usually artifacts (copy/paste, rich text) rather than orthographic intent.
- Con: ZWJ/ZWNJ are semantically meaningful in multiple scripts (and for emoji ZWJ sequences). Stripping can create false positives/negatives depending on the task.
Trait Implementations§
Source§impl Clone for ScrubConfig
impl Clone for ScrubConfig
Source§fn clone(&self) -> ScrubConfig
fn clone(&self) -> ScrubConfig
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more