pub struct FuzzyVocabMatcher<'v> { /* private fields */ }Expand description
Vocabulary-aware fuzzy corrector for a closed token set.
Construct once per engine session from the vocabulary slice exposed by
marque_ism::TokenSet::correction_vocab. The vocab must be sorted and
deduplicated: the “is already valid” fast path uses slice::binary_search,
and the ambiguity check assumes each candidate appears at most once. For
marque_ism::CapcoTokenSet the invariant is enforced at the source —
ALL_CVE_TOKENS is emitted sorted and deduplicated by
marque-ism/build.rs and verified by token_set::tests.
§Example
use marque_core::fuzzy::FuzzyVocabMatcher;
use marque_ism::CapcoTokenSet;
use marque_ism::token_set::TokenSet as _;
let vocab = CapcoTokenSet.correction_vocab();
let matcher = FuzzyVocabMatcher::new(vocab);
// "SERCET" is one transpose away from "SECRET"
let result = matcher.correct("SERCET");
assert_eq!(result.map(|c| c.token), Some("SECRET"));
// Known tokens → no correction
assert!(matcher.correct("SECRET").is_none());
// Too ambiguous → no correction
// (tokens equidistant from the input → ambiguous, return None)Implementations§
Source§impl<'v> FuzzyVocabMatcher<'v>
impl<'v> FuzzyVocabMatcher<'v>
Sourcepub fn new(vocab: &'v [&'static str]) -> Self
pub fn new(vocab: &'v [&'static str]) -> Self
Create a new matcher over vocab.
vocab must be the sorted, deduplicated CVE token slice returned by
[TokenSet::correction_vocab]. Construction is O(1) — the slice is
not copied or indexed. The slice itself may live on the caller’s
TokenSet implementation (e.g., a Vec<&'static str> field), but each
entry must be &'static str so that FuzzyCorrection::token — which
borrows directly from the vocabulary — outlives the matcher.
Sourcepub fn correct(&self, token: &str) -> Option<FuzzyCorrection>
pub fn correct(&self, token: &str) -> Option<FuzzyCorrection>
Attempt to find a fuzzy correction for an unknown token.
Returns None when:
tokenis already a known vocabulary entry (no correction needed).tokenis too short (<MIN_FUZZY_LENbytes).- No vocabulary entry is within
MAX_EDIT_DISTANCEedits. - Multiple vocabulary entries tie at the closest distance (ambiguous).
Returns Some(FuzzyCorrection) only when the correction is unambiguous.
§ASCII invariant
Length checks and the underlying edit-distance computation both operate on byte counts. The CAPCO vocabulary is pure ASCII (classification levels, SCI/dissem/SAR tokens, etc.), so byte count and character count coincide for every expected input. Non-ASCII input is compared byte-wise and will not produce meaningful corrections — which is the intended behavior, since no non-ASCII candidate exists in the closed vocab.
Sourcepub fn correct_all(&self, token: &str) -> Vec<FuzzyCorrection>
pub fn correct_all(&self, token: &str) -> Vec<FuzzyCorrection>
Return every vocabulary entry within
MAX_EDIT_DISTANCE of token, paired with its distance.
Behaves like Self::correct but does NOT collapse ambiguous
matches to None. The decoder uses this when the caller needs
to score multiple candidates against a downstream prior — for
REL TO trigraph fuzzy recovery, the corpus-weighted log-prior
breaks ties that the matcher itself cannot (issue #233).
Fast-paths the same as Self::correct:
tokenis already in vocab → returns an empty vec.token.len() < MIN_FUZZY_LEN→ returns an empty vec.
Output is ordered by ascending distance, then by the
vocabulary’s lexicographic order (because the iteration walks
the sorted vocab slice). Capped by MAX_EDIT_DISTANCE so a
single call cannot run away on a tiny vocab; the priors-bake
vocabulary stays well bounded in practice.
Sourcepub fn correct_all_with_floor(
&self,
token: &str,
confidence_floor: f32,
) -> Vec<FuzzyCorrection>
pub fn correct_all_with_floor( &self, token: &str, confidence_floor: f32, ) -> Vec<FuzzyCorrection>
Like Self::correct_all but with a caller-controlled
confidence floor.
confidence_floor MUST lie in [0.0, 1.0] — correction_confidence
returns values in that range, so a negative floor would silently
disable filtering (the comparison confidence >= negative_floor
is always true) and a floor > 1.0 would silently drop every
match. A debug build panics on a misuse instead of producing a
release binary that returns counterintuitive empty / unfiltered
results.
The default floor (MIN_USEFUL_CONFIDENCE = 0.45) excludes
distance-2 corrections of 3-char inputs, which is the right
safety policy for the standard fuzzy path because those
corrections are too speculative without surrounding context.
The decoder’s REL TO trigraph expansion (issue #233) supplies
surrounding context — the candidate goes through the strict
REL TO parser, the resulting marking has a corpus-weighted
trigraph prior, and the decoder’s UNAMBIGUOUS_LOG_MARGIN
breaks ties at score time. Lowering the floor for that
specific call site is what lets a typo like ASU → AUS
(distance 2 in plain Levenshtein) reach the scorer.
Callers passing a floor of 0.0 get every match within
MAX_EDIT_DISTANCE.