Skip to main content

FuzzyVocabMatcher

marque_core::fuzzy

Struct FuzzyVocabMatcher

pub struct FuzzyVocabMatcher<'v> { /* private fields */ }

Expand description

Vocabulary-aware fuzzy corrector for a closed token set.

Construct once per engine session from the vocabulary slice exposed by marque_ism::TokenSet::correction_vocab. The vocab must be sorted and deduplicated: the “is already valid” fast path uses slice::binary_search, and the ambiguity check assumes each candidate appears at most once. For marque_ism::CapcoTokenSet the invariant is enforced at the source — ALL_CVE_TOKENS is emitted sorted and deduplicated by marque-ism/build.rs and verified by token_set::tests.

§Example

use marque_core::fuzzy::FuzzyVocabMatcher;
use marque_ism::CapcoTokenSet;
use marque_ism::token_set::TokenSet as _;

let vocab = CapcoTokenSet.correction_vocab();
let matcher = FuzzyVocabMatcher::new(vocab);

// "SERCET" is one transpose away from "SECRET"
let result = matcher.correct("SERCET");
assert_eq!(result.map(|c| c.token), Some("SECRET"));

// Known tokens → no correction
assert!(matcher.correct("SECRET").is_none());

// Too ambiguous → no correction
// (tokens equidistant from the input → ambiguous, return None)

Implementations§

impl<'v> FuzzyVocabMatcher<'v>

pub fn new(vocab: &'v [&'static str]) -> Self

Create a new matcher over vocab.

vocab must be the sorted, deduplicated CVE token slice returned by [TokenSet::correction_vocab]. Construction is O(1) — the slice is not copied or indexed. The slice itself may live on the caller’s TokenSet implementation (e.g., a Vec<&'static str> field), but each entry must be &'static str so that FuzzyCorrection::token — which borrows directly from the vocabulary — outlives the matcher.

pub fn correct(&self, token: &str) -> Option<FuzzyCorrection>

Attempt to find a fuzzy correction for an unknown token.

Returns None when:

token is already a known vocabulary entry (no correction needed).
token is too short (< MIN_FUZZY_LEN bytes).
No vocabulary entry is within MAX_EDIT_DISTANCE edits.
Multiple vocabulary entries tie at the closest distance (ambiguous).

Returns Some(FuzzyCorrection) only when the correction is unambiguous.

§ASCII invariant

Length checks and the underlying edit-distance computation both operate on byte counts. The CAPCO vocabulary is pure ASCII (classification levels, SCI/dissem/SAR tokens, etc.), so byte count and character count coincide for every expected input. Non-ASCII input is compared byte-wise and will not produce meaningful corrections — which is the intended behavior, since no non-ASCII candidate exists in the closed vocab.

pub fn correct_all(&self, token: &str) -> Vec<FuzzyCorrection>

Return every vocabulary entry within MAX_EDIT_DISTANCE of token, paired with its distance.

Behaves like Self::correct but does NOT collapse ambiguous matches to None. The decoder uses this when the caller needs to score multiple candidates against a downstream prior — for REL TO trigraph fuzzy recovery, the corpus-weighted log-prior breaks ties that the matcher itself cannot (issue #233). Fast-paths the same as Self::correct:

token is already in vocab → returns an empty vec.
token.len() < MIN_FUZZY_LEN → returns an empty vec.

Output is ordered by ascending distance, then by the vocabulary’s lexicographic order (because the iteration walks the sorted vocab slice). Capped by MAX_EDIT_DISTANCE so a single call cannot run away on a tiny vocab; the priors-bake vocabulary stays well bounded in practice.

pub fn correct_all_with_floor( &self, token: &str, confidence_floor: f32, ) -> Vec<FuzzyCorrection>

Like Self::correct_all but with a caller-controlled confidence floor.

confidence_floor MUST lie in [0.0, 1.0] — correction_confidence returns values in that range, so a negative floor would silently disable filtering (the comparison confidence >= negative_floor is always true) and a floor > 1.0 would silently drop every match. A debug build panics on a misuse instead of producing a release binary that returns counterintuitive empty / unfiltered results.

The default floor (MIN_USEFUL_CONFIDENCE = 0.45) excludes distance-2 corrections of 3-char inputs, which is the right safety policy for the standard fuzzy path because those corrections are too speculative without surrounding context. The decoder’s REL TO trigraph expansion (issue #233) supplies surrounding context — the candidate goes through the strict REL TO parser, the resulting marking has a corpus-weighted trigraph prior, and the decoder’s UNAMBIGUOUS_LOG_MARGIN breaks ties at score time. Lowering the floor for that specific call site is what lets a typo like ASU → AUS (distance 2 in plain Levenshtein) reach the scorer.

Callers passing a floor of 0.0 get every match within MAX_EDIT_DISTANCE.

Auto Trait Implementations§

impl<'v> Freeze for FuzzyVocabMatcher<'v>

impl<'v> RefUnwindSafe for FuzzyVocabMatcher<'v>

impl<'v> Send for FuzzyVocabMatcher<'v>

impl<'v> Sync for FuzzyVocabMatcher<'v>

impl<'v> Unpin for FuzzyVocabMatcher<'v>

impl<'v> UnsafeUnpin for FuzzyVocabMatcher<'v>

impl<'v> UnwindSafe for FuzzyVocabMatcher<'v>

Blanket Implementations§

impl<T> Any for T
where T: 'static + ?Sized,

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

impl<T> Borrow<T> for T
where T: ?Sized,

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

impl<T> BorrowMut<T> for T
where T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

impl<T> From<T> for T

fn from(t: T) -> T

Returns the argument unchanged.

impl<T, U> Into<U> for T
where U: From<T>,

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

impl<T, U> TryFrom<U> for T
where U: Into<T>,

type Error = Infallible

The type returned in the event of a conversion error.

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.