Skip to main content

Module fuzzy

Module fuzzy 

Source
Expand description

Vocabulary-aware fuzzy correction for CAPCO tokens.

§Design

CAPCO markings are built from a closed vocabulary of ~52 CVE tokens (classification levels, SCI controls, dissemination controls, and a handful of structural keywords). OCR and manual transcription errors produce near-miss variants — SERCET, NOFRON, CONFIDETIAL — that no rule ever fires on because the scanner never detects them as marking candidates.

The approach here mirrors what makes typos so effective at eliminating false positives, adapted for the closed-world property of CAPCO vocabulary:

  1. Closed-world validation first. If the input token is already in the vocabulary, return None immediately — no correction needed, no false positive possible.

  2. Exhaustive near-miss search. Because the vocabulary is tiny (~52 tokens), computing Levenshtein edit distance to every known token is fast (microseconds).

  3. Ambiguity rejection. If two or more vocabulary entries are equally close, the correction is ambiguous. Return None and let the engine surface it as a human-review item — exactly what typos does for words that could correct to multiple targets.

  4. Minimum-length guard. Very short tokens (1-2 characters) are excluded from fuzzy matching because edit distance is semantically unreliable at that length. C, S, U are valid in context but look similar enough to dozens of other possibilities that any fuzzy suggestion would be noise. See MIN_FUZZY_LEN for the 2-char rationale (PR 7 SAR sub-compartment false-positives).

  5. Confidence scores. Each FuzzyCorrection carries a base confidence derived from edit distance and token length. The calling engine multiplies this by a context factor (+0.10–0.15 when the token is inside a detected marking region) before comparing against the configured threshold.

§Integration Points

The FuzzyVocabMatcher is injected into the engine’s pre-scanner step. In the default configuration it operates after the AhoCorasick corrections map pass — user-configured exact corrections take priority; the fuzzy matcher handles residual OCR noise the exact map doesn’t cover.

WASM-safe: no I/O, no platform-specific code, and only small transient heap allocations during edit-distance computation.

Structs§

FuzzyCorrection
A correction candidate produced by FuzzyVocabMatcher::correct.
FuzzyVocabMatcher
Vocabulary-aware fuzzy corrector for a closed token set.

Constants§

MAX_EDIT_DISTANCE
Maximum Levenshtein edit distance considered for a correction.
MIN_FUZZY_LEN
Minimum input token length for fuzzy matching.