Expand description
Vocabulary-aware fuzzy correction for CAPCO tokens.
§Design
CAPCO markings are built from a closed vocabulary of ~52 CVE tokens
(classification levels, SCI controls, dissemination controls, and a handful
of structural keywords). OCR and manual transcription errors produce
near-miss variants — SERCET, NOFRON, CONFIDETIAL — that no rule ever
fires on because the scanner never detects them as marking candidates.
The approach here mirrors what makes typos
so effective at eliminating false positives, adapted for the closed-world
property of CAPCO vocabulary:
-
Closed-world validation first. If the input token is already in the vocabulary, return
Noneimmediately — no correction needed, no false positive possible. -
Exhaustive near-miss search. Because the vocabulary is tiny (~52 tokens), computing Levenshtein edit distance to every known token is fast (microseconds).
-
Ambiguity rejection. If two or more vocabulary entries are equally close, the correction is ambiguous. Return
Noneand let the engine surface it as a human-review item — exactly whattyposdoes for words that could correct to multiple targets. -
Minimum-length guard. Very short tokens (1-2 characters) are excluded from fuzzy matching because edit distance is semantically unreliable at that length.
C,S,Uare valid in context but look similar enough to dozens of other possibilities that any fuzzy suggestion would be noise. SeeMIN_FUZZY_LENfor the 2-char rationale (PR 7 SAR sub-compartment false-positives). -
Confidence scores. Each
FuzzyCorrectioncarries a base confidence derived from edit distance and token length. The calling engine multiplies this by a context factor (+0.10–0.15 when the token is inside a detected marking region) before comparing against the configured threshold.
§Integration Points
The FuzzyVocabMatcher is injected into the engine’s pre-scanner step.
In the default configuration it operates after the AhoCorasick corrections
map pass — user-configured exact corrections take priority; the fuzzy matcher
handles residual OCR noise the exact map doesn’t cover.
WASM-safe: no I/O, no platform-specific code, and only small transient heap allocations during edit-distance computation.
Structs§
- Fuzzy
Correction - A correction candidate produced by
FuzzyVocabMatcher::correct. - Fuzzy
Vocab Matcher - Vocabulary-aware fuzzy corrector for a closed token set.
Constants§
- MAX_
EDIT_ DISTANCE - Maximum Levenshtein edit distance considered for a correction.
- MIN_
FUZZY_ LEN - Minimum input token length for fuzzy matching.