Skip to main content

Module matching

Module matching 

Source
Expand description

Fused normalization pipeline for case-insensitive, confusable-aware matching.

Pipeline: NFKC → CaseFold → Confusable Skeleton (NFD → confusable_map → NFD).

Two strings that produce the same normalize_for_matching output are equivalent for matching purposes: they share the same compatibility decomposition, the same case folding, and the same confusable prototype.

§Optimization summary (Component E)

The matching pipeline composes four conceptually-distinct stages (NFKC → casefold → skeleton → casefold). A naive implementation walks the input four times with three string allocations between stages. We preserve that staged structure for correctness — full-fusion attempts produced subtle parity divergences against the legacy chain on cross-codepoint canonical reorder cases — but every individual stage is optimized:

  • NFKC is the existing fused decomposer/composer (Component D), running at peak SIMD throughput on the hot ASCII / Latin-1 path.
  • Casefold has a SIMD-driven ASCII fast path that scans 64-byte chunks for non-ASCII / uppercase bytes and lowercases via b | 0x20, avoiding per-byte trie lookups on pure-ASCII regions (see crate::casefold).
  • Skeleton uses a 256-byte bloom filter to skip the binary search into the confusable mapping table for the vast majority of codepoints that have no mapping (see tables::confusable_bloom_might_contain, wired into crate::confusable::skeleton).
  • Outer fixed-point loop runs at most 4 iterations; in practice it converges after 1.

Structs§

MatchingOptions
Options for the matching normalization pipeline.

Functions§

matches_normalized
Check whether two strings match after full normalization.
normalize_for_matching
Normalize input for matching: NFKC → CaseFold → Confusable Skeleton.
normalize_for_matching_utf16
Normalize input for matching and encode the result as UTF-16.