Expand description
Fused normalization pipeline for case-insensitive, confusable-aware matching.
Pipeline: NFKC → CaseFold → Confusable Skeleton (NFD → confusable_map → NFD).
Two strings that produce the same normalize_for_matching output are
equivalent for matching purposes: they share the same compatibility
decomposition, the same case folding, and the same confusable prototype.
§Optimization summary (Component E)
The matching pipeline composes four conceptually-distinct stages
(NFKC → casefold → skeleton → casefold). A naive implementation walks
the input four times with three string allocations between stages. We
preserve that staged structure for correctness — full-fusion attempts
produced subtle parity divergences against the legacy chain on
cross-codepoint canonical reorder cases — but every individual stage is
optimized:
- NFKC is the existing fused decomposer/composer (Component D), running at peak SIMD throughput on the hot ASCII / Latin-1 path.
- Casefold has a SIMD-driven ASCII fast path that scans 64-byte
chunks for non-ASCII / uppercase bytes and lowercases via
b | 0x20, avoiding per-byte trie lookups on pure-ASCII regions (seecrate::casefold). - Skeleton uses a 256-byte bloom filter to skip the binary search
into the confusable mapping table for the vast majority of codepoints
that have no mapping (see
tables::confusable_bloom_might_contain, wired intocrate::confusable::skeleton). - Outer fixed-point loop runs at most 4 iterations; in practice it converges after 1.
Structs§
- Matching
Options - Options for the matching normalization pipeline.
Functions§
- matches_
normalized - Check whether two strings match after full normalization.
- normalize_
for_ matching - Normalize input for matching: NFKC → CaseFold → Confusable Skeleton.
- normalize_
for_ matching_ utf16 - Normalize input for matching and encode the result as UTF-16.