Expand description
Invisible-character & tag-character encoders (Plan 9 tag chars, variation selectors, stylistic ligatures, enclosed alphanumerics, soft hyphens, word joiners). Looks identical, normalizes identical, byte stream is unrecognizable. Invisible-character & tag-character encoders.
A class of encodings the rest of unicode.rs doesn’t cover. They share
one trait: the rendered or normalized string LOOKS exactly like the
original to a human or to a downstream tokenizer, but the byte stream a
WAF inspects bears no resemblance to the keywords it has rules for.
- Tag characters (U+E0000–U+E007F, “Plan 9 tags”). Each ASCII
codepoint
chas a tag-equivalent atU+E0000 + c. Strip them and you recover the original ASCII. Prompt-injection research has shown modern LLM tokenizers preserve and decode these — meaning an LLM-backed WAF will see a benign-looking blob while still receiving the attack tokens. - Variation selectors (U+FE00–U+FE0F, U+E0100–U+E01EF). Originally for emoji presentation. Some normalizers strip them; some preserve them. A WAF that strips has to choose to strip every codepoint in two non-contiguous ranges, which most don’t.
- Stylistic ligatures (U+FB00–U+FB06).
ff/fi/fl/ffi/ffl/ſt/st. NFKC decomposes them; non-NFKC tokenizers see them as single codepoints not in any keyword. Defeats post-normalization filters that operate on the unnormalized stream. - Enclosed alphanumerics (U+24B6–U+24E9 circled, U+1F110–U+1F12B parenthesized). Compatibility-decompose to plain Latin under NFKC. Backends that NFKC see the keyword; WAFs that don’t, don’t.
- Soft hyphen / format chars (U+00AD, U+200B–U+200D, U+2060,
U+FEFF). Some of these already live in
unicode::zero_width_injectfor selective injection. This module exposes them as a Strategy-compatible whole-string encoder too, for cases where the engine wants to swap encoders rather than compose them.
All encoders here preserve UTF-8 validity and are byte-deterministic given the same input. None of them require entropy.
§Why a new module
unicode.rs is already 17K LOC of encoders. The encoders here belong
together as a class — “looks identical, parses identical, byte stream
is unrecognizable” — and putting them next to the case-folding /
homoglyph / math-alphabet encoders would dilute that boundary.
Constants§
- INVISIBLE_
ENCODER_ NAMES - Returns the list of every invisible-class encoder name shipped by
this module — used by the integration test to assert the
dispatcher in
strategy.rshas wired every one of them.
Functions§
- circled_
letter_ encode - Replace every ASCII letter with its circled compatibility-equivalent (U+24B6..=U+24CF for uppercase, U+24D0..=U+24E9 for lowercase).
- ligature_
encode - Replace canonical ligature digraphs with their precomposed stylistic ligature codepoints (U+FB00..=U+FB06).
- parenthesized_
letter_ encode - Replace every ASCII letter with its parenthesized compatibility-equivalent (U+1F110..=U+1F12B for uppercase, U+249C..=U+24B5 for lowercase).
- soft_
hyphen_ inject - Inject U+00AD SOFT HYPHEN between every pair of codepoints.
- tag_
char_ encode - Encode every ASCII byte as its Plan 9 tag-character equivalent.
- variation_
selector_ pad - Append a variation selector (U+FE0F by default) after every codepoint.
- variation_
selector_ supplementary_ pad - Pad every codepoint with a deterministic-but-different variation
selector drawn from the supplementary range
U+E0100..=U+E01EF. - word_
joiner_ wrap - Wrap each codepoint in U+2060 WORD JOINER.