Module invisible

Expand description

Invisible-character & tag-character encoders (Plan 9 tag chars, variation selectors, stylistic ligatures, enclosed alphanumerics, soft hyphens, word joiners). Looks identical, normalizes identical, byte stream is unrecognizable. Invisible-character & tag-character encoders.

A class of encodings the rest of unicode.rs doesn’t cover. They share one trait: the rendered or normalized string LOOKS exactly like the original to a human or to a downstream tokenizer, but the byte stream a WAF inspects bears no resemblance to the keywords it has rules for.

Tag characters (U+E0000–U+E007F, “Plan 9 tags”). Each ASCII codepoint c has a tag-equivalent at U+E0000 + c. Strip them and you recover the original ASCII. Prompt-injection research has shown modern LLM tokenizers preserve and decode these — meaning an LLM-backed WAF will see a benign-looking blob while still receiving the attack tokens.
Variation selectors (U+FE00–U+FE0F, U+E0100–U+E01EF). Originally for emoji presentation. Some normalizers strip them; some preserve them. A WAF that strips has to choose to strip every codepoint in two non-contiguous ranges, which most don’t.
Stylistic ligatures (U+FB00–U+FB06). ff/fi/fl/ffi/ffl/ ſt/st. NFKC decomposes them; non-NFKC tokenizers see them as single codepoints not in any keyword. Defeats post-normalization filters that operate on the unnormalized stream.
Enclosed alphanumerics (U+24B6–U+24E9 circled, U+1F110–U+1F12B parenthesized). Compatibility-decompose to plain Latin under NFKC. Backends that NFKC see the keyword; WAFs that don’t, don’t.
Soft hyphen / format chars (U+00AD, U+200B–U+200D, U+2060, U+FEFF). Some of these already live in unicode::zero_width_inject for selective injection. This module exposes them as a Strategy-compatible whole-string encoder too, for cases where the engine wants to swap encoders rather than compose them.

All encoders here preserve UTF-8 validity and are byte-deterministic given the same input. None of them require entropy.

§Why a new module

unicode.rs is already 17K LOC of encoders. The encoders here belong together as a class — “looks identical, parses identical, byte stream is unrecognizable” — and putting them next to the case-folding / homoglyph / math-alphabet encoders would dilute that boundary.

Constants§

INVISIBLE_ENCODER_NAMES: Returns the list of every invisible-class encoder name shipped by this module — used by the integration test to assert the dispatcher in strategy.rs has wired every one of them.

Functions§

circled_letter_encode: Replace every ASCII letter with its circled compatibility-equivalent (U+24B6..=U+24CF for uppercase, U+24D0..=U+24E9 for lowercase).
ligature_encode: Replace canonical ligature digraphs with their precomposed stylistic ligature codepoints (U+FB00..=U+FB06).
parenthesized_letter_encode: Replace every ASCII letter with its parenthesized compatibility-equivalent (U+1F110..=U+1F12B for uppercase, U+249C..=U+24B5 for lowercase).
soft_hyphen_inject: Inject U+00AD SOFT HYPHEN between every pair of codepoints.
tag_char_encode: Encode every ASCII byte as its Plan 9 tag-character equivalent.
variation_selector_pad: Append a variation selector (U+FE0F by default) after every codepoint.
variation_selector_supplementary_pad: Pad every codepoint with a deterministic-but-different variation selector drawn from the supplementary range U+E0100..=U+E01EF.
word_joiner_wrap: Wrap each codepoint in U+2060 WORD JOINER.