Expand description
Unicode normalisation layer for security analysis.
This module provides text normalisation as a preprocessing step before all security analysis. It applies a multi-stage pipeline to defeat Unicode-based evasion techniques:
- NFKC normalisation — compatibility decomposition + canonical composition
- Diacritics stripping — removes combining marks to defeat accent evasion (IS-031)
- Invisible character stripping — removes zero-width, tag, and control characters (IS-022)
- Homoglyph mapping — maps Cyrillic, Greek, upside-down, and Braille characters to ASCII equivalents (IS-021, IS-015)
- Emoji stripping — removes emoji to defeat emoji-smuggling attacks (IS-020)
§Why?
Attackers can bypass regex-based detection by using visually identical but
distinct Unicode code points — for example, Cyrillic а (U+0430) instead
of Latin a (U+0061), embedding zero-width characters inside keywords,
using upside-down letters, encoding text in Braille, adding diacritics, or
interspersing emoji characters. Normalising text before analysis neutralises
these evasion techniques.
Functions§
- normalise_
text - Normalise text for security analysis.
- strip_
diacritics - Strip diacritics (combining marks) from text.
- strip_
emoji - Strip emoji characters from text.