Skip to main content

Module normalise

Module normalise 

Source
Expand description

Unicode normalisation layer for security analysis.

This module provides text normalisation as a preprocessing step before all security analysis. It applies a multi-stage pipeline to defeat Unicode-based evasion techniques:

  1. NFKC normalisation — compatibility decomposition + canonical composition
  2. Diacritics stripping — removes combining marks to defeat accent evasion (IS-031)
  3. Invisible character stripping — removes zero-width, tag, and control characters (IS-022)
  4. Homoglyph mapping — maps Cyrillic, Greek, upside-down, and Braille characters to ASCII equivalents (IS-021, IS-015)
  5. Emoji stripping — removes emoji to defeat emoji-smuggling attacks (IS-020)

§Why?

Attackers can bypass regex-based detection by using visually identical but distinct Unicode code points — for example, Cyrillic а (U+0430) instead of Latin a (U+0061), embedding zero-width characters inside keywords, using upside-down letters, encoding text in Braille, adding diacritics, or interspersing emoji characters. Normalising text before analysis neutralises these evasion techniques.

Functions§

normalise_text
Normalise text for security analysis.
strip_diacritics
Strip diacritics (combining marks) from text.
strip_emoji
Strip emoji characters from text.