Skip to main content

normalize

Function normalize 

Source
pub fn normalize(input: &str) -> Normalized
Expand description

Normalize untrusted text to defeat common obfuscation before pattern matching.

Pipeline:

  1. strip zero-width chars and C0/C1 control chars (keeping \t, \n, \r);
  2. homoglyph-fold a curated confusables map;
  3. apply NFKC;
  4. lowercase;
  5. detect base64 runs and, when they decode to valid UTF-8, append the decoded text (offsets pointing back at the run start) so patterns can match smuggled content.

The returned Normalized’s offset map records, for each normalized byte, its originating original byte (read it via Normalized::original_span). Steps 1–4 are computed char-by-char over the original input so offsets stay accurate even through NFKC’s 1→N expansions; step 5 appends decoded bytes all attributed to the run’s start offset.