Skip to main content

Module unicode

Module unicode 

Source
Expand description

Unicode and HTML entity encoding strategies. Unicode and HTML entity encoding strategies.

Constants§

ZERO_WIDTH_DEFAULTS
Recommended cycle of invisible characters for zero-width injection. [U+200B ZWSP, U+200C ZWNJ, U+200D ZWJ, U+FEFF BOM, U+034F CGJ].

Functions§

bidi_inject
Bidi override wrapper — wraps reversed_keyword between U+202E (RIGHT-TO-LEFT OVERRIDE) and U+202C (POP DIRECTIONAL FORMATTING).
combining_mark_inject
Inject a combining diacritical mark after each letter of payload.
fullwidth_encode
Fullwidth Unicode encoding — replaces ASCII with fullwidth equivalents.
homoglyph_encode
Homoglyph substitution — replaces select ASCII characters with visually identical Unicode characters from other scripts.
html_entity_decimal_encode
HTML decimal entity encoding — each character becomes &#DD;.
html_entity_encode
HTML entity encoding — each character becomes &#xXX;.
html_entity_variants
HTML entity encoding with per-character variant rotation.
html_entity_zero_pad
HTML entity encoding with zero-padded numeric reference — every character becomes either &#x{:0>width$X}; (hex form) or &#{:0>width$}; (decimal form). Leading zeros pad the number to pad characters.
iis_unicode_encode
IIS/ASP percent Unicode encoding — each character becomes %uXXXX.
json_key_unicode_escape
AWS WAF JSON-pointer escape — encode every char of key as \uXXXX so the WAF’s JSON-pointer rule (e.g. /id literal-match) misses, while the backend JSON parser decodes the escape and routes the value to the original field.
json_string_encode
JSON string-content escape — produces the escaped INTERIOR of a JSON string literal (no surrounding "..." quotes).
json_unicode_alnum
Partial JSON Unicode escape — encodes ASCII alphanumeric chars as \uXXXX while leaving structural punctuation (quotes, operators, whitespace) bare.
json_unicode_full
Full JSON \uXXXX escape — escapes EVERY character of the input (including punctuation, whitespace, and control chars). Stronger than json_unicode_alnum which only touches alnum chars. Use when the WAF tokenises on punctuation boundaries that json_unicode_alnum leaves intact, OR when the WAF rule is a regex over the raw bytes of the keyword + adjacent punctuation.
json_unicode_mixed_case
Mixed-case JSON \uXXXX escape — alternates \u and \U plus upper/lowercase hex digits. Some WAF regexes are case-sensitive against \u[0-9A-F]{4}; JSON parsers RFC 8259 only accept \u lowercase, but JavaScript JSON.parse and PHP json_decode tolerate both — pick the form the backend tolerates and the WAF’s regex misses.
letterlike_encode
Letterlike-symbols + circled-Latin selective substitution — replaces individual ASCII letters in the payload with codepoints from U+2100-214F and U+24B6-24E9 that NFKC-normalize back to the original ASCII letter. Unlike the math-*-encode functions which substitute every letter from a single block, this picks the most visually- distinct codepoint per letter to maximise WAF-rule mismatch while keeping the encoded string visibly identifiable.
math_bold_encode
Mathematical Alphanumeric Symbols encoding — replaces ASCII letters and digits with their Math-Bold counterparts in the Unicode U+1D400 block.
math_double_struck_encode
Mathematical Double-Struck (blackboard bold) alphabet — uppercase U+1D538, lowercase U+1D552. Holes at C/H/N/P/Q/R/Z filled from the letterlike-symbols block.
math_fraktur_encode
Mathematical Fraktur (blackletter) alphabet — uppercase U+1D504, lowercase U+1D51E. Fraktur has holes at C/H/I/R/Z which are filled by U+212D ℭ, U+210C ℌ, U+2111 ℑ, U+211C ℜ, U+2128 ℨ.
math_italic_encode
Mathematical Italic alphabet — same NFKC trick as math_bold_encode but in a different Unicode block (U+1D434 uppercase, U+1D44E lowercase). WAFs that have added detection for the bold range (U+1D400-) do not always cover italic.
math_script_encode
Mathematical Script alphabet — uppercase U+1D49C, lowercase U+1D4B6. Script has SIX holes (U+1D49D B, U+1D4A0 E, U+1D4A1 F, U+1D4A3 H, U+1D4A4 I, U+1D4A7 M, U+1D4AD R, U+1D4BA e, U+1D4BC g, U+1D4C4 o) — each filled by the letterlike-symbols block (U+212C BCRIPT CAPITAL B, U+2130 SCRIPT CAPITAL E, etc.) so the encoded string stays NFKC-equivalent to ASCII.
overlong_utf8_path
Overlong UTF-8 encoding of . and / for path traversal.
pg_chr_decompose
Postgres / Oracle CHR()-function decomposition — CHR(N) || CHR(N) || ... per char of every single-quoted string literal.
script_homoglyph_encode
Cross-script Cyrillic / Greek letter substitution.
sharp_s_encode
Sharp-s (ß U+00DF) substitution for s/S.
sql_adjacent_string_concat
SQL adjacent-string-literal concatenation — every 'string' literal of length ≥ 2 is rewritten as a sequence of single-character adjacent literals: 'admin''a' 'd' 'm' 'i' 'n'.
sql_char_decompose
SQL CHAR()-function decomposition — converts every single-quoted string literal in the payload to a CHAR(N1,N2,...) function call with one codepoint per argument.
sql_concat_split
SQL string-literal CONCAT splitter — converts every single-quoted string in the payload to a CONCAT('a','b',...) expression with one char per argument.
turkish_i_encode
Turkish dotless-i substitution: replace i/I with U+0131/U+0130.
unicode_encode
Unicode encoding — each character becomes \uXXXX.
zero_width_inject
Inject zero-width / format characters between letters of payload.