Module unicode

Source

Expand description

Unicode and HTML entity encoding strategies. Unicode and HTML entity encoding strategies.

Constants§

ZERO_WIDTH_DEFAULTS: Recommended cycle of invisible characters for zero-width injection. [U+200B ZWSP, U+200C ZWNJ, U+200D ZWJ, U+FEFF BOM, U+034F CGJ].

Functions§

bidi_inject: Bidi override wrapper — wraps reversed_keyword between U+202E (RIGHT-TO-LEFT OVERRIDE) and U+202C (POP DIRECTIONAL FORMATTING).
combining_mark_inject: Inject a combining diacritical mark after each letter of payload.
fullwidth_encode: Fullwidth Unicode encoding — replaces ASCII with fullwidth equivalents.
homoglyph_encode: Homoglyph substitution — replaces select ASCII characters with visually identical Unicode characters from other scripts.
html_entity_decimal_encode: HTML decimal entity encoding — each character becomes &#DD;.
html_entity_encode: HTML entity encoding — each character becomes &#xXX;.
html_entity_variants: HTML entity encoding with per-character variant rotation.
html_entity_zero_pad: HTML entity encoding with zero-padded numeric reference — every character becomes either &#x{:0>width$X}; (hex form) or &#{:0>width$}; (decimal form). Leading zeros pad the number to pad characters.
iis_unicode_encode: IIS/ASP percent Unicode encoding — each character becomes %uXXXX.
json_key_unicode_escape: AWS WAF JSON-pointer escape — encode every char of key as \uXXXX so the WAF’s JSON-pointer rule (e.g. /id literal-match) misses, while the backend JSON parser decodes the escape and routes the value to the original field.
json_string_encode: JSON string-content escape — produces the escaped INTERIOR of a JSON string literal (no surrounding "..." quotes).
json_unicode_alnum: Partial JSON Unicode escape — encodes ASCII alphanumeric chars as \uXXXX while leaving structural punctuation (quotes, operators, whitespace) bare.
json_unicode_full: Full JSON \uXXXX escape — escapes EVERY character of the input (including punctuation, whitespace, and control chars). Stronger than json_unicode_alnum which only touches alnum chars. Use when the WAF tokenises on punctuation boundaries that json_unicode_alnum leaves intact, OR when the WAF rule is a regex over the raw bytes of the keyword + adjacent punctuation.
json_unicode_mixed_case: Mixed-case JSON \uXXXX escape — alternates \u and \U plus upper/lowercase hex digits. Some WAF regexes are case-sensitive against \u[0-9A-F]{4}; JSON parsers RFC 8259 only accept \u lowercase, but JavaScript JSON.parse and PHP json_decode tolerate both — pick the form the backend tolerates and the WAF’s regex misses.
letterlike_encode: Letterlike-symbols + circled-Latin selective substitution — replaces individual ASCII letters in the payload with codepoints from U+2100-214F and U+24B6-24E9 that NFKC-normalize back to the original ASCII letter. Unlike the math-*-encode functions which substitute every letter from a single block, this picks the most visually- distinct codepoint per letter to maximise WAF-rule mismatch while keeping the encoded string visibly identifiable.
math_bold_encode: Mathematical Alphanumeric Symbols encoding — replaces ASCII letters and digits with their Math-Bold counterparts in the Unicode U+1D400 block.
math_double_struck_encode: Mathematical Double-Struck (blackboard bold) alphabet — uppercase U+1D538, lowercase U+1D552. Holes at C/H/N/P/Q/R/Z filled from the letterlike-symbols block.
math_fraktur_encode: Mathematical Fraktur (blackletter) alphabet — uppercase U+1D504, lowercase U+1D51E. Fraktur has holes at C/H/I/R/Z which are filled by U+212D ℭ, U+210C ℌ, U+2111 ℑ, U+211C ℜ, U+2128 ℨ.
math_italic_encode: Mathematical Italic alphabet — same NFKC trick as math_bold_encode but in a different Unicode block (U+1D434 uppercase, U+1D44E lowercase). WAFs that have added detection for the bold range (U+1D400-) do not always cover italic.
math_script_encode: Mathematical Script alphabet — uppercase U+1D49C, lowercase U+1D4B6. Script has SIX holes (U+1D49D B, U+1D4A0 E, U+1D4A1 F, U+1D4A3 H, U+1D4A4 I, U+1D4A7 M, U+1D4AD R, U+1D4BA e, U+1D4BC g, U+1D4C4 o) — each filled by the letterlike-symbols block (U+212C BCRIPT CAPITAL B, U+2130 SCRIPT CAPITAL E, etc.) so the encoded string stays NFKC-equivalent to ASCII.
overlong_utf8_path: Overlong UTF-8 encoding of . and / for path traversal.
pg_chr_decompose: Postgres / Oracle CHR()-function decomposition — CHR(N) || CHR(N) || ... per char of every single-quoted string literal.
script_homoglyph_encode: Cross-script Cyrillic / Greek letter substitution.
sharp_s_encode: Sharp-s (ß U+00DF) substitution for s/S.
sql_adjacent_string_concat: SQL adjacent-string-literal concatenation — every 'string' literal of length ≥ 2 is rewritten as a sequence of single-character adjacent literals: 'admin' → 'a' 'd' 'm' 'i' 'n'.
sql_char_decompose: SQL CHAR()-function decomposition — converts every single-quoted string literal in the payload to a CHAR(N1,N2,...) function call with one codepoint per argument.
sql_concat_split: SQL string-literal CONCAT splitter — converts every single-quoted string in the payload to a CONCAT('a','b',...) expression with one char per argument.
turkish_i_encode: Turkish dotless-i substitution: replace i/I with U+0131/U+0130.
unicode_encode: Unicode encoding — each character becomes \uXXXX.
zero_width_inject: Inject zero-width / format characters between letters of payload.

Module unicode

Module unicode Copy item path

Constants§

Functions§

Module unicode