Skip to main content

Module unicode_normalizer

Module unicode_normalizer 

Source
Expand description

Unicode normalization and language-agnostic tokenization utilities.

Provides:

  • Script: Unicode script detection for individual characters.
  • UnicodeNormalizer: NFC/NFD normalization, accent stripping, case folding.
  • Language-agnostic tokenization that handles CJK character segmentation.

Structs§

UnicodeNormalizer
Unicode-aware text normalizer.
UnicodeNormalizerConfig
Configuration for UnicodeNormalizer.

Enums§

NormForm
Unicode normalization form.
Script
Unicode script classification for a single character.

Functions§

detect_script
Detect the Script for a single Unicode character.