Expand description
§Yekdast: A Configurable Persian Text Normalizer
yekdast is a comprehensive Rust library for normalizing and standardizing Persian (Farsi) text.
It transforms inconsistent text into a clean, uniform format, making it ideal for preprocessing
text for analysis, display, or storage. It includes advanced features like smart ZWNJ insertion,
slang normalization, and user-defined custom replacement rules.
§Features
- Unification of Arabic characters (e.g., ي, ك) to their Persian counterparts (ی, ک).
- Digit normalization (convert between Persian, Arabic, and Latin numerals).
- Punctuation normalization (e.g., converting ,to،).
- Smart Zero-Width Non-Joiner (ZWNJ) insertion for prefixes, suffixes, and user-defined compound words.
- User-configurable dictionaries for slang-to-formal conversion.
- User-configurable custom replacement rules.
- Whitespace cleanup, including squeezing multiple spaces and trimming.
- Automatic protection for URLs, email addresses, and code blocks.
Structs§
- NormalizeOptions 
- A comprehensive set of options to control the text normalization process.
Enums§
- DigitPolicy 
- Defines the policy for handling digits during normalization.
- PunctPolicy 
- Defines the policy for handling punctuation marks.
- UnicodeForm 
- Defines the Unicode normalization form to be applied.
- ZwnjPolicy 
- Defines the policy for applying the Zero-Width Non-Joiner (ZWNJ / nim-fāseleh).
Functions§
- normalize_text 
- Normalizes a given string based on the provided options.