Expand description
§Yekdast: A Configurable Persian Text Normalizer
yekdast is a comprehensive Rust library for normalizing and standardizing Persian (Farsi) text.
It transforms inconsistent text into a clean, uniform format, making it ideal for preprocessing
text for analysis, display, or storage. It includes advanced features like smart ZWNJ insertion,
slang normalization, and user-defined custom replacement rules.
§Features
- Unification of Arabic characters (e.g., ي, ك) to their Persian counterparts (ی, ک).
- Digit normalization (convert between Persian, Arabic, and Latin numerals).
- Punctuation normalization (e.g., converting
,to،). - Smart Zero-Width Non-Joiner (ZWNJ) insertion for prefixes, suffixes, and user-defined compound words.
- User-configurable dictionaries for slang-to-formal conversion.
- User-configurable custom replacement rules.
- Whitespace cleanup, including squeezing multiple spaces and trimming.
- Automatic protection for URLs, email addresses, and code blocks.
Structs§
- Normalize
Options - A comprehensive set of options to control the text normalization process.
Enums§
- Digit
Policy - Defines the policy for handling digits during normalization.
- Punct
Policy - Defines the policy for handling punctuation marks.
- Unicode
Form - Defines the Unicode normalization form to be applied.
- Zwnj
Policy - Defines the policy for applying the Zero-Width Non-Joiner (ZWNJ / nim-fāseleh).
Functions§
- normalize_
text - Normalizes a given string based on the provided options.