Crate yekdast

Crate yekdast 

Source
Expand description

§Yekdast: A Configurable Persian Text Normalizer

yekdast is a comprehensive Rust library for normalizing and standardizing Persian (Farsi) text. It transforms inconsistent text into a clean, uniform format, making it ideal for preprocessing text for analysis, display, or storage. It includes advanced features like smart ZWNJ insertion, slang normalization, and user-defined custom replacement rules.

§Features

  • Unification of Arabic characters (e.g., ي, ك) to their Persian counterparts (ی, ک).
  • Digit normalization (convert between Persian, Arabic, and Latin numerals).
  • Punctuation normalization (e.g., converting , to ،).
  • Smart Zero-Width Non-Joiner (ZWNJ) insertion for prefixes, suffixes, and user-defined compound words.
  • User-configurable dictionaries for slang-to-formal conversion.
  • User-configurable custom replacement rules.
  • Whitespace cleanup, including squeezing multiple spaces and trimming.
  • Automatic protection for URLs, email addresses, and code blocks.

Structs§

NormalizeOptions
A comprehensive set of options to control the text normalization process.

Enums§

DigitPolicy
Defines the policy for handling digits during normalization.
PunctPolicy
Defines the policy for handling punctuation marks.
UnicodeForm
Defines the Unicode normalization form to be applied.
ZwnjPolicy
Defines the policy for applying the Zero-Width Non-Joiner (ZWNJ / nim-fāseleh).

Functions§

normalize_text
Normalizes a given string based on the provided options.