yekdast 0.1.0

A utility library for normalizing and cleaning up Persian (Farsi) text.
Documentation

Yekdast (یکدست) for Rust

Crates.io Docs.rs License: MIT

A fast, configurable, and modern Rust library for normalizing and preprocessing Persian (Farsi) text.

Yekdast is a powerful tool for cleaning up messy Persian text data, preparing it for subsequent steps like search, analysis, or display in your applications. The name "Yekdast" (یکدست) means "uniform" or "consistent" in Persian.


## Why Yekdast?

  • 🚀 Fast and Efficient: Written in Rust for the best possible performance.
  • 🔧 Fully Configurable: Provide your own dictionaries for handling compound words, normalizing slang, and defining custom replacement rules.
  • 🧠 Smart: Features intelligent Zero-Width Non-Joiner (ZWNJ) handling for prefixes, suffixes, and custom compound words.
  • 🛡️ Safe: Automatically protects sensitive parts of your text like URLs, emails, code blocks, and HTML tags from being normalized.
  • ✨ Comprehensive: A complete suite of features including character, digit, punctuation, and whitespace normalization.

## Installation

To use Yekdast in your project, add the following line to your Cargo.toml file:

[dependencies]
yekdast = "0.1.0" # Please replace with the latest version

## Quick Start

Using the library with its default settings is straightforward.

use yekdast::{normalize_text, NormalizeOptions};

fn main() {
    let messy_text = "سلام, من يك برنامه نويس هستم و در كتاب خانه كار مي كنم.";
    
    // Use the default normalization options
    let options = NormalizeOptions::default();
    
    let clean_text = normalize_text(messy_text, &options);
    
    println!("Original: {}", messy_text);
    println!("Normalized: {}", clean_text);
    // Output: Normalized: سلام، من یک برنامه نویس هستم و در کتاب‌خانه کار می‌کنم.
}

## Advanced Usage & Options

The real power of Yekdast lies in its configurability. You can control every aspect of the normalization process.

use yekdast::{normalize_text, NormalizeOptions, DigitPolicy};
use std::collections::HashMap;

fn main() {
    let text = "من توی خونه شماره 123 کار میکنم و علاقه مند به برنامه نویسی هستم. میباشد.";

    // 1. Define a custom slang-to-formal dictionary
    let mut slang_map = HashMap::new();
    slang_map.insert("توی".to_string(), "در".to_string());
    slang_map.insert("خونه".to_string(), "خانه".to_string());

    // 2. Define a list of compound words for ZWNJ insertion
    let zwnj_words = vec![
        "علاقه مند".to_string(),
        "کار میکنم".to_string(), 
    ];

    // 3. Define custom, high-priority replacement rules
    let custom_rules = vec![
        ("میباشد.".to_string(), "است.".to_string()),
    ];

    // 4. Construct the final options
    let options = NormalizeOptions {
        digits: DigitPolicy::Fa, // Convert all digits to Persian
        slang_map,
        zwnj_compound_words: zwnj_words,
        custom_rules,
        ..Default::default()
    };
    
    let clean_text = normalize_text(text, &options);

    println!("{}", clean_text);
    // Output: من در خانه شماره ۱۲۳ کار‌می‌کنم و علاقه‌مند به برنامه نویسی هستم. است.
}

## Feature Checklist

  • Unify Arabic characters (ي, ك) to Persian (ی, ک)
  • Normalize and convert digits (Persian, Arabic, Latin)
  • Normalize punctuation (comma, semicolon, question mark)
  • Smart ZWNJ insertion for prefixes (می/نمی), suffixes (ها/تر/ترین), and custom compound words
  • Support for custom slang-to-formal dictionaries
  • Support for user-defined custom replacement rules
  • Squeeze and clean up extra whitespace
  • Normalize newline characters
  • Remove diacritics and Tatweel (kashida)
  • Automatically protect URLs, emails, code, and HTML tags
  • Support for Unicode Normalization Forms (NFC, NFKC)

## Contributing

Contributions are welcome! Please feel free to open an issue to report a bug or suggest a feature, or submit a pull request.


## License

This project is licensed under the MIT License.