# Yekdast (یکدست) for Rust
[](https://crates.io/crates/yekdast)
[](https://docs.rs/yekdast)
[](https://opensource.org/licenses/MIT)
A fast, configurable, and modern Rust library for **normalizing and preprocessing Persian (Farsi) text**.
`Yekdast` is a powerful tool for cleaning up messy Persian text data, preparing it for subsequent steps like search, analysis, or display in your applications. The name "Yekdast" (یکدست) means "uniform" or "consistent" in Persian.
---
### ## Why Yekdast?
* **🚀 Fast and Efficient:** Written in Rust for the best possible performance.
* **🔧 Fully Configurable:** Provide your own dictionaries for handling compound words, normalizing slang, and defining custom replacement rules.
* **🧠 Smart:** Features intelligent Zero-Width Non-Joiner (ZWNJ) handling for prefixes, suffixes, and custom compound words.
* **🛡️ Safe:** Automatically protects sensitive parts of your text like URLs, emails, code blocks, and HTML tags from being normalized.
* **✨ Comprehensive:** A complete suite of features including character, digit, punctuation, and whitespace normalization.
---
### ## Installation
To use `Yekdast` in your project, add the following line to your `Cargo.toml` file:
```toml
[dependencies]
yekdast = "0.1.0" # Please replace with the latest version
```
---
### ## Quick Start
Using the library with its default settings is straightforward.
```rust
use yekdast::{normalize_text, NormalizeOptions};
fn main() {
let messy_text = "سلام, من يك برنامه نويس هستم و در كتاب خانه كار مي كنم.";
// Use the default normalization options
let options = NormalizeOptions::default();
let clean_text = normalize_text(messy_text, &options);
println!("Original: {}", messy_text);
println!("Normalized: {}", clean_text);
// Output: Normalized: سلام، من یک برنامه نویس هستم و در کتابخانه کار میکنم.
}
```
---
### ## Advanced Usage & Options
The real power of `Yekdast` lies in its configurability. You can control every aspect of the normalization process.
```rust
use yekdast::{normalize_text, NormalizeOptions, DigitPolicy};
use std::collections::HashMap;
fn main() {
let text = "من توی خونه شماره 123 کار میکنم و علاقه مند به برنامه نویسی هستم. میباشد.";
// 1. Define a custom slang-to-formal dictionary
let mut slang_map = HashMap::new();
slang_map.insert("توی".to_string(), "در".to_string());
slang_map.insert("خونه".to_string(), "خانه".to_string());
// 2. Define a list of compound words for ZWNJ insertion
let zwnj_words = vec![
"علاقه مند".to_string(),
"کار میکنم".to_string(),
];
// 3. Define custom, high-priority replacement rules
let custom_rules = vec![
("میباشد.".to_string(), "است.".to_string()),
];
// 4. Construct the final options
let options = NormalizeOptions {
digits: DigitPolicy::Fa, // Convert all digits to Persian
slang_map,
zwnj_compound_words: zwnj_words,
custom_rules,
..Default::default()
};
let clean_text = normalize_text(text, &options);
println!("{}", clean_text);
// Output: من در خانه شماره ۱۲۳ کارمیکنم و علاقهمند به برنامه نویسی هستم. است.
}
```
### ## Feature Checklist
- [x] Unify Arabic characters (ي, ك) to Persian (ی, ک)
- [x] Normalize and convert digits (Persian, Arabic, Latin)
- [x] Normalize punctuation (comma, semicolon, question mark)
- [x] Smart ZWNJ insertion for prefixes (`می/نمی`), suffixes (`ها/تر/ترین`), and custom compound words
- [x] Support for custom slang-to-formal dictionaries
- [x] Support for user-defined custom replacement rules
- [x] Squeeze and clean up extra whitespace
- [x] Normalize newline characters
- [x] Remove diacritics and Tatweel (kashida)
- [x] Automatically protect URLs, emails, code, and HTML tags
- [x] Support for Unicode Normalization Forms (NFC, NFKC)
---
### ## Contributing
Contributions are welcome! Please feel free to open an issue to report a bug or suggest a feature, or submit a pull request.
---
### ## License
This project is licensed under the MIT License.