Yekdast (یکدست) for Rust
A fast, configurable, and modern Rust library for normalizing and preprocessing Persian (Farsi) text.
Yekdast is a powerful tool for cleaning up messy Persian text data, preparing it for subsequent steps like search, analysis, or display in your applications. The name "Yekdast" (یکدست) means "uniform" or "consistent" in Persian.
## Why Yekdast?
- 🚀 Fast and Efficient: Written in Rust for the best possible performance.
- 🔧 Fully Configurable: Provide your own dictionaries for handling compound words, normalizing slang, and defining custom replacement rules.
- 🧠 Smart: Features intelligent Zero-Width Non-Joiner (ZWNJ) handling for prefixes, suffixes, and custom compound words.
- 🛡️ Safe: Automatically protects sensitive parts of your text like URLs, emails, code blocks, and HTML tags from being normalized.
- ✨ Comprehensive: A complete suite of features including character, digit, punctuation, and whitespace normalization.
## Installation
To use Yekdast in your project, add the following line to your Cargo.toml file:
[]
= "0.1.0" # Please replace with the latest version
## Quick Start
Using the library with its default settings is straightforward.
use ;
## Advanced Usage & Options
The real power of Yekdast lies in its configurability. You can control every aspect of the normalization process.
use ;
use HashMap;
## Feature Checklist
- Unify Arabic characters (ي, ك) to Persian (ی, ک)
- Normalize and convert digits (Persian, Arabic, Latin)
- Normalize punctuation (comma, semicolon, question mark)
- Smart ZWNJ insertion for prefixes (
می/نمی), suffixes (ها/تر/ترین), and custom compound words - Support for custom slang-to-formal dictionaries
- Support for user-defined custom replacement rules
- Squeeze and clean up extra whitespace
- Normalize newline characters
- Remove diacritics and Tatweel (kashida)
- Automatically protect URLs, emails, code, and HTML tags
- Support for Unicode Normalization Forms (NFC, NFKC)
## Contributing
Contributions are welcome! Please feel free to open an issue to report a bug or suggest a feature, or submit a pull request.
## License
This project is licensed under the MIT License.