= parsitext
image:https://img.shields.io/crates/v/parsitext.svg[crates.io,link=https://crates.io/crates/parsitext] image:https://docs.rs/parsitext/badge.svg[docs.rs,link=https://docs.rs/parsitext] image:https://github.com/obsernetics/rust-lib/actions/workflows/ci.yml/badge.svg[CI,link=https://github.com/obsernetics/rust-lib/actions/workflows/ci.yml] image:https://img.shields.io/badge/license-Apache--2.0-blue.svg[License] image:https://img.shields.io/badge/MSRV-1.85-orange.svg[MSRV] image:https://deps.rs/repo/github/obsernetics/rust-lib/status.svg[Dependencies,link=https://deps.rs/repo/github/obsernetics/rust-lib]
High-performance Persian (Farsi) text processing engine for Rust.
== Features
=== Normalisation pipeline
- ZWNJ normalisation — detect and clean up misplaced Zero-Width Non-Joiners in compound words.
- Digit unification — convert between Persian (۰–۹), Arabic-Indic (٠–٩), and Latin (0–9) scripts.
- Orthography fixes — normalise Arabic character variants (ك ي ى ة) to Persian canonical forms.
- Diacritics removal — strip Arabic harakat (fathah, kasrah, shadda, etc.).
- Spacing cleanup — collapse multiple whitespace, strip BOM and invisible chars.
- Repetition reduction — "خیییلی" → "خییلی" (preserves digit runs in IDs).
- Slang normalisation (opt-in) — map common informal/goftari forms to written Persian.
- Profanity filter (opt-in) —
LightandMediumlevels. - Custom rules — user-supplied word-boundary-aware replacements.
=== Iranian validators (matches persian-tools.js)
national_id— کد ملی, 10-digit checksum.sheba— شبا / IBAN, mod-97 checksum + bank-issuer lookup (English + Persian names).bank_card— 16-digit Luhn + bank-issuer lookup from BIN.phone— 11-digit Iranian mobile + canonical form + operator detection (MCI / Irancell / RighTel / Shatel / Aptel).landline— 11-digit Iranian fixed-line + province detection (all 31 provinces).postal_code— 10-digit Iranian postal code.car_plate— Iranian vehicle licence plate (12 ب 345 - 67).
=== Entity recognition (14 kinds) Phone number, Jalali date (numeric + textual), money amount, national ID, bank card (Luhn-validated), IBAN (checksum-validated), postal code, car plate, time expression, mention, hashtag, URL, e-mail. Match results are post-validated against the issuing checksums to suppress false positives.
=== Persian numbers (matches Hazm + persian-tools)
to_words(1234)→"یک هزار و دویست و سی و چهار".from_words("دو میلیون و پانصد هزار")→2_500_000.format(1_234_567)→"۱،۲۳۴،۵۶۷"(Persian thousand separators).ordinal(3)→"سوم".
=== Money parser
Structured parsing of mixed numeric / spelled-out monetary values:
"دو میلیون و پانصد هزار تومان" → MoneyAmount { value: 2_500_000, unit: Toman }, with Toman ↔ Rial conversion.
=== Light Persian stemmer
Lucene-style suffix-stripping for plurals, possessives, comparatives, and verb endings. "کتابهایم" → "کتاب".
=== ZWNJ insertion
Heuristic morphological glue (opt-in via config or one-shot helper):
"میروم" → "میروم", "کتابها" → "کتابها".
=== Persian → Latin transliteration
Character-level romanisation: "سلام" → "slam", "ایران" → "ayran". Useful for search-key generation.
=== Spell-check primitives Levenshtein edit distance + dictionary-based suggestion engine. Bring your own word list; no built-in dictionary (licence-free Persian dictionaries are not bundled).
=== Optional jalali integration
With the jalali Cargo feature, detected date entities are validated against the real Jalali calendar (rejects e.g. Esfand 30 in non-leap years) and Parsitext::parse_jalali_date returns a structured JalaliDate.
=== Text utilities
- Sentence splitting on
.؟!؛with ZWNJ-awareness. - Tokenisation that keeps ZWNJ-joined compound words together.
- Text statistics — word count, Persian ratio, digit count, sentence count, unique-token count.
- Batch processing — parallel via Rayon (
parallelfeature, default on). - Language detection —
is_persian/contains_persian.
=== Optional features
[cols="1,1,3", options="header"] |=== | Feature | Default | What it enables
| parallel | ✓ | Rayon-powered process_batch
| serde | | Serialize/Deserialize on all public output types
| jalali | | Validate Jalali dates and parse them into JalaliDate via the [jalali-calendar] crate
|===
== Quick start
[source,toml]
[dependencies] parsitext = "0.1"
Or with optional features:
parsitext = { version = "0.1", features = ["serde"] }
[source,rust]
use parsitext::{Parsitext, ParsitextConfig};
let pt = Parsitext::default(); let result = pt.process("سلام داداش، قيمتش حدود ١.٥ میلیون تومنه؟");
println!("{}", result.normalized); for entity in &result.entities { println!("{entity}"); }
== Examples
Each runnable example lives in examples/:
[cols="1,2", options="header"] |=== | Example | What it demonstrates
| basic_normalize | Orthography, digit, and ZWNJ normalisation.
| entity_detection | Detect phone, date, money, URL, and mention entities.
| batch_processing | Process texts in parallel with Rayon.
| custom_rules | Apply user-defined whole-word replacement rules.
| validators | National ID, IBAN, bank card, phone-operator validation, number↔words, money parsing, stemmer.
| showcase | Car plates, time expressions, ZWNJ insertion, transliteration, spell suggestions, money formatting, landline province detection, Jalali date parsing.
|===
Run with:
[source,bash]
cargo run --example -p parsitext
== Benchmarks
[source,bash]
cargo bench -p parsitext
The benches/ directory contains normalizer.rs and pipeline.rs, measuring normalisation throughput and full pipeline performance across representative Persian text inputs.
== Documentation
[source,bash]
cargo doc -p parsitext --all-features
== License
Apache-2.0.