= parsitext
image:https://img.shields.io/crates/v/parsitext.svg[crates.io,link=https://crates.io/crates/parsitext]
image:https://docs.rs/parsitext/badge.svg[docs.rs,link=https://docs.rs/parsitext]
image:https://github.com/obsernetics/rust-lib/actions/workflows/ci.yml/badge.svg[CI,link=https://github.com/obsernetics/rust-lib/actions/workflows/ci.yml]
image:https://img.shields.io/badge/license-Apache--2.0-blue.svg[License]
image:https://img.shields.io/badge/MSRV-1.85-orange.svg[MSRV]
image:https://deps.rs/repo/github/obsernetics/rust-lib/status.svg[Dependencies,link=https://deps.rs/repo/github/obsernetics/rust-lib]
High-performance Persian (Farsi) text processing engine for Rust.
== Features
=== Normalisation pipeline
* ZWNJ normalisation — detect and clean up misplaced Zero-Width Non-Joiners in compound words.
* Digit unification — convert between Persian (۰–۹), Arabic-Indic (٠–٩), and Latin (0–9) scripts.
* Orthography fixes — normalise Arabic character variants (ك ي ى ة) to Persian canonical forms.
* Diacritics removal — strip Arabic harakat (fathah, kasrah, shadda, etc.).
* Spacing cleanup — collapse multiple whitespace, strip BOM and invisible chars.
* Repetition reduction — "خیییلی" → "خییلی" (preserves digit runs in IDs).
* Slang normalisation (opt-in) — map common informal/goftari forms to written Persian.
* Profanity filter (opt-in) — `Light` and `Medium` levels.
* Custom rules — user-supplied word-boundary-aware replacements.
=== Iranian validators (matches `persian-tools.js`)
* `national_id` — کد ملی, 10-digit checksum.
* `sheba` — شبا / IBAN, mod-97 checksum + bank-issuer lookup (English + Persian names).
* `bank_card` — 16-digit Luhn + bank-issuer lookup from BIN.
* `phone` — 11-digit Iranian mobile + canonical form + operator detection (MCI / Irancell / RighTel / Shatel / Aptel).
* `landline` — 11-digit Iranian fixed-line + province detection (all 31 provinces).
* `postal_code` — 10-digit Iranian postal code.
* `car_plate` — Iranian vehicle licence plate (`12 ب 345 - 67`).
=== Entity recognition (14 kinds)
Phone number, Jalali date (numeric + textual), money amount, national ID, bank card (Luhn-validated), IBAN (checksum-validated), postal code, **car plate**, **time expression**, mention, hashtag, URL, e-mail. Match results are post-validated against the issuing checksums to suppress false positives.
=== Persian numbers (matches Hazm + persian-tools)
* `to_words(1234)` → `"یک هزار و دویست و سی و چهار"`.
* `from_words("دو میلیون و پانصد هزار")` → `2_500_000`.
* `format(1_234_567)` → `"۱،۲۳۴،۵۶۷"` (Persian thousand separators).
* `ordinal(3)` → `"سوم"`.
=== Money parser
Structured parsing of mixed numeric / spelled-out monetary values:
`"دو میلیون و پانصد هزار تومان"` → `MoneyAmount { value: 2_500_000, unit: Toman }`, with Toman ↔ Rial conversion.
=== Light Persian stemmer
Lucene-style suffix-stripping for plurals, possessives, comparatives, and verb endings. `"کتابهایم"` → `"کتاب"`.
=== ZWNJ insertion
Heuristic morphological glue (opt-in via config or one-shot helper):
`"میروم"` → `"میروم"`, `"کتابها"` → `"کتابها"`.
=== Persian → Latin transliteration
Character-level romanisation: `"سلام"` → `"slam"`, `"ایران"` → `"ayran"`. Useful for search-key generation.
=== Spell-check primitives
Levenshtein edit distance + dictionary-based suggestion engine. Bring your own word list; no built-in dictionary (licence-free Persian dictionaries are not bundled).
=== Optional `jalali` integration
With the `jalali` Cargo feature, detected date entities are validated against the real Jalali calendar (rejects e.g. Esfand 30 in non-leap years) and `Parsitext::parse_jalali_date` returns a structured `JalaliDate`.
=== Text utilities
* Sentence splitting on `.` `؟` `!` `؛` with ZWNJ-awareness.
* Tokenisation that keeps ZWNJ-joined compound words together.
* Text statistics — word count, Persian ratio, digit count, sentence count, unique-token count.
* Batch processing — parallel via Rayon (`parallel` feature, default on).
* Language detection — `is_persian` / `contains_persian`.
=== Optional features
[cols="1,1,3", options="header"]
|===
| Feature | Default | What it enables
| `parallel` | ✓ | Rayon-powered `process_batch`
| `serde` | | `Serialize`/`Deserialize` on all public output types
| `jalali` | | Validate Jalali dates and parse them into `JalaliDate` via the [jalali-calendar] crate
|===
== Quick start
[source,toml]
----
[dependencies]
parsitext = "0.1"
# Or with optional features:
parsitext = { version = "0.1", features = ["serde"] }
----
[source,rust]
----
use parsitext::{Parsitext, ParsitextConfig};
let pt = Parsitext::default();
let result = pt.process("سلام داداش، قيمتش حدود ١.٥ میلیون تومنه؟");
println!("{}", result.normalized);
for entity in &result.entities {
println!("{entity}");
}
----
== Examples
Each runnable example lives in `examples/`:
[cols="1,2", options="header"]
|===
| Example | What it demonstrates
| `basic_normalize` | Orthography, digit, and ZWNJ normalisation.
| `entity_detection` | Detect phone, date, money, URL, and mention entities.
| `batch_processing` | Process texts in parallel with Rayon.
| `custom_rules` | Apply user-defined whole-word replacement rules.
| `validators` | National ID, IBAN, bank card, phone-operator validation, number↔words, money parsing, stemmer.
| `showcase` | Car plates, time expressions, ZWNJ insertion, transliteration, spell suggestions, money formatting, landline province detection, Jalali date parsing.
|===
Run with:
[source,bash]
----
cargo run --example <name> -p parsitext
----
== Benchmarks
[source,bash]
----
cargo bench -p parsitext
----
The `benches/` directory contains `normalizer.rs` and `pipeline.rs`, measuring normalisation throughput and full pipeline performance across representative Persian text inputs.
== Documentation
[source,bash]
----
cargo doc -p parsitext --all-features
----
== License
Apache-2.0.