= parsitext
image:https://img.shields.io/crates/v/parsitext.svg[crates.io,link=https://crates.io/crates/parsitext] image:https://docs.rs/parsitext/badge.svg[docs.rs,link=https://docs.rs/parsitext] image:https://github.com/obsernetics/rust-lib/actions/workflows/ci.yml/badge.svg[CI,link=https://github.com/obsernetics/rust-lib/actions/workflows/ci.yml] image:https://img.shields.io/badge/license-Apache--2.0-blue.svg[License] image:https://img.shields.io/badge/MSRV-1.88-orange.svg[MSRV] image:https://deps.rs/repo/github/obsernetics/rust-lib/status.svg[Dependencies,link=https://deps.rs/repo/github/obsernetics/rust-lib]
High-performance Persian (Farsi) text processing engine for Rust.
== Features
=== Normalisation pipeline
- ZWNJ normalisation — detect and clean up misplaced Zero-Width Non-Joiners in compound words.
- Digit unification — convert between Persian (۰–۹), Arabic-Indic (٠–٩), and Latin (0–9) scripts.
- Orthography fixes — normalise Arabic character variants (ك ي ى ة) to Persian canonical forms.
- Diacritics removal — strip Arabic harakat (fathah, kasrah, shadda, etc.).
- Spacing cleanup — collapse multiple whitespace, strip BOM and invisible chars.
- Repetition reduction — "خیییلی" → "خییلی" (preserves digit runs in IDs).
- Slang normalisation (opt-in) — map common informal/goftari forms to written Persian.
- Profanity filter (opt-in) —
LightandMediumlevels. - Custom rules — user-supplied word-boundary-aware replacements.
=== Iranian validators (matches persian-tools.js + iranianbank)
national_id— کد ملی, 10-digit personal-ID checksum + issuance-prefix extraction.legal_id— شناسه ملی, 11-digit company / legal-entity checksum.sheba— شبا / IBAN, mod-97 checksum + bank-issuer lookup + IBAN generation from(bank_code, account_type, account_number)— matches theiranianbankcrate'sIban::newfeature.bank_card— 16-digit Luhn + bank-issuer lookup from BIN.phone— 11-digit Iranian mobile + canonical form + operator detection (MCI / Irancell / RighTel / Shatel / Aptel).landline— 11-digit Iranian fixed-line + province detection (all 31 provinces).postal_code— 10-digit Iranian postal code.car_plate— Iranian vehicle licence plate (12 ب 345 - 67).bill— قبض / pay-slip: bill-id and pay-id checksums + bill-type detection (water / electricity / gas / phone / mobile / municipality / tax / fines).
=== Entity recognition (14 kinds) Phone number, Jalali date (numeric + textual), money amount, national ID, bank card (Luhn-validated), IBAN (checksum-validated), postal code, car plate, time expression, mention, hashtag, URL, e-mail. Match results are post-validated against the issuing checksums to suppress false positives.
=== Persian numbers (matches Hazm + persian-tools)
to_words(1234)→"یک هزار و دویست و سی و چهار".from_words("دو میلیون و پانصد هزار")→2_500_000.format(1_234_567)→"۱،۲۳۴،۵۶۷"(Persian thousand separators).ordinal(3)→"سوم".
=== Money parser
Structured parsing of mixed numeric / spelled-out monetary values:
"دو میلیون و پانصد هزار تومان" → MoneyAmount { value: 2_500_000, unit: Toman }, with Toman ↔ Rial conversion.
=== Light Persian stemmer
Lucene-style suffix-stripping for plurals, possessives, comparatives, and verb endings. "کتابهایم" → "کتاب".
=== ZWNJ insertion
Heuristic morphological glue (opt-in via config or one-shot helper):
"میروم" → "میروم", "کتابها" → "کتابها".
=== Persian → Latin transliteration
Character-level romanisation: "سلام" → "slam", "ایران" → "ayran". Useful for search-key generation.
=== Finglish ↔ Persian conversion
Persian-in-Latin-script (Finglish) → Persian script via a ~250-word dictionary
plus character-level transliteration with ZWNJ-aware digraphs (kh sh ch
zh gh oo ou aa ee).
"salam khoobi?" → "سلام خوبی؟".
=== Chat / GenZ register conversion (style::*)
to_formal(text)— informal goftari → written neveshtar.to_chat(text)— formal Persian / Finglish → contracted goftari.to_genz(text)— chat + English-loanword swap (مهمانی→پارتی,جذاب→کول,واقعاً→ریلی). All three accept Persian or Finglish input and auto-detect.
=== Persian phonetic matching (phonetic::soundex)
Soundex-style codec that collapses Persian homophone groups
(ص = س = ث, ز = ذ = ض = ظ, ت = ط, …) so that
fuzzy-name lookup and search work without indexing every spelling variant.
=== Tantivy analyzer (opt-in)
With the tantivy Cargo feature, parsitext::tantivy_analyzer::PersianTokenizer
plugs into link:https://docs.rs/tantivy[tantivy]'s search index — ZWNJ-aware
tokenisation, optional stemmer pass, optional Arabic→Persian character
normalisation, and correct UTF-8 byte offsets for highlighting.
=== Spell-check primitives
Levenshtein edit distance + dictionary-based suggestion engine. Ships with
a bundled ~250-word common-word list (spell_dict::COMMON_WORDS,
spell::suggest_builtin); bring your own list with spell::suggest for
domain-specific vocab.
=== Optional jalali integration
With the jalali Cargo feature, detected date entities are validated against the real Jalali calendar (rejects e.g. Esfand 30 in non-leap years) and Parsitext::parse_jalali_date returns a structured JalaliDate.
=== Text utilities
- Sentence splitting on
.؟!؛with ZWNJ-awareness. - Tokenisation that keeps ZWNJ-joined compound words together.
- Text statistics — word count, Persian ratio, digit count, sentence count, unique-token count.
- Batch processing — parallel via Rayon (
parallelfeature, default on). - Language detection —
is_persian,contains_persian,script::has_arabic,script::has_persian,script::is_pure_persian,script::to_arabic.
=== Geographic data (geo)
All 31 Iranian provinces — id, English/Persian names, capital, telephone area
code, slug — plus a curated set of major cities for province lookup:
geo::find_province_by_city("اصفهان") → Isfahan,
geo::get_cities_of_province(19) → Fars cities.
=== URL helpers (url_fix)
Persian-aware percent-encoding (encode/decode) and fix(url) to render
percent-encoded Persian URLs back into readable form while leaving the
scheme + authority untouched.
=== Relative time (time_diff)
describe(seconds) and describe_between(from, to) produce Persian phrases
like "۲ روز پیش" / "۳ ساعت دیگر" for any signed offset.
=== Optional features
[cols="1,1,3", options="header"] |=== | Feature | Default | What it enables
| parallel | ✓ | Rayon-powered process_batch
| serde | | Serialize/Deserialize on all public output types
| jalali | | Validate Jalali dates and parse them into JalaliDate via the link:https://crates.io/crates/jalali-calendar[jalali-calendar] crate
| tantivy | | PersianTokenizer for the link:https://crates.io/crates/tantivy[tantivy] search engine
|===
== Quick start
[source,toml]
[dependencies] parsitext = "0.1"
Or with optional features:
parsitext = { version = "0.1", features = ["serde"] }
[source,rust]
use parsitext::{Parsitext, ParsitextConfig};
let pt = Parsitext::default(); let result = pt.process("سلام داداش، قيمتش حدود ١.٥ میلیون تومنه؟");
println!("{}", result.normalized); for entity in &result.entities { println!("{entity}"); }
== Examples
Each runnable example lives in examples/:
[cols="1,2", options="header"] |=== | Example | What it demonstrates
| basic_normalize | Orthography, digit, and ZWNJ normalisation.
| entity_detection | Detect phone, date, money, URL, and mention entities.
| batch_processing | Process texts in parallel with Rayon.
| custom_rules | Apply user-defined whole-word replacement rules.
| validators | National ID, IBAN, bank card, phone-operator validation, number↔words, money parsing, stemmer.
| showcase | Car plates, time expressions, ZWNJ insertion, transliteration, spell suggestions, money formatting, landline province detection, Jalali date parsing.
| finglish_chat | Finglish→Persian conversion, formal/chat/GenZ register transformation, Soundex matching, spell suggestions.
|===
Run with:
[source,bash]
cargo run --example -p parsitext
== Benchmarks
[source,bash]
cargo bench -p parsitext
The benches/ directory contains normalizer.rs and pipeline.rs, measuring normalisation throughput and full pipeline performance across representative Persian text inputs.
== Documentation
[source,bash]
cargo doc -p parsitext --all-features
== License
Apache-2.0.