parsitext 0.1.3

High-performance Persian (Farsi) text processing engine for Rust — normalization, tokenization, entity recognition.
Documentation

= parsitext

image:https://img.shields.io/crates/v/parsitext.svg[crates.io,link=https://crates.io/crates/parsitext] image:https://docs.rs/parsitext/badge.svg[docs.rs,link=https://docs.rs/parsitext] image:https://github.com/obsernetics/rust-lib/actions/workflows/ci.yml/badge.svg[CI,link=https://github.com/obsernetics/rust-lib/actions/workflows/ci.yml] image:https://img.shields.io/badge/license-Apache--2.0-blue.svg[License] image:https://img.shields.io/badge/MSRV-1.88-orange.svg[MSRV] image:https://deps.rs/repo/github/obsernetics/rust-lib/status.svg[Dependencies,link=https://deps.rs/repo/github/obsernetics/rust-lib]

High-performance Persian (Farsi) text processing engine for Rust.

== Features

=== Normalisation pipeline

  • ZWNJ normalisation — detect and clean up misplaced Zero-Width Non-Joiners in compound words.
  • Digit unification — convert between Persian (۰–۹), Arabic-Indic (٠–٩), and Latin (0–9) scripts.
  • Orthography fixes — normalise Arabic character variants (ك ي ى ة) to Persian canonical forms.
  • Diacritics removal — strip Arabic harakat (fathah, kasrah, shadda, etc.).
  • Spacing cleanup — collapse multiple whitespace, strip BOM and invisible chars.
  • Repetition reduction — "خیییلی" → "خییلی" (preserves digit runs in IDs).
  • Slang normalisation (opt-in) — map common informal/goftari forms to written Persian.
  • Profanity filter (opt-in) — Light and Medium levels.
  • Custom rules — user-supplied word-boundary-aware replacements.

=== Iranian validators (matches persian-tools.js + iranianbank)

  • national_id — کد ملی, 10-digit personal-ID checksum + issuance-prefix extraction.
  • legal_id — شناسه ملی, 11-digit company / legal-entity checksum.
  • sheba — شبا / IBAN, mod-97 checksum + bank-issuer lookup + IBAN generation from (bank_code, account_type, account_number) — matches the iranianbank crate's Iban::new feature.
  • bank_card — 16-digit Luhn + bank-issuer lookup from BIN.
  • phone — 11-digit Iranian mobile + canonical form + operator detection (MCI / Irancell / RighTel / Shatel / Aptel).
  • landline — 11-digit Iranian fixed-line + province detection (all 31 provinces).
  • postal_code — 10-digit Iranian postal code.
  • car_plate — Iranian vehicle licence plate (12 ب 345 - 67).
  • bill — قبض / pay-slip: bill-id and pay-id checksums + bill-type detection (water / electricity / gas / phone / mobile / municipality / tax / fines).

=== Entity recognition (14 kinds) Phone number, Jalali date (numeric + textual), money amount, national ID, bank card (Luhn-validated), IBAN (checksum-validated), postal code, car plate, time expression, mention, hashtag, URL, e-mail. Match results are post-validated against the issuing checksums to suppress false positives.

=== Persian numbers (matches Hazm + persian-tools)

  • to_words(1234)"یک هزار و دویست و سی و چهار".
  • from_words("دو میلیون و پانصد هزار")2_500_000.
  • format(1_234_567)"۱،۲۳۴،۵۶۷" (Persian thousand separators).
  • ordinal(3)"سوم".

=== Money parser Structured parsing of mixed numeric / spelled-out monetary values: "دو میلیون و پانصد هزار تومان"MoneyAmount { value: 2_500_000, unit: Toman }, with Toman ↔ Rial conversion.

=== Light Persian stemmer Lucene-style suffix-stripping for plurals, possessives, comparatives, and verb endings. "کتاب‌هایم""کتاب".

=== ZWNJ insertion Heuristic morphological glue (opt-in via config or one-shot helper): "میروم""می‌روم", "کتابها""کتاب‌ها".

=== Persian → Latin transliteration Character-level romanisation: "سلام""slam", "ایران""ayran". Useful for search-key generation.

=== Finglish ↔ Persian conversion Persian-in-Latin-script (Finglish) → Persian script via a ~250-word dictionary plus character-level transliteration with ZWNJ-aware digraphs (kh sh ch zh gh oo ou aa ee). "salam khoobi?""سلام خوبی؟".

=== Chat / GenZ register conversion (style::*)

  • to_formal(text) — informal goftari → written neveshtar.
  • to_chat(text) — formal Persian / Finglish → contracted goftari.
  • to_genz(text) — chat + English-loanword swap (مهمانیپارتی, جذابکول, واقعاًریلی). All three accept Persian or Finglish input and auto-detect.

=== Persian phonetic matching (phonetic::soundex) Soundex-style codec that collapses Persian homophone groups (ص = س = ث, ز = ذ = ض = ظ, ت = ط, …) so that fuzzy-name lookup and search work without indexing every spelling variant.

=== Tantivy analyzer (opt-in) With the tantivy Cargo feature, parsitext::tantivy_analyzer::PersianTokenizer plugs into link:https://docs.rs/tantivy[tantivy]'s search index — ZWNJ-aware tokenisation, optional stemmer pass, optional Arabic→Persian character normalisation, and correct UTF-8 byte offsets for highlighting.

=== Spell-check primitives Levenshtein edit distance + dictionary-based suggestion engine. Ships with a bundled ~250-word common-word list (spell_dict::COMMON_WORDS, spell::suggest_builtin); bring your own list with spell::suggest for domain-specific vocab.

=== Optional jalali integration With the jalali Cargo feature, detected date entities are validated against the real Jalali calendar (rejects e.g. Esfand 30 in non-leap years) and Parsitext::parse_jalali_date returns a structured JalaliDate.

=== Text utilities

  • Sentence splitting on . ؟ ! ؛ with ZWNJ-awareness.
  • Tokenisation that keeps ZWNJ-joined compound words together.
  • Text statistics — word count, Persian ratio, digit count, sentence count, unique-token count.
  • Batch processing — parallel via Rayon (parallel feature, default on).
  • Language detection — is_persian, contains_persian, script::has_arabic, script::has_persian, script::is_pure_persian, script::to_arabic.

=== Geographic data (geo) All 31 Iranian provinces — id, English/Persian names, capital, telephone area code, slug — plus a curated set of major cities for province lookup: geo::find_province_by_city("اصفهان") → Isfahan, geo::get_cities_of_province(19) → Fars cities.

=== URL helpers (url_fix) Persian-aware percent-encoding (encode/decode) and fix(url) to render percent-encoded Persian URLs back into readable form while leaving the scheme + authority untouched.

=== Relative time (time_diff) describe(seconds) and describe_between(from, to) produce Persian phrases like "۲ روز پیش" / "۳ ساعت دیگر" for any signed offset.

=== Optional features

[cols="1,1,3", options="header"] |=== | Feature | Default | What it enables

| parallel | ✓ | Rayon-powered process_batch | serde | | Serialize/Deserialize on all public output types | jalali | | Validate Jalali dates and parse them into JalaliDate via the link:https://crates.io/crates/jalali-calendar[jalali-calendar] crate | tantivy | | PersianTokenizer for the link:https://crates.io/crates/tantivy[tantivy] search engine |===

== Quick start

[source,toml]

[dependencies] parsitext = "0.1"

Or with optional features:

parsitext = { version = "0.1", features = ["serde"] }

[source,rust]

use parsitext::{Parsitext, ParsitextConfig};

let pt = Parsitext::default(); let result = pt.process("سلام داداش، قيمتش حدود ١.٥ میلیون تومنه؟");

println!("{}", result.normalized); for entity in &result.entities { println!("{entity}"); }

== Examples

Each runnable example lives in examples/:

[cols="1,2", options="header"] |=== | Example | What it demonstrates

| basic_normalize | Orthography, digit, and ZWNJ normalisation. | entity_detection | Detect phone, date, money, URL, and mention entities. | batch_processing | Process texts in parallel with Rayon. | custom_rules | Apply user-defined whole-word replacement rules. | validators | National ID, IBAN, bank card, phone-operator validation, number↔words, money parsing, stemmer. | showcase | Car plates, time expressions, ZWNJ insertion, transliteration, spell suggestions, money formatting, landline province detection, Jalali date parsing. | finglish_chat | Finglish→Persian conversion, formal/chat/GenZ register transformation, Soundex matching, spell suggestions. |===

Run with:

[source,bash]

cargo run --example -p parsitext

== Benchmarks

[source,bash]

cargo bench -p parsitext

The benches/ directory contains normalizer.rs and pipeline.rs, measuring normalisation throughput and full pipeline performance across representative Persian text inputs.

== Documentation

[source,bash]

cargo doc -p parsitext --all-features

== License

Apache-2.0.