parsitext 0.1.2

High-performance Persian (Farsi) text processing engine for Rust — normalization, tokenization, entity recognition.
Documentation

= parsitext

image:https://img.shields.io/crates/v/parsitext.svg[crates.io,link=https://crates.io/crates/parsitext] image:https://docs.rs/parsitext/badge.svg[docs.rs,link=https://docs.rs/parsitext] image:https://github.com/obsernetics/rust-lib/actions/workflows/ci.yml/badge.svg[CI,link=https://github.com/obsernetics/rust-lib/actions/workflows/ci.yml] image:https://img.shields.io/badge/license-Apache--2.0-blue.svg[License] image:https://img.shields.io/badge/MSRV-1.85-orange.svg[MSRV] image:https://deps.rs/repo/github/obsernetics/rust-lib/status.svg[Dependencies,link=https://deps.rs/repo/github/obsernetics/rust-lib]

High-performance Persian (Farsi) text processing engine for Rust.

== Features

=== Normalisation pipeline

  • ZWNJ normalisation — detect and clean up misplaced Zero-Width Non-Joiners in compound words.
  • Digit unification — convert between Persian (۰–۹), Arabic-Indic (٠–٩), and Latin (0–9) scripts.
  • Orthography fixes — normalise Arabic character variants (ك ي ى ة) to Persian canonical forms.
  • Diacritics removal — strip Arabic harakat (fathah, kasrah, shadda, etc.).
  • Spacing cleanup — collapse multiple whitespace, strip BOM and invisible chars.
  • Repetition reduction — "خیییلی" → "خییلی" (preserves digit runs in IDs).
  • Slang normalisation (opt-in) — map common informal/goftari forms to written Persian.
  • Profanity filter (opt-in) — Light and Medium levels.
  • Custom rules — user-supplied word-boundary-aware replacements.

=== Iranian validators (matches persian-tools.js)

  • national_id — کد ملی, 10-digit checksum.
  • sheba — شبا / IBAN, mod-97 checksum + bank-issuer lookup (English + Persian names).
  • bank_card — 16-digit Luhn + bank-issuer lookup from BIN.
  • phone — 11-digit Iranian mobile + canonical form + operator detection (MCI / Irancell / RighTel / Shatel / Aptel).
  • landline — 11-digit Iranian fixed-line + province detection (all 31 provinces).
  • postal_code — 10-digit Iranian postal code.
  • car_plate — Iranian vehicle licence plate (12 ب 345 - 67).

=== Entity recognition (14 kinds) Phone number, Jalali date (numeric + textual), money amount, national ID, bank card (Luhn-validated), IBAN (checksum-validated), postal code, car plate, time expression, mention, hashtag, URL, e-mail. Match results are post-validated against the issuing checksums to suppress false positives.

=== Persian numbers (matches Hazm + persian-tools)

  • to_words(1234)"یک هزار و دویست و سی و چهار".
  • from_words("دو میلیون و پانصد هزار")2_500_000.
  • format(1_234_567)"۱،۲۳۴،۵۶۷" (Persian thousand separators).
  • ordinal(3)"سوم".

=== Money parser Structured parsing of mixed numeric / spelled-out monetary values: "دو میلیون و پانصد هزار تومان"MoneyAmount { value: 2_500_000, unit: Toman }, with Toman ↔ Rial conversion.

=== Light Persian stemmer Lucene-style suffix-stripping for plurals, possessives, comparatives, and verb endings. "کتاب‌هایم""کتاب".

=== ZWNJ insertion Heuristic morphological glue (opt-in via config or one-shot helper): "میروم""می‌روم", "کتابها""کتاب‌ها".

=== Persian → Latin transliteration Character-level romanisation: "سلام""slam", "ایران""ayran". Useful for search-key generation.

=== Spell-check primitives Levenshtein edit distance + dictionary-based suggestion engine. Bring your own word list; no built-in dictionary (licence-free Persian dictionaries are not bundled).

=== Optional jalali integration With the jalali Cargo feature, detected date entities are validated against the real Jalali calendar (rejects e.g. Esfand 30 in non-leap years) and Parsitext::parse_jalali_date returns a structured JalaliDate.

=== Text utilities

  • Sentence splitting on . ؟ ! ؛ with ZWNJ-awareness.
  • Tokenisation that keeps ZWNJ-joined compound words together.
  • Text statistics — word count, Persian ratio, digit count, sentence count, unique-token count.
  • Batch processing — parallel via Rayon (parallel feature, default on).
  • Language detection — is_persian / contains_persian.

=== Optional features

[cols="1,1,3", options="header"] |=== | Feature | Default | What it enables

| parallel | ✓ | Rayon-powered process_batch | serde | | Serialize/Deserialize on all public output types | jalali | | Validate Jalali dates and parse them into JalaliDate via the [jalali-calendar] crate |===

== Quick start

[source,toml]

[dependencies] parsitext = "0.1"

Or with optional features:

parsitext = { version = "0.1", features = ["serde"] }

[source,rust]

use parsitext::{Parsitext, ParsitextConfig};

let pt = Parsitext::default(); let result = pt.process("سلام داداش، قيمتش حدود ١.٥ میلیون تومنه؟");

println!("{}", result.normalized); for entity in &result.entities { println!("{entity}"); }

== Examples

Each runnable example lives in examples/:

[cols="1,2", options="header"] |=== | Example | What it demonstrates

| basic_normalize | Orthography, digit, and ZWNJ normalisation. | entity_detection | Detect phone, date, money, URL, and mention entities. | batch_processing | Process texts in parallel with Rayon. | custom_rules | Apply user-defined whole-word replacement rules. | validators | National ID, IBAN, bank card, phone-operator validation, number↔words, money parsing, stemmer. | showcase | Car plates, time expressions, ZWNJ insertion, transliteration, spell suggestions, money formatting, landline province detection, Jalali date parsing. |===

Run with:

[source,bash]

cargo run --example -p parsitext

== Benchmarks

[source,bash]

cargo bench -p parsitext

The benches/ directory contains normalizer.rs and pipeline.rs, measuring normalisation throughput and full pipeline performance across representative Persian text inputs.

== Documentation

[source,bash]

cargo doc -p parsitext --all-features

== License

Apache-2.0.