Expand description
§parsitext
High-performance Persian (Farsi) text processing engine for Rust.
Built for Iranian production workloads — single-pass normalisation, nine entity-recognition patterns, ZWNJ-aware tokenisation, and Rayon-parallel batch processing.
§Quick start
use parsitext::{Parsitext, ParsitextConfig};
// Use defaults: orthography + digits + ZWNJ + entity detection.
let pt = Parsitext::default();
let result = pt.process("سلام داداش، قيمتش حدود ١.٥ میلیون تومنه؟");
// Arabic ي is normalised to Persian ی; Arabic-Indic ١ → Persian ۱.
assert!(result.normalized.contains('ی'));
assert!(result.normalized.contains('۱'));
// Entity recognised: MoneyAmount.
assert!(!result.entities.is_empty());
println!("{}", result.entities[0]);§Normalisation pipeline
Parsitext::process(text)
└─ Normalizer::normalize
├─ orthography::fix_arabic_chars (Arabic ك ي ة → Persian ک ی ه)
├─ digits::{to_persian,to_latin} (digit script unification)
├─ zwnj::normalize_zwnj (strip misplaced U+200C)
├─ diacritics::remove_diacritics (harakat, opt-in)
├─ spacing::reduce_repetitions (خیییلی → خییلی)
├─ spacing::normalize_spaces (whitespace collapse)
├─ SlangReplacer (goftari→neveshtar, opt-in)
├─ ProfanityFilter (*** replacement, opt-in)
└─ CustomRules (user replacements, opt-in)
└─ tokenizer::tokenize (whitespace + punctuation split)
└─ EntityRecognizer::detect (phone, date, money, …)§Feature flags
| Feature | Default | Effect |
|---|---|---|
parallel | ✓ | Rayon-powered Parsitext::process_batch |
serde | Serialize/Deserialize on all public types |
parsitext = { version = "0.1", features = ["serde"] }Re-exports§
pub use config::CustomRule;pub use config::DigitTarget;pub use config::ParsitextConfig;pub use config::ParsitextConfigBuilder;pub use config::ProcessingMode;pub use config::ProfanityLevel;pub use entity::Entity;pub use entity::EntityKind;pub use entity::Span;pub use money::MoneyAmount;pub use money::MoneyUnit;pub use stats::TextStats;
Modules§
- config
- Configuration types and builder for
Parsitext. - diacritics
- Arabic diacritics (harakat / تشکیل) removal.
- entity
- Structured-entity recognition for Iranian Persian text.
- finglish
- Finglish (Persian written in Latin script) → Persian conversion.
- jalali
jalali - Optional integration with the
jalali-calendarcrate. - money
- Structured parsing of Persian money expressions.
- numbers
- Persian number ↔ word conversion and formatting.
- phonetic
- Persian phonetic matching — a Soundex-style codec.
- sentence
- Sentence boundary detection for Persian text.
- spell
- Spell-checking primitives.
- spell_
dict - A bundled high-frequency Persian word list for
crate::spell. - stats
- Text statistics for Persian documents.
- stemmer
- Light Persian stemmer.
- style
- Persian text register conversions: formal ↔ chat ↔ GenZ.
- tantivy_
analyzer tantivy - Tantivy tokenizer and stemmer for Persian text
(gated by the
tantivyCargo feature). - transliterate
- Persian → Latin transliteration (romanisation).
- validators
- Validators for Iranian identifiers and document numbers.
- zwnj_
insert - Heuristic ZWNJ insertion — the inverse of the normaliser’s ZWNJ removal pass.
Structs§
- Parsitext
- The main entry point for all Persian text processing.
- Processed
Text - The output of
Parsitext::process. - Processing
Stats - Per-document processing statistics attached to every
ProcessedText.
Enums§
- Error
- Errors returned by this crate.