Skip to main content

Crate parsitext

Crate parsitext 

Source
Expand description

§parsitext

High-performance Persian (Farsi) text processing engine for Rust.

Built for Iranian production workloads — single-pass normalisation, nine entity-recognition patterns, ZWNJ-aware tokenisation, and Rayon-parallel batch processing.

§Quick start

use parsitext::{Parsitext, ParsitextConfig};

// Use defaults: orthography + digits + ZWNJ + entity detection.
let pt = Parsitext::default();
let result = pt.process("سلام داداش، قيمتش حدود ١.٥ میلیون تومنه؟");

// Arabic ي is normalised to Persian ی; Arabic-Indic ١ → Persian ۱.
assert!(result.normalized.contains('ی'));
assert!(result.normalized.contains('۱'));

// Entity recognised: MoneyAmount.
assert!(!result.entities.is_empty());
println!("{}", result.entities[0]);

§Normalisation pipeline

Parsitext::process(text)
  └─ Normalizer::normalize
       ├─ orthography::fix_arabic_chars      (Arabic ك ي ة → Persian ک ی ه)
       ├─ digits::{to_persian,to_latin}      (digit script unification)
       ├─ zwnj::normalize_zwnj               (strip misplaced U+200C)
       ├─ diacritics::remove_diacritics      (harakat, opt-in)
       ├─ spacing::reduce_repetitions        (خیییلی → خییلی)
       ├─ spacing::normalize_spaces          (whitespace collapse)
       ├─ SlangReplacer                      (goftari→neveshtar, opt-in)
       ├─ ProfanityFilter                    (*** replacement, opt-in)
       └─ CustomRules                        (user replacements, opt-in)
  └─ tokenizer::tokenize                    (whitespace + punctuation split)
  └─ EntityRecognizer::detect               (phone, date, money, …)

§Feature flags

FeatureDefaultEffect
parallelRayon-powered Parsitext::process_batch
serdeSerialize/Deserialize on all public types
parsitext = { version = "0.1", features = ["serde"] }

Re-exports§

pub use config::CustomRule;
pub use config::DigitTarget;
pub use config::ParsitextConfig;
pub use config::ParsitextConfigBuilder;
pub use config::ProcessingMode;
pub use config::ProfanityLevel;
pub use entity::Entity;
pub use entity::EntityKind;
pub use entity::Span;
pub use money::MoneyAmount;
pub use money::MoneyUnit;
pub use stats::TextStats;

Modules§

config
Configuration types and builder for Parsitext.
diacritics
Arabic diacritics (harakat / تشکیل) removal.
entity
Structured-entity recognition for Iranian Persian text.
finglish
Finglish (Persian written in Latin script) → Persian conversion.
jalalijalali
Optional integration with the jalali-calendar crate.
money
Structured parsing of Persian money expressions.
numbers
Persian number ↔ word conversion and formatting.
phonetic
Persian phonetic matching — a Soundex-style codec.
sentence
Sentence boundary detection for Persian text.
spell
Spell-checking primitives.
spell_dict
A bundled high-frequency Persian word list for crate::spell.
stats
Text statistics for Persian documents.
stemmer
Light Persian stemmer.
style
Persian text register conversions: formal ↔ chat ↔ GenZ.
tantivy_analyzertantivy
Tantivy tokenizer and stemmer for Persian text (gated by the tantivy Cargo feature).
transliterate
Persian → Latin transliteration (romanisation).
validators
Validators for Iranian identifiers and document numbers.
zwnj_insert
Heuristic ZWNJ insertion — the inverse of the normaliser’s ZWNJ removal pass.

Structs§

Parsitext
The main entry point for all Persian text processing.
ProcessedText
The output of Parsitext::process.
ProcessingStats
Per-document processing statistics attached to every ProcessedText.

Enums§

Error
Errors returned by this crate.