Skip to main content

Crate harmorp

Crate harmorp 

Source
Expand description

§harmorp

Indonesian stemmer implementing the Enhanced Confix-Stripping (ECS) variant of Nazief-Adriani (Asian et al., 2007).

§Enhancements over the original Nazief-Adriani algorithm

The original Nazief-Adriani (1996) strips one prefix and one suffix per pass and stops after a fixed number of rounds. This implementation adds four improvements:

  1. Iterative confix-stripping — up to four prefix+suffix passes per word, so deeply nested forms like mempertimbangkan (mem+per+timbang+kan) and pembelajaran (pe+bel+ajar+an) resolve correctly.

  2. Nasal-assimilation restoration — the me(N)- and pe(N)- families reconstruct the dropped consonant from the phonological context (e.g. menulis → restore dropped ttulis; menyapu → restore dropped ssapu).

  3. Phonotactic validity guards — candidate stems that begin with two consecutive consonants (a CC onset, invalid in Indonesian) are discarded before selection, preventing spurious over-stripping.

  4. Two-path candidate generation — each pass explores both prefix-first-then-suffix and suffix-first-then-prefix orderings and ranks combined (both stripped) candidates above prefix-only ones, so the best candidate is chosen without a dictionary in most cases.

§Performance characteristics

  • Dictionary lookup: O(1) amortised via mmap-backed FST (fst 0.4)
  • Stem cache: O(1) via DashMap (lock-free sharded hashmap)
  • Hot path: zero heap allocation (SmallVec, &str slices)
  • Batch throughput: sequential with GIL release for PyO3

Structs§

FstDict
FST-backed dictionary loaded from a binary file via mmap.
IndonesianStemmer
Thread-safe Indonesian stemmer with optional FST dictionary and result cache.
NullDict
Null dictionary — always returns false (no-dict mode).
StemmerConfig
Configuration for IndonesianStemmer.

Traits§

Dictionary
Abstraction over dictionary backends (FST via mmap, or HashSet for tests).