Expand description
§harmorp
Indonesian stemmer implementing the Enhanced Confix-Stripping (ECS) variant of Nazief-Adriani (Asian et al., 2007).
§Enhancements over the original Nazief-Adriani algorithm
The original Nazief-Adriani (1996) strips one prefix and one suffix per pass and stops after a fixed number of rounds. This implementation adds four improvements:
-
Iterative confix-stripping — up to four prefix+suffix passes per word, so deeply nested forms like
mempertimbangkan(mem+per+timbang+kan) andpembelajaran(pe+bel+ajar+an) resolve correctly. -
Nasal-assimilation restoration — the
me(N)-andpe(N)-families reconstruct the dropped consonant from the phonological context (e.g.menulis→ restore droppedt→tulis;menyapu→ restore droppeds→sapu). -
Phonotactic validity guards — candidate stems that begin with two consecutive consonants (a CC onset, invalid in Indonesian) are discarded before selection, preventing spurious over-stripping.
-
Two-path candidate generation — each pass explores both prefix-first-then-suffix and suffix-first-then-prefix orderings and ranks combined (both stripped) candidates above prefix-only ones, so the best candidate is chosen without a dictionary in most cases.
§Performance characteristics
- Dictionary lookup: O(1) amortised via mmap-backed FST (
fst0.4) - Stem cache: O(1) via DashMap (lock-free sharded hashmap)
- Hot path: zero heap allocation (SmallVec, &str slices)
- Batch throughput: sequential with GIL release for PyO3
Structs§
- FstDict
- FST-backed dictionary loaded from a binary file via mmap.
- Indonesian
Stemmer - Thread-safe Indonesian stemmer with optional FST dictionary and result cache.
- Null
Dict - Null dictionary — always returns false (no-dict mode).
- Stemmer
Config - Configuration for
IndonesianStemmer.
Traits§
- Dictionary
- Abstraction over dictionary backends (FST via mmap, or HashSet for tests).