harmorp
An Indonesian stemmer implementing the Enhanced Confix-Stripping (ECS) variant of the Nazief-Adriani algorithm (Asian et al., 2007).
Enhancements over the original Nazief-Adriani
The original Nazief-Adriani (1996) applies one prefix and one suffix strip per pass. This implementation adds four improvements:
| Enhancement | Effect |
|---|---|
| Iterative confix-stripping (up to 4 passes) | Handles deeply nested forms: mempertimbangkan → timbang, pembelajaran → ajar |
| Nasal-assimilation restoration | Reconstructs dropped consonants: menulis → tulis (t), menyapu → sapu (s) |
| Phonotactic validity guards | Discards CC-onset candidates (invalid in Indonesian), preventing over-stripping |
| Two-path candidate generation | Explores both prefix-first and suffix-first orderings; ranks combined candidates higher for better no-dict accuracy |
Additional features
- Thread-safe cache: O(1) repeated lookups via DashMap (lock-free sharded hashmap)
- FST dictionary: Optional O(1) amortised root-word lookup via mmap-backed FST
- Zero heap allocation: Hot path uses SmallVec and
&strslices - Batch processing: Efficient multi-word stemming via
stem_batch - Python bindings: Optional PyO3 bindings (feature-gated)
Installation
[]
= "0.1.1"
Usage
Basic
use IndonesianStemmer;
let stemmer = new;
assert_eq!;
assert_eq!;
assert_eq!;
assert_eq!;
Batch processing
use IndonesianStemmer;
let stemmer = new;
let words = vec!;
let stems = stemmer.stem_batch;
// ["baca", "tulis", "jalan"]
With FST dictionary
An FST dictionary improves accuracy for ambiguous nasal-assimilation cases
(e.g. meng- + vowel-initial roots). Build one with the [fst] crate.
use IndonesianStemmer;
let stemmer = with_fst;
assert_eq!;
If the file does not exist the stemmer silently falls back to no-dictionary mode, so this is safe to use during development.
Python bindings
=
# baca
# ['baca', 'tulis']
Algorithm
The ECS variant of Nazief-Adriani strips affixes iteratively (up to 4 passes):
-nyaclitic — possessive/determiner (bukunya→buku)- Iterative confix-stripping — per pass: strip one prefix family + one derivational suffix
- Prefix families:
me(N)-,pe(N)-,ber-,ter-,se-,ke-,di- - Derivational suffixes (priority order):
-kan>-an>-i
- Prefix families:
- Inflectional suffix fallback —
-lah,-kah,-tah,-pun(only when no prefix matched)
Phonotactic validity (no CC-onset) is enforced on every candidate to prevent over-stemming.
Without a dictionary, nasal-assimilation ambiguity is resolved by preferring the longer candidate. With a dictionary, the first candidate found in the FST wins.
Performance
Benchmarked with cargo bench --bench stemmer_bench:
| Scenario | Typical latency |
|---|---|
| Cache hit (warm) | ~50 ns |
| Single word (cold) | ~1–5 µs |
| 10 000-word batch (hot) | ~5 ms |
Documentation
- API Documentation - Complete API reference
- Algorithm Documentation - Detailed algorithm explanation
- Performance Documentation - Performance characteristics and benchmarks
- Benchmark Comparison - harmorp vs sastrawi performance comparison
License
MIT — see LICENSE.
Sponsor
If you find this project useful, consider sponsoring its development.