harmorp 0.1.2

Enhanced Nazief-Adriani Indonesian stemmer: iterative ECS, nasal-assimilation restoration, phonotactic guards, FST dictionary, zero-alloc hot path
Documentation

harmorp

CI crates.io docs.rs License: MIT

An Indonesian stemmer implementing the Enhanced Confix-Stripping (ECS) variant of the Nazief-Adriani algorithm (Asian et al., 2007).

Enhancements over the original Nazief-Adriani

The original Nazief-Adriani (1996) applies one prefix and one suffix strip per pass. This implementation adds four improvements:

Enhancement Effect
Iterative confix-stripping (up to 4 passes) Handles deeply nested forms: mempertimbangkantimbang, pembelajaranajar
Nasal-assimilation restoration Reconstructs dropped consonants: menulistulis (t), menyapusapu (s)
Phonotactic validity guards Discards CC-onset candidates (invalid in Indonesian), preventing over-stripping
Two-path candidate generation Explores both prefix-first and suffix-first orderings; ranks combined candidates higher for better no-dict accuracy

Additional features

  • Thread-safe cache: O(1) repeated lookups via DashMap (lock-free sharded hashmap)
  • FST dictionary: Optional O(1) amortised root-word lookup via mmap-backed FST
  • Zero heap allocation: Hot path uses SmallVec and &str slices
  • Batch processing: Efficient multi-word stemming via stem_batch
  • Python bindings: Optional PyO3 bindings (feature-gated)

Installation

[dependencies]
harmorp = "0.1.1"

Usage

Basic

use harmorp::IndonesianStemmer;

let stemmer = IndonesianStemmer::new();

assert_eq!(stemmer.stem("membaca"),      "baca");
assert_eq!(stemmer.stem("pembelajaran"), "ajar");
assert_eq!(stemmer.stem("pengembangan"), "kembang");
assert_eq!(stemmer.stem("memperbaiki"),  "baik");

Batch processing

use harmorp::IndonesianStemmer;

let stemmer = IndonesianStemmer::new();
let words = vec![
    "membaca".to_string(),
    "menulis".to_string(),
    "berjalan".to_string(),
];
let stems = stemmer.stem_batch(&words);
// ["baca", "tulis", "jalan"]

With FST dictionary

An FST dictionary improves accuracy for ambiguous nasal-assimilation cases (e.g. meng- + vowel-initial roots). Build one with the [fst] crate.

use harmorp::IndonesianStemmer;

let stemmer = IndonesianStemmer::with_fst("exports/dictionary.fst");
assert_eq!(stemmer.stem("mengambil"), "ambil");

If the file does not exist the stemmer silently falls back to no-dictionary mode, so this is safe to use during development.

Python bindings

cargo build --features python
import harmorp

stemmer = harmorp.Stemmer()
print(stemmer.stem("membaca"))        # baca
print(stemmer.stem_batch(["membaca", "menulis"]))  # ['baca', 'tulis']

Algorithm

The ECS variant of Nazief-Adriani strips affixes iteratively (up to 4 passes):

  1. -nya clitic — possessive/determiner (bukunyabuku)
  2. Iterative confix-stripping — per pass: strip one prefix family + one derivational suffix
    • Prefix families: me(N)-, pe(N)-, ber-, ter-, se-, ke-, di-
    • Derivational suffixes (priority order): -kan > -an > -i
  3. Inflectional suffix fallback-lah, -kah, -tah, -pun (only when no prefix matched)

Phonotactic validity (no CC-onset) is enforced on every candidate to prevent over-stemming.

Without a dictionary, nasal-assimilation ambiguity is resolved by preferring the longer candidate. With a dictionary, the first candidate found in the FST wins.

Performance

Benchmarked with cargo bench --bench stemmer_bench:

Scenario Typical latency
Cache hit (warm) ~50 ns
Single word (cold) ~1–5 µs
10 000-word batch (hot) ~5 ms

Documentation

License

MIT — see LICENSE.

Sponsor

If you find this project useful, consider sponsoring its development.