sastrawi-rs 0.5.0

High-performance Indonesian stemmer (Nazief-Adriani + ECS). Zero-regex, FST-powered, Rust 2024.
Documentation

sastrawi-rs

High-performance Indonesian & Javanese stemmer for Rust — Zero-regex, zero-copy, FST-powered.

Rust 2024 License: MIT crates.io docs.rs Downloads

A fully modernized Rust 2024 implementation of stemming algorithms for Bahasa Indonesia and Bahasa Jawa. Based on the Nazief-Adriani / Enhanced Confix Stripping (ECS) algorithm for Indonesian, and a dedicated multi-strata engine (Ngoko, Krama Alus, Krama Inggil) for Javanese. Fork of iDevoid/rust-sastrawi, itself a Rust port of PHP Sastrawi by Andy Librian.

Note: The crate is published as sastrawi-rs on crates.io but imported as sastrawi in Rust code (hyphens become underscores per Rust convention).


Quick Start

Add to Cargo.toml:

[dependencies]
sastrawi-rs = "0.5"

This crate provides two independent stemmers that share the same zero-copy, FST-based architecture:

Stemmer Import Path Dictionary Language
Indonesian sastrawi::{Dictionary, Stemmer} ~26k root words Bahasa Indonesia
Javanese sastrawi::javanese::{JavaneseDictionary, JavaneseStemmer} ~1.3k root words Bahasa Jawa (Ngoko/Krama)

They are completely independent — changing one has zero effect on the other.


šŸ‡®šŸ‡© Indonesian Stemmer

What's New (vs. original rust-sastrawi)

Feature Old New
Engine Regex-based rules Zero-regex manual string slicing
Dictionary HashMap on every call FST (Finite State Transducer) with OnceLock
Allocation Heap strings everywhere Cow<'a, str> zero-copy API
Prefix rules Basic me-/ber- Full Nazief-Adriani: me-, pe-, ber-, ter-, se-, di-, ke-, ku-, kau-
menge-/penge- āŒ āœ… Monosyllabic base words (mengebom→bom)
nge- āŒ āœ… Informal/colloquial prefix (ngecat→cat)
Confix āŒ āœ… ke-an, per-an, ber-an, se-nya simultaneous strip
Loanword suffixes āŒ āœ… -isme, -isasi, -isir, -is
Hyphenated clitics āŒ āœ… kuasa-Mu, allah-lah, nikmat-Ku
Stopword filter āŒ āœ… stem_sentence_filtered + is_stopword
Backtracking Partial Full Longest-Root / Conservative Stemming
Edition Rust 2018 Rust 2024

Usage

use sastrawi::{Dictionary, Stemmer};

fn main() {
    let dict = Dictionary::new();
    let stemmer = Stemmer::new(&dict);

    let sentence = "Perekonomian Indonesia sedang dalam pertumbuhan yang membanggakan";
    for word in stemmer.stem_sentence(sentence) {
        print!("{} ", word); // ekonomi indonesia sedang dalam tumbuh yang bangga
    }
}
use sastrawi::{Dictionary, Stemmer};

let dict = Dictionary::new();
let stemmer = Stemmer::new(&dict);

assert_eq!(stemmer.stem_word("membangunkan").as_ref(), "bangun");
assert_eq!(stemmer.stem_word("keberuntunganmu").as_ref(), "untung");
assert_eq!(stemmer.stem_word("mengebom").as_ref(), "bom");       // menge-
assert_eq!(stemmer.stem_word("ngecat").as_ref(), "cat");          // nge- informal
assert_eq!(stemmer.stem_word("keamanan").as_ref(), "aman");       // ke-an confix
assert_eq!(stemmer.stem_word("pertanian").as_ref(), "tani");      // per-an confix
assert_eq!(stemmer.stem_word("idealisasi").as_ref(), "ideal");    // -isasi loanword
assert_eq!(stemmer.stem_word("kuasa-Mu").as_ref(), "kuasa");      // hyphen clitic

Stopword Filtering

Common function words (yang, di, dari, dalam, dengan, …) carry no semantic value for indexing or NLP analysis. stem_sentence_filtered removes them automatically:

use sastrawi::{Dictionary, Stemmer};

let dict = Dictionary::new();
let stemmer = Stemmer::new(&dict);

let sentence = "Perekonomian Indonesia sedang dalam pertumbuhan yang membanggakan";

// Without filter — all tokens included
let all: Vec<_> = stemmer.stem_sentence(sentence).collect();
// ["ekonomi", "indonesia", "sedang", "dalam", "tumbuh", "yang", "bangga"]

// With stopword filter — function words removed
let filtered: Vec<_> = stemmer.stem_sentence_filtered(sentence).collect();
// ["ekonomi", "indonesia", "tumbuh", "bangga"]

assert!(stemmer.is_stopword("yang"));     // true
assert!(!stemmer.is_stopword("ekonomi")); // false

Custom Dictionary

use sastrawi::{Dictionary, Stemmer};

let words = &["aman", "tani", "bangun", "bom"];
let dict = Dictionary::custom(words);
let stemmer = Stemmer::new(&dict);

assert_eq!(stemmer.stem_word("keamanan").as_ref(), "aman");

Indonesian API Reference

// Initialization
let dict = Dictionary::new();                   // bundled dictionary (~26k words)
let dict = Dictionary::custom(&["word", ...]); // custom word list
let stemmer = Stemmer::new(&dict);

// Stemming
stemmer.stem_word(word)               // → Cow<'_, str>  (zero-copy when unchanged)
stemmer.stem_sentence(sentence)       // → impl Iterator<Item = Cow<str>>
stemmer.stem_sentence_filtered(sent)  // → Iterator with stopwords removed

// Utilities
stemmer.is_stopword(word)             // → bool

Indonesian Stemming Pipeline (Nazief-Adriani + ECS)

Input word
  │
  ā”œā”€ 0. Lowercase + hyphen-clitic strip  (kuasa-Mu → kuasa)
  ā”œā”€ 1. Dictionary lookup                → return if found
  ā”œā”€ 2. Remove Particle                  (-lah, -kah, -tah, -pun)
  ā”œā”€ 3. Remove Possessive                (-ku, -mu, -nya)
  ā”œā”€ 4. Remove Suffix + Prefix           (-kan/-an/-i + me-/pe-/ber-/ter-…)
  ā”œā”€ 5. ECS Confix                       (ke-an, per-an, ber-an simultaneously)
  ā”œā”€ 6. Prefix-only                      (Longest Root preference on original word)
  └─ 7. Pengembalian Akhir               (backtracking over suffix combinations)

šŸ«™ Javanese Stemmer (Bahasa Jawa) šŸ†•

v0.4.0 introduces a dedicated Universal Javanese stemmer based on adaptations of the Nazief-Adriani algorithm for Javanese [⁵][⁶]. It is fully isolated from the Indonesian stemmer — different dictionary, different pipeline, different module.

Usage

use sastrawi::javanese::{JavaneseDictionary, JavaneseStemmer};

// Uses the bundled Javanese dictionary (~1.3k pure root words)
let jv_dict = JavaneseDictionary::new();
let jv_stemmer = JavaneseStemmer::new(&jv_dict);

// Ngoko — Anuswara Meluluhkan
assert_eq!(jv_stemmer.stem_word("mangan").as_ref(), "pangan");    // m- + pangan
assert_eq!(jv_stemmer.stem_word("nulis").as_ref(), "tulis");       // n- + tulis

// Ngoko — Anuswara Menempel
assert_eq!(jv_stemmer.stem_word("ndawuhi").as_ref(), "dawuh");     // n- attaches to d
assert_eq!(jv_stemmer.stem_word("mbalang").as_ref(), "balang");    // m- attaches to b

// Krama Passives + Causatives
assert_eq!(jv_stemmer.stem_word("dipunjupuk").as_ref(), "jupuk");
assert_eq!(jv_stemmer.stem_word("lampahaken").as_ref(), "lampah");

// Circumfix Backtracking
assert_eq!(jv_stemmer.stem_word("nglebetaken").as_ref(), "lebet"); // ng-…-aken

// Monosyllabic root (nge-)
assert_eq!(jv_stemmer.stem_word("ngecet").as_ref(), "cet");

Custom Dictionary

use sastrawi::javanese::{JavaneseDictionary, JavaneseStemmer};

let roots = &["gawa", "jupuk", "tulis", "dawuh"];
let jv_dict = JavaneseDictionary::custom(roots);
let jv_stemmer = JavaneseStemmer::new(&jv_dict);

assert_eq!(jv_stemmer.stem_word("nggawa").as_ref(), "gawa");
assert_eq!(jv_stemmer.stem_word("ndawuhi").as_ref(), "dawuh");

Javanese Affix Rules Reference

All rules are derived from academic paper adaptations of Nazief-Adriani for Javanese [⁵][⁶][⁷].

1. Ater-ater Anuswara (Nasalization) [⁶]

The most complex aspect of Javanese morphology. Rules differ by whether the nasal replaces the initial consonant (Meluluhkan) or simply prepends to it (Menempel).

Prefix Meluluhkan (replaces) Menempel (attaches to) Vowel-initial
m- p → m (macul←pacul), w → m (maca←waca) b (mbalang←balang) āœ… (munggah←unggah)
n- t → n (nulis←tulis), th → n (nuthuk←thuthuk) d, dh, j (ndawuh←dawuh, njupuk←jupuk) āœ…
ng- k → ng (ngirim←kirim) g (ngguyu←guyu) āœ… (ngombe←ombe)
ny- s → ny (nyapu←sapu), c → ny (nyekel←cekel) — —
nge- Monosyllabic roots (ngecet←cet, ngecat←cat) [special] — —

2. Ater-ater Tripurusa & General Prefixes

Category Prefixes
Krama Passive dipun-
Tripurusa Ngoko di-, dak-, tak-, kok-, ko-
Nominal Derivation [⁷] pan-, pam-, pang- (allomorphs before labial/velar)
General pa-, pi-, ka-, sa-, ma-, ke-, pra-
Formal/Literary [⁷] kuma-, kapi-, we-, a-, ben-
Dialectal (Jawa Timur) [⁷] tar-, tok-

3. Panambang (Suffixes) & Allomorphs

Category Suffixes
Particles -a, -i, -e, -en, -an, -na, -no (Dialect)
Causative -ake (Ngoko), -aken (Krama)
Possessives -ku, -mu, -ne, -ane (allomorph), -ipun (Krama)
Vowel Sandhi Allomorphs [⁶] -kake, -kaken (gunakake←guna), -ni (larani←lara), -nan
Complex suffixes -ana, -nan

4. Circumfix Backtracking (Confiks)

The pipeline performs exhaustive suffix-then-prefix stripping, meaning all circumfix combinations are resolved automatically without hardcoded confix rules. Examples:

dipunlampahaken → lampah   (dipun- … -aken)
nggunakake      → guna     (ng- … -kake allomorph)
nglebetaken     → lebet    (ng- … -aken)

Javanese API Reference

// Initialization
let jv_dict = JavaneseDictionary::new();                 // bundled (~1.3k root words)
let jv_dict = JavaneseDictionary::custom(&["w1", ...]); // custom list (for testing)
let jv_stemmer = JavaneseStemmer::new(&jv_dict);

// Identical API surface as Indonesian stemmer
jv_stemmer.stem_word(word)          // → Cow<'_, str>
jv_stemmer.stem_sentence(sentence)  // → impl Iterator<Item = Cow<str>>

Javanese Stemming Pipeline

Input word
  │
  ā”œā”€ 0. Lowercase + hyphen strip (ngisin-isini → isin via first segment)
  ā”œā”€ 1. Dictionary lookup        → return if found
  ā”œā”€ 2. Min length guard (< 3)   → return as-is
  ā”œā”€ 3. Exhaustive Suffix scan   → try ALL possessives Ɨ ALL particles
  │      └─ Dictionary check at each combination → return if found
  ā”œā”€ 4. Prefix removal on each suffix-stripped candidate
  │      └─ Anuswara (m-/n-/ng-/ny-/nge-) + Standard (di-/dipun-/pan-/…)
  └─ 5. Backtracking (Pengembalian Akhir) with known suffix combinations

Note on Infixes (Seselan): Javanese infixes -um-, -in-, -el-, -er- are intentionally not implemented in v0.4.0 as they require character-level mid-word insertion detection that conflicts with the zero-regex philosophy. Planned for v0.5.0 with an Aho-Corasick approach.


šŸ”¬ MorphAnalyzer — Dictionary-Free Morphological Analyzer

MorphAnalyzer is a zero-dependency, no-dictionary morphological analyzer for Indonesian. It detects affix patterns and returns candidate roots without validating them against any dictionary.

Honest caveat: Because there is no dictionary, MorphAnalyzer cannot resolve all morphophonemic mutations (e.g. men- + tulis = menulis, but stripping men- yields ulis not tulis without knowing the root starts with t). It is accurate for affix detection and candidate generation, but not as a standalone stemmer.

When to use MorphAnalyzer vs Stemmer

Need Use
Single validated root from a word Stemmer (requires dictionary)
Does this word have any affix? MorphAnalyzer
What prefix/suffix does this word have? MorphAnalyzer
Generate candidate roots for autocomplete or admin UI MorphAnalyzer
Validate that a game submission is morphologically plausible MorphAnalyzer
ML feature: prefix/suffix signals MorphAnalyzer

Usage

use sastrawi::{MorphAnalyzer, MorphAnalysis};

let ma = MorphAnalyzer::new(); // Zero allocation — no dictionary loaded

// --- Affix detection ---
let r = ma.analyze("membangunkan");
assert!(r.has_affix);
assert_eq!(r.prefix.as_deref(), Some("me"));
assert_eq!(r.suffix.as_deref(), Some("kan"));
// candidate_roots may contain ["mbangun"] — final resolution needs Stemmer+dictionary

// --- Plain word (no affix) ---
let r = ma.analyze("buku");
assert!(!r.has_affix);
assert!(r.candidate_roots.is_empty());

// --- Confix (ke-an, per-an, ber-an) ---
let r = ma.analyze("keamanan");
assert_eq!(r.prefix.as_deref(), Some("ke"));
assert_eq!(r.suffix.as_deref(), Some("an"));
assert!(r.candidate_roots.contains(&"aman".to_string()));

// --- nge- informal prefix ---
let r = ma.analyze("ngecat");
assert_eq!(r.prefix.as_deref(), Some("nge"));
assert!(r.candidate_roots.contains(&"cat".to_string()));

// --- Possessive suffix ---
let r = ma.analyze("rumahnya");
assert_eq!(r.suffix.as_deref(), Some("nya"));
assert!(r.candidate_roots.contains(&"rumah".to_string()));

// --- Hyphen-clitic stripped before analysis ---
let r = ma.analyze("kuasa-Mu");
assert_eq!(r.word, "kuasa"); // clitic segment after hyphen is dropped

MorphAnalysis struct

pub struct MorphAnalysis {
    pub word: String,                  // normalized (lowercased, hyphen-stripped) input
    pub prefix: Option<String>,        // detected prefix ("me", "ber", "nge", "ke", …)
    pub suffix: Option<String>,        // detected suffix ("kan", "an", "lah", "ku", …)
    pub candidate_roots: Vec<String>,  // plausible roots (may be ambiguous — see caveats)
    pub has_affix: bool,               // true if prefix OR suffix detected
}

Detection pipeline (priority order)

Input
  │
  ā”œā”€ 0. Lowercase + hyphen-clitic strip
  ā”œā”€ 1. Guard: word < 4 chars → return as-is (no analysis)
  ā”œā”€ 2. CONFIX (ke-an, per-an, ber-an, me-kan)  ← highest precision, both sides locked
  ā”œā”€ 3. PREFIX (me-/ber-/ter-/pe-/nge-/di-/se-/ke-/ku-/kau-)
  ā”œā”€ 4. PARTICLE (-lah, -kah, -tah, -pun), then prefix on remainder
  ā”œā”€ 5. POSSESSIVE (-ku, -mu, -nya), then prefix on remainder
  └─ 6. DERIVATIONAL SUFFIX (-kan, -an, -isme, -isasi, -isir)
         Guard: skipped if result < 3 chars, or -i suffix on short words

Known limitations (by design)

Limitation Reason Workaround
menulis → root ulis, not tulis men- drops nasal; t restoration needs dict Use Stemmer for final root
mengirim → root irim, not kirim meng- drops k which needs dict to restore Use Stemmer for final root
Ambiguous: mengada → ["ada", "ngada"] Both morphologically valid without dict Check candidates against dict
-is / -i suffix not stripped aggressively Avoids false positives on short words Intentional guard

MorphAnalyzer API

let ma = MorphAnalyzer::new();    // or MorphAnalyzer::default()
let r: MorphAnalysis = ma.analyze(word);  // works on any &str

šŸ— Architecture

sastrawi-rs/
ā”œā”€ā”€ src/
│   ā”œā”€ā”€ lib.rs                # Public API re-exports (both stemmers)
│   ā”œā”€ā”€ stemmer.rs            # Indonesian: Nazief-Adriani pipeline + backtracking
│   ā”œā”€ā”€ affixation.rs         # Indonesian: Prefix/suffix/confix orchestration
│   ā”œā”€ā”€ affix_rules.rs        # Indonesian: Zero-regex morphological rules
│   ā”œā”€ā”€ dictionary.rs         # Indonesian: FST-based dictionary (OnceLock)
│   ā”œā”€ā”€ tokenizer.rs          # Shared zero-copy &str tokenizer
│   ā”œā”€ā”€ stopword.rs           # Indonesian: FST stopword filter
│   └── javanese/
│       ā”œā”€ā”€ mod.rs            # Javanese module re-exports
│       ā”œā”€ā”€ stemmer.rs        # Javanese: Exhaustive suffixƗprefix pipeline
│       ā”œā”€ā”€ affixation.rs     # Javanese: Anuswara + Standard prefix orchestration
│       ā”œā”€ā”€ affix_rules.rs    # Javanese: Meluluhkan/Menempel morphological rules
│       └── dictionary.rs     # Javanese: FST-based dictionary (OnceLock)
ā”œā”€ā”€ data/
│   ā”œā”€ā”€ words.txt             # ~26k Indonesian root words (Kateglo, CC-BY-NC-SA 3.0)
│   ā”œā”€ā”€ stopwords.txt         # Common Indonesian stopwords
│   └── javanese_words.txt    # ~1.3k Javanese root words (Riza et al. 2018, CC-BY 4.0)
ā”œā”€ā”€ build.rs                  # Compiles word lists → FST at build time
ā”œā”€ā”€ tests/test.rs             # Indonesian integration tests (6 suites, 200+ cases)
└── tests/javanese_test.rs    # Javanese integration tests (6 suites, 60+ cases)

šŸ“Š Performance

Operation Old (regex) New (zero-regex FST)
Dictionary lookup O(n) HashMap O(k) FST where k = key length
Prefix stripping Regex compile + match Direct string slice comparison
Memory Regex DFA state machines FST bytes + OnceLock

🧪 Testing

cargo test --release
# 12 test suites total (6 Indonesian + 6 Javanese), 260+ word cases

Indonesian Suites

Suite Coverage
test_stem_word 160+ Nazief-Adriani morphological cases
test_stem_sentence Full sentence pipeline
test_nge_informal_prefix ngecat, ngegas, ngelepas
test_ecs_confixes keamanan, pertanian, berhadapan
test_loanword_suffixes -isasi, -isir, -isme, -is
test_stopword_filter stem_sentence_filtered, is_stopword

Javanese Suites

Suite Coverage
test_javanese_anuswara_nasalization Meluluhkan & Menempel for m-, n-, ng-, ny-
test_javanese_tripurusa_and_general_prefixes di-, dipun-, dak-, ko-, ka-, kuma-, etc.
test_javanese_suffixes -ake, -aken, -an, -ni, -ipun, -ne, -no, etc.
test_javanese_confixes_and_backtracking dipunlampahaken, kepanasan, dituruake
test_javanese_complex_sandhi_and_circumfixes Vowel allomorphs (nglarani, nggunakake, nglampahi)
test_javanese_new_affix_rules pan-/pam-/pang-, nge-, -ane, kapi-, tar-/tok-, we-

šŸ†• Indonesian Extensions (2020–2026 Research)

Based on recent Indonesian NLP research (ECS [¹], IndoMorph [²], Aksara v1.5 [³]):

A. nge- Informal Prefix

Colloquial prefix, mirror of menge-. Common in Jakarta informal speech and social media (e.g. MPStemmer [⁓]).

ngecat    → cat
ngegas    → gas
ngelamar  → lamar
ngelepas  → lepas

B. Confixes — ECS (Enhanced Confix Stripping)

Simultaneous prefix+suffix removal, proven to outperform plain Nazief-Adriani in accuracy [¹].

keamanan    → aman    (ke-…-an)
pertanian   → tani    (per-…-an)
berhadapan  → hadap   (ber-…-an)

C. Superlative se-nya Particle

selengkapnya  → lengkap
seberhasilnya → hasil

D. Loanword Suffixes

idealisasi → ideal   (-isasi)
legalisir  → legal   (-isir)
idealisme  → ideal   (-isme)
idealis    → ideal   (-is)

šŸ“š References & Credits

Indonesian Stemmer

  • [¹] ECS: Arifin, A., Mahendra, P., & Ciptaningtyas, H. T. (2009). Enhanced Confix Stripping Stemmer and Ants Algorithm for Classifying News Document in Indonesian Language.
  • [²] IndoMorph: Kamajaya, I., & Moeljadi, D. (2025). IndoMorph: a Morphology Engine for Indonesian. ACL Anthology.
  • [³] Aksara: Universitas Indonesia (2023). Aksara v1.5: Indonesian NLP tool conforming to UD v2 guidelines. GitHub.
  • [⁓] MPStemmer: Prabono, A. G. (2020). Mpstemmer: a multi-phase stemmer for standard and nonstandard Indonesian words. GitHub.
  • Algorithm: Nazief & Adriani (1996, 2007) — "Confix Stripping: Approach to Stemming Algorithm for Bahasa Indonesia"
  • PHP Sastrawi: Andy Librian — original PHP implementation
  • rust-sastrawi: iDevoid — original Rust port (2019)

Javanese Stemmer

  • [⁵] Javanese Nazief-Adriani: Stemming Javanese: Another Adaptation of the Nazief-Adriani affix rules (2020) — ISRITI. Neliti.
  • [⁶] Javanese ECS: Ngoko Javanese Stemmer using Enhanced Confix Stripping — ResearchGate. ResearchGate.
  • [⁷] Complete Javanese Affix Taxonomy: Semantic Scholar (2021–2023) — A complete list of Javanese prefix/suffix rules including pan-/pam-/pang- nominal derivation, literary prefixes (kapi-, we-, a-), and dialectal forms (tar-, tok-). Semantic Scholar.
  • JV Dictionary: Riza, Hammam Riza et al. (2018) — Indonesian Javanese Dictionary Starter Kit Mendeley Data (CC-BY 4.0).

General

  • sastrawi-rs: ibahasa Team — this modernized fork (2026)
  • ID Dictionary: Kateglo by Ivan Lanin (CC-BY-NC-SA 3.0)

šŸ“„ License

MIT — see LICENSE. Dictionary data: Kateglo (CC-BY-NC-SA 3.0) — non-commercial use only for the bundled Indonesian word list. Javanese dictionary: Riza et al. 2018 (CC-BY 4.0).