sastrawi-rs
High-performance Indonesian & Javanese stemmer for Rust ā Zero-regex, zero-copy, FST-powered.
A fully modernized Rust 2024 implementation of stemming algorithms for Bahasa Indonesia and Bahasa Jawa. Based on the Nazief-Adriani / Enhanced Confix Stripping (ECS) algorithm for Indonesian, and a dedicated multi-strata engine (Ngoko, Krama Alus, Krama Inggil) for Javanese. Fork of iDevoid/rust-sastrawi, itself a Rust port of PHP Sastrawi by Andy Librian.
Note: The crate is published as
sastrawi-rson crates.io but imported assastrawiin Rust code (hyphens become underscores per Rust convention).
Quick Start
Add to Cargo.toml:
[]
= "0.5"
This crate provides two independent stemmers that share the same zero-copy, FST-based architecture:
| Stemmer | Import Path | Dictionary | Language |
|---|---|---|---|
| Indonesian | sastrawi::{Dictionary, Stemmer} |
~26k root words | Bahasa Indonesia |
| Javanese | sastrawi::javanese::{JavaneseDictionary, JavaneseStemmer} |
~1.3k root words | Bahasa Jawa (Ngoko/Krama) |
They are completely independent ā changing one has zero effect on the other.
š®š© Indonesian Stemmer
What's New (vs. original rust-sastrawi)
| Feature | Old | New |
|---|---|---|
| Engine | Regex-based rules | Zero-regex manual string slicing |
| Dictionary | HashMap on every call | FST (Finite State Transducer) with OnceLock |
| Allocation | Heap strings everywhere | Cow<'a, str> zero-copy API |
| Prefix rules | Basic me-/ber- | Full Nazief-Adriani: me-, pe-, ber-, ter-, se-, di-, ke-, ku-, kau- |
menge-/penge- |
ā | ā Monosyllabic base words (mengebomābom) |
nge- |
ā | ā Informal/colloquial prefix (ngecatācat) |
| Confix | ā | ā ke-an, per-an, ber-an, se-nya simultaneous strip |
| Loanword suffixes | ā | ā -isme, -isasi, -isir, -is |
| Hyphenated clitics | ā | ā kuasa-Mu, allah-lah, nikmat-Ku |
| Stopword filter | ā | ā
stem_sentence_filtered + is_stopword |
| Backtracking | Partial | Full Longest-Root / Conservative Stemming |
| Edition | Rust 2018 | Rust 2024 |
Usage
use ;
use ;
let dict = new;
let stemmer = new;
assert_eq!;
assert_eq!;
assert_eq!; // menge-
assert_eq!; // nge- informal
assert_eq!; // ke-an confix
assert_eq!; // per-an confix
assert_eq!; // -isasi loanword
assert_eq!; // hyphen clitic
Stopword Filtering
Common function words (yang, di, dari, dalam, dengan, ā¦) carry no semantic value for
indexing or NLP analysis. stem_sentence_filtered removes them automatically:
use ;
let dict = new;
let stemmer = new;
let sentence = "Perekonomian Indonesia sedang dalam pertumbuhan yang membanggakan";
// Without filter ā all tokens included
let all: = stemmer.stem_sentence.collect;
// ["ekonomi", "indonesia", "sedang", "dalam", "tumbuh", "yang", "bangga"]
// With stopword filter ā function words removed
let filtered: = stemmer.stem_sentence_filtered.collect;
// ["ekonomi", "indonesia", "tumbuh", "bangga"]
assert!; // true
assert!; // false
Custom Dictionary
use ;
let words = &;
let dict = custom;
let stemmer = new;
assert_eq!;
Indonesian API Reference
// Initialization
let dict = new; // bundled dictionary (~26k words)
let dict = custom; // custom word list
let stemmer = new;
// Stemming
stemmer.stem_word // ā Cow<'_, str> (zero-copy when unchanged)
stemmer.stem_sentence // ā impl Iterator<Item = Cow<str>>
stemmer.stem_sentence_filtered // ā Iterator with stopwords removed
// Utilities
stemmer.is_stopword // ā bool
Indonesian Stemming Pipeline (Nazief-Adriani + ECS)
Input word
ā
āā 0. Lowercase + hyphen-clitic strip (kuasa-Mu ā kuasa)
āā 1. Dictionary lookup ā return if found
āā 2. Remove Particle (-lah, -kah, -tah, -pun)
āā 3. Remove Possessive (-ku, -mu, -nya)
āā 4. Remove Suffix + Prefix (-kan/-an/-i + me-/pe-/ber-/ter-ā¦)
āā 5. ECS Confix (ke-an, per-an, ber-an simultaneously)
āā 6. Prefix-only (Longest Root preference on original word)
āā 7. Pengembalian Akhir (backtracking over suffix combinations)
š« Javanese Stemmer (Bahasa Jawa) š
v0.4.0 introduces a dedicated Universal Javanese stemmer based on adaptations of the Nazief-Adriani algorithm for Javanese [āµ][ā¶]. It is fully isolated from the Indonesian stemmer ā different dictionary, different pipeline, different module.
Usage
use ;
// Uses the bundled Javanese dictionary (~1.3k pure root words)
let jv_dict = new;
let jv_stemmer = new;
// Ngoko ā Anuswara Meluluhkan
assert_eq!; // m- + pangan
assert_eq!; // n- + tulis
// Ngoko ā Anuswara Menempel
assert_eq!; // n- attaches to d
assert_eq!; // m- attaches to b
// Krama Passives + Causatives
assert_eq!;
assert_eq!;
// Circumfix Backtracking
assert_eq!; // ng-ā¦-aken
// Monosyllabic root (nge-)
assert_eq!;
Custom Dictionary
use ;
let roots = &;
let jv_dict = custom;
let jv_stemmer = new;
assert_eq!;
assert_eq!;
Javanese Affix Rules Reference
All rules are derived from academic paper adaptations of Nazief-Adriani for Javanese [āµ][ā¶][ā·].
1. Ater-ater Anuswara (Nasalization) [ā¶]
The most complex aspect of Javanese morphology. Rules differ by whether the nasal replaces the initial consonant (Meluluhkan) or simply prepends to it (Menempel).
| Prefix | Meluluhkan (replaces) | Menempel (attaches to) | Vowel-initial |
|---|---|---|---|
m- |
p ā m (maculāpacul), w ā m (macaāwaca) |
b (mbalangābalang) |
ā (munggahāunggah) |
n- |
t ā n (nulisātulis), th ā n (nuthukāthuthuk) |
d, dh, j (ndawuhādawuh, njupukājupuk) |
ā |
ng- |
k ā ng (ngirimākirim) |
g (ngguyuāguyu) |
ā (ngombeāombe) |
ny- |
s ā ny (nyapuāsapu), c ā ny (nyekelācekel) |
ā | ā |
nge- |
Monosyllabic roots (ngecetācet, ngecatācat) [special] | ā | ā |
2. Ater-ater Tripurusa & General Prefixes
| Category | Prefixes |
|---|---|
| Krama Passive | dipun- |
| Tripurusa Ngoko | di-, dak-, tak-, kok-, ko- |
| Nominal Derivation [ā·] | pan-, pam-, pang- (allomorphs before labial/velar) |
| General | pa-, pi-, ka-, sa-, ma-, ke-, pra- |
| Formal/Literary [ā·] | kuma-, kapi-, we-, a-, ben- |
| Dialectal (Jawa Timur) [ā·] | tar-, tok- |
3. Panambang (Suffixes) & Allomorphs
| Category | Suffixes |
|---|---|
| Particles | -a, -i, -e, -en, -an, -na, -no (Dialect) |
| Causative | -ake (Ngoko), -aken (Krama) |
| Possessives | -ku, -mu, -ne, -ane (allomorph), -ipun (Krama) |
| Vowel Sandhi Allomorphs [ā¶] | -kake, -kaken (gunakakeāguna), -ni (laraniālara), -nan |
| Complex suffixes | -ana, -nan |
4. Circumfix Backtracking (Confiks)
The pipeline performs exhaustive suffix-then-prefix stripping, meaning all circumfix combinations are resolved automatically without hardcoded confix rules. Examples:
dipunlampahaken ā lampah (dipun- ⦠-aken)
nggunakake ā guna (ng- ⦠-kake allomorph)
nglebetaken ā lebet (ng- ⦠-aken)
Javanese API Reference
// Initialization
let jv_dict = new; // bundled (~1.3k root words)
let jv_dict = custom; // custom list (for testing)
let jv_stemmer = new;
// Identical API surface as Indonesian stemmer
jv_stemmer.stem_word // ā Cow<'_, str>
jv_stemmer.stem_sentence // ā impl Iterator<Item = Cow<str>>
Javanese Stemming Pipeline
Input word
ā
āā 0. Lowercase + hyphen strip (ngisin-isini ā isin via first segment)
āā 1. Dictionary lookup ā return if found
āā 2. Min length guard (< 3) ā return as-is
āā 3. Exhaustive Suffix scan ā try ALL possessives Ć ALL particles
ā āā Dictionary check at each combination ā return if found
āā 4. Prefix removal on each suffix-stripped candidate
ā āā Anuswara (m-/n-/ng-/ny-/nge-) + Standard (di-/dipun-/pan-/ā¦)
āā 5. Backtracking (Pengembalian Akhir) with known suffix combinations
Note on Infixes (Seselan): Javanese infixes
-um-,-in-,-el-,-er-are intentionally not implemented in v0.4.0 as they require character-level mid-word insertion detection that conflicts with the zero-regex philosophy. Planned for v0.5.0 with an Aho-Corasick approach.
š¬ MorphAnalyzer ā Dictionary-Free Morphological Analyzer
MorphAnalyzer is a zero-dependency, no-dictionary morphological analyzer for Indonesian. It detects affix patterns and returns candidate roots without validating them against any dictionary.
Honest caveat: Because there is no dictionary,
MorphAnalyzercannot resolve all morphophonemic mutations (e.g.men-+tulis=menulis, but strippingmen-yieldsulisnottuliswithout knowing the root starts witht). It is accurate for affix detection and candidate generation, but not as a standalone stemmer.
When to use MorphAnalyzer vs Stemmer
| Need | Use |
|---|---|
| Single validated root from a word | Stemmer (requires dictionary) |
| Does this word have any affix? | MorphAnalyzer |
| What prefix/suffix does this word have? | MorphAnalyzer |
| Generate candidate roots for autocomplete or admin UI | MorphAnalyzer |
| Validate that a game submission is morphologically plausible | MorphAnalyzer |
| ML feature: prefix/suffix signals | MorphAnalyzer |
Usage
use ;
let ma = new; // Zero allocation ā no dictionary loaded
// --- Affix detection ---
let r = ma.analyze;
assert!;
assert_eq!;
assert_eq!;
// candidate_roots may contain ["mbangun"] ā final resolution needs Stemmer+dictionary
// --- Plain word (no affix) ---
let r = ma.analyze;
assert!;
assert!;
// --- Confix (ke-an, per-an, ber-an) ---
let r = ma.analyze;
assert_eq!;
assert_eq!;
assert!;
// --- nge- informal prefix ---
let r = ma.analyze;
assert_eq!;
assert!;
// --- Possessive suffix ---
let r = ma.analyze;
assert_eq!;
assert!;
// --- Hyphen-clitic stripped before analysis ---
let r = ma.analyze;
assert_eq!; // clitic segment after hyphen is dropped
MorphAnalysis struct
Detection pipeline (priority order)
Input
ā
āā 0. Lowercase + hyphen-clitic strip
āā 1. Guard: word < 4 chars ā return as-is (no analysis)
āā 2. CONFIX (ke-an, per-an, ber-an, me-kan) ā highest precision, both sides locked
āā 3. PREFIX (me-/ber-/ter-/pe-/nge-/di-/se-/ke-/ku-/kau-)
āā 4. PARTICLE (-lah, -kah, -tah, -pun), then prefix on remainder
āā 5. POSSESSIVE (-ku, -mu, -nya), then prefix on remainder
āā 6. DERIVATIONAL SUFFIX (-kan, -an, -isme, -isasi, -isir)
Guard: skipped if result < 3 chars, or -i suffix on short words
Known limitations (by design)
| Limitation | Reason | Workaround |
|---|---|---|
menulis ā root ulis, not tulis |
men- drops nasal; t restoration needs dict |
Use Stemmer for final root |
mengirim ā root irim, not kirim |
meng- drops k which needs dict to restore |
Use Stemmer for final root |
Ambiguous: mengada ā ["ada", "ngada"] |
Both morphologically valid without dict | Check candidates against dict |
-is / -i suffix not stripped aggressively |
Avoids false positives on short words | Intentional guard |
MorphAnalyzer API
let ma = new; // or MorphAnalyzer::default()
let r: MorphAnalysis = ma.analyze; // works on any &str
š Architecture
sastrawi-rs/
āāā src/
ā āāā lib.rs # Public API re-exports (both stemmers)
ā āāā stemmer.rs # Indonesian: Nazief-Adriani pipeline + backtracking
ā āāā affixation.rs # Indonesian: Prefix/suffix/confix orchestration
ā āāā affix_rules.rs # Indonesian: Zero-regex morphological rules
ā āāā dictionary.rs # Indonesian: FST-based dictionary (OnceLock)
ā āāā tokenizer.rs # Shared zero-copy &str tokenizer
ā āāā stopword.rs # Indonesian: FST stopword filter
ā āāā javanese/
ā āāā mod.rs # Javanese module re-exports
ā āāā stemmer.rs # Javanese: Exhaustive suffixĆprefix pipeline
ā āāā affixation.rs # Javanese: Anuswara + Standard prefix orchestration
ā āāā affix_rules.rs # Javanese: Meluluhkan/Menempel morphological rules
ā āāā dictionary.rs # Javanese: FST-based dictionary (OnceLock)
āāā data/
ā āāā words.txt # ~26k Indonesian root words (Kateglo, CC-BY-NC-SA 3.0)
ā āāā stopwords.txt # Common Indonesian stopwords
ā āāā javanese_words.txt # ~1.3k Javanese root words (Riza et al. 2018, CC-BY 4.0)
āāā build.rs # Compiles word lists ā FST at build time
āāā tests/test.rs # Indonesian integration tests (6 suites, 200+ cases)
āāā tests/javanese_test.rs # Javanese integration tests (6 suites, 60+ cases)
š Performance
| Operation | Old (regex) | New (zero-regex FST) |
|---|---|---|
| Dictionary lookup | O(n) HashMap | O(k) FST where k = key length |
| Prefix stripping | Regex compile + match | Direct string slice comparison |
| Memory | Regex DFA state machines | FST bytes + OnceLock |
š§Ŗ Testing
# 12 test suites total (6 Indonesian + 6 Javanese), 260+ word cases
Indonesian Suites
| Suite | Coverage |
|---|---|
test_stem_word |
160+ Nazief-Adriani morphological cases |
test_stem_sentence |
Full sentence pipeline |
test_nge_informal_prefix |
ngecat, ngegas, ngelepas |
test_ecs_confixes |
keamanan, pertanian, berhadapan |
test_loanword_suffixes |
-isasi, -isir, -isme, -is |
test_stopword_filter |
stem_sentence_filtered, is_stopword |
Javanese Suites
| Suite | Coverage |
|---|---|
test_javanese_anuswara_nasalization |
Meluluhkan & Menempel for m-, n-, ng-, ny- |
test_javanese_tripurusa_and_general_prefixes |
di-, dipun-, dak-, ko-, ka-, kuma-, etc. |
test_javanese_suffixes |
-ake, -aken, -an, -ni, -ipun, -ne, -no, etc. |
test_javanese_confixes_and_backtracking |
dipunlampahaken, kepanasan, dituruake |
test_javanese_complex_sandhi_and_circumfixes |
Vowel allomorphs (nglarani, nggunakake, nglampahi) |
test_javanese_new_affix_rules |
pan-/pam-/pang-, nge-, -ane, kapi-, tar-/tok-, we- |
š Indonesian Extensions (2020ā2026 Research)
Based on recent Indonesian NLP research (ECS [¹], IndoMorph [²], Aksara v1.5 [³]):
A. nge- Informal Prefix
Colloquial prefix, mirror of menge-. Common in Jakarta informal speech and social media (e.g. MPStemmer [ā“]).
ngecat ā cat
ngegas ā gas
ngelamar ā lamar
ngelepas ā lepas
B. Confixes ā ECS (Enhanced Confix Stripping)
Simultaneous prefix+suffix removal, proven to outperform plain Nazief-Adriani in accuracy [¹].
keamanan ā aman (ke-ā¦-an)
pertanian ā tani (per-ā¦-an)
berhadapan ā hadap (ber-ā¦-an)
C. Superlative se-nya Particle
selengkapnya ā lengkap
seberhasilnya ā hasil
D. Loanword Suffixes
idealisasi ā ideal (-isasi)
legalisir ā legal (-isir)
idealisme ā ideal (-isme)
idealis ā ideal (-is)
š References & Credits
Indonesian Stemmer
- [¹] ECS: Arifin, A., Mahendra, P., & Ciptaningtyas, H. T. (2009). Enhanced Confix Stripping Stemmer and Ants Algorithm for Classifying News Document in Indonesian Language.
- [²] IndoMorph: Kamajaya, I., & Moeljadi, D. (2025). IndoMorph: a Morphology Engine for Indonesian. ACL Anthology.
- [³] Aksara: Universitas Indonesia (2023). Aksara v1.5: Indonesian NLP tool conforming to UD v2 guidelines. GitHub.
- [ā“] MPStemmer: Prabono, A. G. (2020). Mpstemmer: a multi-phase stemmer for standard and nonstandard Indonesian words. GitHub.
- Algorithm: Nazief & Adriani (1996, 2007) ā "Confix Stripping: Approach to Stemming Algorithm for Bahasa Indonesia"
- PHP Sastrawi: Andy Librian ā original PHP implementation
- rust-sastrawi: iDevoid ā original Rust port (2019)
Javanese Stemmer
- [āµ] Javanese Nazief-Adriani: Stemming Javanese: Another Adaptation of the Nazief-Adriani affix rules (2020) ā ISRITI. Neliti.
- [ā¶] Javanese ECS: Ngoko Javanese Stemmer using Enhanced Confix Stripping ā ResearchGate. ResearchGate.
- [ā·] Complete Javanese Affix Taxonomy: Semantic Scholar (2021ā2023) ā A complete list of Javanese prefix/suffix rules including pan-/pam-/pang- nominal derivation, literary prefixes (kapi-, we-, a-), and dialectal forms (tar-, tok-). Semantic Scholar.
- JV Dictionary: Riza, Hammam Riza et al. (2018) ā Indonesian Javanese Dictionary Starter Kit Mendeley Data (CC-BY 4.0).
General
- sastrawi-rs: ibahasa Team ā this modernized fork (2026)
- ID Dictionary: Kateglo by Ivan Lanin (CC-BY-NC-SA 3.0)
š License
MIT ā see LICENSE. Dictionary data: Kateglo (CC-BY-NC-SA 3.0) ā non-commercial use only for the bundled Indonesian word list. Javanese dictionary: Riza et al. 2018 (CC-BY 4.0).