kham
Thai word segmentation engine written in Rust. Fast, no_std-compatible core library with bindings for Python, WebAssembly, C, a command-line interface, and database extensions for PostgreSQL and SQLite.
Website & live demo: kham.io
Features
- newmm algorithm — DAG-based maximal matching constrained to Thai Character Cluster (TCC) boundaries
- Compound-first DP scoring — minimises token count before maximising dictionary matches, then uses TNC frequency as tiebreaker; F1 1.000 on 228 curated test cases; 94.9% sentence-level agreement with PyThaiNLP newmm
- Zero-copy API —
segment()returns&strslices into the original input; no heap allocation per token no_stdcore —kham-corecompiles for bare-metal targets (alloconly)- Built-in dictionary — 62,102-word CC0-licensed Thai word list embedded at compile time;
dict_merge()overlay adds custom words without a full trie rebuild - Thai FTS pipeline —
FtsTokenizeradds stopword filtering, POS tagging, NER, RTGS romanization, phonetic soundex, abbreviation expansion, and OOV n-gram fallback - Named entity recognition — gazetteer-based NER (~36,600 entries): provinces, countries, Wikipedia places/orgs, person and family names
- Part-of-speech tagging — 13-category lookup table (~9,000 entries)
- Phonetic encoding — lk82, udom83, MetaSound, and Thai–English cross-language Soundex
- Confidence scoring —
Token::confidence: f32on every token;0.0for Unknown,1.0for unambiguous dict match; intermediate values from TNC frequency and boundary ambiguity - Streaming iterator —
Tokenizer::segment_stream(text)returns aTokenStreamwithnext_word(),next_known(), andnext_above_confidence(f32)for lazy, filtered iteration - Spell correction —
SpellChecker::suggestions(word, n): Levenshtein ≤ 2 over the built-in dictionary, re-ranked by lk82 phonetic similarity and TNC frequency;did_you_mean(word)returns the single best correction;correct_text(text)corrects an entire passage - Keyword extraction —
KeyExtractor::extract(text, n): TF × inverse-corpus-frequency scoring;extract_phrases(text, n)adds bigram and trigram keyphrases; stopwords and single-char tokens excluded - RTGS romanization — table lookup (415 entries) with rule-based fallback for OOV Thai words;
romanize_or_rule()per-token;romanize_sentence(text)for a whole passage - Number normalization — Thai digits ↔ ASCII, spelled-out number words ↔ integer, Thai Baht currency text
- Abbreviation expansion — 118-entry built-in TSV (months, era markers, ranks, agencies)
- Date parsing — 7 input formats, Buddhist Era and Gregorian, round-trips to ISO 8601 and Thai text
- Sentence segmentation — Thai terminators, Paiyannoi, punctuation, with decimal/abbreviation-aware dot rules
- Multi-target — Rust crate, Python wheel, WASM module, C shared library, CLI binary, PostgreSQL FTS parser, SQLite FTS5 tokenizer
Packages
| Crate | Registry | Docs | Description |
|---|---|---|---|
kham-core |
crates.io | (this file) | Pure Rust engine, no_std compatible |
kham-cli |
crates.io | (this file) | kham binary |
kham-python |
PyPI | kham-python/README.md | Python bindings via PyO3 / maturin |
kham-wasm |
npm | kham-wasm/README.md | WebAssembly bindings via wasm-bindgen |
kham-capi |
crates.io | kham-capi/README.md | C FFI with cbindgen-generated header |
kham-pg |
PGXN | kham-pg/README.md | PostgreSQL text search parser for Thai |
kham-sqlite |
— | kham-sqlite/README.md | SQLite FTS5 tokenizer for Thai |
Quick start
Rust
[]
= "0.8"
use Tokenizer;
let tok = new;
let tokens = tok.segment;
for t in &tokens
// กินข้าว (Thai)
// กับ (Thai)
// ปลา (Thai)
Mixed script works out of the box:
let tokens = tok.segment;
assert_eq!; // Thai
assert_eq!; // Number
assert_eq!; // Thai
CLI
# Confidence scores
# Filter by confidence threshold
# Structured output
# Romanize Thai to RTGS Latin
# Spell check a word
# Keyword extraction
# FTS pipeline — kind, POS, NE, stopword, synonyms (one token per line)
# ทักษิณ kind=Person pos=- ne=Person stop=false syn=-
# เดิน kind=Thai pos=Verb ne=- stop=false syn=-
# ทาง kind=Thai pos=Noun ne=- stop=true syn=-
# ไป kind=Thai pos=Verb ne=- stop=true syn=-
# กรุงเทพ kind=Place pos=- ne=Place stop=false syn=-
# FTS + phonetic encoding — syn= shows the lk82 code
|
# กินข้าว kind=Thai pos=- ne=- stop=false syn=1619
# กับ kind=Thai pos=Conj ne=- stop=true syn=1400
# ปลา kind=Thai pos=Noun ne=- stop=false syn=4800
| RUST_LOG=debug
Other targets
| Target | Quick link |
|---|---|
| Python | kham-python/README.md |
| JavaScript / TypeScript (WASM) | kham-wasm/README.md |
| C | kham-capi/README.md |
| PostgreSQL FTS | kham-pg/README.md |
| SQLite FTS5 | kham-sqlite/README.md |
Token contract
span— byte offsets; slice with&input[token.span.clone()]char_span— Unicode scalar-value offsets for Python/JavaScript indexingconfidence—0.0for Unknown tokens;1.0for unambiguous single-path dict matches; intermediate values reflect TNC frequency weight and competing-edge count from the newmm DP pass- Joining all
token.textvalues (whitespace kept) reconstructs the original input exactly
TokenStream
segment_stream returns a lazy iterator that avoids collecting into a Vec:
use Tokenizer;
let tok = new;
let mut stream = tok.segment_stream;
// skip whitespace
while let Some = stream.next_word
// skip whitespace + Unknown tokens
let mut stream = tok.segment_stream;
while let Some = stream.next_known
// filter by confidence threshold
let mut stream = tok.segment_stream;
while let Some = stream.next_above_confidence
Full-Text Search
FtsTokenizer wraps the segmenter with the full NLP pipeline:
use FtsTokenizer;
let fts = new;
let tokens = fts.segment_for_fts;
for t in &tokens
// ทักษิณ ne=Some(Person) pos=None stop=false
// เดิน ne=None pos=Verb stop=false
// ทาง ne=None pos=None stop=true
// ไป ne=None pos=Verb stop=true
// กรุงเทพ ne=Some(Place) pos=None stop=false ← merged from กรุง+เทพ
// Flat lexeme list for tsvector (stopwords removed)
let lexemes = fts.lexemes;
// → ["กินข้าว", "ปลา"]
Builder options:
use FtsTokenizer;
use AbbrevMap;
use SynonymMap;
use StopwordSet;
use RomanizationMap;
use SoundexAlgorithm;
let fts = builder
.abbrevs // ก.ค. → กรกฎาคม before segmentation
.synonyms
.stopwords
.romanization // adds RTGS to synonyms: กิน → "kin"
.soundex // adds lk82 code to synonyms for Thai/Named tokens
.ngram_size // trigrams for Unknown tokens (0 = disable)
.number_normalize // Thai digits → ASCII synonym (default: true)
.build;
FtsToken fields: text, position, kind, is_stop, synonyms, trigrams, pos, ne.
Number normalization
use ;
thai_digits_to_ascii // "123"
parse_thai_word // Some(123)
u64_to_thai_word // "หนึ่งร้อยยี่สิบสาม"
parse_thai_baht
// Some(BahtAmount { baht: 100, satang: 50 })
to_thai_baht_text // "หนึ่งร้อยบาทถ้วน"
In FtsTokenizer, number normalization runs automatically: TokenKind::Number tokens get their ASCII form added to synonyms. Opt out with .number_normalize(false).
Abbreviation expansion
use AbbrevMap;
let map = builtin;
assert_eq!;
assert_eq!;
let exps = map.lookup.unwrap;
assert_eq!;
Built-in TSV covers 12 month abbreviations, era markers, military/police ranks, government agencies, and Bangkok districts. Use with FtsTokenizerBuilder::abbrevs(AbbrevMap::builtin()).
Date parsing
use ;
let d = parse_thai_date.unwrap;
assert_eq!; // BE 2567 → CE 2024
let d = parse_thai_date.unwrap;
assert_eq!;
let d = parse_thai_date.unwrap;
assert_eq!;
Supported formats: full month name, abbreviated month, era marker (พ.ศ. / ค.ศ.), วันที่ prefix, slash/dash-separated, Thai digits. Era inferred when omitted: year ≥ 2300 → Buddhist Era.
Sentence segmentation
use split_sentences;
let text = "สวัสดีครับ! วันนี้อากาศดีมาก\nเราไปกินข้าวกันเถอะ";
let sents = split_sentences;
assert_eq!;
assert_eq!;
assert_eq!;
| Character | Rule |
|---|---|
๚ ๛ |
Always splits |
ฯ |
Splits unless part of ฯลฯ |
\n |
Always splits |
! ? |
Always splits |
. |
Splits only when followed by whitespace or end-of-string |
Spell checking
use SpellChecker;
let checker = builtin;
// Ranked suggestions (Levenshtein ≤ 2, re-ranked by soundex + TNC freq)
let suggestions = checker.suggestions;
for s in &suggestions
// กินข้าว (edit=1, soundex=true, freq=1342)
// Single best correction — None if the word is already in the dictionary
if let Some = checker.did_you_mean
// Correct an entire passage — Unknown tokens (≥ 2 chars) are replaced
let corrected = checker.correct_text;
println!; // ผมกินข้าวกับปลา
Keyword extraction
use KeyExtractor;
let extractor = builtin;
let text = "นายกรัฐมนตรีประกาศนโยบายเศรษฐกิจใหม่สำหรับประชาชน";
// Top-N unigram keywords
let keywords = extractor.extract;
for kw in &keywords
// Bigram and trigram keyphrases
let phrases = extractor.extract_phrases;
for p in &phrases
Stopwords and single-character tokens are excluded. Scoring uses TF × IDF-proxy where IDF-proxy = (max_tnc_freq + 1) / (tnc_freq + 1) — rare words score higher.
Named entity recognition
The built-in gazetteer (~36,600 entries) covers Thai provinces, 246 countries, 17,000+ Wikipedia places/orgs, and 9,000+ person and family names. Multi-token matching merges compound names split by the segmenter:
กรุงเทพ → segmenter splits → กรุง + เทพ
→ NE tagger merges → กรุงเทพ Named(Place)
See ADR-001 for the person-name import decision.
Phonetic encoding (Soundex)
use ;
use ;
assert_eq!; // same consonant group → "1600"
assert!;
// Thai–English cross-language (Suwanvisat & Prasitjutrakul 1998)
let en = thai_english_soundex;
let th = thai_english_soundex;
assert_eq!; // shared phonetic prefix
FTS integration — emit the soundex code as a synonym:
let fts = builder
.soundex
.build;
Building
Prerequisites:
| Target | Tool | Install |
|---|---|---|
| All | Rust ≥ 1.85 | curl -sSf https://sh.rustup.rs | sh |
| WASM | wasm-pack |
cargo install wasm-pack |
| Python | maturin |
pip install maturin |
| C | cbindgen |
cargo install cbindgen |
| PostgreSQL | Docker with BuildKit | docs.docker.com |
| SQLite (macOS) | Homebrew sqlite | brew install sqlite |
| SQLite (Linux) | SQLite dev headers | apt install libsqlite3-dev |
CI
| Job | What it checks |
|---|---|
fmt |
cargo fmt --check |
clippy |
cargo clippy -D warnings |
test |
Unit + integration + doc tests, stable and MSRV 1.85, Linux and macOS |
no_std |
kham-core compiles for thumbv7em-none-eabihf |
wasm |
Unit tests (cargo test -p kham-wasm) + wasm-pack build --target web |
python |
maturin develop + pytest on Python 3.11 and 3.12 |
pg_regress |
SQL regress suites (kham_fts, kham_features, kham_thai, kham_operators, kham_ranking, kham_advanced) in Docker PostgreSQL 17 |
Further reading
| Document | Contents |
|---|---|
| doc/roadmap.md | Release history, pending action checklist, corpus import plan |
| doc/architecture.md | Crate graph, pipeline flowcharts, module responsibilities |
| doc/benchmarks.md | Throughput numbers, PostgreSQL and SQLite FTS5 benchmarks |
| doc/dict-format.md | dict.bin binary format, DARTS lifecycle, data sources |
| doc/adr-001-ne-person-name-import-strategy.md | Person name import strategy |
| doc/adr-002-syllables-corpus-import-decision.md | Why syllables_th.txt is excluded |
| doc/adr-003-orchid-pos-tag-mapping.md | ORCHID 44-tag → 13-category POS mapping |
License
Licensed under either of:
at your option.