kham-core 0.8.1

Pure Rust Thai word segmentation engine — no_std compatible
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
# kham

Thai word segmentation engine written in Rust. Fast, `no_std`-compatible core library with bindings for Python, WebAssembly, C, a command-line interface, and database extensions for PostgreSQL and SQLite.

**Website & live demo: [kham.io](https://kham.io)**

[![CI](https://github.com/preedep/kham/actions/workflows/ci.yml/badge.svg)](https://github.com/preedep/kham/actions/workflows/ci.yml)
[![crates.io](https://img.shields.io/crates/v/kham-core.svg)](https://crates.io/crates/kham-core)
[![PyPI](https://img.shields.io/pypi/v/kham.svg)](https://pypi.org/project/kham/)
[![npm](https://img.shields.io/npm/v/kham-wasm.svg)](https://www.npmjs.com/package/kham-wasm)
[![Website](https://img.shields.io/badge/website-kham.io-blue)](https://kham.io)

## Features

- **newmm algorithm** — DAG-based maximal matching constrained to Thai Character Cluster (TCC) boundaries
- **Compound-first DP scoring** — minimises token count before maximising dictionary matches, then uses TNC frequency as tiebreaker; F1 1.000 on 228 curated test cases; 94.9% sentence-level agreement with PyThaiNLP newmm
- **Zero-copy API**`segment()` returns `&str` slices into the original input; no heap allocation per token
- **`no_std` core**`kham-core` compiles for bare-metal targets (`alloc` only)
- **Built-in dictionary** — 62,102-word CC0-licensed Thai word list embedded at compile time; `dict_merge()` overlay adds custom words without a full trie rebuild
- **Thai FTS pipeline**`FtsTokenizer` adds stopword filtering, POS tagging, NER, RTGS romanization, phonetic soundex, abbreviation expansion, and OOV n-gram fallback
- **Named entity recognition** — gazetteer-based NER (~36,600 entries): provinces, countries, Wikipedia places/orgs, person and family names
- **Part-of-speech tagging** — 13-category lookup table (~9,000 entries)
- **Phonetic encoding** — lk82, udom83, MetaSound, and Thai–English cross-language Soundex
- **Confidence scoring**`Token::confidence: f32` on every token; `0.0` for Unknown, `1.0` for unambiguous dict match; intermediate values from TNC frequency and boundary ambiguity
- **Streaming iterator**`Tokenizer::segment_stream(text)` returns a `TokenStream` with `next_word()`, `next_known()`, and `next_above_confidence(f32)` for lazy, filtered iteration
- **Spell correction**`SpellChecker::suggestions(word, n)`: Levenshtein ≤ 2 over the built-in dictionary, re-ranked by lk82 phonetic similarity and TNC frequency; `did_you_mean(word)` returns the single best correction; `correct_text(text)` corrects an entire passage
- **Keyword extraction**`KeyExtractor::extract(text, n)`: TF × inverse-corpus-frequency scoring; `extract_phrases(text, n)` adds bigram and trigram keyphrases; stopwords and single-char tokens excluded
- **RTGS romanization** — table lookup (415 entries) with rule-based fallback for OOV Thai words; `romanize_or_rule()` per-token; `romanize_sentence(text)` for a whole passage
- **Number normalization** — Thai digits ↔ ASCII, spelled-out number words ↔ integer, Thai Baht currency text
- **Abbreviation expansion** — 118-entry built-in TSV (months, era markers, ranks, agencies)
- **Date parsing** — 7 input formats, Buddhist Era and Gregorian, round-trips to ISO 8601 and Thai text
- **Sentence segmentation** — Thai terminators, Paiyannoi, punctuation, with decimal/abbreviation-aware dot rules
- **Multi-target** — Rust crate, Python wheel, WASM module, C shared library, CLI binary, PostgreSQL FTS parser, SQLite FTS5 tokenizer

## Packages

| Crate | Registry | Docs | Description |
|---|---|---|---|
| `kham-core` | [crates.io]https://crates.io/crates/kham-core | (this file) | Pure Rust engine, `no_std` compatible |
| `kham-cli` | [crates.io]https://crates.io/crates/kham-cli | (this file) | `kham` binary |
| `kham-python` | [PyPI]https://pypi.org/project/kham/ | [kham-python/README.md]kham-python/README.md | Python bindings via PyO3 / maturin |
| `kham-wasm` | [npm]https://www.npmjs.com/package/kham-wasm | [kham-wasm/README.md]kham-wasm/README.md | WebAssembly bindings via wasm-bindgen |
| `kham-capi` | [crates.io]https://crates.io/crates/kham-capi | [kham-capi/README.md]kham-capi/README.md | C FFI with cbindgen-generated header |
| `kham-pg` | [PGXN]https://pgxn.org/dist/kham_pg/ | [kham-pg/README.md]kham-pg/README.md | PostgreSQL text search parser for Thai |
| `kham-sqlite` || [kham-sqlite/README.md]kham-sqlite/README.md | SQLite FTS5 tokenizer for Thai |

---

## Quick start

### Rust

```toml
[dependencies]
kham-core = "0.8"
```

```rust
use kham_core::Tokenizer;

let tok = Tokenizer::new();
let tokens = tok.segment("กินข้าวกับปลา");
for t in &tokens {
    println!("{} ({:?})", t.text, t.kind);
}
// กินข้าว (Thai)
// กับ     (Thai)
// ปลา     (Thai)
```

Mixed script works out of the box:

```rust
let tokens = tok.segment("ธนาคาร100แห่ง");
assert_eq!(tokens[0].text, "ธนาคาร"); // Thai
assert_eq!(tokens[1].text, "100");     // Number
assert_eq!(tokens[2].text, "แห่ง");   // Thai
```

### CLI

```bash
cargo install kham-cli
```

```bash
kham "กินข้าวกับปลา"               # กินข้าว|กับ|ปลา
kham --sep " / " "สวัสดีชาวโลก"    # สวัสดี / ชาว / โลก
kham --kind "ธนาคาร100แห่ง"        # ธนาคาร:Thai|100:Number|แห่ง:Thai
kham --spans "กินข้าวกับปลา"       # กินข้าว:0-7|กับ:7-10|ปลา:10-13

# Confidence scores
kham --confidence "กินข้าวกับปลา"  # กินข้าว:conf=0.95|กับ:conf=1.00|ปลา:conf=1.00

# Filter by confidence threshold
kham --min-confidence 0.9 "กินข้าวกับปลา"

# Structured output
kham --format json "กินข้าวกับปลา"
kham --format csv  "กินข้าวกับปลา"

# Romanize Thai to RTGS Latin
kham --romanize "กินข้าวกับปลา"    # kin khao kap pla

# Spell check a word
kham --spell "กีนข้าว"             # ranked suggestions with edit distance + freq
kham --spell --top-n 3 "ประเทส"

# Keyword extraction
kham --keywords "นักวิทยาศาสตร์ค้นพบดาวเคราะห์ใหม่ในระบบสุริยะ"
kham --keywords --top-n 5 --format json "..."

# FTS pipeline — kind, POS, NE, stopword, synonyms (one token per line)
kham --fts "ทักษิณเดินทางไปกรุงเทพ"
# ทักษิณ  kind=Person  pos=-     ne=Person  stop=false  syn=-
# เดิน    kind=Thai    pos=Verb  ne=-       stop=false  syn=-
# ทาง     kind=Thai    pos=Noun  ne=-       stop=true   syn=-
# ไป      kind=Thai    pos=Verb  ne=-       stop=true   syn=-
# กรุงเทพ kind=Place   pos=-     ne=Place   stop=false  syn=-

# FTS + phonetic encoding — syn= shows the lk82 code
kham --fts --soundex lk82 "กินข้าวกับปลา" | column -t
# กินข้าว  kind=Thai  pos=-     ne=-  stop=false  syn=1619
# กับ      kind=Thai  pos=Conj  ne=-  stop=true   syn=1400
# ปลา      kind=Thai  pos=Noun  ne=-  stop=false  syn=4800

echo "กินข้าว" | kham           # stdin
RUST_LOG=debug kham "กินข้าว"  # per-token trace + timing
```

### Other targets

| Target | Quick link |
|---|---|
| Python | [kham-python/README.md]kham-python/README.md |
| JavaScript / TypeScript (WASM) | [kham-wasm/README.md]kham-wasm/README.md |
| C | [kham-capi/README.md]kham-capi/README.md |
| PostgreSQL FTS | [kham-pg/README.md]kham-pg/README.md |
| SQLite FTS5 | [kham-sqlite/README.md]kham-sqlite/README.md |

---

## Token contract

```rust
pub struct Token<'a> {
    pub text: &'a str,            // zero-copy slice of the input string
    pub span: Range<usize>,       // byte offsets in the original string
    pub char_span: Range<usize>,  // Unicode scalar-value (char) offsets
    pub kind: TokenKind,          // Thai | Latin | Number | Punctuation | Emoji | Whitespace | Unknown | Named(NamedEntityKind)
    pub confidence: f32,          // 0.0 (Unknown) … 1.0 (unambiguous dict match)
}
```

- `span` — byte offsets; slice with `&input[token.span.clone()]`
- `char_span` — Unicode scalar-value offsets for Python/JavaScript indexing
- `confidence``0.0` for Unknown tokens; `1.0` for unambiguous single-path dict matches; intermediate values reflect TNC frequency weight and competing-edge count from the newmm DP pass
- Joining all `token.text` values (whitespace kept) reconstructs the original input exactly

### TokenStream

`segment_stream` returns a lazy iterator that avoids collecting into a `Vec`:

```rust
use kham_core::Tokenizer;

let tok = Tokenizer::new();
let mut stream = tok.segment_stream("ธนาคาร100แห่ง");

// skip whitespace
while let Some(t) = stream.next_word() {
    println!("{}", t.text);
}

// skip whitespace + Unknown tokens
let mut stream = tok.segment_stream("ธนาคาร xyzqqq แห่ง");
while let Some(t) = stream.next_known() {
    println!("{} conf={:.2}", t.text, t.confidence);
}

// filter by confidence threshold
let mut stream = tok.segment_stream("ธนาคาร100แห่ง");
while let Some(t) = stream.next_above_confidence(0.8) {
    println!("{} conf={:.2}", t.text, t.confidence);
}
```

---

## Full-Text Search

`FtsTokenizer` wraps the segmenter with the full NLP pipeline:

```rust
use kham_core::fts::FtsTokenizer;

let fts = FtsTokenizer::new();

let tokens = fts.segment_for_fts("ทักษิณเดินทางไปกรุงเทพ");
for t in &tokens {
    println!("{} ne={:?} pos={:?} stop={}", t.text, t.ne, t.pos, t.is_stop);
}
// ทักษิณ  ne=Some(Person)  pos=None  stop=false
// เดิน    ne=None          pos=Verb  stop=false
// ทาง     ne=None          pos=None  stop=true
// ไป      ne=None          pos=Verb  stop=true
// กรุงเทพ ne=Some(Place)   pos=None  stop=false  ← merged from กรุง+เทพ

// Flat lexeme list for tsvector (stopwords removed)
let lexemes = fts.lexemes("กินข้าวกับปลา");
// → ["กินข้าว", "ปลา"]
```

Builder options:

```rust
use kham_core::fts::FtsTokenizer;
use kham_core::abbrev::AbbrevMap;
use kham_core::synonym::SynonymMap;
use kham_core::stopwords::StopwordSet;
use kham_core::romanizer::RomanizationMap;
use kham_core::soundex::SoundexAlgorithm;

let fts = FtsTokenizer::builder()
    .abbrevs(AbbrevMap::builtin())             // ก.ค. → กรกฎาคม before segmentation
    .synonyms(SynonymMap::from_tsv(include_str!("synonyms.tsv")))
    .stopwords(StopwordSet::from_text("ซื้อ\nขาย\n"))
    .romanization(RomanizationMap::builtin())  // adds RTGS to synonyms: กิน → "kin"
    .soundex(SoundexAlgorithm::Lk82)          // adds lk82 code to synonyms for Thai/Named tokens
    .ngram_size(3)                             // trigrams for Unknown tokens (0 = disable)
    .number_normalize(true)                    // Thai digits → ASCII synonym (default: true)
    .build();
```

`FtsToken` fields: `text`, `position`, `kind`, `is_stop`, `synonyms`, `trigrams`, `pos`, `ne`.

---

## Number normalization

```rust
use kham_core::number::{
    thai_digits_to_ascii, parse_thai_word, u64_to_thai_word,
    parse_thai_baht, to_thai_baht_text,
};

thai_digits_to_ascii("๑๒๓")             // "123"
parse_thai_word("หนึ่งร้อยยี่สิบสาม")  // Some(123)
u64_to_thai_word(123)                   // "หนึ่งร้อยยี่สิบสาม"
parse_thai_baht("หนึ่งร้อยบาทห้าสิบสตางค์")
// Some(BahtAmount { baht: 100, satang: 50 })
to_thai_baht_text(100, 0)              // "หนึ่งร้อยบาทถ้วน"
```

In `FtsTokenizer`, number normalization runs automatically: `TokenKind::Number` tokens get their ASCII form added to `synonyms`. Opt out with `.number_normalize(false)`.

---

## Abbreviation expansion

```rust
use kham_core::abbrev::AbbrevMap;

let map = AbbrevMap::builtin();
assert_eq!(map.expand_text("วันที่5ก.ค.2567"), "วันที่5กรกฎาคม2567");
assert_eq!(map.expand_text("พ.ศ.2567"),        "พุทธศักราช2567");

let exps = map.lookup("ดร.").unwrap();
assert_eq!(exps, &["ดอกเตอร์"]);
```

Built-in TSV covers 12 month abbreviations, era markers, military/police ranks, government agencies, and Bangkok districts. Use with `FtsTokenizerBuilder::abbrevs(AbbrevMap::builtin())`.

---

## Date parsing

```rust
use kham_core::date::{parse_thai_date, Era};

let d = parse_thai_date("5 กรกฎาคม 2567").unwrap();
assert_eq!(d.to_iso8601(), "2024-07-05"); // BE 2567 → CE 2024

let d = parse_thai_date("๕ ก.ค. ๒๕๖๗").unwrap();
assert_eq!(d.to_iso8601(), "2024-07-05");

let d = parse_thai_date("5/7/2567").unwrap();
assert_eq!(d.era, Era::Buddhist);
```

Supported formats: full month name, abbreviated month, era marker (`พ.ศ.` / `ค.ศ.`), `วันที่` prefix, slash/dash-separated, Thai digits. Era inferred when omitted: year ≥ 2300 → Buddhist Era.

---

## Sentence segmentation

```rust
use kham_core::sentence::split_sentences;

let text = "สวัสดีครับ! วันนี้อากาศดีมาก\nเราไปกินข้าวกันเถอะ";
let sents = split_sentences(text);
assert_eq!(sents[0].text, "สวัสดีครับ!");
assert_eq!(sents[1].text, "วันนี้อากาศดีมาก");
assert_eq!(sents[2].text, "เราไปกินข้าวกันเถอะ");
```

| Character | Rule |
|---|---|
| `` `` | Always splits |
| `` | Splits unless part of `ฯลฯ` |
| `\n` | Always splits |
| `!` `?` | Always splits |
| `.` | Splits only when followed by whitespace or end-of-string |

---

## Spell checking

```rust
use kham_core::spell::SpellChecker;

let checker = SpellChecker::builtin();

// Ranked suggestions (Levenshtein ≤ 2, re-ranked by soundex + TNC freq)
let suggestions = checker.suggestions("กีนข้าว", 5);
for s in &suggestions {
    println!("{} (edit={}, soundex={}, freq={})", s.word, s.edit_distance, s.soundex_match, s.freq_score);
}
// กินข้าว (edit=1, soundex=true, freq=1342)

// Single best correction — None if the word is already in the dictionary
if let Some(fix) = checker.did_you_mean("กีนข้าว") {
    println!("{}", fix); // กินข้าว
}

// Correct an entire passage — Unknown tokens (≥ 2 chars) are replaced
let corrected = checker.correct_text("ผมกีนข้าวกับปลา");
println!("{}", corrected); // ผมกินข้าวกับปลา
```

---

## Keyword extraction

```rust
use kham_core::keyword::KeyExtractor;

let extractor = KeyExtractor::builtin();
let text = "นายกรัฐมนตรีประกาศนโยบายเศรษฐกิจใหม่สำหรับประชาชน";

// Top-N unigram keywords
let keywords = extractor.extract(text, 3);
for kw in &keywords {
    println!("{} (score={:.3}, count={})", kw.word, kw.score, kw.count);
}

// Bigram and trigram keyphrases
let phrases = extractor.extract_phrases(text, 5);
for p in &phrases {
    println!("{} (score={:.3}, count={})", p.word, p.score, p.count);
}
```

Stopwords and single-character tokens are excluded. Scoring uses TF × IDF-proxy where IDF-proxy = `(max_tnc_freq + 1) / (tnc_freq + 1)` — rare words score higher.

---

## Named entity recognition

The built-in gazetteer (~36,600 entries) covers Thai provinces, 246 countries, 17,000+ Wikipedia places/orgs, and 9,000+ person and family names. Multi-token matching merges compound names split by the segmenter:

```
กรุงเทพ  → segmenter splits → กรุง + เทพ
         → NE tagger merges → กรุงเทพ  Named(Place)
```

See [ADR-001](doc/adr-001-ne-person-name-import-strategy.md) for the person-name import decision.

---

## Phonetic encoding (Soundex)

```rust
use kham_core::soundex::{lk82, udom83, metasound, sounds_like, SoundexAlgorithm};
use kham_core::soundex::{thai_english_soundex, sounds_like_cross_lang};

assert_eq!(lk82("กาน"), lk82("ขาน")); // same consonant group → "1600"
assert!(sounds_like("กาน", "คาน", SoundexAlgorithm::Lk82));

// Thai–English cross-language (Suwanvisat & Prasitjutrakul 1998)
let en = thai_english_soundex("McDonald");
let th = thai_english_soundex("แมคโดนัลด์");
assert_eq!(&en[..3], &th[..3]); // shared phonetic prefix
```

FTS integration — emit the soundex code as a synonym:

```rust
let fts = FtsTokenizer::builder()
    .soundex(SoundexAlgorithm::Lk82)
    .build();
```

---

## Building

```bash
cargo build                          # all crates
cargo test --release                 # all tests
cargo test -p kham-core --release    # core only
cargo bench -p kham-core             # throughput benchmarks
cargo run -p kham-bench-accuracy     # word-boundary P/R/F1
cargo run -p kham-bench-accuracy -- --threshold 0.95  # CI gate
```

Prerequisites:

| Target | Tool | Install |
|---|---|---|
| All | Rust ≥ 1.85 | `curl -sSf https://sh.rustup.rs \| sh` |
| WASM | `wasm-pack` | `cargo install wasm-pack` |
| Python | `maturin` | `pip install maturin` |
| C | `cbindgen` | `cargo install cbindgen` |
| PostgreSQL | Docker with BuildKit | [docs.docker.com]https://docs.docker.com/engine/install/ |
| SQLite (macOS) | Homebrew sqlite | `brew install sqlite` |
| SQLite (Linux) | SQLite dev headers | `apt install libsqlite3-dev` |

---

## CI

| Job | What it checks |
|---|---|
| `fmt` | `cargo fmt --check` |
| `clippy` | `cargo clippy -D warnings` |
| `test` | Unit + integration + doc tests, stable and MSRV 1.85, Linux and macOS |
| `no_std` | `kham-core` compiles for `thumbv7em-none-eabihf` |
| `wasm` | Unit tests (`cargo test -p kham-wasm`) + `wasm-pack build --target web` |
| `python` | `maturin develop` + `pytest` on Python 3.11 and 3.12 |
| `pg_regress` | SQL regress suites (kham_fts, kham_features, kham_thai, kham_operators, kham_ranking, kham_advanced) in Docker PostgreSQL 17 |

---

## Further reading

| Document | Contents |
|---|---|
| [doc/roadmap.md]doc/roadmap.md | Release history, pending action checklist, corpus import plan |
| [doc/architecture.md]doc/architecture.md | Crate graph, pipeline flowcharts, module responsibilities |
| [doc/benchmarks.md]doc/benchmarks.md | Throughput numbers, PostgreSQL and SQLite FTS5 benchmarks |
| [doc/dict-format.md]doc/dict-format.md | `dict.bin` binary format, DARTS lifecycle, data sources |
| [doc/adr-001-ne-person-name-import-strategy.md]doc/adr-001-ne-person-name-import-strategy.md | Person name import strategy |
| [doc/adr-002-syllables-corpus-import-decision.md]doc/adr-002-syllables-corpus-import-decision.md | Why syllables_th.txt is excluded |
| [doc/adr-003-orchid-pos-tag-mapping.md]doc/adr-003-orchid-pos-tag-mapping.md | ORCHID 44-tag → 13-category POS mapping |

---

## License

Licensed under either of:

- [MIT License]LICENSE-MIT
- [Apache License, Version 2.0]LICENSE-APACHE

at your option.