kham
Thai word segmentation engine written in Rust. Fast, no_std-compatible core library with bindings for Python, WebAssembly, C, and a command-line interface.
Features
- newmm algorithm — DAG-based maximal matching constrained to Thai Character Cluster (TCC) boundaries
- Multi-target — single core library ships as a Rust crate, Python wheel, WASM module, C shared library, and CLI binary
- Zero-copy API —
segment()returns&strslices into the original input; no heap allocation per token no_stdcore —kham-corecompiles for bare-metal targets (alloconly, nostddependency)- Built-in dictionary — 62,102-word CC0-licensed Thai word list embedded at compile time; custom dictionaries loaded at runtime
- TNC frequency scoring — Thai National Corpus (CC0) raw counts guide the DP scorer to prefer statistically common segmentations when multiple dictionary paths tie
- Pre-compiled DARTS — Double-Array Trie is built once at compile time (
build.rs) and loaded from a binary blob at runtime (~64 µs vs ~960 ms construction from text) - Text normalization — วรรณยุกต์ dedup and Sara Am composition before segmentation
- Thai FTS pipeline —
FtsTokenizeradds stopword filtering (1 029 built-in entries, PyThaiNLP Apache-2.0), synonym expansion (TSV-drivenSynonymMap), and character n-gram fallback for OOV tokens; ready for PostgreSQLtsvectorintegration - Structured CLI logging —
RUST_LOG-controlled output with coloured log levels viaenv_logger+colored
Packages
| Crate | Registry | Description |
|---|---|---|
kham-core |
crates.io | Pure Rust engine, no_std compatible |
kham-cli |
crates.io | kham binary (clap) |
kham-python |
PyPI | Python bindings via PyO3 / maturin |
kham-wasm |
npm | WebAssembly bindings via wasm-bindgen |
kham-capi |
crates.io | C FFI with cbindgen-generated header; includes FTS API |
kham-pg |
PGXN (coming soon) | PostgreSQL extension: custom text search parser for Thai |
Quick start
Rust
[]
= "0.1"
use Tokenizer;
let tok = new;
let tokens = tok.segment;
for t in &tokens
// กิน (Thai)
// ข้าว (Thai)
// ...
Mixed script works out of the box:
let tokens = tok.segment;
assert_eq!; // Thai
assert_eq!; // Number
assert_eq!; // Thai
For input that may contain stacked tone marks or decomposed Sara Am, normalize first:
let normalized = tok.normalize; // tone dedup + Sara Am composition
let tokens = tok.segment; // tokens borrow `normalized`
Python
# Simple — list of token strings
=
# ['กิน', 'ข้าว', 'กับ', 'ปลา']
# Rich — Token objects with span information
=
# ธนาคาร 0 6 Thai
# 100 6 9 Number
# แห่ง 9 13 Thai
Token attributes: text, byte_start, byte_end, char_start, char_end, kind.
JavaScript / TypeScript (WASM)
import init from "kham-wasm";
await ;
// Simple — array of token strings
const words = ;
console.log; // ["กิน", "ข้าว", "กับ", "ปลา"]
// Rich — Token objects with span information
const tokens = ;
// ธนาคาร 0 6 Thai
// 100 6 9 Number
// แห่ง 9 13 Thai
Token properties: text, byte_start, byte_end, char_start, char_end, kind.
Note on JS string offsets:
char_start/char_endare Unicode scalar-value counts. For BMP text these equal JavaScript'sstring.slice()indices. For surrogate-pair emoji, usebyte_start/byte_endwithTextEncoderfor precise byte-level slicing.
PostgreSQL
kham-pg is a PostgreSQL extension that registers a custom text search parser so you can index and query Thai text with tsvector / tsquery.
Prerequisites: Docker with BuildKit (for the test runner), or PostgreSQL dev headers and pg_config for a local install.
# Build and run pg_regress tests in Docker (67 tests across 4 suites)
# Manual install (if pg_config is in PATH)
# PGXN distribution zip (for upload to pgxn.org)
Once installed:
-- Register the extension
CREATE EXTENSION kham_pg;
-- Inspect token types produced by the parser
SELECT * FROM ts_token_type('kham');
-- 1 thai Thai word
-- 2 latin Latin script token
-- 3 number Numeric token
-- 4 punct Punctuation
-- 5 emoji Emoji token
-- 6 unknown Unknown / OOV token
-- Tokenise a document
SELECT * FROM ts_parse('kham', 'กินข้าวกับปลา');
-- 1 กิน
-- 1 ข้าว
-- 1 กับ
-- 1 ปลา
-- Build a tsvector (all token types indexed; kham_dict uses simple template)
SELECT to_tsvector('kham', 'กินข้าวกับปลา');
-- 'กิน':1 'กับ':3 'ข้าว':2 'ปลา':4
-- Full-text search (AND)
SELECT title FROM articles
WHERE to_tsvector('kham', body) @@ plainto_tsquery('kham', 'ข้าว ปลา');
-- Phrase search (adjacent tokens)
SELECT title FROM articles
WHERE to_tsvector('kham', body) @@ phraseto_tsquery('kham', 'กิน ข้าว');
-- GIN index for large tables
USING GIN (to_tsvector('kham', body));
-- Ranked results
SELECT title,
ts_rank(to_tsvector('kham', body), plainto_tsquery('kham', 'ปลา')) AS rank
FROM articles
WHERE to_tsvector('kham', body) @@ plainto_tsquery('kham', 'ปลา')
ORDER BY rank DESC;
Note:
ts_headlineis not supported — the kham parser has no HEADLINE callback. This is a known limitation of custom parsers in PostgreSQL.
CLI
# Positional argument
# กิน|ข้าว|กับ|ปลา
# Custom separator
# สวัสดี / ชาว / โลก
# Show token kinds
# ธนาคาร:Thai|100:Number|แห่ง:Thai
# Show Unicode char spans
# กิน:0-3|ข้าว:3-7|กับ:7-10|ปลา:10-13
# Combine kind and spans
# กิน:Thai:0-3|ข้าว:Thai:3-7
# Normalize before segmenting
# Custom dictionary
# Pipeline / stdin
|
|
Full options:
Usage: kham [OPTIONS] [TEXT]
Arguments:
[TEXT] Thai text to segment. Reads from stdin line-by-line if omitted.
Options:
-d, --dict <FILE> Path to a custom word-list file (newline-separated)
-s, --sep <SEP> Output separator between tokens [default: |]
-w, --whitespace Include whitespace tokens in output
-n, --normalize Run normalize() before segmenting
-k, --kind Append token kind after each token (e.g. กิน:Thai)
--spans Append Unicode char span after each token (e.g. กิน:0-3)
-h, --help Print help
-V, --version Print version
Debug and timing output is controlled by the RUST_LOG environment variable:
RUST_LOG=debug RUST_LOG=info
C
Generate the header and link libkham_capi:
// Simple — array of token strings
KhamTokens *tokens = ;
for
;
// Rich — KhamToken structs with full span information
KhamTokenList *list = ;
for
// ธนาคาร char 0..6 Thai
// 100 char 6..9 Number
// แห่ง char 9..13 Thai
;
KhamToken fields: text, byte_start, byte_end, char_start, char_end, kind (all null-terminated UTF-8 strings or size_t).
FTS API (C)
Run the full Thai FTS pipeline from C to get stopword flags, synonym expansions, and OOV trigrams:
// Annotated FTS tokens (all non-whitespace, with metadata)
KhamFtsTokenList *fts = ;
for
// กิน pos=0 stop=0 synonyms=0 trigrams=0
// ข้าว pos=1 stop=0 synonyms=0 trigrams=0
// กับ pos=2 stop=1 synonyms=0 trigrams=0
// ปลา pos=3 stop=0 synonyms=0 trigrams=0
;
// Flat lexeme array for tsvector population (stopwords removed)
size_t n = 0;
char **lexemes = ;
// lexemes[0] = "กิน", lexemes[1] = "ข้าว", lexemes[2] = "ปลา" (n = 3)
;
KhamFtsToken fields: text, position (size_t), kind, is_stop (bool), synonyms/synonyms_len, trigrams/trigrams_len.
Token contract
Every segment() call returns Vec<Token>:
span— byte offsets; use to slice&strdirectly (&input[token.span.clone()])char_span— Unicode scalar-value offsets; use for Python/JavaScript string indexing where strings are char- or code-unit-indexed- Both spans are always valid UTF-8 boundaries
- Joining all
token.textvalues (with whitespace kept) reconstructs the original input exactly
use Tokenizer;
let tok = new;
let input = "ธนาคาร100แห่ง";
let tokens = tok.segment;
// ธนาคาร: 6 chars, 18 bytes
assert_eq!;
assert_eq!;
// 100: 3 chars, 3 bytes
assert_eq!;
assert_eq!;
Custom dictionary
// From a string
let tok = builder
.dict_words
.build;
// From a file (requires the `std` feature)
let tok = builder
.dict_file?
.build;
// Keep whitespace tokens
let tok = builder
.keep_whitespace
.build;
Full-Text Search (FTS)
kham-core ships a complete Thai FTS pipeline on top of the segmenter. The kham-pg PostgreSQL extension (Phase 2) wraps this pipeline as a custom text search parser — see the PostgreSQL quick start above.
Basic indexing
use FtsTokenizer;
let fts = new; // built-in stopwords, no synonyms
// All tokens with metadata
let tokens = fts.segment_for_fts;
for t in &tokens
// กิน pos=0 stop=false
// ข้าว pos=1 stop=false
// กับ pos=2 stop=true ← conjunction → filtered at index time
// ปลา pos=3 stop=false
// Flat lexeme list for tsvector (stopwords removed)
let lexemes = fts.lexemes;
// → ["กิน", "ข้าว", "ปลา"]
Synonym expansion
Define a TSV file where each line maps a canonical form to one or more equivalents:
คอม คอมพิวเตอร์ computer
รถไฟฟ้า BTS MRT รถไฟใต้ดิน
use FtsTokenizer;
use SynonymMap;
let synonyms = from_tsv;
let fts = builder.synonyms.build;
let lexemes = fts.lexemes;
// → ["ซื้อ", "คอม", "คอมพิวเตอร์", "computer", "ใหม่"]
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expanded
Custom stopwords
use StopwordSet;
use FtsTokenizer;
// Add domain-specific stopwords on top of the built-in list
let extra = from_text;
let fts = builder.stopwords.build;
OOV (out-of-vocabulary) n-grams
Words not in the dictionary are emitted as TokenKind::Unknown. The FTS pipeline automatically generates character n-grams for these tokens so they remain searchable:
// Default ngram_size = 3 (trigrams)
// Unknown token "สกรีน" (3-char TCC clusters) → ["สกร", "กรี", "รีน"]
// Disable n-gram generation:
let fts = builder.ngram_size.build;
FtsToken fields
| Field | Type | Description |
|---|---|---|
text |
String |
Token text (normalised) |
position |
usize |
Ordinal index in non-whitespace sequence (0-based) |
kind |
TokenKind |
Thai / Latin / Number / … / Unknown |
is_stop |
bool |
Matched the stopword list |
synonyms |
Vec<String> |
Synonym expansions (empty if none) |
trigrams |
Vec<String> |
Char n-grams for Unknown tokens only |
Architecture
Workspace crate graph
graph LR
core["<b>kham-core</b><br/><i>no_std · alloc only</i><br/>segmentation engine"]
cli["<b>kham-cli</b><br/>kham binary<br/>(clap)"]
python["<b>kham-python</b><br/>Python wheel<br/>(PyO3 · maturin)"]
wasm["<b>kham-wasm</b><br/>WASM module<br/>(wasm-bindgen)"]
capi["<b>kham-capi</b><br/>C shared library<br/>(cbindgen)<br/>segment · FTS · lexemes"]
pg["<b>kham-pg</b><br/>PostgreSQL extension<br/>(C shim · cdylib)"]
core --> cli
core --> python
core --> wasm
core --> capi
core --> pg
Core module responsibilities
classDiagram
direction LR
class normalizer {
+normalize(text) String
--
วรรณยุกต์ dedup
Sara Am composition
}
class pre_tokenizer {
+pre_tokenize(text) Vec~Token~
+classify_char(c) TokenKind
--
Unicode script split
Thai · Latin · Number
Emoji · Punct · WS
}
class tcc {
+tcc_boundaries(text) Vec~usize~
+tcc_iter(text) Iterator
--
Thai Character Cluster
boundary detection
Theeramunkong 2000
}
class dict {
+builtin_dict() Dict
+from_word_list(text) Dict
+from_bytes(data) Dict
+contains(word) bool
+prefixes(text) Vec~str~
--
Double-Array Trie
O(k) byte-level lookup
pre-compiled binary blob
built-in CC0 word list
}
class freq {
+FreqMap::builtin() FreqMap
+from_tsv(data) FreqMap
+get(word) u32
--
TNC raw occurrence counts
CC0 · 106k entries
DP tie-breaking scorer
}
class segmenter {
+segment(text) Vec~Token~
+normalize(text) String
--
newmm DAG algorithm
DP over TCC boundaries
min unknowns · max dict words
TNC freq · min token count
}
class token {
+text : and str
+span : Range~usize~
+char_span : Range~usize~
+kind : TokenKind
--
Thai · Latin · Number
Punctuation · Emoji
Whitespace · Unknown
}
class stopwords {
+StopwordSet::builtin() StopwordSet
+from_text(data) StopwordSet
+contains(word) bool
--
1029 entries · Apache-2.0
sorted Vec binary search
O(log n) lookup
}
class synonym {
+SynonymMap::from_tsv(data) SynonymMap
+expand(word) Option~slice~
+has_synonyms(word) bool
--
BTreeMap canonical→synonyms
TSV format
duplicate canonicals merge
}
class ngram {
+char_ngrams(text, n) Iterator
+token_ngrams(tokens, n) Iterator
--
zero-alloc char slices
OOV fallback indexing
phrase proximity
}
class fts {
+FtsTokenizer::new() FtsTokenizer
+segment_for_fts(text) Vec~FtsToken~
+index_tokens(text) Vec~FtsToken~
+lexemes(text) Vec~String~
--
FtsToken: text · position
is_stop · synonyms · trigrams
PostgreSQL tsvector entry point
}
segmenter ..> normalizer : calls
segmenter ..> pre_tokenizer : calls
segmenter ..> tcc : calls
segmenter ..> dict : queries
segmenter ..> freq : scores
segmenter ..> token : emits
pre_tokenizer ..> token : emits
fts ..> segmenter : wraps
fts ..> stopwords : filters
fts ..> synonym : expands
fts ..> ngram : OOV grams
Segmentation pipeline
flowchart TD
INPUT(["<b>raw &str</b>"])
subgraph OPTIONAL["optional — call before segment()"]
NORM["<b>normalizer::normalize()</b>\nวรรณยุกต์ dedup\nSara Am อํ+อา → อำ"]
end
PRE["<b>pre_tokenizer::pre_tokenize()</b>\nUnicode script classification\nsplit into homogeneous spans"]
SPLIT{span kind?}
PASS["pass through\nas-is"]
subgraph THAI_PATH["Thai span processing"]
TCC["<b>tcc::tcc_boundaries()</b>\nTCC boundary positions\n= legal word-break points"]
DICT["<b>dict::prefixes()</b>\nDATS prefix search\nat each boundary"]
DAG["<b>DP over boundary graph</b>\nminimise unknown tokens\nmaximise dict-word count\nTNC frequency score · fewest tokens"]
end
MERGE(["<b>Vec<Token<'_>></b>\nzero-copy &str slices"])
INPUT --> OPTIONAL
OPTIONAL --> PRE
PRE --> SPLIT
SPLIT -->|"Thai"| TCC
SPLIT -->|"Latin · Number\nEmoji · Punct · WS"| PASS
TCC --> DICT
DICT --> DAG
DAG --> MERGE
PASS --> MERGE
DAG segmentation detail
flowchart LR
subgraph INPUT["Thai span: "กินข้าว""]
direction LR
C0(["pos 0"])
C1(["pos 3\n กิ"])
C2(["pos 6\n น"])
C3(["pos 9\n ข้"])
C4(["pos 15\n าว"])
C5(["pos 21\n end"])
end
C0 -->|"กิน ✓ dict"| C2
C0 -.->|"กิ unknown"| C1
C1 -.->|"น unknown"| C2
C2 -->|"ข้าว ✓ dict"| C5
C2 -.->|"ข้ unknown"| C3
C3 -.->|"าว unknown"| C4
BEST["DP picks bold path:\nกิน · ข้าว\n= 2 dict words"]
C5 --- BEST
Prerequisites
All targets
| Tool | Version | Install |
|---|---|---|
| Rust toolchain | ≥ 1.85 (MSRV) | curl -sSf https://sh.rustup.rs | sh |
| Cargo | ships with Rust | — |
Verify: rustc --version
WASM (kham-wasm)
| Tool | Version | Install |
|---|---|---|
wasm32-unknown-unknown target |
— | rustup target add wasm32-unknown-unknown |
wasm-pack |
≥ 0.13 | cargo install wasm-pack |
wasm-pack wraps cargo build --target wasm32-unknown-unknown and wasm-bindgen-cli to produce the .wasm binary and JavaScript/TypeScript glue in one step.
Python (kham-python)
| Tool | Version | Install |
|---|---|---|
| Python | ≥ 3.8 | system package manager or python.org |
maturin |
≥ 1.0 | pip install maturin |
maturin compiles the PyO3 extension module and installs it into the active virtual environment. Always run inside a venv or conda environment.
&&
&&
The crate targets Python ≥ 3.8 (abi3-py38 stable ABI) — a single wheel runs on 3.8 through 3.13+.
C (kham-capi)
| Tool | Version | Install |
|---|---|---|
cbindgen |
≥ 0.26 | cargo install cbindgen |
| C compiler | any C11-capable compiler | system package manager |
PostgreSQL (kham-pg)
| Tool | Version | Install |
|---|---|---|
| Docker with BuildKit | ≥ 24 | docs.docker.com |
make |
any | system package manager |
For local (non-Docker) builds, also install:
| Tool | Version | Install |
|---|---|---|
| PostgreSQL dev headers | 14–17 | Linux: apt install postgresql-server-dev-17 · macOS: brew install postgresql@17 |
pg_config |
ships with dev headers | — |
| C compiler | any C11-capable compiler | system package manager |
| GNU gettext | any | macOS only: brew install gettext (provides libintl.h required by PG headers) |
cbindgen reads kham-capi/src/lib.rs and kham-capi/cbindgen.toml to generate kham.h. Link against the compiled libkham_capi (.so / .dylib / .dll).
# macOS: target/release/libkham_capi.dylib
# Linux: target/release/libkham_capi.so
# Windows: target/release/kham_capi.dll
Building
The kham-core build script (build.rs) pre-compiles the built-in dictionary into a binary DARTS blob ($OUT_DIR/dict.bin) on every cargo build. It only reruns when build.rs or data/words_th.txt change.
Binding targets (after installing prerequisites above):
&&
Deploy script
scripts/deploy.sh publishes any combination of packages in the correct dependency order:
Runs cargo fmt, cargo clippy, and cargo test before any upload. Requires MATURIN_PYPI_TOKEN env var for PyPI and an active npm login session for npm.
CI / CD
Two GitHub Actions workflows run automatically:
CI (ci.yml) — every push and pull request to main / develop
| Job | What it checks |
|---|---|
fmt |
cargo fmt --check |
clippy |
cargo clippy -D warnings |
test |
Unit + integration + doc tests on stable and MSRV 1.85, Linux and macOS |
no_std |
kham-core compiles for thumbv7em-none-eabihf (bare metal) |
wasm |
wasm-pack build --target web succeeds |
python |
maturin develop on Python 3.8 and 3.12 |
bench_compile |
Benchmark suite compiles without errors |
pg_regress |
67 SQL correctness tests across 4 suites (kham_fts, kham_thai, kham_operators, kham_ranking) inside Docker PostgreSQL 17 |
Release (release.yml) — on v*.*.* tag push
Publishes to all registries after the CI gate passes:
flowchart LR
TAG(["git tag v0.1.0\ngit push --tags"])
CI["CI gate\n(full test matrix)"]
CRATES["crates.io\nkham-core + kham-cli"]
PYPI["PyPI\nkham wheels\n(manylinux · macOS · Windows)"]
NPM["npm\nkham-wasm"]
GH["GitHub Release\nauto release notes\n+ wheel artifacts"]
TAG --> CI
CI --> CRATES
CI --> PYPI
CI --> NPM
CRATES --> GH
PYPI --> GH
NPM --> GH
Required secrets
| Secret | Used for |
|---|---|
CARGO_REGISTRY_TOKEN |
crates.io publish |
NPM_TOKEN |
npm publish |
| PyPI — no secret needed | OIDC trusted publishing; configure via pypi.org Trusted Publisher |
To cut a release:
Benchmarks
Environment
| Field | Value |
|---|---|
| CPU | Apple M-series (arm64) |
| OS | macOS 26.4.1 |
| Rust | 1.94.1 (stable) |
| Profile | release (LTO enabled) |
| Built-in dictionary | 62,102 words · 669,387 DARTS states · 5.1 MiB |
| TNC frequency table | 106,125 entries |
Segmentation throughput (segment/by_length)
Pure Thai input, built-in dictionary, no custom dict.
| Input | Size | Time (median) | Throughput |
|---|---|---|---|
| short | 37 B | 879 ns | 42.3 MiB/s |
| medium | 182 B | 3.80 µs | 45.1 MiB/s |
| long | 546 B | 10.9 µs | 47.1 MiB/s |
Mixed-script throughput (segment/mixed)
Thai + Latin + Number in the same input, measuring pre-tokenizer boundary overhead.
| Input | Size | Time (median) | Throughput |
|---|---|---|---|
sparse (ธนาคาร100แห่ง) |
26 B | 744 ns | 42.3 MiB/s |
| medium (multi-boundary) | 74 B | 1.73 µs | 43.5 MiB/s |
| dense (alternating script) | 29 B | 535 ns | 55.3 MiB/s |
Normalize + segment (normalize_then_segment/medium)
| Operation | Time (median) |
|---|---|
normalize() then segment() on medium input |
4.09 µs |
Normalization throughput (normalize/thai)
| Input | Size | Time (median) | Throughput |
|---|---|---|---|
| short | 37 B | 79.9 ns | 465 MiB/s |
| medium | 182 B | 199 ns | 864 MiB/s |
| long | 546 B | 507 ns | 1.0 GiB/s |
Dictionary construction (dict/construction)
| Operation | Time (median) | Notes |
|---|---|---|
builtin_dict() — binary blob load |
78 µs | pay-once startup cost |
Dict::from_word_list — 62k words |
980 ms | only when merging a custom dict |
Dict::from_word_list — 8-word list |
3.72 µs | small custom dict |
dict/file/read_and_build — disk + build |
1.01 s | kham --dict <file> startup |
Tokenizer::builder().dict_file().build() |
1.04 s | full CLI code path with custom dict |
builtin_dict()is ~12,500× faster thanDict::from_word_listbecause the DARTS trie is pre-compiled bybuild.rsat compile time; runtime cost is a single O(S) binary decode pass.Dict::from_word_listruns only when a user-supplied custom dictionary is merged with the built-in list.
Dictionary lookup (dict/contains, dict/prefixes)
| Operation | Time (median) | Throughput |
|---|---|---|
contains — hit (3-byte word กิน) |
7.1 ns | 1.18 GiB/s |
contains — hit (12-byte word สวัสดี) |
18.3 ns | 940 MiB/s |
contains — miss (ASCII non-word) |
744 ps | 7.5–8.8 GiB/s |
prefixes — short anchor (7 B) |
42.3 ns | 473 MiB/s |
prefixes — medium anchor (60 B) |
36.7 ns | 1.52 GiB/s |
prefixes — long anchor (97 B) |
74.5 ns | 1.24 GiB/s |
TNC frequency table (freq/construction, freq/get)
| Operation | Time (median) | Notes |
|---|---|---|
FreqMap::builtin() — parse 106k TSV entries |
22.1 ms | pay-once startup cost |
FreqMap::get — common word hit (กิน) |
67.8 ns | O(log n) BTreeMap |
FreqMap::get — rare word hit |
48.6 ns | |
FreqMap::get — miss |
56.5 ns |
FreqMap::builtin()startup cost (~22 ms) is the dominant component ofTokenizer::new()(~20 ms total). It is paid once per tokenizer instance; the returnedFreqMapis reused across allsegment()calls.
Run locally:
# HTML report: target/criterion/report/index.html
PostgreSQL extension (kham-pg)
The kham-pg extension is benchmarked at the SQL level using pgbench inside the Docker test container, plus system-level CPU/memory via docker stats.
1 · Latency — psql \timing
\timing on
SELECT to_tsvector('kham', 'กินข้าวกับปลา Python 3 สำหรับนักพัฒนา');
-- Per-node breakdown
EXPLAIN (ANALYZE, BUFFERS)
SELECT to_tsvector('kham', body) FROM documents LIMIT 1000;
2 · Throughput — pgbench
Create bench_fts.sql:
SELECT to_tsvector('kham', 'กินข้าวกับปลา Python 3 สำหรับนักพัฒนา');
Run via Docker:
# Terminal 1 — watch CPU/memory while bench runs
# Terminal 2 — throughput bench (4 clients, 30 seconds)
# Output: TPS, latency avg/stddev
3 · Index build time — realistic workload
(id serial, body text);
INSERT INTO docs (body)
SELECT 'กินข้าวกับปลา Python ' || g
FROM generate_series(1, 100000) g;
\timing on
(to_tsvector('kham', body));
-- Query latency against the index
SELECT count(*) FROM docs
WHERE to_tsvector('kham', body) @@ plainto_tsquery('kham', 'ปลา');
Dictionary and corpus data
| File | License | Entries | Purpose |
|---|---|---|---|
data/words_th.txt |
CC0 | 62,102 words | Built-in segmentation dictionary |
data/tnc_freq.txt |
CC0 | 106,125 entries | TNC raw counts → DP tie-breaking scorer |
data/stopwords_th.txt |
Apache-2.0 (PyThaiNLP) | 1,029 words | FTS stopword filter |
Custom dictionaries are newline-separated plain text files; lines beginning with # are treated as comments.
The frequency table is embedded at compile time and loaded into a FreqMap at runtime. The newmm DP scorer uses it as the third tiebreaker — after minimising unknown tokens and maximising dictionary matches — so statistically more common segmentations are preferred when multiple paths are otherwise equal. Frequency data is kept separate from dict.bin; do not merge them.
The stopword list is sourced from PyThaiNLP (Apache-2.0) and embedded via include_str!. Attribution is preserved in the header of stopwords_th.txt. The list is sorted and deduplicated at runtime into a StopwordSet backed by binary search.
Constraint: Never ship BEST corpus data or any non-Apache-2.0/CC0 material in this repository.
Pre-compiled DARTS binary (dict.bin)
build.rs compiles the built-in word list into a binary Double-Array Trie blob ($OUT_DIR/dict.bin) once at build time. At runtime, builtin_dict() loads this blob via Dict::from_bytes, which is ~15,000× faster than reconstructing the trie from the text word list (~64 µs vs ~960 ms).
File format
All multi-byte integers are little-endian. The file begins with a fixed 16-byte header followed immediately by the two DARTS arrays.
| Offset | Size (bytes) | Field | Type | Description |
|---|---|---|---|---|
| 0 | 4 | magic |
[u8;4] |
b"KDAM" — file-type identifier |
| 4 | 1 | version |
u8 |
Format version; currently 0x01 |
| 5 | 3 | reserved |
[u8;3] |
Zero-filled; reserved for future flags |
| 8 | 4 | base_len |
u32 |
Number of i32 elements in the base array |
| 12 | 4 | check_len |
u32 |
Number of i32 elements in the check array |
| 16 | base_len×4 |
base[] |
i32[] |
DARTS base offsets, little-endian |
16 + base_len×4 |
check_len×4 |
check[] |
i32[] |
DARTS parent-state indices, little-endian (-1 = unused slot) |
Lifecycle
flowchart LR
WL(["words_th.txt\n62k words · CC0"])
BS["build.rs\nbuild_trie() → from_trie()\nBFS base-allocation\nFreeBitmap O(n/64)"]
BIN(["$OUT_DIR/dict.bin\n16-byte header\n+ base[] + check[]"])
IB["include_bytes!\nembedded in binary"]
RT["Dict::from_bytes()\none-pass LE decode\nO(S) — ~64 µs"]
BD(["builtin_dict()\nready Dict"])
WL --> BS --> BIN --> IB --> RT --> BD
FQ(["tnc_freq.txt\n106k entries · CC0"])
FM["include_str!\nembedded at compile time"]
FP["FreqMap::builtin()\nparse TSV → BTreeMap"]
FS(["FreqMap\nDP tie-breaking scorer"])
FQ --> FM --> FP --> FS
Validity guarantees
Dict::from_bytes panics on malformed input rather than returning an error, because failures always indicate a stale or corrupted build artifact — not a recoverable runtime condition. A clean cargo build regenerates a valid blob automatically.
| Condition checked | Panic message |
|---|---|
data.len() < 16 |
"dict.bin too short" |
Bytes 0–3 ≠ b"KDAM" |
"dict.bin: bad magic" |
Byte 4 ≠ 0x01 |
"dict.bin: unsupported version" |
License
Licensed under either of:
at your option.