kham

Thai word segmentation engine written in Rust. Fast, no_std-compatible core library with bindings for Python, WebAssembly, C, and a command-line interface.

Features

newmm algorithm — DAG-based maximal matching constrained to Thai Character Cluster (TCC) boundaries
Multi-target — single core library ships as a Rust crate, Python wheel, WASM module, C shared library, and CLI binary
Zero-copy API — segment() returns &str slices into the original input; no heap allocation per token
no_std core — kham-core compiles for bare-metal targets (alloc only, no std dependency)
Built-in dictionary — 62,102-word CC0-licensed Thai word list embedded at compile time; custom dictionaries loaded at runtime
TNC frequency scoring — Thai National Corpus (CC0) raw counts guide the DP scorer to prefer statistically common segmentations when multiple dictionary paths tie
Pre-compiled DARTS — Double-Array Trie is built once at compile time (build.rs) and loaded from a binary blob at runtime (~64 µs vs ~960 ms construction from text)
Text normalization — วรรณยุกต์ dedup and Sara Am composition before segmentation
Thai FTS pipeline — FtsTokenizer adds stopword filtering (1 029 built-in entries, PyThaiNLP Apache-2.0), synonym expansion (TSV-driven SynonymMap), and character n-gram fallback for OOV tokens; ready for PostgreSQL tsvector integration
Structured CLI logging — RUST_LOG-controlled output with coloured log levels via env_logger + colored

Packages

Crate	Registry	Description
`kham-core`	crates.io	Pure Rust engine, `no_std` compatible
`kham-cli`	crates.io	`kham` binary (clap)
`kham-python`	PyPI	Python bindings via PyO3 / maturin
`kham-wasm`	npm	WebAssembly bindings via wasm-bindgen
`kham-capi`	crates.io	C FFI with cbindgen-generated header; includes FTS API
`kham-pg`	PGXN (coming soon)	PostgreSQL extension: custom text search parser for Thai

Quick start

Rust

[dependencies]
kham-core = "0.1"

use kham_core::Tokenizer;

let tok = Tokenizer::new();
let tokens = tok.segment("กินข้าวกับปลา");
for t in &tokens {
    println!("{} ({:?})", t.text, t.kind);
}
// กิน (Thai)
// ข้าว (Thai)
// ...

Mixed script works out of the box:

let tokens = tok.segment("ธนาคาร100แห่ง");
assert_eq!(tokens[0].text, "ธนาคาร"); // Thai
assert_eq!(tokens[1].text, "100");     // Number
assert_eq!(tokens[2].text, "แห่ง");   // Thai

For input that may contain stacked tone marks or decomposed Sara Am, normalize first:

let normalized = tok.normalize(raw_input); // tone dedup + Sara Am composition
let tokens = tok.segment(&normalized);     // tokens borrow `normalized`

Python

pip install kham

import kham

# Simple — list of token strings
tokens = kham.segment("กินข้าวกับปลา")
print(tokens)  # ['กิน', 'ข้าว', 'กับ', 'ปลา']

# Rich — Token objects with span information
tokens = kham.segment_tokens("ธนาคาร100แห่ง")
for t in tokens:
    print(t.text, t.char_start, t.char_end, t.kind)
# ธนาคาร  0  6  Thai
# 100     6  9  Number
# แห่ง    9  13 Thai

Token attributes: text, byte_start, byte_end, char_start, char_end, kind.

JavaScript / TypeScript (WASM)

npm install kham-wasm

import init, { segment, segment_tokens } from "kham-wasm";
await init();

// Simple — array of token strings
const words = segment("กินข้าวกับปลา");
console.log(words); // ["กิน", "ข้าว", "กับ", "ปลา"]

// Rich — Token objects with span information
const tokens = segment_tokens("ธนาคาร100แห่ง");
for (const t of tokens) {
    console.log(t.text, t.char_start, t.char_end, t.kind);
}
// ธนาคาร  0  6  Thai
// 100     6  9  Number
// แห่ง    9  13 Thai

Token properties: text, byte_start, byte_end, char_start, char_end, kind.

Note on JS string offsets: char_start/char_end are Unicode scalar-value counts. For BMP text these equal JavaScript's string.slice() indices. For surrogate-pair emoji, use byte_start/byte_end with TextEncoder for precise byte-level slicing.

PostgreSQL

kham-pg is a PostgreSQL extension that registers a custom text search parser so you can index and query Thai text with tsvector / tsquery.

Prerequisites: Docker with BuildKit (for the test runner), or PostgreSQL dev headers and pg_config for a local install.

# Build and run pg_regress tests in Docker (67 tests across 4 suites)
make -C kham-pg regress

# Manual install (if pg_config is in PATH)
make -C kham-pg install
psql -c "CREATE EXTENSION kham_pg;"

# PGXN distribution zip (for upload to pgxn.org)
make -C kham-pg dist       # produces kham-pg/kham_pg-0.1.3.zip

Once installed:

-- Register the extension
CREATE EXTENSION kham_pg;

-- Inspect token types produced by the parser
SELECT * FROM ts_token_type('kham');
-- 1  thai     Thai word
-- 2  latin    Latin script token
-- 3  number   Numeric token
-- 4  punct    Punctuation
-- 5  emoji    Emoji token
-- 6  unknown  Unknown / OOV token

-- Tokenise a document
SELECT * FROM ts_parse('kham', 'กินข้าวกับปลา');
-- 1  กิน
-- 1  ข้าว
-- 1  กับ
-- 1  ปลา

-- Build a tsvector (all token types indexed; kham_dict uses simple template)
SELECT to_tsvector('kham', 'กินข้าวกับปลา');
-- 'กิน':1 'กับ':3 'ข้าว':2 'ปลา':4

-- Full-text search (AND)
SELECT title FROM articles
WHERE to_tsvector('kham', body) @@ plainto_tsquery('kham', 'ข้าว ปลา');

-- Phrase search (adjacent tokens)
SELECT title FROM articles
WHERE to_tsvector('kham', body) @@ phraseto_tsquery('kham', 'กิน ข้าว');

-- GIN index for large tables
CREATE INDEX articles_fts_idx ON articles
    USING GIN (to_tsvector('kham', body));

-- Ranked results
SELECT title,
       ts_rank(to_tsvector('kham', body), plainto_tsquery('kham', 'ปลา')) AS rank
FROM articles
WHERE to_tsvector('kham', body) @@ plainto_tsquery('kham', 'ปลา')
ORDER BY rank DESC;

Note: ts_headline is not supported — the kham parser has no HEADLINE callback. This is a known limitation of custom parsers in PostgreSQL.

CLI

cargo install kham-cli

# Positional argument
kham "กินข้าวกับปลา"
# กิน|ข้าว|กับ|ปลา

# Custom separator
kham --sep " / " "สวัสดีชาวโลก"
# สวัสดี / ชาว / โลก

# Show token kinds
kham --kind "ธนาคาร100แห่ง"
# ธนาคาร:Thai|100:Number|แห่ง:Thai

# Show Unicode char spans
kham --spans "กินข้าวกับปลา"
# กิน:0-3|ข้าว:3-7|กับ:7-10|ปลา:10-13

# Combine kind and spans
kham --kind --spans "กินข้าว"
# กิน:Thai:0-3|ข้าว:Thai:3-7

# Normalize before segmenting
kham --normalize "กิน\u{0E02}\u{0E49}\u{0E49}าว"

# Custom dictionary
kham --dict my_words.txt "มะม่วงหิมพานต์"

# Pipeline / stdin
echo "กินข้าว" | kham
cat corpus.txt | kham --sep " "

Full options:

Usage: kham [OPTIONS] [TEXT]

Arguments:
  [TEXT]  Thai text to segment. Reads from stdin line-by-line if omitted.

Options:
  -d, --dict <FILE>   Path to a custom word-list file (newline-separated)
  -s, --sep <SEP>     Output separator between tokens [default: |]
  -w, --whitespace    Include whitespace tokens in output
  -n, --normalize     Run normalize() before segmenting
  -k, --kind          Append token kind after each token (e.g. กิน:Thai)
      --spans         Append Unicode char span after each token (e.g. กิน:0-3)
  -h, --help          Print help
  -V, --version       Print version

Debug and timing output is controlled by the RUST_LOG environment variable:

RUST_LOG=debug kham "กินข้าวกับปลา"   # full per-token trace + timing
RUST_LOG=info  kham --dict w.txt "..."  # dict-load confirmation only

C

Generate the header and link libkham_capi:

cbindgen --config kham-capi/cbindgen.toml --crate kham-capi --output kham-capi/include/kham.h
cargo build -p kham-capi --release

#include "kham.h"

// Simple — array of token strings
KhamTokens *tokens = kham_segment("กินข้าวกับปลา");
for (size_t i = 0; i < tokens->len; i++) {
    printf("%s\n", tokens->words[i]);
}
kham_tokens_free(tokens);

// Rich — KhamToken structs with full span information
KhamTokenList *list = kham_segment_tokens("ธนาคาร100แห่ง");
for (size_t i = 0; i < list->len; i++) {
    KhamToken t = list->tokens[i];
    printf("%s  char %zu..%zu  %s\n", t.text, t.char_start, t.char_end, t.kind);
}
// ธนาคาร  char 0..6   Thai
// 100     char 6..9   Number
// แห่ง    char 9..13  Thai
kham_token_list_free(list);

KhamToken fields: text, byte_start, byte_end, char_start, char_end, kind (all null-terminated UTF-8 strings or size_t).

FTS API (C)

Run the full Thai FTS pipeline from C to get stopword flags, synonym expansions, and OOV trigrams:

#include "kham.h"

// Annotated FTS tokens (all non-whitespace, with metadata)
KhamFtsTokenList *fts = kham_fts_segment("กินข้าวกับปลา");
for (size_t i = 0; i < fts->len; i++) {
    KhamFtsToken t = fts->tokens[i];
    printf("%s  pos=%zu  stop=%d  synonyms=%zu  trigrams=%zu\n",
           t.text, t.position, t.is_stop, t.synonyms_len, t.trigrams_len);
}
// กิน  pos=0  stop=0  synonyms=0  trigrams=0
// ข้าว pos=1  stop=0  synonyms=0  trigrams=0
// กับ  pos=2  stop=1  synonyms=0  trigrams=0
// ปลา  pos=3  stop=0  synonyms=0  trigrams=0
kham_fts_token_list_free(fts);

// Flat lexeme array for tsvector population (stopwords removed)
size_t n = 0;
char **lexemes = kham_fts_lexemes("กินข้าวกับปลา", &n);
// lexemes[0] = "กิน", lexemes[1] = "ข้าว", lexemes[2] = "ปลา"  (n = 3)
kham_fts_lexemes_free(lexemes, n);

KhamFtsToken fields: text, position (size_t), kind, is_stop (bool), synonyms/synonyms_len, trigrams/trigrams_len.

Token contract

Every segment() call returns Vec<Token>:

pub struct Token<'a> {
    pub text: &'a str,            // zero-copy slice of the input string
    pub span: Range<usize>,       // byte offsets in the original string
    pub char_span: Range<usize>,  // Unicode scalar-value (char) offsets
    pub kind: TokenKind,          // Thai | Latin | Number | Punctuation | Emoji | Whitespace | Unknown
}

span — byte offsets; use to slice &str directly (&input[token.span.clone()])
char_span — Unicode scalar-value offsets; use for Python/JavaScript string indexing where strings are char- or code-unit-indexed
Both spans are always valid UTF-8 boundaries
Joining all token.text values (with whitespace kept) reconstructs the original input exactly

use kham_core::Tokenizer;

let tok = Tokenizer::new();
let input = "ธนาคาร100แห่ง";
let tokens = tok.segment(input);

// ธนาคาร: 6 chars, 18 bytes
assert_eq!(tokens[0].span,      0..18);
assert_eq!(tokens[0].char_span, 0..6);

// 100: 3 chars, 3 bytes
assert_eq!(tokens[1].span,      18..21);
assert_eq!(tokens[1].char_span, 6..9);

Custom dictionary

// From a string
let tok = Tokenizer::builder()
    .dict_words("มะม่วงหิมพานต์\nกระทะ\n")
    .build();

// From a file (requires the `std` feature)
let tok = Tokenizer::builder()
    .dict_file("my_words.txt")?
    .build();

// Keep whitespace tokens
let tok = Tokenizer::builder()
    .keep_whitespace(true)
    .build();

Full-Text Search (FTS)

kham-core ships a complete Thai FTS pipeline on top of the segmenter. The kham-pg PostgreSQL extension (Phase 2) wraps this pipeline as a custom text search parser — see the PostgreSQL quick start above.

Basic indexing

use kham_core::fts::FtsTokenizer;

let fts = FtsTokenizer::new(); // built-in stopwords, no synonyms

// All tokens with metadata
let tokens = fts.segment_for_fts("กินข้าวกับปลา");
for t in &tokens {
    println!("{} pos={} stop={}", t.text, t.position, t.is_stop);
}
// กิน  pos=0 stop=false
// ข้าว pos=1 stop=false
// กับ  pos=2 stop=true   ← conjunction → filtered at index time
// ปลา  pos=3 stop=false

// Flat lexeme list for tsvector (stopwords removed)
let lexemes = fts.lexemes("กินข้าวกับปลา");
// → ["กิน", "ข้าว", "ปลา"]

Synonym expansion

Define a TSV file where each line maps a canonical form to one or more equivalents:

คอม    คอมพิวเตอร์    computer
รถไฟฟ้า    BTS    MRT    รถไฟใต้ดิน

use kham_core::fts::FtsTokenizer;
use kham_core::synonym::SynonymMap;

let synonyms = SynonymMap::from_tsv(include_str!("synonyms.tsv"));
let fts = FtsTokenizer::builder().synonyms(synonyms).build();

let lexemes = fts.lexemes("ซื้อคอมใหม่");
// → ["ซื้อ", "คอม", "คอมพิวเตอร์", "computer", "ใหม่"]
//              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  expanded

Custom stopwords

use kham_core::stopwords::StopwordSet;
use kham_core::fts::FtsTokenizer;

// Add domain-specific stopwords on top of the built-in list
let extra = StopwordSet::from_text("ซื้อ\nขาย\nราคา\n");
let fts = FtsTokenizer::builder().stopwords(extra).build();

OOV (out-of-vocabulary) n-grams

Words not in the dictionary are emitted as TokenKind::Unknown. The FTS pipeline automatically generates character n-grams for these tokens so they remain searchable:

// Default ngram_size = 3 (trigrams)
// Unknown token "สกรีน" (3-char TCC clusters) → ["สกร", "กรี", "รีน"]

// Disable n-gram generation:
let fts = FtsTokenizer::builder().ngram_size(0).build();

`FtsToken` fields

Field	Type	Description
`text`	`String`	Token text (normalised)
`position`	`usize`	Ordinal index in non-whitespace sequence (0-based)
`kind`	`TokenKind`	Thai / Latin / Number / … / Unknown
`is_stop`	`bool`	Matched the stopword list
`synonyms`	`Vec<String>`	Synonym expansions (empty if none)
`trigrams`	`Vec<String>`	Char n-grams for `Unknown` tokens only

Architecture

Workspace crate graph

graph LR
    core["<b>kham-core</b><br/><i>no_std · alloc only</i><br/>segmentation engine"]

    cli["<b>kham-cli</b><br/>kham binary<br/>(clap)"]
    python["<b>kham-python</b><br/>Python wheel<br/>(PyO3 · maturin)"]
    wasm["<b>kham-wasm</b><br/>WASM module<br/>(wasm-bindgen)"]
    capi["<b>kham-capi</b><br/>C shared library<br/>(cbindgen)<br/>segment · FTS · lexemes"]

    pg["<b>kham-pg</b><br/>PostgreSQL extension<br/>(C shim · cdylib)"]

    core --> cli
    core --> python
    core --> wasm
    core --> capi
    core --> pg

Core module responsibilities

classDiagram
    direction LR

    class normalizer {
        +normalize(text) String
        --
        วรรณยุกต์ dedup
        Sara Am composition
    }

    class pre_tokenizer {
        +pre_tokenize(text) Vec~Token~
        +classify_char(c) TokenKind
        --
        Unicode script split
        Thai · Latin · Number
        Emoji · Punct · WS
    }

    class tcc {
        +tcc_boundaries(text) Vec~usize~
        +tcc_iter(text) Iterator
        --
        Thai Character Cluster
        boundary detection
        Theeramunkong 2000
    }

    class dict {
        +builtin_dict() Dict
        +from_word_list(text) Dict
        +from_bytes(data) Dict
        +contains(word) bool
        +prefixes(text) Vec~str~
        --
        Double-Array Trie
        O(k) byte-level lookup
        pre-compiled binary blob
        built-in CC0 word list
    }

    class freq {
        +FreqMap::builtin() FreqMap
        +from_tsv(data) FreqMap
        +get(word) u32
        --
        TNC raw occurrence counts
        CC0 · 106k entries
        DP tie-breaking scorer
    }

    class segmenter {
        +segment(text) Vec~Token~
        +normalize(text) String
        --
        newmm DAG algorithm
        DP over TCC boundaries
        min unknowns · max dict words
        TNC freq · min token count
    }

    class token {
        +text : and str
        +span : Range~usize~
        +char_span : Range~usize~
        +kind : TokenKind
        --
        Thai · Latin · Number
        Punctuation · Emoji
        Whitespace · Unknown
    }

    class stopwords {
        +StopwordSet::builtin() StopwordSet
        +from_text(data) StopwordSet
        +contains(word) bool
        --
        1029 entries · Apache-2.0
        sorted Vec binary search
        O(log n) lookup
    }

    class synonym {
        +SynonymMap::from_tsv(data) SynonymMap
        +expand(word) Option~slice~
        +has_synonyms(word) bool
        --
        BTreeMap canonical→synonyms
        TSV format
        duplicate canonicals merge
    }

    class ngram {
        +char_ngrams(text, n) Iterator
        +token_ngrams(tokens, n) Iterator
        --
        zero-alloc char slices
        OOV fallback indexing
        phrase proximity
    }

    class fts {
        +FtsTokenizer::new() FtsTokenizer
        +segment_for_fts(text) Vec~FtsToken~
        +index_tokens(text) Vec~FtsToken~
        +lexemes(text) Vec~String~
        --
        FtsToken: text · position
        is_stop · synonyms · trigrams
        PostgreSQL tsvector entry point
    }

    segmenter ..> normalizer : calls
    segmenter ..> pre_tokenizer : calls
    segmenter ..> tcc : calls
    segmenter ..> dict : queries
    segmenter ..> freq : scores
    segmenter ..> token : emits
    pre_tokenizer ..> token : emits
    fts ..> segmenter : wraps
    fts ..> stopwords : filters
    fts ..> synonym : expands
    fts ..> ngram : OOV grams

Segmentation pipeline

flowchart TD
    INPUT(["<b>raw &amp;str</b>"])

    subgraph OPTIONAL["optional — call before segment()"]
        NORM["<b>normalizer::normalize()</b>\nวรรณยุกต์ dedup\nSara Am อํ+อา → อำ"]
    end

    PRE["<b>pre_tokenizer::pre_tokenize()</b>\nUnicode script classification\nsplit into homogeneous spans"]

    SPLIT{span kind?}

    PASS["pass through\nas-is"]

    subgraph THAI_PATH["Thai span processing"]
        TCC["<b>tcc::tcc_boundaries()</b>\nTCC boundary positions\n= legal word-break points"]
        DICT["<b>dict::prefixes()</b>\nDATS prefix search\nat each boundary"]
        DAG["<b>DP over boundary graph</b>\nminimise unknown tokens\nmaximise dict-word count\nTNC frequency score · fewest tokens"]
    end

    MERGE(["<b>Vec&lt;Token&lt;'_&gt;&gt;</b>\nzero-copy &amp;str slices"])

    INPUT --> OPTIONAL
    OPTIONAL --> PRE
    PRE --> SPLIT
    SPLIT -->|"Thai"| TCC
    SPLIT -->|"Latin · Number\nEmoji · Punct · WS"| PASS
    TCC --> DICT
    DICT --> DAG
    DAG --> MERGE
    PASS --> MERGE

DAG segmentation detail

flowchart LR
    subgraph INPUT["Thai span: &quot;กินข้าว&quot;"]
        direction LR
        C0(["pos 0"])
        C1(["pos 3\n กิ"])
        C2(["pos 6\n น"])
        C3(["pos 9\n ข้"])
        C4(["pos 15\n าว"])
        C5(["pos 21\n end"])
    end

    C0 -->|"กิน ✓ dict"| C2
    C0 -.->|"กิ  unknown"| C1
    C1 -.->|"น   unknown"| C2
    C2 -->|"ข้าว ✓ dict"| C5
    C2 -.->|"ข้  unknown"| C3
    C3 -.->|"าว  unknown"| C4

    BEST["DP picks bold path:\nกิน · ข้าว\n= 2 dict words"]
    C5 --- BEST

Prerequisites

All targets

Tool	Version	Install
Rust toolchain	≥ 1.85 (MSRV)	`curl -sSf https://sh.rustup.rs \| sh`
Cargo	ships with Rust	—

Verify: rustc --version

WASM (`kham-wasm`)

Tool	Version	Install
`wasm32-unknown-unknown` target	—	`rustup target add wasm32-unknown-unknown`
`wasm-pack`	≥ 0.13	`cargo install wasm-pack`

wasm-pack wraps cargo build --target wasm32-unknown-unknown and wasm-bindgen-cli to produce the .wasm binary and JavaScript/TypeScript glue in one step.

Python (`kham-python`)

Tool	Version	Install
Python	≥ 3.8	system package manager or python.org
`maturin`	≥ 1.0	`pip install maturin`

maturin compiles the PyO3 extension module and installs it into the active virtual environment. Always run inside a venv or conda environment.

python -m venv .venv && source .venv/bin/activate
pip install maturin
cd kham-python && maturin develop

The crate targets Python ≥ 3.8 (abi3-py38 stable ABI) — a single wheel runs on 3.8 through 3.13+.

C (`kham-capi`)

Tool	Version	Install
`cbindgen`	≥ 0.26	`cargo install cbindgen`
C compiler	any C11-capable compiler	system package manager

PostgreSQL (`kham-pg`)

Tool	Version	Install
Docker with BuildKit	≥ 24	docs.docker.com
`make`	any	system package manager

For local (non-Docker) builds, also install:

Tool	Version	Install
PostgreSQL dev headers	14–17	Linux: `apt install postgresql-server-dev-17` · macOS: `brew install postgresql@17`
`pg_config`	ships with dev headers	—
C compiler	any C11-capable compiler	system package manager
GNU gettext	any	macOS only: `brew install gettext` (provides `libintl.h` required by PG headers)

cbindgen reads kham-capi/src/lib.rs and kham-capi/cbindgen.toml to generate kham.h. Link against the compiled libkham_capi (.so / .dylib / .dll).

cbindgen --config kham-capi/cbindgen.toml --crate kham-capi --output kham-capi/include/kham.h
cargo build -p kham-capi --release
# macOS: target/release/libkham_capi.dylib
# Linux: target/release/libkham_capi.so
# Windows: target/release/kham_capi.dll

Building

cargo build                                  # all default members (also runs build.rs → dict.bin)
cargo test --release                         # run all tests
cargo test -p kham-core --release            # core only
cargo bench -p kham-core                     # criterion benchmarks
cargo run -p kham-cli -- "ข้อความ"           # run CLI

The kham-core build script (build.rs) pre-compiles the built-in dictionary into a binary DARTS blob ($OUT_DIR/dict.bin) on every cargo build. It only reruns when build.rs or data/words_th.txt change.

Binding targets (after installing prerequisites above):

wasm-pack build kham-wasm --target web           # WASM → kham-wasm/pkg/
cd kham-python && maturin develop                # Python wheel (active venv)
cbindgen --config kham-capi/cbindgen.toml \
    --crate kham-capi --output kham-capi/include/kham.h  # C header
cargo build -p kham-capi --release               # C shared library
make -C kham-pg regress                          # PostgreSQL: build + run pg_regress in Docker

Deploy script

scripts/deploy.sh publishes any combination of packages in the correct dependency order:

./scripts/deploy.sh --all               # publish everything
./scripts/deploy.sh core capi cli       # crates.io only
./scripts/deploy.sh wasm python         # npm + PyPI only
./scripts/deploy.sh --dry-run --all     # preflight checks, no upload

Runs cargo fmt, cargo clippy, and cargo test before any upload. Requires MATURIN_PYPI_TOKEN env var for PyPI and an active npm login session for npm.

CI / CD

Two GitHub Actions workflows run automatically:

CI (`ci.yml`) — every push and pull request to `main` / `develop`

Job	What it checks
`fmt`	`cargo fmt --check`
`clippy`	`cargo clippy -D warnings`
`test`	Unit + integration + doc tests on stable and MSRV 1.85, Linux and macOS
`no_std`	`kham-core` compiles for `thumbv7em-none-eabihf` (bare metal)
`wasm`	`wasm-pack build --target web` succeeds
`python`	`maturin develop` on Python 3.8 and 3.12
`bench_compile`	Benchmark suite compiles without errors
`pg_regress`	67 SQL correctness tests across 4 suites (`kham_fts`, `kham_thai`, `kham_operators`, `kham_ranking`) inside Docker PostgreSQL 17

Release (`release.yml`) — on `v..*` tag push

Publishes to all registries after the CI gate passes:

flowchart LR
    TAG(["git tag v0.1.0\ngit push --tags"])
    CI["CI gate\n(full test matrix)"]
    CRATES["crates.io\nkham-core + kham-cli"]
    PYPI["PyPI\nkham wheels\n(manylinux · macOS · Windows)"]
    NPM["npm\nkham-wasm"]
    GH["GitHub Release\nauto release notes\n+ wheel artifacts"]

    TAG --> CI
    CI --> CRATES
    CI --> PYPI
    CI --> NPM
    CRATES --> GH
    PYPI --> GH
    NPM --> GH

Required secrets

Secret	Used for
`CARGO_REGISTRY_TOKEN`	crates.io publish
`NPM_TOKEN`	npm publish
PyPI — no secret needed	OIDC trusted publishing; configure via pypi.org Trusted Publisher

To cut a release:

git tag v0.1.0
git push origin v0.1.0

Benchmarks

Environment

Field	Value
CPU	Apple M-series (arm64)
OS	macOS 26.4.1
Rust	1.94.1 (stable)
Profile	release (LTO enabled)
Built-in dictionary	62,102 words · 669,387 DARTS states · 5.1 MiB
TNC frequency table	106,125 entries

Segmentation throughput (`segment/by_length`)

Pure Thai input, built-in dictionary, no custom dict.

Input	Size	Time (median)	Throughput
short	37 B	879 ns	42.3 MiB/s
medium	182 B	3.80 µs	45.1 MiB/s
long	546 B	10.9 µs	47.1 MiB/s

Mixed-script throughput (`segment/mixed`)

Thai + Latin + Number in the same input, measuring pre-tokenizer boundary overhead.

Input	Size	Time (median)	Throughput
sparse (`ธนาคาร100แห่ง`)	26 B	744 ns	42.3 MiB/s
medium (multi-boundary)	74 B	1.73 µs	43.5 MiB/s
dense (alternating script)	29 B	535 ns	55.3 MiB/s

Normalize + segment (`normalize_then_segment/medium`)

Operation	Time (median)
`normalize()` then `segment()` on medium input	4.09 µs

Normalization throughput (`normalize/thai`)

Input	Size	Time (median)	Throughput
short	37 B	79.9 ns	465 MiB/s
medium	182 B	199 ns	864 MiB/s
long	546 B	507 ns	1.0 GiB/s

Dictionary construction (`dict/construction`)

Operation	Time (median)	Notes
`builtin_dict()` — binary blob load	78 µs	pay-once startup cost
`Dict::from_word_list` — 62k words	980 ms	only when merging a custom dict
`Dict::from_word_list` — 8-word list	3.72 µs	small custom dict
`dict/file/read_and_build` — disk + build	1.01 s	`kham --dict <file>` startup
`Tokenizer::builder().dict_file().build()`	1.04 s	full CLI code path with custom dict

builtin_dict() is ~12,500× faster than Dict::from_word_list because the DARTS trie is pre-compiled by build.rs at compile time; runtime cost is a single O(S) binary decode pass. Dict::from_word_list runs only when a user-supplied custom dictionary is merged with the built-in list.

Dictionary lookup (`dict/contains`, `dict/prefixes`)

Operation	Time (median)	Throughput
`contains` — hit (3-byte word `กิน`)	7.1 ns	1.18 GiB/s
`contains` — hit (12-byte word `สวัสดี`)	18.3 ns	940 MiB/s
`contains` — miss (ASCII non-word)	744 ps	7.5–8.8 GiB/s
`prefixes` — short anchor (7 B)	42.3 ns	473 MiB/s
`prefixes` — medium anchor (60 B)	36.7 ns	1.52 GiB/s
`prefixes` — long anchor (97 B)	74.5 ns	1.24 GiB/s

TNC frequency table (`freq/construction`, `freq/get`)

Operation	Time (median)	Notes
`FreqMap::builtin()` — parse 106k TSV entries	22.1 ms	pay-once startup cost
`FreqMap::get` — common word hit (`กิน`)	67.8 ns	O(log n) BTreeMap
`FreqMap::get` — rare word hit	48.6 ns
`FreqMap::get` — miss	56.5 ns

FreqMap::builtin() startup cost (~22 ms) is the dominant component of Tokenizer::new() (~20 ms total). It is paid once per tokenizer instance; the returned FreqMap is reused across all segment() calls.

Run locally:

cargo bench -p kham-core
# HTML report: target/criterion/report/index.html

PostgreSQL extension (`kham-pg`)

The kham-pg extension is benchmarked at the SQL level using pgbench inside the Docker test container, plus system-level CPU/memory via docker stats.

1 · Latency — psql `\timing`

\timing on
SELECT to_tsvector('kham', 'กินข้าวกับปลา Python 3 สำหรับนักพัฒนา');

-- Per-node breakdown
EXPLAIN (ANALYZE, BUFFERS)
SELECT to_tsvector('kham', body) FROM documents LIMIT 1000;

2 · Throughput — `pgbench`

Create bench_fts.sql:

SELECT to_tsvector('kham', 'กินข้าวกับปลา Python 3 สำหรับนักพัฒนา');

Run via Docker:

# Terminal 1 — watch CPU/memory while bench runs
docker stats docker-regress-1

# Terminal 2 — throughput bench (4 clients, 30 seconds)
docker exec docker-regress-1 pgbench \
  -n -c 4 -j 4 -T 30 \
  -f /bench_fts.sql \
  -h /var/run/postgresql -p 15432 kham_test
# Output: TPS, latency avg/stddev

3 · Index build time — realistic workload

CREATE TABLE docs (id serial, body text);
INSERT INTO docs (body)
  SELECT 'กินข้าวกับปลา Python ' || g
  FROM generate_series(1, 100000) g;

\timing on
CREATE INDEX ON docs USING gin(to_tsvector('kham', body));

-- Query latency against the index
SELECT count(*) FROM docs
WHERE to_tsvector('kham', body) @@ plainto_tsquery('kham', 'ปลา');

Dictionary and corpus data

File	License	Entries	Purpose
`data/words_th.txt`	CC0	62,102 words	Built-in segmentation dictionary
`data/tnc_freq.txt`	CC0	106,125 entries	TNC raw counts → DP tie-breaking scorer
`data/stopwords_th.txt`	Apache-2.0 (PyThaiNLP)	1,029 words	FTS stopword filter

Custom dictionaries are newline-separated plain text files; lines beginning with # are treated as comments.

The frequency table is embedded at compile time and loaded into a FreqMap at runtime. The newmm DP scorer uses it as the third tiebreaker — after minimising unknown tokens and maximising dictionary matches — so statistically more common segmentations are preferred when multiple paths are otherwise equal. Frequency data is kept separate from dict.bin; do not merge them.

The stopword list is sourced from PyThaiNLP (Apache-2.0) and embedded via include_str!. Attribution is preserved in the header of stopwords_th.txt. The list is sorted and deduplicated at runtime into a StopwordSet backed by binary search.

Constraint: Never ship BEST corpus data or any non-Apache-2.0/CC0 material in this repository.

Pre-compiled DARTS binary (`dict.bin`)

build.rs compiles the built-in word list into a binary Double-Array Trie blob ($OUT_DIR/dict.bin) once at build time. At runtime, builtin_dict() loads this blob via Dict::from_bytes, which is ~15,000× faster than reconstructing the trie from the text word list (~64 µs vs ~960 ms).

File format

All multi-byte integers are little-endian. The file begins with a fixed 16-byte header followed immediately by the two DARTS arrays.

Offset	Size (bytes)	Field	Type	Description
0	4	`magic`	`[u8;4]`	`b"KDAM"` — file-type identifier
4	1	`version`	`u8`	Format version; currently `0x01`
5	3	`reserved`	`[u8;3]`	Zero-filled; reserved for future flags
8	4	`base_len`	`u32`	Number of `i32` elements in the `base` array
12	4	`check_len`	`u32`	Number of `i32` elements in the `check` array
16	`base_len×4`	`base[]`	`i32[]`	DARTS base offsets, little-endian
`16 + base_len×4`	`check_len×4`	`check[]`	`i32[]`	DARTS parent-state indices, little-endian (`-1` = unused slot)

Lifecycle

flowchart LR
    WL(["words_th.txt\n62k words · CC0"])
    BS["build.rs\nbuild_trie() → from_trie()\nBFS base-allocation\nFreeBitmap O(n/64)"]
    BIN(["$OUT_DIR/dict.bin\n16-byte header\n+ base[] + check[]"])
    IB["include_bytes!\nembedded in binary"]
    RT["Dict::from_bytes()\none-pass LE decode\nO(S) — ~64 µs"]
    BD(["builtin_dict()\nready Dict"])

    WL --> BS --> BIN --> IB --> RT --> BD

    FQ(["tnc_freq.txt\n106k entries · CC0"])
    FM["include_str!\nembedded at compile time"]
    FP["FreqMap::builtin()\nparse TSV → BTreeMap"]
    FS(["FreqMap\nDP tie-breaking scorer"])

    FQ --> FM --> FP --> FS

Validity guarantees

Dict::from_bytes panics on malformed input rather than returning an error, because failures always indicate a stale or corrupted build artifact — not a recoverable runtime condition. A clean cargo build regenerates a valid blob automatically.

Condition checked	Panic message
`data.len() < 16`	`"dict.bin too short"`
Bytes 0–3 ≠ `b"KDAM"`	`"dict.bin: bad magic"`
Byte 4 ≠ `0x01`	`"dict.bin: unsupported version"`

License

Licensed under either of:

at your option.

kham-core 0.1.3