Expand description
DAG-based maximal matching segmenter (newmm algorithm).
The segmenter builds a Directed Acyclic Word Graph (DAWG) over the input text using TCC boundaries as candidate split points, then finds the path that maximises the number of dictionary matches (fewest unknown tokens).
§Pipeline
raw text
│
▼ (optional) Tokenizer::normalize() ← fixes tone dedup + Sara Am composition
│
▼ pre_tokenize()
[Thai span] [Number span] [Latin span] …
│
▼ (Thai spans only) tcc_boundaries()
TCC boundary positions: [0, b1, b2, …, len]
│
▼ DP over boundary indices
path of (start, end) pairs that maximises dict matches
│
▼
Vec<Token<'_>>§Normalization and zero-copy
Tokenizer::segment is zero-copy: every Token borrows directly from
the &str you pass in. This means segment() cannot internally normalize
the text (normalization may reorder/remove characters, producing a new
allocation with different byte offsets).
For input that may contain สระลอย in wrong order, stacked tone marks, or decomposed Sara Am, use the two-step pattern:
use kham_core::Tokenizer;
let tok = Tokenizer::new();
let normalized = tok.normalize("กเินข้าว"); // fix any encoding issues
let tokens = tok.segment(&normalized); // tokens borrow `normalized`Structs§
- Tokenizer
- High-level tokenizer. Holds a compiled dictionary and segmentation options.
- Tokenizer
Builder - Builder for
Tokenizer.