Skip to main content

Module segmenter

Module segmenter 

Source
Expand description

DAG-based maximal matching segmenter (newmm algorithm).

The segmenter builds a Directed Acyclic Word Graph (DAWG) over the input text using TCC boundaries as candidate split points, then finds the path that maximises the number of dictionary matches (fewest unknown tokens).

§Pipeline

raw text
  │
  ▼  (optional) Tokenizer::normalize()   ← fixes tone dedup + Sara Am composition
  │
  ▼  pre_tokenize()
[Thai span] [Number span] [Latin span] …
  │
  ▼  (Thai spans only) tcc_boundaries()
TCC boundary positions: [0, b1, b2, …, len]
  │
  ▼  DP over boundary indices
path of (start, end) pairs that maximises dict matches
  │
  ▼
Vec<Token<'_>>

§Normalization and zero-copy

Tokenizer::segment is zero-copy: every Token borrows directly from the &str you pass in. This means segment() cannot internally normalize the text (normalization may reorder/remove characters, producing a new allocation with different byte offsets).

For input that may contain สระลอย in wrong order, stacked tone marks, or decomposed Sara Am, use the two-step pattern:

use kham_core::Tokenizer;

let tok = Tokenizer::new();
let normalized = tok.normalize("กเินข้าว"); // fix any encoding issues
let tokens = tok.segment(&normalized);       // tokens borrow `normalized`

Structs§

Tokenizer
High-level tokenizer. Holds a compiled dictionary and segmentation options.
TokenizerBuilder
Builder for Tokenizer.