Expand description
Unicode script classifier and pre-tokenizer.
Splits raw input into coarse, script-homogeneous Token spans before
the main segmenter runs. The segmenter only needs to apply the expensive
DAG algorithm to Thai spans; all other spans pass through unchanged.
§Pipeline position
raw text
│
▼
pre_tokenize() ← this module
│ splits into [Thai | Latin | Number | Whitespace | Emoji | Punctuation | Unknown]
▼
segmenter ← processes Thai spans with tcc + dict
│
▼
Vec<Token<'_>>§Example
use kham_core::pre_tokenizer::pre_tokenize;
use kham_core::TokenKind;
let spans = pre_tokenize("ธนาคาร100แห่ง");
assert_eq!(spans[0].kind, TokenKind::Thai); // "ธนาคาร"
assert_eq!(spans[1].kind, TokenKind::Number); // "100"
assert_eq!(spans[2].kind, TokenKind::Thai); // "แห่ง"Functions§
- classify_
char - Classify a single Unicode scalar value into a
TokenKind. - is_
emoji - Returns
trueifcbelongs to one of the major Unicode emoji blocks. - pre_
tokenize - Split
textinto a sequence of script-homogeneousTokenspans.