Skip to main content

Module pre_tokenizer

Module pre_tokenizer 

Source
Expand description

Unicode script classifier and pre-tokenizer.

Splits raw input into coarse, script-homogeneous Token spans before the main segmenter runs. The segmenter only needs to apply the expensive DAG algorithm to Thai spans; all other spans pass through unchanged.

§Pipeline position

raw text
   │
   ▼
pre_tokenize()   ← this module
   │  splits into [Thai | Latin | Number | Whitespace | Emoji | Punctuation | Unknown]
   ▼
segmenter        ← processes Thai spans with tcc + dict
   │
   ▼
Vec<Token<'_>>

§Example

use kham_core::pre_tokenizer::pre_tokenize;
use kham_core::TokenKind;

let spans = pre_tokenize("ธนาคาร100แห่ง");
assert_eq!(spans[0].kind, TokenKind::Thai);   // "ธนาคาร"
assert_eq!(spans[1].kind, TokenKind::Number); // "100"
assert_eq!(spans[2].kind, TokenKind::Thai);   // "แห่ง"

Functions§

classify_char
Classify a single Unicode scalar value into a TokenKind.
is_emoji
Returns true if c belongs to one of the major Unicode emoji blocks.
pre_tokenize
Split text into a sequence of script-homogeneous Token spans.