Expand description
§kham-core
Pure Rust Thai word segmentation engine. no_std compatible (requires alloc).
§Quick start
use kham_core::Tokenizer;
let tokenizer = Tokenizer::new();
let tokens = tokenizer.segment("กินข้าวกับปลา");
for token in &tokens {
println!("{} ({:?})", token.text, token.kind);
}§Mixed script
Non-Thai spans (Latin, numbers, emoji) pass through unchanged alongside Thai tokens:
use kham_core::{Tokenizer, TokenKind};
let tok = Tokenizer::new();
let tokens = tok.segment("ธนาคาร100แห่ง");
assert_eq!(tokens[1].text, "100");
assert_eq!(tokens[1].kind, TokenKind::Number);
// char_span is suitable for Python/JS string indexing
assert_eq!(tokens[0].char_span, 0..6); // ธนาคาร = 6 chars
assert_eq!(tokens[1].char_span, 6..9); // 100 = 3 chars§Custom dictionary
Merge extra words with the built-in dictionary using the builder:
use kham_core::Tokenizer;
let tok = Tokenizer::builder()
.dict_words("ปัญญาประดิษฐ์\n")
.build();
let tokens = tok.segment("ปัญญาประดิษฐ์คือ");
assert!(tokens.iter().any(|t| t.text == "ปัญญาประดิษฐ์"));§Normalize then segment
Tokenizer::segment is zero-copy. For input with stacked tone marks or
สระลอย in wrong order, normalize into a new String first, then borrow it:
use kham_core::Tokenizer;
let tok = Tokenizer::new();
let normalized = tok.normalize("กเินข้าว"); // reorder สระลอย
let tokens = tok.segment(&normalized); // tokens borrow `normalized`
assert!(!tokens.is_empty());§Full-text search pipeline
fts::FtsTokenizer wraps the segmenter with stopword tagging, synonym
expansion, POS tags, and named-entity recognition in one call:
use kham_core::fts::FtsTokenizer;
let fts = FtsTokenizer::new();
// All tokens with metadata (position, kind, stopword flag, synonyms, …)
let tokens = fts.segment_for_fts("กินข้าวกับปลา");
assert!(tokens.iter().any(|t| t.text == "กับ" && t.is_stop));
// Only indexable tokens (stopwords excluded, positions preserved)
let indexed = fts.index_tokens("กินข้าวกับปลา");
assert!(indexed.iter().all(|t| !t.is_stop));
// Flat list of lexeme strings ready for a tsvector
let lexemes = fts.lexemes("กินข้าวกับปลา");
assert!(lexemes.iter().any(|l| l == "กิน" || l == "ปลา"));Re-exports§
pub use error::KhamError;pub use segmenter::Tokenizer;pub use segmenter::TokenizerBuilder;pub use token::NamedEntityKind;pub use token::Token;pub use token::TokenKind;
Modules§
- abbrev
- Thai abbreviation expansion.
- date
- Thai date normalization.
- dict
- Dictionary backed by a Double-Array Trie (DARTS).
- error
- Error types for kham-core.
- freq
- Word frequency table built from the Thai National Corpus (TNC).
- fts
- Full-text search pipeline for Thai text.
- ne
- Named entity tagging via a gazetteer (word-list approach).
- ngram
- Character-level and token-level n-gram generation for Thai FTS.
- normalizer
- Thai text normalizer.
- number
- Thai number normalization.
- pos
- Part-of-speech tagging for Thai words.
- pre_
tokenizer - Unicode script classifier and pre-tokenizer.
- romanizer
- RTGS romanization of segmented Thai words.
- segmenter
- DAG-based maximal matching segmenter (newmm algorithm).
- sentence
- Thai sentence segmentation.
- soundex
- Thai phonetic encoding (Soundex) — lk82, udom83, MetaSound, and Thai–English cross-language.
- stopwords
- Thai stopword filter.
- synonym
- Synonym expansion for Thai full-text search.
- tcc
- Thai Character Cluster (TCC) boundary detection.
- token
- Token types returned by the segmenter.