Skip to main content

Crate kham_core

Crate kham_core 

Source
Expand description

§kham-core

Pure Rust Thai word segmentation engine. no_std compatible (requires alloc).

§Quick start

use kham_core::Tokenizer;

let tokenizer = Tokenizer::new();
let tokens = tokenizer.segment("กินข้าวกับปลา");
for token in &tokens {
    println!("{} ({:?})", token.text, token.kind);
}

§Mixed script

Non-Thai spans (Latin, numbers, emoji) pass through unchanged alongside Thai tokens:

use kham_core::{Tokenizer, TokenKind};

let tok = Tokenizer::new();
let tokens = tok.segment("ธนาคาร100แห่ง");
assert_eq!(tokens[1].text, "100");
assert_eq!(tokens[1].kind, TokenKind::Number);
// char_span is suitable for Python/JS string indexing
assert_eq!(tokens[0].char_span, 0..6); // ธนาคาร = 6 chars
assert_eq!(tokens[1].char_span, 6..9); // 100 = 3 chars

§Custom dictionary

Merge extra words with the built-in dictionary using the builder:

use kham_core::Tokenizer;

let tok = Tokenizer::builder()
    .dict_words("ปัญญาประดิษฐ์\n")
    .build();
let tokens = tok.segment("ปัญญาประดิษฐ์คือ");
assert!(tokens.iter().any(|t| t.text == "ปัญญาประดิษฐ์"));

§Normalize then segment

Tokenizer::segment is zero-copy. For input with stacked tone marks or สระลอย in wrong order, normalize into a new String first, then borrow it:

use kham_core::Tokenizer;

let tok = Tokenizer::new();
let normalized = tok.normalize("กเินข้าว"); // reorder สระลอย
let tokens = tok.segment(&normalized);       // tokens borrow `normalized`
assert!(!tokens.is_empty());

§Full-text search pipeline

fts::FtsTokenizer wraps the segmenter with stopword tagging, synonym expansion, POS tags, and named-entity recognition in one call:

use kham_core::fts::FtsTokenizer;

let fts = FtsTokenizer::new();

// All tokens with metadata (position, kind, stopword flag, synonyms, …)
let tokens = fts.segment_for_fts("กินข้าวกับปลา");
assert!(tokens.iter().any(|t| t.text == "กับ" && t.is_stop));

// Only indexable tokens (stopwords excluded, positions preserved)
let indexed = fts.index_tokens("กินข้าวกับปลา");
assert!(indexed.iter().all(|t| !t.is_stop));

// Flat list of lexeme strings ready for a tsvector
let lexemes = fts.lexemes("กินข้าวกับปลา");
assert!(lexemes.iter().any(|l| l == "กิน" || l == "ปลา"));

Re-exports§

pub use error::KhamError;
pub use segmenter::Tokenizer;
pub use segmenter::TokenizerBuilder;
pub use token::NamedEntityKind;
pub use token::Token;
pub use token::TokenKind;

Modules§

abbrev
Thai abbreviation expansion.
date
Thai date normalization.
dict
Dictionary backed by a Double-Array Trie (DARTS).
error
Error types for kham-core.
freq
Word frequency table built from the Thai National Corpus (TNC).
fts
Full-text search pipeline for Thai text.
ne
Named entity tagging via a gazetteer (word-list approach).
ngram
Character-level and token-level n-gram generation for Thai FTS.
normalizer
Thai text normalizer.
number
Thai number normalization.
pos
Part-of-speech tagging for Thai words.
pre_tokenizer
Unicode script classifier and pre-tokenizer.
romanizer
RTGS romanization of segmented Thai words.
segmenter
DAG-based maximal matching segmenter (newmm algorithm).
sentence
Thai sentence segmentation.
soundex
Thai phonetic encoding (Soundex) — lk82, udom83, MetaSound, and Thai–English cross-language.
stopwords
Thai stopword filter.
synonym
Synonym expansion for Thai full-text search.
tcc
Thai Character Cluster (TCC) boundary detection.
token
Token types returned by the segmenter.