Skip to main content

Crate kham_core

Crate kham_core 

Source
Expand description

§kham-core

Pure Rust Thai word segmentation engine. no_std compatible (requires alloc).

§Quick start

use kham_core::Tokenizer;

let tokenizer = Tokenizer::new();
let tokens = tokenizer.segment("กินข้าวกับปลา");
for token in &tokens {
    println!("{} ({:?})", token.text, token.kind);
}

Re-exports§

pub use error::KhamError;
pub use segmenter::Tokenizer;
pub use segmenter::TokenizerBuilder;
pub use token::NamedEntityKind;
pub use token::Token;
pub use token::TokenKind;

Modules§

dict
Dictionary backed by a Double-Array Trie (DARTS).
error
Error types for kham-core.
freq
Word frequency table built from the Thai National Corpus (TNC).
fts
Full-text search pipeline for Thai text.
ne
Named entity tagging via a gazetteer (word-list approach).
ngram
Character-level and token-level n-gram generation for Thai FTS.
normalizer
Thai text normalizer.
pos
Part-of-speech tagging for Thai words.
pre_tokenizer
Unicode script classifier and pre-tokenizer.
romanizer
RTGS romanization of segmented Thai words.
segmenter
DAG-based maximal matching segmenter (newmm algorithm).
stopwords
Thai stopword filter.
synonym
Synonym expansion for Thai full-text search.
tcc
Thai Character Cluster (TCC) boundary detection.
token
Token types returned by the segmenter.