pub struct KeyExtractor { /* private fields */ }Expand description
Thai keyword extractor using TF × inverse-corpus-frequency scoring.
Backed by the built-in 62k-word tokenizer, the TNC frequency table (~106k entries), and the Thai stopword list (~1 029 entries).
Construction is O(n) in the TNC table size — reuse the returned instance
rather than calling builtin() on every query.
§Filtering rules
A token is eligible as a keyword when all of the following hold:
- Kind is
Thai,Latin,Number, orNamed(whitespace, punctuation, emoji, and unknown tokens are always skipped) - Character length ≥ 2 (single-char tokens are too coarse to be keywords)
- Not in the built-in Thai stopword list
§Examples
use kham_core::keyword::KeyExtractor;
let kex = KeyExtractor::builtin();
// Rare domain-specific word outranks a common word
// "ซอฟต์แวร์" (software) is rare in TNC and should appear as a top keyword
let kws = kex.extract("นักพัฒนาซอฟต์แวร์เขียนซอฟต์แวร์ทุกวัน", 5);
assert!(kws.iter().any(|k| k.word == "ซอฟต์แวร์"));Implementations§
Source§impl KeyExtractor
impl KeyExtractor
Sourcepub fn builtin() -> Self
pub fn builtin() -> Self
Create a keyword extractor backed by the built-in tokenizer, TNC frequency table, and Thai stopword list.
§Examples
use kham_core::keyword::KeyExtractor;
let kex = KeyExtractor::builtin();
assert!(!kex.extract("กินข้าวกับปลา", 5).is_empty());Sourcepub fn extract(&self, text: &str, max_n: usize) -> Vec<Keyword>
pub fn extract(&self, text: &str, max_n: usize) -> Vec<Keyword>
Extract up to max_n keywords from text, ranked by TF-IDF score.
Returns an empty Vec when text is empty, contains no eligible
content words, or max_n is zero.
Ties in score are broken alphabetically so results are deterministic.
§Examples
use kham_core::keyword::KeyExtractor;
let kex = KeyExtractor::builtin();
// Edge cases
assert!(kex.extract("", 5).is_empty());
assert!(kex.extract("กินข้าวกับปลา", 0).is_empty());
// Score order is non-increasing
let kws = kex.extract("การเรียนภาษาโปรแกรมมิ่งเป็นทักษะสำคัญสำหรับนักพัฒนา", 10);
for pair in kws.windows(2) {
assert!(
pair[0].score >= pair[1].score,
"out-of-order: {:?} before {:?}", pair[0], pair[1]
);
}Sourcepub fn extract_phrases(&self, text: &str, max_n: usize) -> Vec<Keyword>
pub fn extract_phrases(&self, text: &str, max_n: usize) -> Vec<Keyword>
Extract up to max_n multi-word keyphrases (bigrams and trigrams) from
text, ranked by TF × average-IDF score.
Phrases are formed from adjacent content tokens — tokens that pass the
same eligibility rules as [extract]: non-whitespace, non-punctuation,
non-emoji, non-unknown, character length ≥ 2, and not a stopword. A
bigram is two such consecutive tokens; a trigram is three.
The IDF for a phrase is the average IDF of its constituent words.
Returns an empty Vec when text has fewer than 2 eligible tokens or
max_n is zero.
§Example
use kham_core::keyword::KeyExtractor;
let kex = KeyExtractor::builtin();
let phrases = kex.extract_phrases("นักพัฒนาซอฟต์แวร์เขียนโค้ดทุกวัน", 5);
// Each keyword word field contains a space-separated phrase
assert!(phrases.iter().all(|k| k.word.contains(' ')));