pub struct StopwordSet { /* private fields */ }Expand description
A sorted set of stopwords supporting O(log n) lookup.
Construct once per process with StopwordSet::builtin and reuse across
segmentation calls.
Implementations§
Source§impl StopwordSet
impl StopwordSet
Sourcepub fn builtin() -> Self
pub fn builtin() -> Self
Load the built-in Thai stopword list (1 029 entries, PyThaiNLP Apache-2.0).
Sourcepub fn from_text(data: &str) -> Self
pub fn from_text(data: &str) -> Self
Build a StopwordSet from a newline-separated word list.
Lines beginning with # and blank lines are ignored.
BOM characters (\u{FEFF}) are stripped from every line.
The resulting set is sorted and deduplicated.
Sourcepub fn builtin_with_extra(extra: &str) -> Self
pub fn builtin_with_extra(extra: &str) -> Self
Load the built-in list plus additional words from extra.
extra uses the same format as from_text: newline-separated words,
# comment lines and blank lines ignored, BOM stripped.
The combined set is sorted and deduplicated.
Use this when you have domain-specific function words to suppress in addition to the standard Thai stopword list.
§Example
use kham_core::stopwords::StopwordSet;
let stops = StopwordSet::builtin_with_extra("ดาวน์โหลด\nอัปโหลด\n");
assert!(stops.contains("และ")); // built-in
assert!(stops.contains("ดาวน์โหลด")); // extra