Expand description
Named entity tagging via a gazetteer (word-list approach).
NeTagger relabels pre-segmented Thai tokens that appear in the
gazetteer from TokenKind::Thai to TokenKind::Named(kind).
The tagger runs as a post-processing pass after segmentation — it
does not change the segmentation boundaries, only the token kind.
Multi-token matching: NeTagger::tag_tokens uses greedy
longest-match over consecutive Thai tokens, so compound names split
by the segmenter (e.g. กรุง+เทพ → กรุงเทพ) are correctly
identified and merged into a single TokenKind::Named token.
Three entity categories are supported: NamedEntityKind::Person,
NamedEntityKind::Place, and NamedEntityKind::Org.
§Data format
Tab-separated text file, one entry per line:
# Thai word<TAB>NE_TAG
กรุงเทพ<TAB>PLACE
ทักษิณ<TAB>PERSON
ปตท<TAB>ORGLines beginning with # and blank lines are ignored.
Duplicate keys: last entry wins.
§Example
use kham_core::ne::NeTagger;
use kham_core::token::NamedEntityKind;
let tagger = NeTagger::from_tsv("กรุงเทพ\tPLACE\nทักษิณ\tPERSON\n");
assert_eq!(tagger.tag("กรุงเทพ"), Some(NamedEntityKind::Place));
assert_eq!(tagger.tag("xyz"), None);Structs§
- NeTagger
- Gazetteer-based named entity tagger.