Skip to main content

Module ne

Module ne 

Source
Expand description

Named entity tagging via a gazetteer (word-list approach).

NeTagger relabels pre-segmented Thai tokens that appear in the gazetteer from TokenKind::Thai to TokenKind::Named(kind). The tagger runs as a post-processing pass after segmentation — it does not change the segmentation boundaries, only the token kind.

Multi-token matching: NeTagger::tag_tokens uses greedy longest-match over consecutive Thai tokens, so compound names split by the segmenter (e.g. กรุง+เทพกรุงเทพ) are correctly identified and merged into a single TokenKind::Named token.

Three entity categories are supported: NamedEntityKind::Person, NamedEntityKind::Place, and NamedEntityKind::Org.

§Data format

Tab-separated text file, one entry per line:

# Thai word<TAB>NE_TAG
กรุงเทพ<TAB>PLACE
ทักษิณ<TAB>PERSON
ปตท<TAB>ORG

Lines beginning with # and blank lines are ignored. Duplicate keys: last entry wins.

§Example

use kham_core::ne::NeTagger;
use kham_core::token::NamedEntityKind;

let tagger = NeTagger::from_tsv("กรุงเทพ\tPLACE\nทักษิณ\tPERSON\n");
assert_eq!(tagger.tag("กรุงเทพ"), Some(NamedEntityKind::Place));
assert_eq!(tagger.tag("xyz"), None);

Structs§

NeTagger
Gazetteer-based named entity tagger.