Skip to main content

Module tcc

Module tcc 

Source
Expand description

Thai Character Cluster (TCC) boundary detection.

Implements the TCC rules from Theeramunkong et al. (2000). A TCC is the smallest indivisible Thai orthographic unit — roughly one leading vowel + one consonant + its upper vowels + tone mark + trailing vowel.

§Pattern (simplified)

TCC = LEAD? CONSONANT UPPER* TONE? (THANTHAKAT | FOLLOW | NIKHAHIT)?
    | NON_THAI+

TCC segmentation is used as a pre-pass by the main segmenter to ensure that word boundaries always fall on TCC boundaries.

Functions§

tcc_boundaries
Return the byte offsets of every TCC boundary in text.
tcc_iter
Iterate over the TCCs in text as &str slices.