Expand description
Thai Character Cluster (TCC) boundary detection.
Implements the TCC rules from Theeramunkong et al. (2000). A TCC is the smallest indivisible Thai orthographic unit — roughly one leading vowel + one consonant + its upper vowels + tone mark + trailing vowel.
§Pattern (simplified)
TCC = LEAD? CONSONANT UPPER* TONE? (THANTHAKAT | FOLLOW | NIKHAHIT)?
| NON_THAI+TCC segmentation is used as a pre-pass by the main segmenter to ensure that word boundaries always fall on TCC boundaries.
Functions§
- tcc_
boundaries - Return the byte offsets of every TCC boundary in
text. - tcc_
iter - Iterate over the TCCs in
textas&strslices.