Expand description
Thai text normalizer.
Applies two transformations in order:
-
วรรณยุกต์ dedup — consecutive tone marks on the same consonant are collapsed to the last one. This handles accidental double-keystrokes (e.g. อ่ อ้ → อ้) as well as identical repetitions (อ่ อ่ → อ่).
-
Sara Am composition — the two-character sequence nikhahit (อํ U+0E4D) + sara aa (อา U+0E32) is composed into the single sara am character (อำ U+0E33), as Unicode intends.
§Why สระลอย reorder is not included
Reordering a misplaced leading vowel (เ แ โ ใ ไ) requires knowing whether
that vowel belongs to the consonant before it or the consonant after it
in the code stream. In correctly encoded Thai text the sequence
consonant + lead_vowel is common at word boundaries (e.g. ว + โ in
“ชาวโลก”), and a simple look-ahead cannot distinguish that from a truly
misplaced vowel without full TCC-level analysis. Correct TCC analysis
requires the same character predicates used here, creating a dependency
cycle. Applications that need สระลอย correction should pre-process the
text with a dedicated TCC-aware utility before calling normalize.
§NFC note
Full Unicode NFC normalisation is not applied because Thai characters
have combining class 0 and do not participate in canonical decomposition.
The two rules above cover all practically observed Thai normalisation
issues. Mixed-script Latin text is passed through unchanged; callers
that require full NFC on Latin portions should pre-process the text
with a Unicode normalisation library before calling normalize.
Functions§
- normalize
- Normalise Thai text into canonical form.