Skip to main content

Module normalizer

Module normalizer 

Source
Expand description

Thai text normalizer.

Applies two transformations in order:

  1. วรรณยุกต์ dedup — consecutive tone marks on the same consonant are collapsed to the last one. This handles accidental double-keystrokes (e.g. อ่ อ้ → อ้) as well as identical repetitions (อ่ อ่ → อ่).

  2. Sara Am composition — the two-character sequence nikhahit (อํ U+0E4D) + sara aa (อา U+0E32) is composed into the single sara am character (อำ U+0E33), as Unicode intends.

§Why สระลอย reorder is not included

Reordering a misplaced leading vowel (เ แ โ ใ ไ) requires knowing whether that vowel belongs to the consonant before it or the consonant after it in the code stream. In correctly encoded Thai text the sequence consonant + lead_vowel is common at word boundaries (e.g. ว + โ in “ชาวโลก”), and a simple look-ahead cannot distinguish that from a truly misplaced vowel without full TCC-level analysis. Correct TCC analysis requires the same character predicates used here, creating a dependency cycle. Applications that need สระลอย correction should pre-process the text with a dedicated TCC-aware utility before calling normalize.

§NFC note

Full Unicode NFC normalisation is not applied because Thai characters have combining class 0 and do not participate in canonical decomposition. The two rules above cover all practically observed Thai normalisation issues. Mixed-script Latin text is passed through unchanged; callers that require full NFC on Latin portions should pre-process the text with a Unicode normalisation library before calling normalize.

Functions§

normalize
Normalise Thai text into canonical form.