rust-canto
A Rust library for segmenting Cantonese text and converting Chinese characters to Jyutping (粵拼)/Yale romanization (耶魯拼音). Compiles to WebAssembly for use as a Typst plugin.
Features
- Word segmentation — splits Cantonese text into natural word units using a trie + dynamic programming algorithm
- Jyutping annotation — converts each word to its Jyutping romanization
- Yale annotation — converts each word to its Yale romanization
- Mixed input — handles mixed Chinese/English/punctuation input gracefully
- WASM output — compiles to
.wasmfor use as a Typst plugin viawasm-minimal-protocol
Usage as a Typst Plugin
Prerequisites
Install the WebAssembly build target:
Build
The compiled plugin will be at:
target/wasm32-unknown-unknown/release/rust_canto.wasm
It is a standalone binary file that can be copied to your project.
In Typst
You can use my Typst package
auto-canto to retrieve this
crate's output conveniently.
If you wish you process this crate's output yourself, you may load the plugin
and call annotate() with your input text:
// replace with the relative path
#let canto = plugin("rust_canto.wasm")
#let to-jyutping-words(txt) = {
json(canto.annotate(bytes(txt)))
}
#let data = to-jyutping-words("今日我要上堂")
The annotate function returns a JSON array of {word, jyutping, yale} objects,
so that my Typst package
pycantonese-parser can
process it.
English words and punctuation are returned with null as the Jyutping:
Algorithm
Text is segmented using a trie + dynamic programming approach:
1. Building the trie
A trie is built at startup from three bundled data files derived from rime-cantonese:
chars.tsv(34,000+ entries) — single-character readings with optional frequency weights (e.g.佢 keoi5and佢 heoi5 3%). Each character's readings are inserted in descending weight order so thatreadings[0]always holds the most common pronunciation. Entries with no percentage are treated as the primary reading (weight 100) and take precedence over those with an explicit percentage.words.tsv(103,000+ entries) — multi-character word readings. These build full paths through the trie and are loaded afterchars.tsvso that single-character nodes are already in place.- lettered.tsv (1,000+ entries) – Latin+CJK word readings. They are loaded after
words.tsv. freq.txt(266,000+ entries) — word frequencies used as a tiebreaker during segmentation (see below).
2. Segmentation
Input text is tokenised in a single left-to-right pass using dynamic
programming over the trie. dp[i] holds the best (token_count, total_freq)
for the first i characters; the goal is to minimise token_count and, on a
tie, maximise total_freq. For example, 好學生 can split as 好學 + 生 or
好 + 學生; both yield two tokens, but 學生 (freq 71,278) beats 好學
(freq 2,847), so 好 + 學生 wins.
Each character position is resolved by three rules applied in priority order:
Trie walk. For every possible start position, the trie is walked
left-to-right to find all matching words. A match contributes one token and
carries the word's Jyutping reading and frequency. Mixed Latin+CJK entries such
as AB膠 and 做part-time, as well as hyphenated entries like chok-cheat,
are stored in the trie and matched here.
Alpha-run fallback. If the trie finds no reading for a span, the span may
still be merged into one token if it is a contiguous run of non-CJK
alphanumeric characters. Hyphens (-), underscores (_), and apostrophes
(') are allowed as internal connectors but not at the start or end of the
span, so part-time, rust_canto, and i'm each become one token while a
bare - remains a single-character token. The resulting token has no Jyutping
reading. This rule only fires when the trie has no entry for the span, so a
word like ge that appears in the lettered dictionary correctly receives its
reading ge3 rather than None.
Single-character fallback. Any character not covered by the above —
whitespace, punctuation, symbols — becomes its own token. The trie is still
consulted for a reading, which is how single-character lettered entries such as
% → pat6 sen1 are handled. In particular, % is never absorbed into an
alpha run, so 3% always splits into two tokens 3 and %, allowing the
Cantonese reading of % to be displayed independently.
3. Romanization
Each segmented token's Jyutping reading is taken directly from the trie.
Yale romanization is then derived from the Jyutping by converting initials
(z→j, c→ch, j→y), finals (eoi→eui, eo/oe→eu, etc.),
and applying tone diacritics (macron for tone 1, acute for tone 2, grave for
tone 4, acute for tone 5; tones 3 and 6 are unmarked). Low-register tones
(4–6) additionally insert h after the vowel nucleus and before any stop coda
(-p, -t, -k, -m, -n, -ng).
Data Sources
The bundled dictionary data is derived from rime-cantonese, licensed under CC BY 4.0.
Related Projects
- auto-canto — the Typst package that processes the output of this crate for automatic Catonese annotation
- PyCantonese — the Python library that inspired this project
- to-jyutping — to NodeJS package that inspired the trie structure in this project
License
MIT
Data bundled from rime-cantonese is licensed under CC BY 4.0 — see
data/README.md for details.