rust-canto 0.3.1

Convert Chinese characters to Jyutping (粵拼) / Yale romanization (耶魯)
Documentation

rust-canto

A Rust library for segmenting Cantonese text and converting Chinese characters to Jyutping (粵拼)/Yale romanization (耶魯拼音). Compiles to WebAssembly for use as a Typst plugin.

Features

  • Word segmentation — splits Cantonese text into natural word units using a trie + dynamic programming algorithm
  • Jyutping annotation — converts each word to its Jyutping romanization
  • Yale annotation — converts each word to its Yale romanization
  • Mixed input — handles mixed Chinese/English/punctuation input gracefully
  • WASM output — compiles to .wasm for use as a Typst plugin via wasm-minimal-protocol

Usage as a Typst Plugin

Prerequisites

Install the WebAssembly build target:

rustup target add wasm32-unknown-unknown

Build

cargo build --release --target wasm32-unknown-unknown

The compiled plugin will be at:

target/wasm32-unknown-unknown/release/rust_canto.wasm

It is a standalone binary file that can be copied to your project.

In Typst

You can use my Typst package auto-canto to retrieve this crate's output conveniently.

If you wish you process this crate's output yourself, you may load the plugin and call annotate() with your input text:

// replace with the relative path
#let canto = plugin("rust_canto.wasm")

#let to-jyutping-words(txt) = {
  json(canto.annotate(bytes(txt)))
}

#let data = to-jyutping-words("今日我要上堂")

The annotate function returns a JSON array of {word, jyutping, yale} objects, so that my Typst package pycantonese-parser can process it.

[
  {
    word: "今日",
    jyutping: "gam1 jat6",
    yale: ["gām", "yaht"],
  },
  {
    word: "",
    jyutping: "ngo5",
    yale: ["ngóh"],
  },
  {
    word: "",
    jyutping: "jiu3",
    yale: ["yiu"],
  },
  {
    word: "上堂",
    jyutping: "soeng5 tong4",
    yale: ["séuhng","tòhng"],
  },
]

English words and punctuation are returned with null as the Jyutping:

[
  {
    word: "",
    jyutping: "keoi",
    yale: ["kéuih"],
  },
  {
    word: "",
    jyutping: "jau6",
    yale: ["yauh"],
  },
  {
    word: "chem",
    jyutping: kem1,
    yale: ["kēm"],
  },
  {
    word: "",
    jyutping: "tong4",
    yale: ["tòhng"],
  },
  {
    word: "",
    jyutping: none,
    yale: none
  },
]

Algorithm

Text is segmented using a trie + dynamic programming approach:

1. Building the trie

A trie is built at startup from three bundled data files derived from rime-cantonese:

  • chars.tsv (34,000+ entries) — single-character readings with optional frequency weights (e.g. 佢 keoi5 and 佢 heoi5 3%). Each character's readings are inserted in descending weight order so that readings[0] always holds the most common pronunciation. Entries with no percentage are treated as the primary reading (weight 100) and take precedence over those with an explicit percentage.
  • words.tsv (103,000+ entries) — multi-character word readings. These build full paths through the trie and are loaded after chars.tsv so that single-character nodes are already in place.
  • lettered.tsv (1,000+ entries) – Latin+CJK word readings. They are loaded after words.tsv.
  • freq.txt (266,000+ entries) — word frequencies used as a tiebreaker during segmentation (see below).

2. Segmentation

Input text is tokenised in a single left-to-right pass using dynamic programming over the trie. dp[i] holds the best (token_count, total_freq) for the first i characters; the goal is to minimise token_count and, on a tie, maximise total_freq. For example, 好學生 can split as 好學 + 生 or 好 + 學生; both yield two tokens, but 學生 (freq 71,278) beats 好學 (freq 2,847), so 好 + 學生 wins.

Each character position is resolved by three rules applied in priority order:

Trie walk. For every possible start position, the trie is walked left-to-right to find all matching words. A match contributes one token and carries the word's Jyutping reading and frequency. Mixed Latin+CJK entries such as AB膠 and 做part-time, as well as hyphenated entries like chok-cheat, are stored in the trie and matched here.

Alpha-run fallback. If the trie finds no reading for a span, the span may still be merged into one token if it is a contiguous run of non-CJK alphanumeric characters. Hyphens (-), underscores (_), and apostrophes (') are allowed as internal connectors but not at the start or end of the span, so part-time, rust_canto, and i'm each become one token while a bare - remains a single-character token. The resulting token has no Jyutping reading. This rule only fires when the trie has no entry for the span, so a word like ge that appears in the lettered dictionary correctly receives its reading ge3 rather than None.

Single-character fallback. Any character not covered by the above — whitespace, punctuation, symbols — becomes its own token. The trie is still consulted for a reading, which is how single-character lettered entries such as %pat6 sen1 are handled. In particular, % is never absorbed into an alpha run, so 3% always splits into two tokens 3 and %, allowing the Cantonese reading of % to be displayed independently.

3. Romanization

Each segmented token's Jyutping reading is taken directly from the trie. Yale romanization is then derived from the Jyutping by converting initials (zj, cch, jy), finals (eoieui, eo/oeeu, etc.), and applying tone diacritics (macron for tone 1, acute for tone 2, grave for tone 4, acute for tone 5; tones 3 and 6 are unmarked). Low-register tones (4–6) additionally insert h after the vowel nucleus and before any stop coda (-p, -t, -k, -m, -n, -ng).

Data Sources

The bundled dictionary data is derived from rime-cantonese, licensed under CC BY 4.0.

Related Projects

  • auto-canto — the Typst package that processes the output of this crate for automatic Catonese annotation
  • PyCantonese — the Python library that inspired this project
  • to-jyutping — to NodeJS package that inspired the trie structure in this project

License

MIT

Data bundled from rime-cantonese is licensed under CC BY 4.0 — see data/README.md for details.