rust-canto
A Rust library for segmenting Cantonese text and converting Chinese characters to Jyutping (粵拼) romanization. Compiles to WebAssembly for use as a Typst plugin.
Features
- Word segmentation — splits Cantonese text into natural word units using a trie + dynamic programming algorithm
- Jyutping annotation — converts each word to its Jyutping romanization
- Mixed input — handles mixed Chinese/English/punctuation input gracefully
- WASM output — compiles to
.wasmfor use as a Typst plugin viawasm-minimal-protocol
Usage as a Typst Plugin
Prerequisites
Install the WebAssembly build target:
Build
The compiled plugin will be at:
target/wasm32-unknown-unknown/release/rust_canto.wasm
In Typst
Load the plugin and call annotate() with your input text:
#let canto = plugin("rust_canto.wasm")
#let to-jyutping-words(txt) = {
let arr = json(canto.annotate(bytes(txt)))
arr.map(p => ("word": p.at(0), "jyutping": p.at(1)))
}
#let data = to-jyutping-words("今日我要上堂")
The annotate function returns a JSON array of [word, jyutping] pairs:
English words and punctuation are returned with null as the Jyutping:
Algorithm
Text is segmented using a trie + dynamic programming approach:
- A trie is built at startup from the bundled
words.tsv(103,000+ entries) andchars.tsv(34,000+ characters) datasets, derived from rime-cantonese. - For each position in the input, all possible word matches are found by walking the trie left-to-right.
- Dynamic programming selects the segmentation that minimises token count,
using word frequency from
freq.txtas a tiebreaker — so學生(freq 71,278) beats好學(freq 2,847) when both produce the same token count.
Data Sources
The bundled dictionary data is derived from rime-cantonese, licensed under CC BY 4.0.
Related Projects
- PyCantonese — the Python library that inspired this project
- to-jyutping — to NodeJS package that inspired the trie structure in this project
License
MIT
Data bundled from rime-cantonese is licensed under CC BY 4.0 — see
data/README.md for details.