rust-canto 0.3.4

# rust-canto

A Rust library for segmenting Cantonese text and converting Chinese characters
to Jyutping (粵拼)/Yale romanization (耶魯拼音). Compiles to WebAssembly for use as a
[Typst](https://typst.app) plugin.

## Features

- **Word segmentation** — splits Cantonese text into natural word units using a
  trie + dynamic programming algorithm
- **Jyutping annotation** — converts each word to its Jyutping romanization
- **Yale annotation** — converts each word to its Yale romanization
- **Mixed input** — handles mixed Chinese/English/punctuation input gracefully
- **WASM output** — compiles to `.wasm` for use as a Typst plugin via
  [`wasm-minimal-protocol`](https://github.com/astrale-sharp/wasm-minimal-protocol)

## Development & Building

### Prerequisites

- [Rust](https://rustup.rs/) stable 1.80+
- `wasm32-unknown-unknown` target: `rustup target add wasm32-unknown-unknown`
- **Clang**: required to compile the `zstd` compression library
  - Ubuntu/Debian: `sudo apt install clang`
  - macOS: included with Xcode command line tools

### Standard development build

```sh
cargo build --release --target wasm32-unknown-unknown
```

`build.rs` regenerates `OUT_DIR/trie.dat` automatically every time it is run.
[`OUT_DIR`](https://doc.rust-lang.org/cargo/reference/environment-variables.html#environment-variables-cargo-sets-for-crates)
is an environment variable representing the output directory.  It is used so
that Cargo considers the output path to be "stable".

### Production build (optimized WASM)

The project comes with a build script.

```sh
chmod +x build.sh
./build.sh
```

This compiles the library to WASM and runs `wasm-opt` with size-reduction flags
(`-Oz`, `--strip-debug`, `--disable-reference-types`), producing
`rust_canto.wasm` in the project root, ready for use in Typst.

### In Typst

You can use my Typst package
[`auto-canto`](https://github.com/VincentTam/auto-canto) to retrieve this
crate's output conveniently.

If you wish you process this crate's output yourself, you may load the plugin
and call `annotate()` with your input text:

```typ
// replace with the relative path
#let canto = plugin("rust_canto.wasm")

#let to-jyutping-words(txt) = {
  json(canto.annotate(bytes(txt)))
}

#let data = to-jyutping-words("今日我要上堂")
```

The `annotate` function returns a JSON array of `{word, jyutping, yale}`
objects, so that my Typst package
[canto-parser](https://typst.app/universe/package/canto-parser) can process it.

```json
[
  {
    word: "今日",
    jyutping: "gam1 jat6",
    yale: ["gām", "yaht"],
  },
  {
    word: "我",
    jyutping: "ngo5",
    yale: ["ngóh"],
  },
  {
    word: "要",
    jyutping: "jiu3",
    yale: ["yiu"],
  },
  {
    word: "上堂",
    jyutping: "soeng5 tong4",
    yale: ["séuhng","tòhng"],
  },
]
```

English words and punctuation are returned with `null` as the Jyutping:

```json
[
  {
    word: "佢",
    jyutping: "keoi",
    yale: ["kéuih"],
  },
  {
    word: "有",
    jyutping: "jau6",
    yale: ["yauh"],
  },
  {
    word: "chem",
    jyutping: kem1,
    yale: ["kēm"],
  },
  {
    word: "堂",
    jyutping: "tong4",
    yale: ["tòhng"],
  },
  {
    word: "？",
    jyutping: none,
    yale: none
  },
]
```

## Algorithm

Text is segmented using a **trie + dynamic programming** approach:

### 1. Building the trie

A trie is built at startup from three bundled data files derived from
[rime-cantonese](https://github.com/rime/rime-cantonese):

- **`chars.tsv`** (34,000+ entries) — single-character readings with optional
  frequency weights (e.g. `佢 keoi5` and `佢 heoi5 3%`). Each character's
  readings are inserted in descending weight order so that `readings[0]` always
  holds the most common pronunciation. Entries with no percentage are treated as
  the primary reading (weight 100) and take precedence over those with an
  explicit percentage.
- **`words.tsv`** (103,000+ entries) — multi-character word readings. These
  build full paths through the trie and are loaded after `chars.tsv` so that
  single-character nodes are already in place.
- **lettered.tsv** (1,000+ entries) – Latin+CJK word readings.  They are loaded
after `words.tsv`.
- **`freq.txt`** (266,000+ entries) — word frequencies used as a tiebreaker
  during segmentation (see below).

### 2. Segmentation

Input text is tokenised in a single left-to-right pass using dynamic
programming over the trie. `dp[i]` holds the best `(token_count, total_freq)`
for the first `i` characters; the goal is to minimise `token_count` and, on a
tie, maximise `total_freq`. For example, `好學生` can split as `好學 + 生` or
`好 + 學生`; both yield two tokens, but `學生` (freq 71,278) beats `好學`
(freq 2,847), so `好 + 學生` wins.

Each character position is resolved by three rules applied in priority order:

**Trie walk.** For every possible start position, the trie is walked
left-to-right to find all matching words. A match contributes one token and
carries the word's Jyutping reading and frequency. Mixed Latin+CJK entries such
as `AB膠` and `做part-time`, as well as hyphenated entries like `chok-cheat`,
are stored in the trie and matched here.

**Alpha-run fallback.** If the trie finds no reading for a span, the span may
still be merged into one token if it is a contiguous run of non-CJK
alphanumeric characters. Hyphens (`-`), underscores (`_`), and apostrophes
(`'`) are allowed as internal connectors but not at the start or end of the
span, so `part-time`, `rust_canto`, and `i'm` each become one token while a
bare `-` remains a single-character token. The resulting token has no Jyutping
reading. This rule only fires when the trie has no entry for the span, so a
word like `ge` that appears in the lettered dictionary correctly receives its
reading `ge3` rather than `None`.

**Single-character fallback.** Any character not covered by the above —
whitespace, punctuation, symbols — becomes its own token. The trie is still
consulted for a reading, which is how single-character lettered entries such as
`%` → `pat6 sen1` are handled. In particular, `%` is never absorbed into an
alpha run, so `3%` always splits into two tokens `3` and `%`, allowing the
Cantonese reading of `%` to be displayed independently.

### 3. Romanization

Each segmented token's Jyutping reading is taken directly from the trie.
Yale romanization is then derived from the Jyutping by converting initials
(`z`→`j`, `c`→`ch`, `j`→`y`), finals (`eoi`→`eui`, `eo`/`oe`→`eu`, etc.),
and applying tone diacritics (macron for tone 1, acute for tone 2, grave for
tone 4, acute for tone 5; tones 3 and 6 are unmarked). Low-register tones
(4–6) additionally insert `h` after the vowel nucleus and before any stop coda
(`-p`, `-t`, `-k`, `-m`, `-n`, `-ng`).

## Data Sources

The bundled dictionary data is derived from
[rime-cantonese](https://github.com/rime/rime-cantonese), licensed under
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).

## Related Projects

- [auto-canto](https://github.com/VincentTam/auto-canto) — the Typst package
  that processes the output of this crate for automatic Catonese annotation
- [PyCantonese](https://pycantonese.org) — the Python library that inspired
  this project
- [to-jyutping](https://github.com/CanCLID/to-jyutping) — to NodeJS package
  that inspired the trie structure in this project

## License

MIT

Data bundled from rime-cantonese is licensed under CC BY 4.0 — see
[`data/README.md`](data/README.md) for details.