rust-mando
Convert Chinese characters (漢字) to pīnyīn, with word segmentation powered by jieba-rs for accurate heteronym resolution. Pronunciation lookups are provided by chinese-dictionary.
Why word segmentation?
Mandarin Chinese is written without spaces, and many characters have multiple
pronunciations (多音字, heteronyms) depending on the word they belong to.
Without segmentation, 音樂 and 快樂 both just see 樂 in isolation and may
produce the wrong reading. rust-mando segments text into words first, then
resolves each character's pronunciation in context.
音樂 → yīn yuè (樂 read as yuè in 音樂)
中國 → Zhōng guó (not Zhòng guó — 中 is correctly read as zhōng in 中國)
快樂 → kuài lè (not kuài yuè — 樂 is correctly read as lè in 快樂)
Prerequisites
dict/dict.txt.big must be present before any build (including cargo test
and cargo demo). Download it once:
build.rs compresses it to $OUT_DIR/dict.dat (zstd level 19) automatically
on every build. The file is listed in .gitignore and is not committed.
Usage
Output targets
| Build | ABI |
|---|---|
cargo xtask build |
wasm-minimal-protocol |
cargo demo <cmd> <input> |
native CLI |
cargo test |
native unit tests |
Style reference
style |
Example (中) |
Description |
|---|---|---|
"marks" |
zhōng |
Tone diacritic on the vowel (default) |
"numbers" |
zhong1 |
Tone number at the end of the syllable |
Any unrecognised style string falls back to "marks".
As a Typst plugin
After building the WASM module, copy
rust_mando.wasm next to your .typ file and load it with plugin():
#let mando = plugin("rust_mando.wasm")
// Flat pīnyīn string — non-Chinese tokens are omitted
#str(mando.pinyin_flat(bytes("北京歡迎你"), bytes("marks")))
// → "běi jīng huān yíng nǐ"
// Segmented JSON — pinyin is null for non-Chinese tokens
#json(mando.pinyin_segmented(bytes("你好world!"), bytes("marks")))
// → [
// {"word": "你好", "pinyin": ["nǐ", "hǎo"]},
// {"word": "world", "pinyin": null},
// {"word": "!", "pinyin": null},
// ]
As a native CLI
| Command | Description |
|---|---|
flat |
Space-separated pīnyīn (marks + numbers) |
segment |
Word-boundary breakdown with pīnyīn per segment |
heteronyms |
All possible readings per character (多音字) |
As a library
[]
= "0.1.0"
to_pinyin_flat(text: &str, style: &str) -> String
Returns a space-separated string of pīnyīn syllables. Non-Chinese tokens (Latin words, punctuation, whitespace) are omitted from the output entirely.
use to_pinyin_flat;
to_pinyin_flat; // "běi jīng huān yíng nǐ"
to_pinyin_flat; // "bei3 jing1 huan1 ying2 ni3"
to_pinyin_flat; // "nǐ hǎo" — non-Chinese omitted
to_pinyin_segmented(text: &str, style: &str) -> Vec<Segment>
Returns one Segment per jieba word boundary. pinyin is None (JSON
null) when the word contains no Chinese characters.
use ;
let segs = to_pinyin_segmented;
// [
// Segment { word: "自然語言", pinyin: Some(["zì", "rán", "yǔ", "yán"]) },
// Segment { word: "處理", pinyin: Some(["chǔ", "lǐ"]) },
// ]
let mixed = to_pinyin_segmented;
// [
// Segment { word: "你好", pinyin: Some(["nǐ", "hǎo"]) },
// Segment { word: "world", pinyin: None },
// Segment { word: "!", pinyin: None },
// ]
// Access pinyin — use as_deref().unwrap_or(&[]) to handle None gracefully:
for seg in &segs
// 自然語言 → zì rán yǔ yán
// 處理 → chǔ lǐ
to_pinyin_heteronyms(text: &str, style: &str) -> Vec<WordEntry>
Returns one WordEntry per jieba word boundary. Each WordEntry contains the
original word and a Vec<CharEntry>. Each CharEntry holds the character and
all its possible readings — not just the most common one. Non-Chinese
characters appear with an empty readings vec.
use ;
let entries = to_pinyin_heteronyms;
// [
// WordEntry {
// word: "中國",
// chars: [
// CharEntry { char: "中", readings: ["Zhōng", "zhōng", "zhòng"] },
// CharEntry { char: "國", readings: ["Guó", "guó"] },
// ]
// },
// WordEntry {
// word: "音樂",
// chars: [
// CharEntry { char: "音", readings: ["yīn"] },
// CharEntry { char: "樂", readings: ["Lè", "Yuè", "lè", "yuè"] },
// ]
// },
// ]
for entry in &entries
// 中 → [Zhōng / zhōng / zhòng]
// 國 → [Guó / guó]
// 音 → [yīn]
// 樂 → [Lè / Yuè / lè / yuè]
As a WebAssembly module
# Prerequisites
# Build → rust_mando.wasm at workspace root
# Skip wasm-opt for faster iteration
# Compare .wasm size before and after wasm-opt
# Run native unit tests
WASM exports
All three functions take UTF-8 bytes and return UTF-8 bytes. Structured results are JSON-encoded.
pinyin_flat(text, style) → bytes
Returns a space-separated pīnyīn string. Non-Chinese tokens are omitted.
#str(mando.pinyin_flat(bytes("北京歡迎你"), bytes("marks")))
// → "běi jīng huān yíng nǐ"
pinyin_segmented(text, style) → bytes
Returns a JSON array of {"word": …, "pinyin": […] | null} objects.
#json(mando.pinyin_segmented(bytes("你好world"), bytes("marks")))
// → [{"word":"你好","pinyin":["nǐ","hǎo"]},{"word":"world","pinyin":null}]
pinyin_heteronyms(text, style) → bytes
Returns a JSON array of all possible readings per character.
#json(mando.pinyin_heteronyms(bytes("中國"), bytes("marks")))
// → [{"word":"中國","chars":[
// {"char":"中","readings":["Zhōng","zhōng","zhòng"]},
// {"char":"國","readings":["Guó","guó"]}
// ]}]
Related projects
- rust-canto — Cantonese romanisation (Jyutping) Typst plugin by the same author
- jieba-wasm — WASM bindings for jieba-rs for web apps
- pinyin — lightweight character-by-character pīnyīn crate (no word segmentation)
License
MIT