rust-mando 0.1.1

Convert Chinese characters to pinyin with jieba word segmentation
Documentation

rust-mando

Convert Chinese characters (漢字) to pīnyīn, with word segmentation powered by jieba-rs for accurate heteronym resolution. Pronunciation lookups are provided by chinese-dictionary.

Why word segmentation?

Mandarin Chinese is written without spaces, and many characters have multiple pronunciations (多音字, heteronyms) depending on the word they belong to. Without segmentation, 音樂 and 快樂 both just see 樂 in isolation and may produce the wrong reading. rust-mando segments text into words first, then resolves each character's pronunciation in context.

音樂    →  yīn yuè        (樂 read as yuè in 音樂)
中國    →  Zhōng guó      (not Zhòng guó — 中 is correctly read as zhōng in 中國)
快樂    →  kuài lè        (not kuài yuè — 樂 is correctly read as lè in 快樂)

Prerequisites

dict/dict.txt.big must be present before any build (including cargo test and cargo demo). Download it once:

curl -L https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big \
     -o dict/dict.txt.big

build.rs compresses it to $OUT_DIR/dict.dat (zstd level 19) automatically on every build. The file is listed in .gitignore and is not committed.

Usage

Output targets

Build ABI
cargo xtask build wasm-minimal-protocol
cargo demo <cmd> <input> native CLI
cargo test native unit tests

Style reference

style Example () Description
"marks" zhōng Tone diacritic on the vowel (default)
"numbers" zhong1 Tone number at the end of the syllable

Any unrecognised style string falls back to "marks".

As a Typst plugin

After building the WASM module, copy rust_mando.wasm next to your .typ file and load it with plugin():

#let mando = plugin("rust_mando.wasm")

// Flat pīnyīn string — non-Chinese tokens are omitted
#str(mando.pinyin_flat(bytes("北京歡迎你"), bytes("marks")))
// → "běi jīng huān yíng nǐ"

// Segmented JSON — pinyin is null for non-Chinese tokens
#json(mando.pinyin_segmented(bytes("你好world!"), bytes("marks")))
// → [
//     {"word": "你好",  "pinyin": ["nǐ", "hǎo"]},
//     {"word": "world", "pinyin": null},
//     {"word": "!",    "pinyin": null},
//   ]

As a native CLI

cargo demo <COMMAND> <INPUT>
Command Description
flat Space-separated pīnyīn (marks + numbers)
segment Word-boundary breakdown with pīnyīn per segment
heteronyms All possible readings per character (多音字)
cargo demo flat       今天天氣真好
cargo demo segment    自然語言處理
cargo demo heteronyms 中國音樂

As a library

[dependencies]
rust-mando = "0.1.0"

to_pinyin_flat(text: &str, style: &str) -> String

Returns a space-separated string of pīnyīn syllables. Non-Chinese tokens (Latin words, punctuation, whitespace) are omitted from the output entirely.

use rust_mando::to_pinyin_flat;

to_pinyin_flat("北京歡迎你", "marks");    // "běi jīng huān yíng nǐ"
to_pinyin_flat("北京歡迎你", "numbers");  // "bei3 jing1 huan1 ying2 ni3"
to_pinyin_flat("你好world!", "marks");   // "nǐ hǎo"  — non-Chinese omitted

to_pinyin_segmented(text: &str, style: &str) -> Vec<Segment>

Returns one Segment per jieba word boundary. pinyin is None (JSON null) when the word contains no Chinese characters.

use rust_mando::{to_pinyin_segmented, Segment};

let segs = to_pinyin_segmented("自然語言處理", "marks");
// [
//   Segment { word: "自然語言", pinyin: Some(["zì", "rán", "yǔ", "yán"]) },
//   Segment { word: "處理",     pinyin: Some(["chǔ", "lǐ"]) },
// ]

let mixed = to_pinyin_segmented("你好world!", "marks");
// [
//   Segment { word: "你好",  pinyin: Some(["nǐ", "hǎo"]) },
//   Segment { word: "world", pinyin: None },
//   Segment { word: "!",    pinyin: None },
// ]

// Access pinyin — use as_deref().unwrap_or(&[]) to handle None gracefully:
for seg in &segs {
    let py = seg.pinyin.as_deref().unwrap_or(&[]).join(" ");
    println!("{}{}", seg.word, py);
}
// 自然語言 → zì rán yǔ yán
// 處理 → chǔ lǐ

to_pinyin_heteronyms(text: &str, style: &str) -> Vec<WordEntry>

Returns one WordEntry per jieba word boundary. Each WordEntry contains the original word and a Vec<CharEntry>. Each CharEntry holds the character and all its possible readings — not just the most common one. Non-Chinese characters appear with an empty readings vec.

use rust_mando::{to_pinyin_heteronyms, WordEntry, CharEntry};

let entries = to_pinyin_heteronyms("中國音樂", "marks");
// [
//   WordEntry {
//     word: "中國",
//     chars: [
//       CharEntry { char: "中", readings: ["Zhōng", "zhōng", "zhòng"] },
//       CharEntry { char: "國", readings: ["Guó", "guó"] },
//     ]
//   },
//   WordEntry {
//     word: "音樂",
//     chars: [
//       CharEntry { char: "音", readings: ["yīn"] },
//       CharEntry { char: "樂", readings: ["Lè", "Yuè", "lè", "yuè"] },
//     ]
//   },
// ]

for entry in &entries {
    for ch in &entry.chars {
        println!("{} → [{}]", ch.ch, ch.readings.join(" / "));
    }
}
// 中 → [Zhōng / zhōng / zhòng]
// 國 → [Guó / guó]
// 音 → [yīn]
// 樂 → [Lè / Yuè / lè / yuè]

As a WebAssembly module

# Prerequisites
rustup target add wasm32-unknown-unknown

# Build → rust_mando.wasm at workspace root
cargo xtask build

# Skip wasm-opt for faster iteration
cargo xtask build --no-opt

# Compare .wasm size before and after wasm-opt
cargo xtask sizes

# Run native unit tests
cargo xtask test

WASM exports

All three functions take UTF-8 bytes and return UTF-8 bytes. Structured results are JSON-encoded.

pinyin_flat(text, style) → bytes

Returns a space-separated pīnyīn string. Non-Chinese tokens are omitted.

#str(mando.pinyin_flat(bytes("北京歡迎你"), bytes("marks")))
// → "běi jīng huān yíng nǐ"

pinyin_segmented(text, style) → bytes

Returns a JSON array of {"word": …, "pinyin": […] | null} objects.

#json(mando.pinyin_segmented(bytes("你好world"), bytes("marks")))
// → [{"word":"你好","pinyin":["nǐ","hǎo"]},{"word":"world","pinyin":null}]

pinyin_heteronyms(text, style) → bytes

Returns a JSON array of all possible readings per character.

#json(mando.pinyin_heteronyms(bytes("中國"), bytes("marks")))
// → [{"word":"中國","chars":[
//      {"char":"中","readings":["Zhōng","zhōng","zhòng"]},
//      {"char":"國","readings":["Guó","guó"]}
//    ]}]

Related projects

  • rust-canto — Cantonese romanisation (Jyutping) Typst plugin by the same author
  • jieba-wasm — WASM bindings for jieba-rs for web apps
  • pinyin — lightweight character-by-character pīnyīn crate (no word segmentation)

License

MIT