rust-mando 0.1.2

Convert Chinese characters to pinyin with jieba word segmentation
Documentation
# rust-mando

Convert Chinese characters (漢字) to pīnyīn, with word segmentation powered by
[jieba-rs](https://crates.io/crates/jieba-rs) for accurate heteronym resolution.
Pronunciation lookups use a compact binary table built from
[CC-CEDICT](https://www.mdbg.net/chinese/dictionary?page=cedict) at compile time.

## Why word segmentation?

Mandarin Chinese is written without spaces, and many characters have multiple
pronunciations (多音字, heteronyms) depending on the word they belong to.
Without segmentation, 音樂 and 快樂 both just see 樂 in isolation and may
produce the wrong reading. `rust-mando` segments text into words first, then
resolves each character's pronunciation in context.

```
音樂    →  yīn yuè        (樂 read as yuè in 音樂)
中國    →  Zhōng guó      (not Zhòng guó — 中 is correctly read as zhōng in 中國)
快樂    →  kuài lè        (not kuài yuè — 樂 is correctly read as lè in 快樂)
```

## Usage

### Output targets

| Build                       | ABI                   |
|-----------------------------|-----------------------|
| `cargo xtask build`         | wasm-minimal-protocol |
| `cargo demo <cmd> <input>`  | native CLI            |
| `cargo test`                | native unit tests     |

### Style reference

| `style`     | Example (``) | Description                            |
|-------------|----------------|----------------------------------------|
| `"marks"`   | `zhōng`        | Tone diacritic on the vowel (default)  |
| `"numbers"` | `zhong1`       | Tone number at the end of the syllable |

Any unrecognised style string falls back to `"marks"`.

### As a Typst plugin

After [building the WASM module](#as-a-webassembly-module), copy
`rust_mando.wasm` next to your `.typ` file and load it with `plugin()`:

```typ
#let mando = plugin("rust_mando.wasm")

// Flat pīnyīn string — non-Chinese tokens are omitted
#str(mando.pinyin_flat(bytes("北京歡迎你"), bytes("marks")))
// → "běi jīng huān yíng nǐ"

// Segmented JSON — pinyin is null for non-Chinese tokens
#json(mando.pinyin_segmented(bytes("你好world!"), bytes("marks")))
// → [
//     {"word": "你好",  "pinyin": ["nǐ", "hǎo"]},
//     {"word": "world", "pinyin": null},
//     {"word": "!",    "pinyin": null},
//   ]
```

### As a native CLI

```sh
cargo demo <COMMAND> <INPUT>
```

| Command      | Description                                      |
|--------------|--------------------------------------------------|
| `flat`       | Space-separated pīnyīn (`marks` + `numbers`)     |
| `segment`    | Word-boundary breakdown with pīnyīn per segment  |

```sh
cargo demo flat       今天天氣真好
cargo demo segment    自然語言處理
```

### As a library

```toml
[dependencies]
rust-mando = "0.1.2"
```

#### `to_pinyin_flat(text: &str, style: &str) -> String`

Returns a space-separated string of pīnyīn syllables. Non-Chinese tokens
(Latin words, punctuation, whitespace) are omitted from the output entirely.

```rust
use rust_mando::to_pinyin_flat;

to_pinyin_flat("北京歡迎你", "marks");    // "běi jīng huān yíng nǐ"
to_pinyin_flat("北京歡迎你", "numbers");  // "bei3 jing1 huan1 ying2 ni3"
to_pinyin_flat("你好world!", "marks");   // "nǐ hǎo"  — non-Chinese omitted
```

#### `to_pinyin_segmented(text: &str, style: &str) -> Vec<Segment>`

Returns one `Segment` per jieba word boundary. `pinyin` is `None` (JSON
`null`) when the word contains no Chinese characters.

```rust
use rust_mando::{to_pinyin_segmented, Segment};

let segs = to_pinyin_segmented("自然語言處理", "marks");
// [
//   Segment { word: "自然語言", pinyin: Some(["zì", "rán", "yǔ", "yán"]) },
//   Segment { word: "處理",     pinyin: Some(["chǔ", "lǐ"]) },
// ]

let mixed = to_pinyin_segmented("你好world!", "marks");
// [
//   Segment { word: "你好",  pinyin: Some(["nǐ", "hǎo"]) },
//   Segment { word: "world", pinyin: None },
//   Segment { word: "!",    pinyin: None },
// ]

// Access pinyin — use as_deref().unwrap_or(&[]) to handle None gracefully:
for seg in &segs {
    let py = seg.pinyin.as_deref().unwrap_or(&[]).join(" ");
    println!("{} → {}", seg.word, py);
}
// 自然語言 → zì rán yǔ yán
// 處理 → chǔ lǐ
```

## Related projects

- [rust-canto]https://crates.io/crates/rust-canto — Cantonese romanisation
  (Jyutping) Typst plugin by the same author
- [jieba-wasm]https://github.com/fengkx/jieba-wasm — WASM bindings for
  jieba-rs for web apps
- [pinyin]https://crates.io/crates/pinyin — lightweight character-by-character
  pīnyīn crate (no word segmentation)
- [chinese_dictionary]https://crates.io/crates/chinese_dictionary — a crate
  that inspired me to use CC-CEDICT for pinyin conversion.

## License

MIT