rust-mando 0.1.0

Convert Chinese characters to pinyin with jieba word segmentation
Documentation
# rust-mando

Convert Chinese characters (漢字) to pīnyīn, with word segmentation powered by
[jieba-rs](https://crates.io/crates/jieba-rs) for accurate heteronym resolution.
Pronunciation lookups are provided by
[chinese-dictionary](https://crates.io/crates/chinese-dictionary).

## Why word segmentation?

Mandarin Chinese is written without spaces, and many characters have multiple
pronunciations (多音字, heteronyms) depending on the word they belong to.
Without segmentation, 音樂 and 快樂 both just see 樂 in isolation and may
produce the wrong reading. `rust-mando` segments text into words first, then
resolves each character's pronunciation in context.

```
音樂    →  yīn yuè         (樂 read as yuè in 音樂 per jieba segmentation)
中國    →  Zhōng guó      (not Zhòng guó — 中 is correctly read as zhōng in 中國)
快樂    →  kuài lè        (not kuài yuè — 樂 is correctly read as lè in 快樂)
```

## Usage
### Output targets

| Build                              | ABI                    |
|------------------------------------|------------------------|
| `cargo xtask build`                | wasm-minimal-protocol  |
| `cargo demo <cmd> <input>`         | native CLI             |
| `cargo test`                       | native unit tests      |

### Dictionary

When targeting `wasm32`, `build.rs` compresses `dict/dict.txt.big` to
`$OUT_DIR/dict.dat` (zstd level 19, host-only C binding). At runtime
`ruzstd` decompresses it in memory (pure Rust, WASM-safe), giving jieba
full traditional Chinese (繁體) segmentation coverage.

Pinyin lookups use the
[chinese-dictionary](https://crates.io/crates/chinese-dictionary) crate, which
provides comprehensive pronunciation data including heteronym support.

The native CLI uses jieba's built-in dictionary (no dict file needed).

### Typst usage

After [building the standalone WASM module](#as-a-webassembly-module)

```typ
#let mando = plugin("path/to/your/rust_mando.wasm")
#let flat(txt, style) = str(mando.pinyin_flat(bytes(txt), bytes(style)))
#flat("北京歡迎你", "tone")   // "běi jīng huān yíng nǐ"
```


### As a native CLI

```sh
cargo demo <COMMAND> <INPUT>
```

| Command | Description |
|---|---|
| `flat` | Space-separated pīnyīn (`tones` + `numbers`) |
| `segmented` | Word-boundary breakdown with pīnyīn per segment |
| `heteronyms` | All possible readings per character (多音字) |

```sh
cargo demo flat       今天天氣真好
cargo demo segmented  自然語言處理
cargo demo heteronyms 中國音樂
```

### As a library

```toml
[dependencies]
rust-mando = "0.1.0"
```

#### `to_pinyin_flat(text: &str, style: &str) -> String`

Returns a space-separated string of pīnyīn syllables.

Note: Any characters that do not have a valid pīnyīn representation (such as
Latin letters, numbers, or symbols) are omitted from the output.

```rust
use rust_mando::to_pinyin_flat;

to_pinyin_flat("北京歡迎你", "tones");    // "běi jīng huān yíng nǐ"
to_pinyin_flat("北京歡迎你", "numbers");  // "bei3 jing1 huan1 ying2 ni3"
// Mixed input: Latin characters pass through unchanged
to_pinyin_flat("你好world", "tone");     // "nǐ hǎo
```

#### `to_pinyin_segmented(text: &str, style: &str) -> Vec<Segment>`

Returns one `Segment` per jieba word boundary. Each `Segment` contains the
original word and a `Vec<String>` of syllables — one entry per character,
with non-Chinese characters appearing as themselves.

```rust
use rust_mando::{to_pinyin_segmented, Segment};

let segs: Vec<Segment> = to_pinyin_segmented("自然語言處理", "tone");
// [
//   Segment { word: "自然語言",   pinyin: ["zì",  "rán", "yǔ",  "yán"] },
//   Segment { word: "處理",   pinyin: ["chǔ", "lǐ"]  },
// ]

// Access fields directly:
for seg in &segs {
    println!("{} → {}", seg.word, seg.pinyin.join(" "));
}
// 自然語言 → zì rán yǔ yán
// 處理 → chǔ lǐ
```

#### `to_pinyin_heteronyms(text: &str) -> Vec<WordEntry>`

Returns one `WordEntry` per jieba word boundary. Each `WordEntry` contains
the original word and a `Vec<CharEntry>`. Each `CharEntry` holds the character
and **all** its possible tone-marked readings — not just the most likely one.
Non-Chinese characters appear with an empty `readings` vec.

```rust
use rust_mando::{to_pinyin_heteronyms, NewWordEntry, CharEntry};

let entries: Vec<NewWordEntry> = to_pinyin_heteronyms("中國音樂", "tone");
// [
//   NewWordEntry {
//     word: "中國",
//     chars: [
//       CharEntry { ch: "中", readings: ["Zhōng", "zhōng", "zhòng"] },
//       CharEntry { ch: "國", readings: ["Guó", "guó"]                 },
//     ]
//   },
//   NewWordEntry {
//     word: "音樂",
//     chars: [
//       CharEntry { ch: "音", readings: ["yīn"]                          },
//       CharEntry { ch: "樂", readings: ["Lè", "Yuè", "lè", "yuè"]       },
//     ]
//   },
// ]

// Access fields directly:
for entry in &entries {
    for ch in &entry.chars {
        println!("{} → [{}]", ch.ch, ch.readings.join(" / "));
    }
}
// 中 → [Zhōng / zhōng / zhòng]
// 國 → [Guó / guó]
// 音 → [yīn]
// 樂 → [Lè / Yuè / lè / yuè]
```

### Style reference

| `style`          | Example (``) | Description                              |
|------------------|----------------|------------------------------------------|
| `"tones"`         | `zhōng`        | Tone diacritic on the vowel              |
| `"numbers"` | `zhong1`       | Tone number at the end of the syllable   |

Any unrecognised style string falls back to `"tone"`.

### As a WebAssembly module

```sh
cargo xtask build                    # wasm32-unknown-unknown + wasm-opt -Oz
cargo xtask build --no-opt           # skip wasm-opt for faster iteration
cargo xtask sizes                    # compare .wasm size before/after opt
cargo xtask test                     # run tests under wasm-pack test --node
```

#### WASM exports

The WASM module exposes three functions for use with Typst and other WASM hosts:

- **`pinyin_flat(text: &[u8], style: &[u8]) -> Vec<u8>`** — Returns a
space-separated pīnyīn string as UTF-8 bytes.

  ```typ
  #str(mando.pinyin_flat(bytes("北京歡迎你"), bytes("tone")))
  // → "běi jīng huān yíng nǐ"
  ```

- **`pinyin_segmented(text: &[u8], style: &[u8]) -> Vec<u8>`** — Returns a JSON
array of segments (word + syllables) as UTF-8 bytes.

  ```typ
  #str(mando.pinyin_segmented(bytes("自然語言"), bytes("tone")))
  // → "[{\"word\":\"自然語言\",\"pinyin\":[\"zì\",\"rán\",\"yǔ\",\"yán\"]}]"
  ```

- **`pinyin_heteronyms(text: &[u8], style: &[u8]) -> Vec<u8>`** — Returns a JSON
array of all possible readings per character as UTF-8 bytes.

  ```typ
  #str(mando.pinyin_heteronyms(bytes("中國"), bytes("tone")))
  // → "[{\"word\":\"中國\",\"chars\":[{\"char\":\"中\",\"readings\":[\"Zhōng\",\"zhōng\",\"zhòng\"]},{\"char\":\"國\",\"readings\":[\"Guó\",\"guó\"]}]}]"
  ```

## Related projects

- [rust-canto]https://crates.io/crates/rust-canto — Cantonese romanisation
(Jyutping) for the same author's Cantonese tooling needs
- [jieba-wasm]https://github.com/fengkx/jieba-wasm — WASM bindings for
jieba-rs for web app
- [pinyin]https://crates.io/crates/pinyin — a lightweight crate for
character-by-character conversion to pīnyīn

## License

MIT