inputx-pinyin
Self-developed Mandarin Pinyin input method engine for Rust — segmenter,
fuzzy syllables, FST-backed dict, L0 / L1+ ranking, WASM-ready via the
companion inputx-pinyin-wasm crate.
Powers the Inputx IME on iOS and the web — this crate is the standalone, reusable Pinyin engine, also publishable to crates.io for any downstream that wants a clean, permissively-licensed Mandarin Pinyin stack.
License: MIT OR Apache-2.0 (dual). Built from scratch on permissively-licensed sources only — see License below for the full attribution chain.
What's in the box
- 414,325 FST entries (
data/pinyin.fst, ~9 MB):- 44,357 single-character readings from Unihan
kHanyuPinlu+kMandarin - 369,968 multi-character phrase readings — pypinyin canonical override for ~43k phrases, Unihan cartesian product for the long tail
- 105 hand-curated heteronym entries collapse known-noise readings
(
银行 → yinhang,重新 → chongxin,着陆 → zhuolu, …)
- 44,357 single-character readings from Unihan
- Segmenter — DP all-splits enumeration of a pinyin buffer
- Fuzzy syllables — 9 toggleable consonant/vowel-pair tolerances
(
z⇄zh,n⇄l,en⇄eng, …) for non-standard typists - L0 user-learning — 3-pick auto-pin per
(input, word)pair, with JSON-serializable snapshot for cross-session persistence - Streaming prefix scan (
prefix_for_each) — zero-allocation visitor over FST entries matching a prefix, used by Inputx's per-keystroke partial-input completion (zho → 中国,zhong → 中国/中华/中央, …)
Quick start
# Cargo.toml
[]
= "1.0"
use ;
let eng = new;
let dict = eng.dict;
// Exact-syllable lookup (FST-backed).
let cands = dict.lookup;
// → ["中国", "中过", ...]
// Streaming prefix scan — visitor sees every entry whose pinyin starts
// with the given string, with no Vec allocation up front. Used for
// partial-input candidate generation in hot IME loops.
dict.prefix_for_each;
// Tell the engine the user picked a word. After 3 picks of the same
// (input, word), it's auto-pinned to L0 for that input.
dict.record_pick;
The crate name on crates.io is
inputx-pinyin, but the lib name isgolia_pinyinfor ergonomic imports —use golia_pinyin::...works directly.
Performance
The companion Inputx IME enforces a
per-keystroke 16 ms / 60 Hz frame budget on the full
candidate-refresh pipeline (exact lookup + 简拼 initials + prefix scan +
filter), measured in its perfgate unit
test.
Bare engine numbers (Apple Silicon, release; run cargo bench -p inputx-pinyin):
| Op | Latency |
|---|---|
dict.lookup("zhongguo") (exact, multi-candidate) |
~530 ns |
dict.lookup_into("zhongguo", &mut buf) (reused buf) |
~510 ns |
dict.lookup("ni") (single-syllable, large fanout) |
~7.3 µs |
dict.lookup("xxxxxx") (miss) |
~160 ns |
dict.prefix_for_each("zhong", _) (4-letter prefix) |
~770 µs |
dict.prefix_for_each("z", _) (worst case, ~50k entries) |
~4.8 ms |
dict.prefix_for_each_raw (vs _for_each) |
~10% faster |
dict.prefix_exists("zhong") (early-terminate bool) |
< 100 ns |
dict.record_pick(input, word) |
< 1 µs |
segment("zhongguorenmin") (4-syllable input) |
< 2 µs |
encode::char_to_pinyin('中') (reverse lookup, cached) |
< 50 ns |
Tools (--features tools)
Maintainer-only binaries for regenerating the FST from upstream sources. Library consumers don't need these:
License
Engine code is dual-licensed under MIT OR Apache-2.0 © 2026 GOLIA K.K., at your option.
Bundled data and its licenses
| Source | License | What it contributes |
|---|---|---|
| Unihan Database | Unicode License v3 | Per-char pinyin readings (kHanyuPinlu, kMandarin) |
jieba (fxsjy/jieba) |
MIT | ~349k Mandarin phrase entries used to seed phrase readings |
pypinyin (mozillazg/python-pinyin) |
MIT | ~47k hand-curated canonical phrase pronunciations |
| Leipzig Corpora Collection | CC-BY 4.0 | zho_wikipedia_2018_1M, zho_news_2020_100K — frequency weights |
| SUBTLEX-CH-WF | CC-BY 4.0 | Spoken-register frequency weights from film subtitles |
The published crate only ships the derived FST + integer frequency scores — none of the source corpora text is redistributed.