inputx-pinyin 1.0.1

Self-developed Mandarin Pinyin input method engine — segmenter, fuzzy syllables, FST dict, WASM-ready. Powers the Inputx IME.
Documentation

inputx-pinyin

Self-developed Mandarin Pinyin input method engine for Rust — segmenter, fuzzy syllables, FST-backed dict, L0 / L1+ ranking, WASM-ready via the companion inputx-pinyin-wasm crate.

Powers the Inputx IME on iOS and the web — this crate is the standalone, reusable Pinyin engine, also publishable to crates.io for any downstream that wants a clean, permissively-licensed Mandarin Pinyin stack.

License: MIT OR Apache-2.0 (dual). Built from scratch on permissively-licensed sources only — see License below for the full attribution chain.

Read this in 简体中文 · 日本語.

What's in the box

  • 414,325 FST entries (data/pinyin.fst, ~9 MB):
    • 44,357 single-character readings from Unihan kHanyuPinlu + kMandarin
    • 369,968 multi-character phrase readings — pypinyin canonical override for ~43k phrases, Unihan cartesian product for the long tail
    • 105 hand-curated heteronym entries collapse known-noise readings (银行 → yinhang, 重新 → chongxin, 着陆 → zhuolu, …)
  • Segmenter — DP all-splits enumeration of a pinyin buffer
  • Fuzzy syllables — 9 toggleable consonant/vowel-pair tolerances (z⇄zh, n⇄l, en⇄eng, …) for non-standard typists
  • L0 user-learning — 3-pick auto-pin per (input, word) pair, with JSON-serializable snapshot for cross-session persistence
  • Streaming prefix scan (prefix_for_each) — zero-allocation visitor over FST entries matching a prefix, used by Inputx's per-keystroke partial-input completion (zho → 中国, zhong → 中国/中华/中央, …)

Quick start

# Cargo.toml
[dependencies]
inputx-pinyin = "1.0"
use golia_pinyin::{PinyinEngine, PinyinDict};

let eng = PinyinEngine::new();
let dict = eng.dict();

// Exact-syllable lookup (FST-backed).
let cands = dict.lookup("zhongguo");
// → ["中国", "中过", ...]

// Streaming prefix scan — visitor sees every entry whose pinyin starts
// with the given string, with no Vec allocation up front. Used for
// partial-input candidate generation in hot IME loops.
dict.prefix_for_each("zho", |pinyin, word, freq| {
    println!("{pinyin} {word} (freq={freq})");
});

// Tell the engine the user picked a word. After 3 picks of the same
// (input, word), it's auto-pinned to L0 for that input.
dict.record_pick("zhongguo", "中国");

The crate name on crates.io is inputx-pinyin, but the lib name is golia_pinyin for ergonomic imports — use golia_pinyin::... works directly.

Performance

The companion Inputx IME enforces a per-keystroke 16 ms / 60 Hz frame budget on the full candidate-refresh pipeline (exact lookup + 简拼 initials + prefix scan + filter), measured in its perfgate unit test.

Bare engine numbers (Apple Silicon, release):

Op Latency
dict.lookup("zhongguo") < 100 ns
dict.prefix_for_each("zhong", ...) ~ 0.8 ms
dict.prefix_for_each("z", ...) worst case ~ 4.5 ms
dict.record_pick(input, word) < 1 µs

Tools (--features tools)

Maintainer-only binaries for regenerating the FST from upstream sources. Library consumers don't need these:

cargo run --features tools --release --bin unihan-extract-readings
cargo run --features tools --release --bin compose-phrase-readings
cargo run --features tools --release --bin pinyin-fetch-corpus
cargo run --features tools --release --bin pinyin-build-weights
cargo run --features tools --release --bin pinyin-build-fst

License

Engine code is dual-licensed under MIT OR Apache-2.0 © 2026 GOLIA K.K., at your option.

Bundled data and its licenses

Source License What it contributes
Unihan Database Unicode License v3 Per-char pinyin readings (kHanyuPinlu, kMandarin)
jieba (fxsjy/jieba) MIT ~349k Mandarin phrase entries used to seed phrase readings
pypinyin (mozillazg/python-pinyin) MIT ~47k hand-curated canonical phrase pronunciations
Leipzig Corpora Collection CC-BY 4.0 zho_wikipedia_2018_1M, zho_news_2020_100K — frequency weights
SUBTLEX-CH-WF CC-BY 4.0 Spoken-register frequency weights from film subtitles

The published crate only ships the derived FST + integer frequency scores — none of the source corpora text is redistributed.