wubi
Self-developed Wubi 86 (五笔字型) encoder + dictionary for Rust.
PHF + FST backed, with a built-in L0 / L1+ ranking model and per-user
auto-learning. WASM-ready via the companion wubi-wasm
crate.
License: MIT OR Apache-2.0 (dual). The dictionary data derives from the public Wubi 86 standard (王永民, 1986); no GPL or LGPL data is embedded.
What's in the box
- 135,822 FST entries:
- 25 一级简码, 616 二级简码, 5,173 三级简码
- 53 hand-curated 字根 + 70,317 algorithmically-decomposed CJK chars
- 61,205 phrases (词组)
- Wubi 86 encoder, all four canonical rules
- L0 / L1+ ranking — immutable layered lexicon (Auto < Phrase < Zigen < Jianma3 < Jianma2 < Jianma1 by base weight) plus a mutable per-user override layer with a 3-pick auto-promotion rule
- Layer prefs — host-tunable multipliers per layer
- Reproducible weight pipeline (
wubi-build-weights) with CI byte-diff verify
Quick start
# Cargo.toml
[]
= "0.1"
use WubiDict;
let dict = embedded;
// Lookup is L0 + layer + freq ranked.
let candidates = dict.lookup;
// → ["中国", "跑车", "跨国", "䟧", ...]
// Hot loop (IME use case): reuse the buffer.
let mut buf = Vecnew;
dict.lookup_into;
// Tell the dict the user picked a candidate. After 3 picks of the same
// (code, word), it's auto-pinned to L0. Pin/forget/layer-prefs APIs are
// also exposed for explicit host control.
dict.record_pick;
See the workspace README for the full architecture overview, weight pipeline, and ROADMAP.
Performance (Apple Silicon, release)
| Op | Latency |
|---|---|
| 字根 / 一级简码 PHF lookup | ~10 ns |
dict.lookup (1–6 cand) |
270–620 ns |
dict.lookup miss |
~145 ns |
dict.prefix (~5K cand) |
~1.3 ms |
| Encoder | 8–15 ns |
Tools (--features tools)
wubi-fetch-corpus— download corpora declared indata/corpus/manifest.toml, SHA-verified, cached locallywubi-build-weights— scan corpora, derivedata/weights/weights.tsvdata/weights/provenance.toml.verifymode for CI byte-diff.
Stability
Pre-1.0. APIs stabilizing toward v0.2.0 (mechanism) and v1.0 (real corpus weights). Dictionary entries are stable; codes for existing characters won't change.