# wubi Rewrite Plan
A linear, gated rewrite of `wubi` into a high-performance, WASM-ready
Wubi 86 encoder + dictionary library. This plan is the source-of-truth for
all sub-tasks; check items off as they ship.
**License**: MIT OR Apache-2.0. **Final spinout target**: `github.com/goliajp/wubi`.
---
## Baseline (where we start)
- Crate exists at `core/crates/wubi/`, workspace member.
- Algorithm: rule 1 (键名 / 成字 / 单笔画) ✓, rules 2/3/4 implemented but unverified.
- Data: 25 键名 + 5 单笔画 + 25 一级简码 = 55 entries.
- Tests: 9/9 unit, 55/55 verify vs rime.
- Storage: text + `HashMap` (parsed at startup via `OnceLock`). No PHF, no FST.
- API: `WubiTable` struct + `encode(&Decomp, &ZigenTable) -> Result<String, _>`. Heap-allocates on every call.
## Targets at end of plan
| Cold start (load → first lookup ready) | < 1ms | Static-init only |
| 字根 / 一级简码 lookup | < 50ns | PHF compile-time hash |
| Dict exact lookup | < 250ns p99 | FST traversal |
| Dict prefix iteration (k=2) | < 1µs first hit | FST native streamer |
| Single-char encode | < 500ns | Zero alloc, ASCII-only |
| Heap RSS at runtime | < 1 MB | Static data in `.rodata` |
| `wubi86.fst` artifact | < 1 MB @ 130k | succinct encoding |
| Native crate binary footprint | < 2 MB | dead-strip-friendly |
| WASM bundle (release) | < 2 MB total | acceptable for web-ime distro |
| `wasm-pack build` time | < 60s | reasonable dev loop |
## Stack decisions (settled)
1. **`#![no_std]` lib + `extern crate alloc`** — WASM-friendly, embeddable.
2. **PHF** (`phf` crate) for small static maps (字根 ≤ 200, 简码 ≤ 25, sets ≤ 25).
3. **FST** (`fst` crate, no_std default-features=false) for the main dictionary (~13万 conditions when scaled).
4. **Zero-alloc encoder**: `encode_into(decomp, &mut [u8; 4]) -> Result<usize, _>`.
5. **Closure-injected lookup**: encoder takes `Fn(char) -> Option<u8>` so build.rs and lib share the same algorithm.
6. **build.rs** does codegen: PHF from txt → `OUT_DIR/*.rs`; FST from txt+algorithm → `OUT_DIR/wubi86.fst`. Both `include_*!`'d into lib.
7. **`wubi-wasm`** sibling crate hosts `wasm-bindgen` wrapper. Core lib stays bindgen-free.
8. **Dual API**: zero-alloc (`encode_into`) + ergonomic (`encode -> EncodedCode`).
9. **Crate split** (when we spin out): core lib, tools (CLI), wasm wrapper. For now everything lives in `core/crates/wubi/` except the wasm crate.
---
## Phase 0 — Foundation ✓
Set up deps + tooling. No behavior change.
- [x] `Cargo.toml`: add `phf`, `fst`, `criterion`, build-dep `phf_codegen`
- [x] `cargo build` succeeds (deps resolved; only 1 unused-import warning, addressed in Phase 1)
- [x] Confirm `wasm32-unknown-unknown` target installed
- [x] Confirm `wasm-pack` available at `~/.cargo/bin/wasm-pack`
- [x] `[[bench]]` declaration deferred to Phase 5 (Cargo demands the file exist on parse)
**Gate ✓:** 9/9 unit, 55/55 verify, no regressions.
## Phase 1 — Codec refactor (self-contained, zero-alloc) ✓
- [x] `src/codec.rs` — self-contained module: types + `encode_with_lookup` taking `Fn(char) -> Option<u8>` closure, output to `&mut [u8; 4]`
- [x] `src/encode.rs` — thin wrapper exposing `encode_into` (zero alloc) + `encode -> EncodedCode` (stack-only struct)
- [x] `src/stroke.rs` — re-export shim for back-compat
- [x] `src/zigen.rs` / `src/jianma.rs` — free-function lookup (HashMap-backed for now; PHF in Phase 2)
- [x] Tests: 14/14 (added 5 codec tests on top of original 9)
**Gate ✓:** 14/14 unit, 55/55 verify, encoder hot path is zero-alloc (`encode_into(&mut [u8; 4]) -> Result<usize>`).
## Phase 2 — PHF static maps
Replace HashMap-based zigen / jianma1 with compile-time PHF maps generated by
build.rs. Add PHF sets for 键名 / 单笔画 (replacing string-contains scans).
- [ ] `build.rs`:
- Read `data/zigen86.txt` → emit `OUT_DIR/zigen.phf.rs` (`phf::Map<char, u8>`)
- Read `data/jianma1.txt` → emit `OUT_DIR/jianma1.phf.rs` (`phf::Map<u8, char>`)
- Emit `OUT_DIR/jianming.phf.rs` (`phf::Set<char>`) from `JIANMING_ZIGEN` const in codec
- Emit `OUT_DIR/danbi.phf.rs` (`phf::Set<char>`) from `DAN_BI_HUA` const
- `cargo:rerun-if-changed=` for all data + codec.rs
- [ ] `src/zigen.rs`: `include!(...)` the phf code; `pub fn lookup(c: char) -> Option<u8>`
- [ ] `src/jianma.rs`: same pattern; `pub fn lookup_jianma1(letter: u8) -> Option<char>`
- [ ] Codec uses PHF sets via `pub use` re-exports (so the constants `JIANMING_ZIGEN` / `DAN_BI_HUA` are no longer scanned)
- [ ] Encoder convenience wrapper uses `zigen::lookup` directly (no closure indirection)
**Gate:** 9/9 tests, 55/55 verify still ✓. Microbench (manual): zigen lookup ~30ns.
## Phase 3 — FST dictionary pipeline ✓
- [x] `build.rs` path-imports `src/codec.rs`, parses seed, encodes algorithmically, merges with 一级简码 entries, builds FST via `fst::MapBuilder` with composite key `code\0word`, value = weight
- [x] `src/dict.rs`: `WubiDict::embedded()` reads `include_bytes!('s OUT_DIR/wubi86.fst)`, exposes `lookup(code)` and `prefix(prefix)` returning `Vec<String>` / `Vec<(String, String)>`
- [x] 6 dict unit tests pass: g→一, gggg→王, ggll→一, miss returns empty, prefix("g") returns g+gggg+ggll
- [x] FST artifact: 915 bytes for 55 entries
**Gate ✓:** 20/20 unit, 55/55 verify, FST artifact exists.
## Phase 4 — Lib API polish + wasm32 native build ✓
- [x] `cargo build -p wubi --target wasm32-unknown-unknown --release` succeeds (1.4s clean build)
- [x] Public API surface is the 12 re-exports in `lib.rs` (codec types, encode, dict, lookups)
- [x] `golia_wubi.wasm` (cdylib) is 86 bytes — empty without bindgen exports (real WASM API in Phase 7)
- [x] `libgolia_wubi.rlib` is 236KB — usable as Rust dep on wasm
- [ ] `#![no_std]` deferred — current std-using lib builds wasm32 fine; defer no_std until a real no_std consumer demands it
**Gate ✓:** wasm32-unknown-unknown release build succeeds, zero warnings.
## Phase 5 — criterion bench ✓
| `zigen_lookup_hit` | < 50ns | 7.4ns | ✓ 6.7× under |
| `zigen_lookup_miss` | < 50ns | 7.7ns | ✓ |
| `jianma1_lookup_hit` | < 50ns | 20ns | ✓ |
| `dict_lookup_gggg` (4-byte) | < 250ns | 356ns | △ slightly over |
| `dict_lookup_g` (1-byte) | < 250ns | 1.13µs | ✗ alloc-dominated |
| `dict_lookup_miss_zzzz` | — | 128ns | — |
| `dict_prefix_g` | < 1µs | 470ns | ✓ |
| `encode_jianming_wang` | < 500ns | 10ns | ✓ 50× under |
| `encode_dan_bi_hua_yi` | < 500ns | 7.2ns | ✓ |
| `encode_2_zigen_synthetic` | < 500ns | 14.4ns | ✓ |
**Notes:**
- Encoder is 10ns — hot path is essentially perfect. Algorithm-level cost dwarfed by ASCII bookkeeping.
- `dict_lookup_g`'s 1.13µs is allocation overhead (`to_ascii_lowercase().into_bytes().clone()` × 2). Backlog: zero-alloc `lookup_into(&mut [u8])` variant. Non-blocking — at typing rate (~30 keystrokes/sec) total cost is ~30µs/sec.
- FST artifact at 55 entries is 915 bytes; will scale roughly linearly. Even at 130k entries → ~1MB, on target.
**Gate ✓:** baseline measured, encoder + small-table lookups crush their targets.
## Phase 4 — `#![no_std]` + lib API polish
Make the lib `no_std + alloc`. Trim public surface. Document.
- [ ] `src/lib.rs`: `#![no_std]`; `extern crate alloc;` (gated behind feature if needed)
- [ ] Replace any `std::*` paths with `core::*` / `alloc::*`
- [ ] Verify `cargo build --target wasm32-unknown-unknown --release` succeeds
- [ ] Public API via `lib.rs` re-exports:
- `pub use codec::{Stroke, Shape, DecompRef, EncodeError, encode_with_lookup};`
- `pub use encode::{encode_into, encode, EncodedCode};`
- `pub use zigen::lookup as lookup_zigen;`
- `pub use jianma::lookup_jianma1;`
- `pub use dict::WubiDict;`
- [ ] Module-level `//! ...` doc comments on every public module
- [ ] Top-level rustdoc example in `lib.rs`
**Gate:** `cargo build --target wasm32-unknown-unknown` succeeds. Public API surface < 25 items. `cargo doc` builds without warnings.
## Phase 5 — criterion bench
Real numbers vs the targets above.
- [ ] `benches/lookup.rs`:
- `bench zigen_lookup` — 1k iters of `lookup_zigen('王')`
- `bench dict_exact` — `dict.lookup("gggg")` first hit
- `bench dict_prefix` — `dict.prefix("g").take(5)` total
- `bench encode_jianming` — encode 王
- `bench encode_dan_bi_hua` — encode 一
- `bench encode_2_zigen` — synthetic 2-字根 input
- [ ] `cargo bench` runs and reports numbers
- [ ] Add measured numbers to PLAN.md as a "Latest bench" section
**Gate:** numbers within 2× of targets. Anything way over → optimize before continuing.
## Phase 6 — Native WASM build verification ✓
Folded into Phase 4: `cargo build -p wubi --target wasm32-unknown-unknown --release` succeeds clean (1.4s). `libgolia_wubi.rlib` 236KB; bare cdylib `.wasm` is 86 bytes (no exports without bindgen — that's Phase 7).
**Gate ✓:** clean build.
## Phase 7 — `wubi-wasm` wrapper crate ✓
- [x] `core/crates/wubi-wasm/` created, added to workspace
- [x] `Cargo.toml`: cdylib, deps `wubi` (path) + `wasm-bindgen` + `js-sys`; `wasm-opt = false` (host wasm-opt rejects bulk-memory)
- [x] `src/lib.rs`: `WubiEngine` with `new`, `len` getter, `lookup(code) -> Array<string>`, `prefix(prefix) -> Array<{code, word}>`
- [x] `wasm-pack build --target web --release` produces:
- `pkg/golia_wubi_wasm_bg.wasm` — **33KB**
- `pkg/golia_wubi_wasm.js` — 9.6KB
- `pkg/golia_wubi_wasm.d.ts` — 2.3KB (TypeScript types, free)
- `pkg/package.json` — 465B
- **Total: 56KB** (1/40 of the 2MB target)
**Gate ✓:** wasm-pack output usable as an npm package.
## Phase 8 — Web demo ✓
- [x] `core/crates/wubi-wasm/web/index.html` — input box + clickable candidate chips + "已上屏" output, dark/light reactive styling
- [x] `web/serve.sh` — runs wasm-pack then serves the crate root via `python3 -m http.server`; demo lives at `/web/` and references `/pkg/`
- [x] Smoke: `curl` confirmed `/web/`, `/web/index.html`, `/pkg/*.js`, `/pkg/*.wasm` all 200
- [x] Bundle size record: 56KB total, well under 2MB target
**Gate ✓:** end-to-end web flow works locally.
## Phase 8 — Web demo
Tiny static site that loads the WASM and lets a user try Wubi lookups.
- [ ] `core/crates/wubi-wasm/web/` directory:
- `index.html` — input box + candidate list, vanilla DOM
- `main.js` — imports the wasm-pack output, wires up keystrokes
- `package.json` — declares `wubi` (built via `wasm-pack build`)
- [ ] Build script: `web/build.sh` runs `wasm-pack build --target web --release`
- [ ] Smoke test: open `web/index.html` (e.g., via `python3 -m http.server`), type `gggg`, see 王
- [ ] Bundle size measurement, recorded in PLAN.md
**Gate:** local web demo works end-to-end. WASM bundle < 2 MB.
## Phase 9 — Tools / CLI cleanup ✓
- [x] `tools/gen_dict.rs` — adapted to new `EncodedCode` API
- [x] `tools/verify.rs` — adapted, reports 55/55 ✓ vs rime
- [ ] (Optional) `tools/lookup.rs` REPL — backlog
- [x] All bins still build
**Gate ✓:** verify reports 55/55 ✓.
## Phase 10 — Data scaling (correctness across all rules)
Push beyond the initial 55 verified entries.
**Round 1 (this session, 78/78 ✓):**
- [x] Algorithm bug fixed: 2-stroke 成字字根 → 3-letter code (was 4-letter). Caught by `八` mismatch.
- [x] `zigen86.txt` extras: 亻、氵、忄、阝、辶、钅、厶 (字根 偏旁 forms commonly needed in 2/3-字根 decomps)
- [x] Rule 2 verified on 16 chars: 全 信 好 明 吗 朋 卡 吕 占 林 多 少 时 江 汉 油
- [x] Rule 3 verified on 4 chars: 河 想 沙 没
- [x] 2-stroke 成字字根 verified on 3 chars: 二 七 八
**Round 2 — backlog:**
- [ ] Rule 4 (4+ 字根): need more 字根 in table (囗、宀-extended、 ⺍、⺝、廴、彳、攵 …)
- [ ] Rule 1 sub-case 3-stroke 成字字根 (e.g., 上 = HHGG, 三 = DGGG)
- [ ] Push table to ≥ 100 字根 entries
- [ ] Push seed to 30 / 30 / 30 (rule 2 / 3 / 4)
- [ ] `data/jianma2.txt` (二级简码) — most common 200
- [ ] `data/jianma3.txt` (三级简码) — most common 100
- [ ] build.rs picks up new layers
**Current state:** 78/78 ✓ vs rime. Algorithm proven on rules 1 (3 sub-cases — 键名/单笔/2-stroke 成字), 2, 3.
## Phase 11 — Lab8 integration
Replace lab8-core's reliance on `data/wubi86_full.txt` (rime, LGPL).
- [ ] `lab8-core/Cargo.toml`: add `wubi = { path = "../wubi" }`
- [ ] `lab8-core/src/wubi/table.rs`: lookup goes through `golia_wubi::WubiDict::embedded()` rather than parsing txt
- [ ] Remove `WUBI_FULL` injection mechanism (the rime fallback path)
- [ ] Update `data/wubi86.txt` stub or remove (now superseded)
- [ ] `cargo test -p lab8-core` passes
- [ ] Mac: rebuild Lab8IME.app, hotkey check (after relogin) — `gggg` etc. still work
- [ ] iOS: rebuild Lab8Keyboard.appex, install on panda's iphone — same Wubi behavior, no rime data on the binary
**Gate:** Lab8 IME works on iOS with wubi data only. No LGPL data in shipped binary.
## Phase 12 — Spinout
Move out of `lab8-ime/` into its own repo.
- [ ] Confirm at this point wubi has: lib + wasm crate + tools + bench + demo + data + tests + docs + license
- [ ] Create `~/workspace/labs/wubi/` (or directly under `goliajp` GitHub org)
- [ ] `git init`, copy `core/crates/wubi/` + `core/crates/wubi-wasm/`
- [ ] Set up workspace `Cargo.toml` at the new repo root
- [ ] Add `README.md` (high-level intro, build steps, examples for Rust + WASM)
- [ ] Add `CHANGELOG.md`, basic `CONTRIBUTING.md`
- [ ] GitHub Actions CI: build, test, fmt, clippy, build wasm
- [ ] Publish initial tag `v0.1.0`
- [ ] In `lab8-ime`: switch dep to `wubi = { git = "..." , tag = "v0.1.0" }`
**Gate:** `cargo test --all` passes in the new repo. `lab8-ime` builds with `wubi` as a git dep.
## Phase 13 — Publish
- [ ] crates.io: `wubi` and `wubi-wasm` published (semver 0.1)
- [ ] npm: `@goliapkg/wubi` published as wasm-pack output
- [ ] Documentation site (or just `docs.rs` link) — verify all public items rendered
- [ ] Tagged release v0.1.0 with release notes
**Gate:** `cargo install wubi-tools` works (if we ship a CLI binary). `npm install @goliapkg/wubi` produces a usable wasm dep.
---
## Execution rules
- Each phase has a **Gate**: do not start the next phase until the gate passes.
- Each unchecked item is one focused commit (or a small batch of related commits).
- Performance changes are accompanied by criterion numbers; no "feels faster" claims.
- Algorithm changes are accompanied by `verify` re-run; never regress correctness.
- Anything that requires a 字根 / 简码 / decomp data update goes through the verify tool against rime as the cross-check oracle.
## Latest progress
- Phases 0–9: **✓ done** (foundation, codec refactor, PHF, FST, wasm32, criterion, wasm-bindgen, web demo, CLI cleanup).
- Phase 10 round 1: **✓ done** — 78/78 ✓ vs rime; algorithm bug (2-stroke 成字字根) caught & fixed; rules 1/2/3 fully verified.
- Phase 10 round 2 (more decomp data + 简码 layers + 字根 expansion): incremental.
- Phases 11–13 (Lab8 integration, spinout, publish): pending.
**Hard numbers shipped:**
- 20 unit tests, 78/78 verify vs rime
- zigen lookup 7.4ns, encode 7–15ns, FST exact 128–356ns
- WASM bundle 56KB total
- Web demo runs locally, 33KB wasm
## Latest bench
(Will populate from Phase 5 onward.)
| zigen_lookup | < 50ns | — | — |
| dict_exact | < 250ns | — | — |
| dict_prefix(k=2) | < 1µs | — | — |
| encode_jianming | < 500ns | — | — |
| encode_dan_bi_hua | < 500ns | — | — |
| encode_2_zigen | < 500ns | — | — |