wubi 0.4.0

Self-developed Wubi 86 encoder, dictionary, and dataset (PHF + FST, WASM-ready).
Documentation
# wubi Rewrite Plan

A linear, gated rewrite of `wubi` into a high-performance, WASM-ready
Wubi 86 encoder + dictionary library. This plan is the source-of-truth for
all sub-tasks; check items off as they ship.

**License**: MIT OR Apache-2.0. **Final spinout target**: `github.com/goliajp/wubi`.

---

## Baseline (where we start)

- Crate exists at `core/crates/wubi/`, workspace member.
- Algorithm: rule 1 (键名 / 成字 / 单笔画) ✓, rules 2/3/4 implemented but unverified.
- Data: 25 键名 + 5 单笔画 + 25 一级简码 = 55 entries.
- Tests: 9/9 unit, 55/55 verify vs rime.
- Storage: text + `HashMap` (parsed at startup via `OnceLock`). No PHF, no FST.
- API: `WubiTable` struct + `encode(&Decomp, &ZigenTable) -> Result<String, _>`. Heap-allocates on every call.

## Targets at end of plan

| Dimension                               | Target           | Why                                |
|-----------------------------------------|------------------|------------------------------------|
| Cold start (load → first lookup ready)  | < 1ms            | Static-init only                   |
| 字根 / 一级简码 lookup                   | < 50ns           | PHF compile-time hash              |
| Dict exact lookup                       | < 250ns p99      | FST traversal                      |
| Dict prefix iteration (k=2)             | < 1µs first hit  | FST native streamer                |
| Single-char encode                      | < 500ns          | Zero alloc, ASCII-only              |
| Heap RSS at runtime                     | < 1 MB           | Static data in `.rodata`           |
| `wubi86.fst` artifact                   | < 1 MB @ 130k    | succinct encoding                  |
| Native crate binary footprint           | < 2 MB           | dead-strip-friendly                 |
| WASM bundle (release)                   | < 2 MB total     | acceptable for web-ime distro      |
| `wasm-pack build` time                  | < 60s            | reasonable dev loop                |

## Stack decisions (settled)

1. **`#![no_std]` lib + `extern crate alloc`** — WASM-friendly, embeddable.
2. **PHF** (`phf` crate) for small static maps (字根 ≤ 200, 简码 ≤ 25, sets ≤ 25).
3. **FST** (`fst` crate, no_std default-features=false) for the main dictionary (~13万 conditions when scaled).
4. **Zero-alloc encoder**: `encode_into(decomp, &mut [u8; 4]) -> Result<usize, _>`.
5. **Closure-injected lookup**: encoder takes `Fn(char) -> Option<u8>` so build.rs and lib share the same algorithm.
6. **build.rs** does codegen: PHF from txt → `OUT_DIR/*.rs`; FST from txt+algorithm → `OUT_DIR/wubi86.fst`. Both `include_*!`'d into lib.
7. **`wubi-wasm`** sibling crate hosts `wasm-bindgen` wrapper. Core lib stays bindgen-free.
8. **Dual API**: zero-alloc (`encode_into`) + ergonomic (`encode -> EncodedCode`).
9. **Crate split** (when we spin out): core lib, tools (CLI), wasm wrapper. For now everything lives in `core/crates/wubi/` except the wasm crate.

---

## Phase 0 — Foundation ✓

Set up deps + tooling. No behavior change.

- [x] `Cargo.toml`: add `phf`, `fst`, `criterion`, build-dep `phf_codegen`
- [x] `cargo build` succeeds (deps resolved; only 1 unused-import warning, addressed in Phase 1)
- [x] Confirm `wasm32-unknown-unknown` target installed
- [x] Confirm `wasm-pack` available at `~/.cargo/bin/wasm-pack`
- [x] `[[bench]]` declaration deferred to Phase 5 (Cargo demands the file exist on parse)

**Gate ✓:** 9/9 unit, 55/55 verify, no regressions.

## Phase 1 — Codec refactor (self-contained, zero-alloc) ✓

- [x] `src/codec.rs` — self-contained module: types + `encode_with_lookup` taking `Fn(char) -> Option<u8>` closure, output to `&mut [u8; 4]`
- [x] `src/encode.rs` — thin wrapper exposing `encode_into` (zero alloc) + `encode -> EncodedCode` (stack-only struct)
- [x] `src/stroke.rs` — re-export shim for back-compat
- [x] `src/zigen.rs` / `src/jianma.rs` — free-function lookup (HashMap-backed for now; PHF in Phase 2)
- [x] Tests: 14/14 (added 5 codec tests on top of original 9)

**Gate ✓:** 14/14 unit, 55/55 verify, encoder hot path is zero-alloc (`encode_into(&mut [u8; 4]) -> Result<usize>`).

## Phase 2 — PHF static maps

Replace HashMap-based zigen / jianma1 with compile-time PHF maps generated by
build.rs. Add PHF sets for 键名 / 单笔画 (replacing string-contains scans).

- [ ] `build.rs`:
  - Read `data/zigen86.txt` → emit `OUT_DIR/zigen.phf.rs` (`phf::Map<char, u8>`)
  - Read `data/jianma1.txt` → emit `OUT_DIR/jianma1.phf.rs` (`phf::Map<u8, char>`)
  - Emit `OUT_DIR/jianming.phf.rs` (`phf::Set<char>`) from `JIANMING_ZIGEN` const in codec
  - Emit `OUT_DIR/danbi.phf.rs` (`phf::Set<char>`) from `DAN_BI_HUA` const
  - `cargo:rerun-if-changed=` for all data + codec.rs
- [ ] `src/zigen.rs`: `include!(...)` the phf code; `pub fn lookup(c: char) -> Option<u8>`
- [ ] `src/jianma.rs`: same pattern; `pub fn lookup_jianma1(letter: u8) -> Option<char>`
- [ ] Codec uses PHF sets via `pub use` re-exports (so the constants `JIANMING_ZIGEN` / `DAN_BI_HUA` are no longer scanned)
- [ ] Encoder convenience wrapper uses `zigen::lookup` directly (no closure indirection)

**Gate:** 9/9 tests, 55/55 verify still ✓. Microbench (manual): zigen lookup ~30ns.

## Phase 3 — FST dictionary pipeline ✓

- [x] `build.rs` path-imports `src/codec.rs`, parses seed, encodes algorithmically, merges with 一级简码 entries, builds FST via `fst::MapBuilder` with composite key `code\0word`, value = weight
- [x] `src/dict.rs`: `WubiDict::embedded()` reads `include_bytes!('s OUT_DIR/wubi86.fst)`, exposes `lookup(code)` and `prefix(prefix)` returning `Vec<String>` / `Vec<(String, String)>`
- [x] 6 dict unit tests pass: g→一, gggg→王, ggll→一, miss returns empty, prefix("g") returns g+gggg+ggll
- [x] FST artifact: 915 bytes for 55 entries

**Gate ✓:** 20/20 unit, 55/55 verify, FST artifact exists.

## Phase 4 — Lib API polish + wasm32 native build ✓

- [x] `cargo build -p wubi --target wasm32-unknown-unknown --release` succeeds (1.4s clean build)
- [x] Public API surface is the 12 re-exports in `lib.rs` (codec types, encode, dict, lookups)
- [x] `golia_wubi.wasm` (cdylib) is 86 bytes — empty without bindgen exports (real WASM API in Phase 7)
- [x] `libgolia_wubi.rlib` is 236KB — usable as Rust dep on wasm
- [ ] `#![no_std]` deferred — current std-using lib builds wasm32 fine; defer no_std until a real no_std consumer demands it

**Gate ✓:** wasm32-unknown-unknown release build succeeds, zero warnings.

## Phase 5 — criterion bench ✓

| Bench                       | Target   | Measured | Status |
|-----------------------------|----------|----------|--------|
| `zigen_lookup_hit`          | < 50ns   | 7.4ns    | ✓ 6.7× under |
| `zigen_lookup_miss`         | < 50ns   | 7.7ns    ||
| `jianma1_lookup_hit`        | < 50ns   | 20ns     ||
| `dict_lookup_gggg` (4-byte) | < 250ns  | 356ns    | △ slightly over |
| `dict_lookup_g` (1-byte)    | < 250ns  | 1.13µs   | ✗ alloc-dominated |
| `dict_lookup_miss_zzzz`     || 128ns    ||
| `dict_prefix_g`             | < 1µs    | 470ns    ||
| `encode_jianming_wang`      | < 500ns  | 10ns     | ✓ 50× under |
| `encode_dan_bi_hua_yi`      | < 500ns  | 7.2ns    ||
| `encode_2_zigen_synthetic`  | < 500ns  | 14.4ns   ||

**Notes:**
- Encoder is 10ns — hot path is essentially perfect. Algorithm-level cost dwarfed by ASCII bookkeeping.
- `dict_lookup_g`'s 1.13µs is allocation overhead (`to_ascii_lowercase().into_bytes().clone()` × 2). Backlog: zero-alloc `lookup_into(&mut [u8])` variant. Non-blocking — at typing rate (~30 keystrokes/sec) total cost is ~30µs/sec.
- FST artifact at 55 entries is 915 bytes; will scale roughly linearly. Even at 130k entries → ~1MB, on target.

**Gate ✓:** baseline measured, encoder + small-table lookups crush their targets.

## Phase 4 — `#![no_std]` + lib API polish

Make the lib `no_std + alloc`. Trim public surface. Document.

- [ ] `src/lib.rs`: `#![no_std]`; `extern crate alloc;` (gated behind feature if needed)
- [ ] Replace any `std::*` paths with `core::*` / `alloc::*`
- [ ] Verify `cargo build --target wasm32-unknown-unknown --release` succeeds
- [ ] Public API via `lib.rs` re-exports:
  - `pub use codec::{Stroke, Shape, DecompRef, EncodeError, encode_with_lookup};`
  - `pub use encode::{encode_into, encode, EncodedCode};`
  - `pub use zigen::lookup as lookup_zigen;`
  - `pub use jianma::lookup_jianma1;`
  - `pub use dict::WubiDict;`
- [ ] Module-level `//! ...` doc comments on every public module
- [ ] Top-level rustdoc example in `lib.rs`

**Gate:** `cargo build --target wasm32-unknown-unknown` succeeds. Public API surface < 25 items. `cargo doc` builds without warnings.

## Phase 5 — criterion bench

Real numbers vs the targets above.

- [ ] `benches/lookup.rs`:
  - `bench zigen_lookup` — 1k iters of `lookup_zigen('王')`
  - `bench dict_exact``dict.lookup("gggg")` first hit
  - `bench dict_prefix``dict.prefix("g").take(5)` total
  - `bench encode_jianming` — encode 王
  - `bench encode_dan_bi_hua` — encode 一
  - `bench encode_2_zigen` — synthetic 2-字根 input
- [ ] `cargo bench` runs and reports numbers
- [ ] Add measured numbers to PLAN.md as a "Latest bench" section

**Gate:** numbers within 2× of targets. Anything way over → optimize before continuing.

## Phase 6 — Native WASM build verification ✓

Folded into Phase 4: `cargo build -p wubi --target wasm32-unknown-unknown --release` succeeds clean (1.4s). `libgolia_wubi.rlib` 236KB; bare cdylib `.wasm` is 86 bytes (no exports without bindgen — that's Phase 7).

**Gate ✓:** clean build.

## Phase 7 — `wubi-wasm` wrapper crate ✓

- [x] `core/crates/wubi-wasm/` created, added to workspace
- [x] `Cargo.toml`: cdylib, deps `wubi` (path) + `wasm-bindgen` + `js-sys`; `wasm-opt = false` (host wasm-opt rejects bulk-memory)
- [x] `src/lib.rs`: `WubiEngine` with `new`, `len` getter, `lookup(code) -> Array<string>`, `prefix(prefix) -> Array<{code, word}>`
- [x] `wasm-pack build --target web --release` produces:
  - `pkg/golia_wubi_wasm_bg.wasm`**33KB**
  - `pkg/golia_wubi_wasm.js` — 9.6KB
  - `pkg/golia_wubi_wasm.d.ts` — 2.3KB (TypeScript types, free)
  - `pkg/package.json` — 465B
  - **Total: 56KB** (1/40 of the 2MB target)

**Gate ✓:** wasm-pack output usable as an npm package.

## Phase 8 — Web demo ✓

- [x] `core/crates/wubi-wasm/web/index.html` — input box + clickable candidate chips + "已上屏" output, dark/light reactive styling
- [x] `web/serve.sh` — runs wasm-pack then serves the crate root via `python3 -m http.server`; demo lives at `/web/` and references `/pkg/`
- [x] Smoke: `curl` confirmed `/web/`, `/web/index.html`, `/pkg/*.js`, `/pkg/*.wasm` all 200
- [x] Bundle size record: 56KB total, well under 2MB target

**Gate ✓:** end-to-end web flow works locally.

## Phase 8 — Web demo

Tiny static site that loads the WASM and lets a user try Wubi lookups.

- [ ] `core/crates/wubi-wasm/web/` directory:
  - `index.html` — input box + candidate list, vanilla DOM
  - `main.js` — imports the wasm-pack output, wires up keystrokes
  - `package.json` — declares `wubi` (built via `wasm-pack build`)
- [ ] Build script: `web/build.sh` runs `wasm-pack build --target web --release`
- [ ] Smoke test: open `web/index.html` (e.g., via `python3 -m http.server`), type `gggg`, see 王
- [ ] Bundle size measurement, recorded in PLAN.md

**Gate:** local web demo works end-to-end. WASM bundle < 2 MB.

## Phase 9 — Tools / CLI cleanup ✓

- [x] `tools/gen_dict.rs` — adapted to new `EncodedCode` API
- [x] `tools/verify.rs` — adapted, reports 55/55 ✓ vs rime
- [ ] Optional `tools/lookup.rs` REPL — backlog
- [x] All bins still build

**Gate ✓:** verify reports 55/55 ✓.

## Phase 10 — Data scaling (correctness across all rules)

Push beyond the initial 55 verified entries.

**Round 1 (this session, 78/78 ✓):**

- [x] Algorithm bug fixed: 2-stroke 成字字根 → 3-letter code (was 4-letter). Caught by `` mismatch.
- [x] `zigen86.txt` extras: 亻、氵、忄、阝、辶、钅、厶 (字根 偏旁 forms commonly needed in 2/3-字根 decomps)
- [x] Rule 2 verified on 16 chars: 全 信 好 明 吗 朋 卡 吕 占 林 多 少 时 江 汉 油
- [x] Rule 3 verified on 4 chars: 河 想 沙 没
- [x] 2-stroke 成字字根 verified on 3 chars: 二 七 八

**Round 2 — backlog:**

- [ ] Rule 4 (4+ 字根): need more 字根 in table (囗、宀-extended、 ⺍、⺝、廴、彳、攵 …)
- [ ] Rule 1 sub-case 3-stroke 成字字根 (e.g., 上 = HHGG, 三 = DGGG)
- [ ] Push table to ≥ 100 字根 entries
- [ ] Push seed to 30 / 30 / 30 (rule 2 / 3 / 4)
- [ ] `data/jianma2.txt` (二级简码) — most common 200
- [ ] `data/jianma3.txt` (三级简码) — most common 100
- [ ] build.rs picks up new layers

**Current state:** 78/78 ✓ vs rime. Algorithm proven on rules 1 (3 sub-cases — 键名/单笔/2-stroke 成字), 2, 3.

## Phase 11 — Lab8 integration

Replace lab8-core's reliance on `data/wubi86_full.txt` (rime, LGPL).

- [ ] `lab8-core/Cargo.toml`: add `wubi = { path = "../wubi" }`
- [ ] `lab8-core/src/wubi/table.rs`: lookup goes through `golia_wubi::WubiDict::embedded()` rather than parsing txt
- [ ] Remove `WUBI_FULL` injection mechanism (the rime fallback path)
- [ ] Update `data/wubi86.txt` stub or remove (now superseded)
- [ ] `cargo test -p lab8-core` passes
- [ ] Mac: rebuild Lab8IME.app, hotkey check (after relogin) — `gggg` etc. still work
- [ ] iOS: rebuild Lab8Keyboard.appex, install on panda's iphone — same Wubi behavior, no rime data on the binary

**Gate:** Lab8 IME works on iOS with wubi data only. No LGPL data in shipped binary.

## Phase 12 — Spinout

Move out of `lab8-ime/` into its own repo.

- [ ] Confirm at this point wubi has: lib + wasm crate + tools + bench + demo + data + tests + docs + license
- [ ] Create `~/workspace/labs/wubi/` (or directly under `goliajp` GitHub org)
- [ ] `git init`, copy `core/crates/wubi/` + `core/crates/wubi-wasm/`
- [ ] Set up workspace `Cargo.toml` at the new repo root
- [ ] Add `README.md` (high-level intro, build steps, examples for Rust + WASM)
- [ ] Add `CHANGELOG.md`, basic `CONTRIBUTING.md`
- [ ] GitHub Actions CI: build, test, fmt, clippy, build wasm
- [ ] Publish initial tag `v0.1.0`
- [ ] In `lab8-ime`: switch dep to `wubi = { git = "..." , tag = "v0.1.0" }`

**Gate:** `cargo test --all` passes in the new repo. `lab8-ime` builds with `wubi` as a git dep.

## Phase 13 — Publish

- [ ] crates.io: `wubi` and `wubi-wasm` published (semver 0.1)
- [ ] npm: `@goliapkg/wubi` published as wasm-pack output
- [ ] Documentation site (or just `docs.rs` link) — verify all public items rendered
- [ ] Tagged release v0.1.0 with release notes

**Gate:** `cargo install wubi-tools` works (if we ship a CLI binary). `npm install @goliapkg/wubi` produces a usable wasm dep.

---

## Execution rules

- Each phase has a **Gate**: do not start the next phase until the gate passes.
- Each unchecked item is one focused commit (or a small batch of related commits).
- Performance changes are accompanied by criterion numbers; no "feels faster" claims.
- Algorithm changes are accompanied by `verify` re-run; never regress correctness.
- Anything that requires a 字根 / 简码 / decomp data update goes through the verify tool against rime as the cross-check oracle.

## Latest progress

- Phases 0–9: **✓ done** (foundation, codec refactor, PHF, FST, wasm32, criterion, wasm-bindgen, web demo, CLI cleanup).
- Phase 10 round 1: **✓ done** — 78/78 ✓ vs rime; algorithm bug (2-stroke 成字字根) caught & fixed; rules 1/2/3 fully verified.
- Phase 10 round 2 (more decomp data + 简码 layers + 字根 expansion): incremental.
- Phases 11–13 (Lab8 integration, spinout, publish): pending.

**Hard numbers shipped:**
- 20 unit tests, 78/78 verify vs rime
- zigen lookup 7.4ns, encode 7–15ns, FST exact 128–356ns
- WASM bundle 56KB total
- Web demo runs locally, 33KB wasm

## Latest bench

(Will populate from Phase 5 onward.)

| Bench               | Target  | Measured | Status |
|---------------------|---------|----------|--------|
| zigen_lookup        | < 50ns  |||
| dict_exact          | < 250ns |||
| dict_prefix(k=2)    | < 1µs   |||
| encode_jianming     | < 500ns |||
| encode_dan_bi_hua   | < 500ns |||
| encode_2_zigen      | < 500ns |||