Skip to main content

Crate inputx_wubi_data

Crate inputx_wubi_data 

Source
Expand description

inputx-wubi-data — embedded Wubi 86 IDFv1 dict + lookup helpers for the inputx-wubi engine, packaged as a publishable stone.

Successor to inputx-wubi-cement under the v1.5 D11 taxonomy correction (2026-05): cement = an application’s source code (your own wubi.rs / engine.rs), NOT a published crate. The historical -cement-suffix crate is deprecated and re-exports from this crate for backward compat.

§What’s in the box

  • EMBEDDED_WUBI_IDF — IDFv1 binary dict blob with the wubi Layer enum index encoded in EntryFlags::engine_tag() (v1.4.7 sub-phase A4 step 2).
  • wubi_idf_reader — process-global OnceLock<IdfReader> over the embedded blob; amortizes the 4 MB parse + sha256 verify across the process lifetime.
  • layer_from_idf_tag — reverse of Layer::as_u8; decodes an IDF entry’s engine_tag back into the originating wubi Layer.
  • table module — process-global stateful WubiDict cache + per-code lookup helpers (lookup, lookup_with_scores, lookup_with_layer, lookup_with_freq_layer, prefix_predictions, record_pick, export_l0, import_l0)
    • rare-CJK toggle (set_show_rare / show_rare) + warmup helper.

§What’s NOT here

  • Stateful WubiEngine (buffer / handle_letter / auto-commit / commit_index / L0 pin state machine) — that classifies as application cement per the v1.5 D11 correction and now lives in the Inputx monorepo’s inputx-core/src/wubi/engine.rs. IME implementers copying this stone are expected to bring their own state machine matching their UI ergonomics.

Structs§

L0Snapshot
Re-export of the wubi L0 snapshot type so hosts can build / destructure it without depending on the inputx-wubi crate directly. Persistent state of the L0 layer. Caller serializes / deserializes this however it likes (TOML, MessagePack, sqlite, …) — the crate intentionally has no serde dependency.

Constants§

EMBEDDED_WUBI_IDF
Embedded IDFv1 wubi dict blob, sourced from inputx-wubi-data/data/words.idf at compile time. Each entry’s EntryFlags::engine_tag() carries the wubi Layer enum index (v1.4.7 sub-phase A4 step 2 schema bump), so cement-side fills can reconstruct (word, layer, raw_freq) without re-reading the inputx_wubi::WubiDict table.

Functions§

export_l0
Snapshot the current L0 state (pins + pending pick counts + layer prefs) for host-side persistence. Host stores it however it wants (UserDefaults on Apple platforms, IndexedDB in web, etc.) and feeds it back via import_l0 on next launch.
import_l0
Restore a previously-exported L0 snapshot. Entries whose (code, word) no longer exist in the lexicon (e.g., after a wubi data version bump removed an extension char) are silently dropped. Returns the count of accepted pins.
is_displayable
true iff every character in word is below the rare-CJK threshold (U+20000 — start of CJK Extension B). Pinyin-side composer also calls this so the user-facing rare-char toggle applies uniformly to both engines (item 54).
layer_from_idf_tag
Decode an IDF wubi entry’s EntryFlags::engine_tag() back into the originating inputx_wubi::Layer variant. Falls back to Layer::Auto on out-of-range bytes (defensive — the writer only emits 0..=5).
lookup
Exact lookup for code. Returns the candidates ranked by L0/L1, with rare CJK candidates filtered unless show_rare() is true.
lookup_with_freq_layer
Per-code lookup exposing raw frequency (separate from layer.base · pref) — used by the v1.4.7 composite hot path for orthodox score decomposition into (log_prior_q4 = Q4·ln(1+freq), log_likelihood_q4 = Q4·ln(layer.base() · pref · demotes)). Rare-CJK filter applied uniformly with lookup_with_layer.
lookup_with_layer
Layer-aware variant: each candidate also carries its origin Layer (Jianma1/2/3, Zigen, Phrase, Auto). Composite dispatch uses the layer tag to make context-aware ranking decisions — e.g. demoting low-confidence Auto / Phrase wubi candidates when the buffer shape suggests pinyin intent, while keeping high-confidence Jianma simcodes untouched (the 伙-rule: wubi simcodes always lead at their code).
lookup_with_scores
Scored variant of lookup. Returns (word, score) tuples for the composite cross-engine merge. Rare-CJK filter applied here too.
prefix_predictions
Prefix-prediction lookup: (word, freq, code_len) for every dict entry whose code strictly extends prefix (no exact-code matches). Rare-CJK filter applied uniformly with lookup. Wired into the composite dispatch so Wubi gets the same prefix-prediction shape as pinyin / JP (e.g. jj exact 是 stays at #0, predictions 日/时 follow).
record_pick
Notify the dictionary that the user committed word for code. The internal pick counter advances; on threshold the word auto-pins. All learning logic lives in wubi — this is just a passthrough so the IME layer doesn’t need to know about counters.
set_show_rare
show_rare
warmup
Force-init the embedded WubiDict and exercise common lookup paths so the OS faults the FST’s .rodata pages into RAM and any internal fst::Map streamer state is primed. Idempotent — relies on OnceLock::get_or_init for the dict, and WubiDict::lookup for the page-touch effect. Called from Session::warmup so a host can off-load the cold-path cost to a background thread at startup instead of paying it on the user’s first keystroke. ~100-300ms on iPhone cold; <1ms idempotent.
wubi_idf_reader
Process-global IdfReader over EMBEDDED_WUBI_IDF. Parses the 4 MB header / FST / entry-table sections once and amortizes the ~few-ms cost over the whole process lifetime; subsequent wubi_idf_reader().lookup(code) calls are O(|code|) FST walks with zero allocation per query.