Expand description
inputx-wubi-data — embedded Wubi 86 IDFv1 dict + lookup helpers
for the inputx-wubi
engine, packaged as a publishable stone.
Successor to inputx-wubi-cement
under the v1.5 D11 taxonomy correction (2026-05): cement = an
application’s source code (your own wubi.rs / engine.rs),
NOT a published crate. The historical -cement-suffix crate is
deprecated and re-exports from this crate for backward compat.
§What’s in the box
EMBEDDED_WUBI_IDF— IDFv1 binary dict blob with the wubiLayerenum index encoded inEntryFlags::engine_tag()(v1.4.7 sub-phase A4 step 2).wubi_idf_reader— process-globalOnceLock<IdfReader>over the embedded blob; amortizes the 4 MB parse + sha256 verify across the process lifetime.layer_from_idf_tag— reverse ofLayer::as_u8; decodes an IDF entry’s engine_tag back into the originating wubiLayer.tablemodule — process-global statefulWubiDictcache + per-code lookup helpers (lookup,lookup_with_scores,lookup_with_layer,lookup_with_freq_layer,prefix_predictions,record_pick,export_l0,import_l0)- rare-CJK toggle (
set_show_rare/show_rare) + warmup helper.
- rare-CJK toggle (
§What’s NOT here
- Stateful
WubiEngine(buffer /handle_letter/ auto-commit / commit_index / L0 pin state machine) — that classifies as application cement per the v1.5 D11 correction and now lives in the Inputx monorepo’sinputx-core/src/wubi/engine.rs. IME implementers copying this stone are expected to bring their own state machine matching their UI ergonomics.
Structs§
- L0Snapshot
- Re-export of the wubi L0 snapshot type so hosts can build /
destructure it without depending on the
inputx-wubicrate directly. Persistent state of the L0 layer. Caller serializes / deserializes this however it likes (TOML, MessagePack, sqlite, …) — the crate intentionally has noserdedependency.
Constants§
- EMBEDDED_
WUBI_ IDF - Embedded IDFv1 wubi dict blob, sourced from
inputx-wubi-data/data/words.idfat compile time. Each entry’sEntryFlags::engine_tag()carries the wubiLayerenum index (v1.4.7 sub-phase A4 step 2 schema bump), so cement-side fills can reconstruct(word, layer, raw_freq)without re-reading theinputx_wubi::WubiDicttable.
Functions§
- export_
l0 - Snapshot the current L0 state (pins + pending pick counts + layer
prefs) for host-side persistence. Host stores it however it wants
(UserDefaults on Apple platforms, IndexedDB in web, etc.) and feeds
it back via
import_l0on next launch. - import_
l0 - Restore a previously-exported L0 snapshot. Entries whose
(code, word)no longer exist in the lexicon (e.g., after a wubi data version bump removed an extension char) are silently dropped. Returns the count of accepted pins. - is_
displayable trueiff every character inwordis below the rare-CJK threshold (U+20000— start of CJK Extension B). Pinyin-side composer also calls this so the user-facing rare-char toggle applies uniformly to both engines (item 54).- layer_
from_ idf_ tag - Decode an IDF wubi entry’s
EntryFlags::engine_tag()back into the originatinginputx_wubi::Layervariant. Falls back toLayer::Autoon out-of-range bytes (defensive — the writer only emits 0..=5). - lookup
- Exact lookup for
code. Returns the candidates ranked by L0/L1, with rare CJK candidates filtered unlessshow_rare()istrue. - lookup_
with_ freq_ layer - Per-code lookup exposing raw frequency (separate from layer.base ·
pref) — used by the v1.4.7 composite hot path for orthodox
score decomposition into (log_prior_q4 = Q4·ln(1+freq),
log_likelihood_q4 = Q4·ln(layer.base() · pref · demotes)). Rare-CJK
filter applied uniformly with
lookup_with_layer. - lookup_
with_ layer - Layer-aware variant: each candidate also carries its origin Layer (Jianma1/2/3, Zigen, Phrase, Auto). Composite dispatch uses the layer tag to make context-aware ranking decisions — e.g. demoting low-confidence Auto / Phrase wubi candidates when the buffer shape suggests pinyin intent, while keeping high-confidence Jianma simcodes untouched (the 伙-rule: wubi simcodes always lead at their code).
- lookup_
with_ scores - Scored variant of
lookup. Returns(word, score)tuples for the composite cross-engine merge. Rare-CJK filter applied here too. - prefix_
predictions - Prefix-prediction lookup:
(word, freq, code_len)for every dict entry whose code strictly extendsprefix(no exact-code matches). Rare-CJK filter applied uniformly withlookup. Wired into the composite dispatch so Wubi gets the same prefix-prediction shape as pinyin / JP (e.g.jjexact 是 stays at #0, predictions 日/时 follow). - record_
pick - Notify the dictionary that the user committed
wordforcode. The internal pick counter advances; on threshold the word auto-pins. All learning logic lives inwubi— this is just a passthrough so the IME layer doesn’t need to know about counters. - set_
show_ rare - show_
rare - warmup
- Force-init the embedded
WubiDictand exercise common lookup paths so the OS faults the FST’s.rodatapages into RAM and any internalfst::Mapstreamer state is primed. Idempotent — relies onOnceLock::get_or_initfor the dict, andWubiDict::lookupfor the page-touch effect. Called fromSession::warmupso a host can off-load the cold-path cost to a background thread at startup instead of paying it on the user’s first keystroke. ~100-300ms on iPhone cold; <1ms idempotent. - wubi_
idf_ reader - Process-global
IdfReaderoverEMBEDDED_WUBI_IDF. Parses the 4 MB header / FST / entry-table sections once and amortizes the ~few-ms cost over the whole process lifetime; subsequentwubi_idf_reader().lookup(code)calls are O(|code|) FST walks with zero allocation per query.