inputx-dict-format
IDFv1 binary dict format for IME engines — mmap zero-copy reader, deterministic writer, probability-native (Q4 log priors), shared layout across pinyin / wubi / Japanese / future Korean and Vietnamese.
Why
IME dict files are read on every keystroke and rebuilt rarely. The hot-path constraint is mmap-friendly zero-copy decoding; the rebuild constraint is deterministic output (so two builds from the same corpus produce byte-identical files, verifiable by sha256).
IDFv1 sets one binary layout for all of those engines so the runtime
reader code, dual-path verification harness, and OTA delivery (post-v2)
stay a single implementation. Per-engine semantics ride in the
engine_kind byte and the match_type byte per entry.
Layout (96-byte header + sections)
+---------+---------------+--------------+---------------+----------------+
| Header | String pool | Entry table | FST code idx | FST word idx |
| 64 + 32 | varlen, pad8 | N × 16 B | varlen | varlen |
+---------+---------------+--------------+---------------+----------------+
| Bigram block (optional, v2+) |
| Embedding block (optional, v2+)|
+--------------------------------+
- Header (64 B) — magic
b"IDFv",format_version,engine_kind, flag word, section offsets and sizes, embedding metadata. - sha256 trailer (32 B) — payload hash covers everything from byte 96 onward; reader rejects on mismatch.
- String pool — deduplicated UTF-8, null-terminated, padded to 8-byte alignment. Entries refer to byte offsets (u24).
- Entry table — fixed 16 B per entry:
word_offset(u24),code_offset(u24),log_prior(i16 Q4),match_type(u8 →inputx_scoring::MatchTypevariant),flags(u8),raw_freq(u32 — pre-quantization corpus freq, lossless tiebreaker for entries that land in the same Q4log_priorbucket; v1.4.7 schema bump repurposed the previously-unusedbigram_offsetslot), 2 B reserved. - EntryFlags —
BLACKLIST(bit 0),CURATED_OVERRIDE(bit 1),USER_ADDED(bit 2), plus bits 5-7ENGINE_TAG_MASKfor an engine- specific 3-bit payload (used byEngineKind::Wubito carry theLayerenum index; zero for other engines). - FST code / word indexes —
inputx_fsa::Fsablobs (code → entry_index, word → entry_index). v1.4.3 ships with empty indexes; reader falls back to a linear scan over the entry table. v1.4.6+ fills them.
API
Reader (no_std + alloc clean, std for mmap):
use ;
let r = open?;
assert_eq!;
for entry in r.lookup
Writer (std):
use ;
use ;
let mut b = new;
b.add_entry;
let sha = b.build?;
println!;
Determinism
IdfBuilder::build is deterministic given the input entry set:
- Entries are sorted by
(code, word, log_prior). - Exact
(code, word)duplicates are deduped. - String pool entries are sorted unique UTF-8.
- Section bytes are written in a fixed order.
Two builds from the same input produce byte-identical files and the same payload sha256. This is the verification gate at every snapshot rebuild (PLAN-dict-format-IDFv1.md §"Build determinism").
License
Dual-licensed under MIT or Apache-2.0.