inputx-dict-format

IDFv1 binary dict format for IME engines — mmap zero-copy reader, deterministic writer, probability-native (Q4 log priors), shared layout across pinyin / wubi / Japanese / future Korean and Vietnamese.

Why

IME dict files are read on every keystroke and rebuilt rarely. The hot-path constraint is mmap-friendly zero-copy decoding; the rebuild constraint is deterministic output (so two builds from the same corpus produce byte-identical files, verifiable by sha256).

IDFv1 sets one binary layout for all of those engines so the runtime reader code, dual-path verification harness, and OTA delivery (post-v2) stay a single implementation. Per-engine semantics ride in the engine_kind byte and the match_type byte per entry.

Layout (96-byte header + sections)

+---------+---------------+--------------+---------------+----------------+
| Header  | String pool   | Entry table  | FST code idx  | FST word idx   |
| 64 + 32 | varlen, pad8  | N × 16 B     | varlen        | varlen         |
+---------+---------------+--------------+---------------+----------------+
                                         | Bigram block (optional, v2+)   |
                                         | Embedding block (optional, v2+)|
                                         +--------------------------------+

Header (64 B) — magic b"IDFv", format_version, engine_kind, flag word, section offsets and sizes, embedding metadata.
sha256 trailer (32 B) — payload hash covers everything from byte 96 onward; reader rejects on mismatch.
String pool — deduplicated UTF-8, null-terminated, padded to 8-byte alignment. Entries refer to byte offsets (u24).
Entry table — fixed 16 B per entry: word_offset (u24), code_offset (u24), log_prior (i16 Q4), match_type (u8 → inputx_scoring::MatchType variant), flags (u8), raw_freq (u32 — pre-quantization corpus freq, lossless tiebreaker for entries that land in the same Q4 log_prior bucket; v1.4.7 schema bump repurposed the previously-unused bigram_offset slot), 2 B reserved.
EntryFlags — BLACKLIST (bit 0), CURATED_OVERRIDE (bit 1), USER_ADDED (bit 2), plus bits 5-7 ENGINE_TAG_MASK for an engine- specific 3-bit payload (used by EngineKind::Wubi to carry the Layer enum index; zero for other engines).
FST code / word indexes — inputx_fsa::Fsa blobs (code → entry_index, word → entry_index). v1.4.3 ships with empty indexes; reader falls back to a linear scan over the entry table. v1.4.6+ fills them.

API

Reader (no_std + alloc clean, std for mmap):

use inputx_dict_format::{IdfReader, EngineKind};

let r = IdfReader::open("data/private-dict/v0.0.1/pinyin/words.idf")?;
assert_eq!(r.engine_kind(), EngineKind::Pinyin);
for entry in r.lookup(b"jixu") {
    println!("{} log_prior={}", entry.word, entry.log_prior);
}

Writer (std):

use inputx_dict_format::{IdfBuilder, EngineKind, EntryFlags};
use inputx_scoring::{MatchType, log_prior_from_freq};

let mut b = IdfBuilder::new(EngineKind::Pinyin);
b.add_entry(
    "jixu",
    "继续",
    i16::try_from(log_prior_from_freq(44_652)).unwrap_or(i16::MAX),
    MatchType::Exact,
    EntryFlags::default(),
);
let sha = b.build("words.idf".as_ref())?;
println!("built words.idf with sha256 {:x?}", sha);

Determinism

IdfBuilder::build is deterministic given the input entry set:

Entries are sorted by (code, word, log_prior).
Exact (code, word) duplicates are deduped.
String pool entries are sorted unique UTF-8.
Section bytes are written in a fixed order.

Two builds from the same input produce byte-identical files and the same payload sha256. This is the verification gate at every snapshot rebuild (PLAN-dict-format-IDFv1.md §"Build determinism").

License

Dual-licensed under MIT or Apache-2.0.

inputx-dict-format 1.4.0

inputx-dict-format

Why

Layout (96-byte header + sections)

API

Determinism

License