inputx_dict_format/lib.rs
1//! `inputx-dict-format` — IDFv1 binary dict format for IME engines.
2//!
3//! Probability-native (Q4 fixed-point log priors per
4//! [`inputx_scoring`]), mmap zero-copy reader, deterministic writer.
5//! Same binary layout across pinyin / wubi / Japanese / future Korean
6//! and Vietnamese engines.
7//!
8//! # Architecture (per `.claude/PLAN-dict-format-IDFv1.md`)
9//!
10//! ```text
11//! +---------+---------------+--------------+---------------+----------------+
12//! | Header | String pool | Entry table | FST code idx | FST word idx |
13//! | 64 B | varlen, pad8 | N × 16 B | varlen | varlen |
14//! +---------+---------------+--------------+---------------+----------------+
15//! | Bigram block (optional) |
16//! | Embedding block (optional) |
17//! | Padding to 8-byte EOF |
18//! +--------------------------------+
19//! ```
20//!
21//! - **Header**: magic `b"IDFv"`, format_version (currently 1), section
22//! offsets, sha256 of payload.
23//! - **String pool**: deduplicated UTF-8 with byte offsets.
24//! - **Entry table**: fixed 16 B per entry; carries word_offset (u24),
25//! code_offset (u24), log_prior (i16 Q4), match_type (u8), flags (u8),
26//! bigram_offset (u32, 0 if absent), embedding_offset (u32, 0 if
27//! absent).
28//! - **FST code index**: [`inputx_fsa::Fsa`] mapping code bytes →
29//! entry_index (first hit; multi-reading entries follow as a run).
30//! - **FST word index**: reverse, word → entry_index, for L0 / blacklist
31//! joins.
32//!
33//! # Reader / writer
34//!
35//! - [`reader::IdfReader::open`] — mmap a `.idf` file and validate header.
36//! - [`reader::IdfReader::lookup`] — exact code → iterator of entries.
37//! - [`reader::IdfReader::prefix_top_k`] — prefix scan top-k by log_prior.
38//! - [`writer::IdfBuilder`] (gated on `std`) — deterministic build:
39//! sort + dedupe entries, build FST, write atomic via tmpfile + rename.
40
41#![cfg_attr(not(feature = "std"), no_std)]
42
43#[cfg(feature = "std")]
44extern crate std;
45
46extern crate alloc;
47
48pub mod codec;
49pub mod reader;
50#[cfg(feature = "std")]
51pub mod writer;
52
53pub use codec::{EngineKind, EntryFlags, Header, Version, MAGIC, HEADER_SIZE, ENTRY_SIZE};
54pub use reader::{Entry, IdfReader};
55#[cfg(feature = "std")]
56pub use writer::IdfBuilder;