Skip to main content

inputx_dict_format/
lib.rs

1//! `inputx-dict-format` — IDFv1 binary dict format for IME engines.
2//!
3//! Probability-native (Q4 fixed-point log priors per
4//! [`inputx_scoring`]), mmap zero-copy reader, deterministic writer.
5//! Same binary layout across pinyin / wubi / Japanese / future Korean
6//! and Vietnamese engines.
7//!
8//! # Architecture (per `.claude/PLAN-dict-format-IDFv1.md`)
9//!
10//! ```text
11//! +---------+---------------+--------------+---------------+----------------+
12//! | Header  | String pool   | Entry table  | FST code idx  | FST word idx   |
13//! | 64 B    | varlen, pad8  | N × 16 B     | varlen        | varlen         |
14//! +---------+---------------+--------------+---------------+----------------+
15//!                                          | Bigram block (optional)        |
16//!                                          | Embedding block (optional)     |
17//!                                          | Padding to 8-byte EOF          |
18//!                                          +--------------------------------+
19//! ```
20//!
21//! - **Header**: magic `b"IDFv"`, format_version (currently 1), section
22//!   offsets, sha256 of payload.
23//! - **String pool**: deduplicated UTF-8 with byte offsets.
24//! - **Entry table**: fixed 16 B per entry; carries word_offset (u24),
25//!   code_offset (u24), log_prior (i16 Q4), match_type (u8), flags (u8),
26//!   bigram_offset (u32, 0 if absent), embedding_offset (u32, 0 if
27//!   absent).
28//! - **FST code index**: [`inputx_fsa::Fsa`] mapping code bytes →
29//!   entry_index (first hit; multi-reading entries follow as a run).
30//! - **FST word index**: reverse, word → entry_index, for L0 / blacklist
31//!   joins.
32//!
33//! # Reader / writer
34//!
35//! - [`reader::IdfReader::open`] — mmap a `.idf` file and validate header.
36//! - [`reader::IdfReader::lookup`] — exact code → iterator of entries.
37//! - [`reader::IdfReader::prefix_top_k`] — prefix scan top-k by log_prior.
38//! - [`writer::IdfBuilder`] (gated on `std`) — deterministic build:
39//!   sort + dedupe entries, build FST, write atomic via tmpfile + rename.
40
41#![cfg_attr(not(feature = "std"), no_std)]
42
43#[cfg(feature = "std")]
44extern crate std;
45
46extern crate alloc;
47
48pub mod codec;
49pub mod reader;
50#[cfg(feature = "std")]
51pub mod writer;
52
53pub use codec::{EngineKind, EntryFlags, Header, Version, MAGIC, HEADER_SIZE, ENTRY_SIZE};
54pub use reader::{Entry, IdfReader};
55#[cfg(feature = "std")]
56pub use writer::IdfBuilder;