Skip to main content

Module codec

Module codec 

Source
Expand description

IDFv1 binary layout — header / entry / flag types + LE codec helpers.

Wire layout per .claude/PLAN-dict-format-IDFv1.md. Little-endian throughout (x86_64 + arm64 are both LE; cross-platform safe). Alignment-strict: section boundaries are 8-byte aligned; Entry is exactly 16 bytes packed.

Structs§

EntryFlags
Per-entry flag bits. Bit assignments are stable across IDFv1.
EntryRecord
Per-entry record (16 bytes packed). word_offset and code_offset are u24 (3 bytes); they point into the string pool. log_prior is signed Q4 fixed-point (one log unit per 16 integer steps, per inputx_scoring::Q4). raw_freq is the original pre-quantization corpus frequency (added v1.4.7 sub-phase A4 step 1) — it lets cement-side cement rebuild a lossless tiebreaker when two entries land in the same Q4 log_prior bucket (e.g. 乎/护 for code hu, both quantize to Q4=170; raw_freq distinguishes them).
Header
Header, mirrors the on-disk 64-byte layout exactly. All integer fields are little-endian on disk; the public struct holds host-byte values (decoded on parse, encoded on to_bytes).

Enums§

EngineKind
Which engine the dict serves. Stable u8 across versions so dispatch code can match without translation.
Version
Format version. v1 = the layout described in .claude/PLAN-dict-format-IDFv1.md. Future v2+ will live alongside via the format_version header byte; v1 readers MUST reject unknown versions with a clear error.

Constants§

ENTRY_SIZE
Fixed per-entry size in bytes.
FULL_HEADER_SIZE
Total on-disk header region (header + sha256 area). Sections begin here.
HEADER_SIZE
Fixed header size in bytes. Sections begin at HEADER_SIZE.
MAGIC
Magic bytes at file offset 0. ASCII "IDFv".
SHA256_SIZE
Reserved byte length for the sha256 region immediately after the 64-byte header proper. The full on-disk header region is HEADER_SIZE + SHA256_SIZE = 96 bytes; sections begin at offset 96.

Functions§

decode_match_type
Decode EntryRecord::match_type back to inputx_scoring::MatchType. Inline payload fields are zeroed; callers attach runtime values.
encode_match_type
Encode an inputx_scoring::MatchType into the single u8 stored in EntryRecord::match_type. Round-trippable via decode_match_type. Inline payload (proximity / fuzzy cost / bigram_links) is lost on encode — the writer uses Exact for entries that have a fixed dict-baseline classification; runtime paths attach the inline payload based on how the buffer matched.