# OnPair binary format
The decode-side state of a compressed column is three byte buffers — the
**dictionary bytes**, the **dictionary offsets**, and the **codes** — plus the
scalar `bits` (code width, `9..=16`). This document specifies their binary
layout and the invariants a decoder relies on.
Decoding is a gather-copy: the code stream is a sequence of dictionary indices;
each index names a token (a short byte string), and the decoded output is the
tokens concatenated in code order.
## Conventions
- All multi-byte integers are **little-endian**.
- `MAX_TOKEN_SIZE = 16` — the maximum token length in bytes and the decoder's
fixed read width.
- `N` — the number of dictionary tokens. `N <= 2^bits`.
- `M` — the number of codes (decoded values = number of tokens emitted).
## 1. Dictionary offsets — `dict_offsets`
`N + 1` little-endian **`u32`** values delimiting the tokens within
`dict_bytes`:
```
dict_offsets: [ o0=0, o1, o2, …, oN ] (N+1 × u32 LE)
token i = dict_bytes[ o_i .. o_{i+1} )
```
Invariants:
- `o0 == 0`.
- Strictly increasing: `o_i < o_{i+1}` for all `i` (every token is non-empty).
- Each token length `o_{i+1} - o_i` is in `1 ..= MAX_TOKEN_SIZE`.
- `oN` is the **logical length** of the dictionary (sum of all token lengths).
`N = dict_offsets.len() - 1`. An empty dictionary is `[0]` (`N == 0`).
## 2. Dictionary bytes — `dict_bytes`
The `N` token byte strings concatenated in index order, with no separators,
followed by **trailing decoder padding**:
```
dict_bytes: [ token0 ‖ token1 ‖ … ‖ token_{N-1} ‖ <padding> ]
└─────────── oN logical bytes ───────────┘
```
The decoder reads a fixed `MAX_TOKEN_SIZE` bytes starting at every token offset
(then slices to the token's true length — the branchless "fat read, advance by
`len`" pattern), so the buffer must stay readable `MAX_TOKEN_SIZE` past the
**highest** offset, which is the last token's:
```
dict_bytes.len() >= o_{N-1} + MAX_TOKEN_SIZE
```
Equivalently, the slack past the logical end `oN` is **variable**:
```
padding = MAX_TOKEN_SIZE - len(last token) // 0 .. MAX_TOKEN_SIZE-1
```
— `0` when the last token is a full `MAX_TOKEN_SIZE` bytes, up to
`MAX_TOKEN_SIZE - 1` (15) when it is a single byte. (For a width-`W` token cap
the range is `0 ..= W-1`.) Padding byte values are unspecified; the decoder
never uses them as token data. A writer may emit exactly this minimum or
over-allocate to the `MAX_TOKEN_SIZE - 1` worst case (what `compress` does);
both are valid.
## 3. Codes
`M` values, each a dictionary index in `[0, N)`. Every code fits in `bits` bits
because `N <= 2^bits`.
**Packed form**: codes are bit-packed **LSB-first at `bits` bits each** into a
stream of little-endian `u64` words. A code whose bits straddle a 64-bit word
boundary is split — low bits in the current word, remaining high bits open the
next.
```
word k holds codes packed LSB-first: bit position of code j = j * bits
packed size = ceil(M * bits / 8) bytes
```
A word-at-a-time unpacker over-reads up to 7 bytes when it extracts the last
code (its `u64` load starts near the end of the buffer). This is a reader
concern, not a format one: the unpacker reads the bulk fast with full-width word
loads and switches to an exact, masked tail for the final few codes whose load
would cross `ceil(M * bits / 8)` — the same fast-region-plus-exact-tail split
the decode loop uses (§Decode, `fat::decode_loop`). No padding word is emitted.
Reference onpair appends one zero `u64` after the packed codes (its unpacker
over-reads unconditionally). OnPair neither writes nor requires it; a reader
ingesting such a buffer ignores any bytes past `ceil(M * bits / 8)`.
**In-memory form** (this crate): `codes` is materialized as a `[u16]`, one
element per code, unpacked — the `bits`-wide packing is applied only when
serializing. The two are equivalent under `bits`.
Invariant: every code `< N` (indexes a real token).
## Related: row boundaries
A column also carries `code_boundaries`, `R + 1` offsets into the code stream
that delimit the `R` input rows (row `r`'s codes are
`codes[ b_r .. b_{r+1} )`). Their integer width matches the input offset type
(`u32`/`u64`). They are an index over the codes, not part of the code/dict
encoding, and are not required to decode the column as one flat stream.
## Decode
```
out = []
for c in codes: # in order
out.extend( dict_bytes[ dict_offsets[c] .. dict_offsets[c+1] ) )
```
The fast path reads a fixed `MAX_TOKEN_SIZE` bytes from `dict_offsets[c]` and
advances `out` by the true token length; the dictionary padding (§2) makes that
read in bounds.
## Validation
A decoder fed an untrusted/deserialized column should check, in `O(N)` (plus
`O(M)` for the codes), before decoding:
1. `dict_offsets` strictly increasing, every token length `<= MAX_TOKEN_SIZE`;
2. `dict_bytes.len() >= o_{N-1} + MAX_TOKEN_SIZE` (the padding above);
3. every code `< N`.
In this crate these are `Parts::validate_dictionary` (1–2, dictionary only) and
`Parts::validate` (1–3). The safe decoders (`decompress`, `decompress_into`)
enforce 1–2 once up front and check 3 per code in the decode loop.