# linflate: Custom DEFLATE Decompressor Design
Zero-copy, SIMD-optimized, cache-aware DEFLATE decoder. Now a standalone shared
crate (extracted from ljar-rs), used by both ljar-rs and lgz-rs.
Same philosophy as lbzip2-rs's `decode_block_into`: caller owns the output
buffer, no per-call allocation, hot path tuned for modern CPU cache hierarchy.
---
## Architecture Decision: libdeflate Model
After analyzing miniz_oxide, zlib-ng, libdeflate, and zlib-rs:
**We follow libdeflate's full-buffer model**, not zlib-ng's streaming model.
Rationale:
- ljar-rs already has the full compressed segment in a ring slot (zero-copy
pointer). The decompressor receives `&[u8]` compressed + `&mut [u8]` output.
- Full-buffer eliminates: circular window arithmetic, `& mask` on every
back-ref, `have/wsize/wnext` bookkeeping, all "copy from window" paths.
- Simpler code = fewer branches = better prediction.
- libdeflate is the fastest known DEFLATE decompressor. We copy its design.
**We add zlib-ng's SIMD match copy** on top of libdeflate's architecture:
- libdeflate relies on auto-vectorization for match copies.
- zlib-ng/zlib-rs use explicit SIMD chunk copies (`__m128i`/`__m256i`).
- We use Rust's const-generic chunk approach (from zlib-rs) for clean SIMD
without `#[cfg]` spaghetti.
**We add lbzip2-rs's zero-copy output pattern** (`decode_into(&mut [u8])`):
- Caller pre-allocates output buffer (known from ZIP Central Directory).
- Decompressor writes directly into it. No intermediate Vec, no realloc.
---
## Module Structure
```
src/
lib.rs — public API: inflate_into(), inflate_to_vec(), inflate_segment*()
bitreader.rs — 64-bit branchless bit reader
tables.rs — Huffman table build + packed entry format
fastloop.rs — the hot decode loop (literals + matches)
copy.rs — SIMD match copy (const-generic N)
fixed.rs — pre-built fixed Huffman tables (static)
```
---
## 1. Bit Reader (`bitreader.rs`)
### Design: branchless 64-bit, libdeflate/zlib-ng style
```rust
pub struct BitReader<'a> {
ptr: *const u8, // current input position
end: *const u8, // one-past-end of input
buf: u64, // shift register, LSB-first
bits: u32, // number of valid bits in buf (0..63)
}
```
**Struct size: 32 bytes — one cache line.**
### Branchless refill (4 instructions, no branch)
```rust
#[inline(always)]
pub unsafe fn refill(&mut self) {
// Load 8 bytes (may overread — safe if caller guarantees 8 bytes
// past input end are readable, OR we clamp at end).
let raw = core::ptr::read_unaligned(self.ptr as *const u64);
self.buf |= u64::from_le(raw) << (self.bits as u8);
let advance = ((63 ^ self.bits) >> 3) as usize; // branchless byte count
self.ptr = self.ptr.add(advance);
self.bits |= 56; // guarantee ≥ 56 bits
}
```
Guarantees: after `refill()`, `self.bits >= 56`. A lit/len decode needs at
most 15 bits, dist at most 15 bits, extra at most 13 bits = 43 bits max
per symbol. One refill per symbol is always sufficient.
### peek/consume split
```rust
#[inline(always)]
pub fn peek(&self, n: u32) -> u32 {
(self.buf as u32) & ((1u32 << n) - 1)
}
#[inline(always)]
pub fn consume(&mut self, n: u32) {
self.buf >>= n;
self.bits -= n;
}
// Combined: consume n bits and return them.
#[inline(always)]
pub fn take(&mut self, n: u32) -> u32 {
let v = self.peek(n);
self.consume(n);
v
}
```
### BMI2 variant (x86_64 with BMI2)
```rust
#[cfg(target_arch = "x86_64")]
#[inline(always)]
pub fn extract_var(&self, n: u32) -> u64 {
#[cfg(target_feature = "bmi2")]
{ unsafe { core::arch::x86_64::_bzhi_u64(self.buf, n as u64) } }
#[cfg(not(target_feature = "bmi2"))]
{ self.buf & ((1u64 << n) - 1) }
}
```
---
## 2. Huffman Tables (`tables.rs`)
### Design: 11-bit first level, packed u32 entries (libdeflate pattern)
```rust
const LITLEN_TABLEBITS: u32 = 11; // 2048 main entries
const DIST_TABLEBITS: u32 = 8; // 256 main entries
const PRECODE_TABLEBITS: u32 = 7; // 128 entries (max precode len = 7)
// Litlen: max 2342 entries × 4 bytes = 9.4 KB
// Dist: max 402 entries × 4 bytes = 1.6 KB
// Total: ~11 KB — fits in 32 KB L1-D with headroom for stack + output window.
```
### Entry format (packed u32)
**Literal:**
```
bit 31: 1 (LITERAL flag)
bit 23-16: literal byte value
bit 3-0: code length in bits
```
**Length (match):**
```
bit 31: 0
bit 24-16: length base value
bit 11-8: code length
bit 4-0: code_length + num_extra_bits (consume this many total)
```
**Subtable pointer:**
```
bit 15: 1 (SUBTABLE flag)
bit 14-8: subtable index offset
bit 3-0: main table bits
```
Key micro-optimization from libdeflate: `bits_to_consume = entry as u8`.
The full u32 entry can be subtracted from `bits` because only the low byte
matters. Saves a mask instruction.
### Table build
```rust
pub struct DecompressTables {
litlen: [u32; 2342],
dist: [u32; 402],
precode: [u32; 128],
// Scratch for code length processing (on stack, not heap)
lens: [u8; 288 + 32],
sorted_syms: [u16; 288],
}
```
**All on the stack or in a thread-local.** No heap allocation per block.
Thread-local pool (same pattern as lbzip2-rs's `TT_BUF`):
```rust
thread_local! {
static TABLES: RefCell<Option<Box<DecompressTables>>> = RefCell::new(None);
}
pub fn with_tables<R>(f: impl FnOnce(&mut DecompressTables) -> R) -> R {
TABLES.with(|cell| {
let mut opt = cell.borrow_mut();
let tables = opt.get_or_insert_with(|| Box::new(DecompressTables::zeroed()));
f(tables)
})
}
```
### #[cold] slow path
```rust
#[inline(always)]
fn decode_litlen(tables: &DecompressTables, bits: &mut BitReader) -> u32 {
let entry = tables.litlen[bits.peek(LITLEN_TABLEBITS) as usize];
if likely(entry & SUBTABLE_FLAG == 0) {
bits.consume(entry as u8 as u32);
return entry;
}
decode_litlen_slow(tables, bits, entry) // #[cold]
}
#[cold]
#[inline(never)]
fn decode_litlen_slow(tables: &DecompressTables, bits: &mut BitReader, entry: u32) -> u32 {
// Walk subtable...
}
```
---
## 3. Fast Loop (`fastloop.rs`)
### Design: libdeflate's preload-before-copy pattern
The key insight: after decoding length+distance, we **preload** the next
litlen entry and **then** refill bits. The match copy runs while the
preloaded entry's cache line is settling in L1.
```rust
pub unsafe fn inflate_fast(
bits: &mut BitReader,
tables: &DecompressTables,
out: &mut [u8],
) -> Result<usize, InflateError> {
let out_ptr = out.as_mut_ptr();
let out_end = out_ptr.add(out.len());
let mut out_pos = out_ptr;
// Fastloop bounds: stop 260 bytes before output end, 15 bytes before input end.
let out_fast_end = out_end.sub(260);
let in_fast_end = bits.end.sub(15);
bits.refill();
let mut entry = tables.litlen[bits.peek(LITLEN_TABLEBITS) as usize];
while bits.ptr < in_fast_end && out_pos < out_fast_end {
// ── Literal fast path (up to 3 per iteration) ────────────
let saved_buf = bits.buf;
bits.buf >>= entry as u8;
bits.bits -= entry as u8 as u32;
if entry & LITERAL_FLAG != 0 {
*out_pos = (entry >> 16) as u8;
out_pos = out_pos.add(1);
entry = tables.litlen[bits.peek(LITLEN_TABLEBITS) as usize];
if entry & LITERAL_FLAG != 0 {
bits.buf >>= entry as u8;
bits.bits -= entry as u8 as u32;
*out_pos = (entry >> 16) as u8;
out_pos = out_pos.add(1);
entry = tables.litlen[bits.peek(LITLEN_TABLEBITS) as usize];
if entry & LITERAL_FLAG != 0 {
bits.buf >>= entry as u8;
bits.bits -= entry as u8 as u32;
*out_pos = (entry >> 16) as u8;
out_pos = out_pos.add(1);
entry = tables.litlen[bits.peek(LITLEN_TABLEBITS) as usize];
continue;
}
}
}
// ── End of block ─────────────────────────────────────────
if entry & EOB_FLAG != 0 {
break;
}
// ── Match: length + distance ─────────────────────────────
let length = (entry >> 16) as usize
+ (extract_var(saved_buf, entry) >> ((entry >> 8) as u8)) as usize;
// Decode distance
let dist_entry = tables.dist[bits.peek(DIST_TABLEBITS) as usize];
// ... extract distance + extra bits ...
let dist = /* computed distance */ ;
// ── PRELOAD next entry BEFORE match copy ─────────────────
bits.refill();
entry = tables.litlen[bits.peek(LITLEN_TABLEBITS) as usize];
// ── Match copy (SIMD-accelerated) ────────────────────────
copy_match::<CHUNK_SIZE>(out_ptr, out_pos, dist, length);
out_pos = out_pos.add(length);
}
// ── Generic (safe) loop for remainder near buffer edges ──────
// Byte-at-a-time decode for final symbols...
Ok(out_pos.offset_from(out_ptr) as usize)
}
```
### Preload hides latency
The `entry = tables.litlen[...]` after refill initiates an L1 load (~4–5
cycles). While the CPU executes that load, the match copy runs (~3–20 cycles
for typical matches). By the time the loop iterates, `entry` is ready. This
is the same trick libdeflate uses — **instruction-level parallelism across
the loop boundary**.
---
## 4. SIMD Match Copy (`copy.rs`)
### Design: zlib-rs const-generic N + zlib-ng short-distance handling
```rust
/// Copy `length` bytes from `out_pos - dist` to `out_pos`.
///
/// Caller guarantees: `dist <= out_pos - out_start` and
/// `out_pos + length <= out_end - CHUNK_SIZE` (overwrite headroom).
#[inline(always)]
pub unsafe fn copy_match<const N: usize>(
out_start: *const u8,
out_pos: *mut u8,
dist: usize,
length: usize,
) {
let src = out_pos.sub(dist);
if dist >= N {
// Non-overlapping: N-byte chunk copy
copy_chunks::<N>(src, out_pos, length);
} else if dist == 1 {
// RLE: fill with single byte
let b = *src;
core::ptr::write_bytes(out_pos, b, length);
} else if dist >= 8 {
// Semi-short: word-stride copy (libdeflate pattern)
copy_stride_word(src, out_pos, dist, length);
} else {
// Short distance (2..7): SIMD tile or stride copy
#[cfg(target_arch = "x86_64")]
{
if is_x86_feature_detected!("avx2") {
copy_short_avx2(src, out_pos, dist, length);
return;
}
}
copy_stride_word(src, out_pos, dist, length);
}
}
/// Non-overlapping N-byte chunk copy.
#[inline(always)]
unsafe fn copy_chunks<const N: usize>(
mut src: *const u8,
mut dst: *mut u8,
length: usize,
) {
let end = src.add(length);
while src < end {
let chunk: [u8; N] = core::ptr::read_unaligned(src as *const [u8; N]);
core::ptr::write_unaligned(dst as *mut [u8; N], chunk);
src = src.add(N);
dst = dst.add(N);
}
}
// With N=32, LLVM emits vmovdqu ymm load + vmovdqu ymm store.
// With N=16, LLVM emits movdqu xmm load + movdqu xmm store.
```
### AVX2 short-distance tiling (zlib-ng pattern)
```rust
#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")]
unsafe fn copy_short_avx2(
src: *const u8,
mut dst: *mut u8,
dist: usize,
length: usize,
) {
use core::arch::x86_64::*;
// Load source bytes and tile across 32-byte register.
let src_vec = _mm_loadu_si128(src as *const __m128i);
let wide = _mm256_inserti128_si256(_mm256_castsi128_si256(src_vec), src_vec, 1);
// Permutation LUT: for dist D, PERM_LUT[D-2] contains shuffle indices
// that tile a D-byte pattern across 32 bytes.
let perm = _mm256_load_si256(PERM_LUT[dist - 2].as_ptr() as *const __m256i);
let tiled = _mm256_shuffle_epi8(wide, perm);
let end = dst.add(length);
while dst < end {
_mm256_storeu_si256(dst as *mut __m256i, tiled);
dst = dst.add(32);
}
}
/// Pre-computed permutation indices for dist 2..7.
/// Each entry tiles the first `dist` bytes across 32 bytes.
static PERM_LUT: [[u8; 32]; 6] = [
// dist=2: [0,1, 0,1, 0,1, ...]
[0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1, 0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1],
// dist=3: [0,1,2, 0,1,2, 0,1,2, ...]
[0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0, 0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0],
// dist=4: [0,1,2,3, 0,1,2,3, ...]
[0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3, 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3],
// dist=5: [0,1,2,3,4, 0,1,2,3,4, ...]
[0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0, 0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0],
// dist=6: [0,1,2,3,4,5, 0,1,2,3,4,5, ...]
[0,1,2,3,4,5,0,1,2,3,4,5,0,1,2,3, 0,1,2,3,4,5,0,1,2,3,4,5,0,1,2,3],
// dist=7: [0,1,2,3,4,5,6, 0,1,2,3,4,5,6, ...]
[0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1, 0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1],
];
```
### Runtime dispatch (once per call, not per match)
```rust
const CHUNK_SIZE: usize = {
#[cfg(target_arch = "x86_64")]
{ 32 } // AVX2
#[cfg(target_arch = "aarch64")]
{ 16 } // NEON
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
{ 8 } // fallback
};
```
---
## 5. Public API (`lib.rs`)
```rust
/// Decompress DEFLATE data into a pre-allocated output buffer.
///
/// Returns the number of bytes written to `output`.
/// This is the zero-copy path — no allocation beyond what the caller provides.
///
/// The decompressor may write up to `CHUNK_SIZE` bytes past the actual
/// decompressed size (overwrite headroom for SIMD). Caller must account
/// for this in the allocation.
pub fn inflate_into(
compressed: &[u8],
output: &mut [u8],
) -> Result<usize, InflateError> {
with_tables(|tables| {
let mut bits = BitReader::new(compressed);
// Parse DEFLATE blocks until BFINAL=1
loop {
let bfinal = bits.take(1);
let btype = bits.take(2);
match btype {
0 => decode_stored(&mut bits, output, &mut out_pos)?,
1 => {
build_fixed_tables(tables);
unsafe { inflate_fast(&mut bits, tables, output) }?;
}
2 => {
decode_dynamic_header(&mut bits, tables)?;
unsafe { inflate_fast(&mut bits, tables, output) }?;
}
_ => return Err(InflateError::InvalidBlockType),
}
if bfinal != 0 { break; }
}
Ok(out_pos)
})
}
/// Convenience: allocates output buffer and decompresses.
pub fn inflate_to_vec(compressed: &[u8], max_size: usize) -> Result<Vec<u8>, InflateError> {
let mut output = vec![0u8; max_size + CHUNK_SIZE]; // SIMD headroom
let written = inflate_into(compressed, &mut output)?;
output.truncate(written);
Ok(output)
}
```
---
## 6. Cache Budget Analysis
On a typical modern x86_64 (Intel 12th gen / AMD Zen4):
| BitReader struct | 32 B | L1 (1 cache line) |
| litlen table | 9.4 KB | L1 |
| dist table | 1.6 KB | L1 |
| precode table | 512 B | L1 |
| lens + sorted_syms | 1.3 KB | L1 |
| Stack locals (loop vars) | ~128 B | registers + L1 |
| **Total hot working set** | **~13 KB** | **< 32 KB L1-D** ✅ |
| Output buffer (current page) | 4 KB | L1 (write-combine) |
| Input buffer (current region) | ~64 B ahead | L1 (prefetched by HW) |
The entire hot decode state fits in L1. No L2 misses in the inner loop
except for back-references with large distance (> 32 KB window), which
are inherently cache-cold but rare in typical DEFLATE streams.
---
## 7. Integration with ljar-rs / lgz-rs
### Worker thread hot path
```rust
// In ljar's or lgz's worker thread:
let segment = &data[item.start_byte..item.end_byte];
let mut out = vec![0u8; item.output_size_hint + linflate::OVERWRITE_HEADROOM];
match linflate::inflate_into(segment, &mut out) {
Ok(written) => { out.truncate(written); Ok(out) }
Err(e) => Err(e),
};
```
### Future: true zero-copy into pre-allocated slab
When the caller knows `uncompressed_size` (e.g. from ZIP Central Directory)
AND the entry is single-segment (no parallel split):
```rust
// Pre-allocate once for the entire entry
let mut slab = vec![0u8; entry.uncompressed_size as usize + linflate::OVERWRITE_HEADROOM];
// Worker decompresses directly into the slab — no Vec per segment
linflate::inflate_into(segment, &mut slab[offset..])?;
// Writer receives (entry_idx, written_len) — no data copy
```
---
## 8. Implementation Order
1. `bitreader.rs` — branchless refill, peek/consume, BMI2 variant
2. `tables.rs` — packed u32 entry format, 11-bit litlen, table build
3. `fixed.rs` — static fixed Huffman tables
4. `copy.rs` — SIMD match copy with const-generic N, AVX2 short-distance
5. `fastloop.rs` — the hot loop with preload-before-copy
6. `mod.rs` — public API, block parsing, dynamic header decode
7. Integration tests: verify against miniz_oxide on all 20 Maven JARs
8. Benchmarks: compare against miniz_oxide, measure throughput per core
---
## 9. Key Differences from Each Library
| Buffer model | Full-buffer (libdeflate) | Callers already have segments in memory |
| Huffman entry size | u32 packed (libdeflate) | Length base + extra baked in, one expression per match |
| First-level table bits | 11 (libdeflate) | Eliminates subtable lookups for real streams |
| Bit refill | Branchless XOR (zlib-ng) | 4 instructions, no branch |
| Literal unroll | 3 per iter (zlib-ng) | Best for literal-heavy class files |
| Match copy SIMD | Const-generic N (zlib-rs) | Clean Rust, LLVM emits right instructions |
| Short-dist SIMD | AVX2 shuffle LUT (zlib-ng) | One instruction tiles any dist 2..7 |
| Allocation | Thread-local table pool (lbzip2-rs) | Zero heap churn per block |
| Output model | `inflate_into(&mut [u8])` (lbzip2-rs) | Caller owns buffer, zero-copy |
| Slow path | `#[cold] #[inline(never)]` (lbzip2-rs) | Keep fast path hot, push rare code out |