linflate 0.1.3 - Docs.rs

# linflate: Custom DEFLATE Decompressor Design

Zero-copy, SIMD-optimized, cache-aware DEFLATE decoder. Now a standalone shared
crate (extracted from ljar-rs), used by both ljar-rs and lgz-rs.
Same philosophy as lbzip2-rs's `decode_block_into`: caller owns the output
buffer, no per-call allocation, hot path tuned for modern CPU cache hierarchy.

---

## Architecture Decision: libdeflate Model

After analyzing miniz_oxide, zlib-ng, libdeflate, and zlib-rs:

**We follow libdeflate's full-buffer model**, not zlib-ng's streaming model.

Rationale:
- ljar-rs already has the full compressed segment in a ring slot (zero-copy
  pointer). The decompressor receives `&[u8]` compressed + `&mut [u8]` output.
- Full-buffer eliminates: circular window arithmetic, `& mask` on every
  back-ref, `have/wsize/wnext` bookkeeping, all "copy from window" paths.
- Simpler code = fewer branches = better prediction.
- libdeflate is the fastest known DEFLATE decompressor. We copy its design.

**We add zlib-ng's SIMD match copy** on top of libdeflate's architecture:
- libdeflate relies on auto-vectorization for match copies.
- zlib-ng/zlib-rs use explicit SIMD chunk copies (`__m128i`/`__m256i`).
- We use Rust's const-generic chunk approach (from zlib-rs) for clean SIMD
  without `#[cfg]` spaghetti.

**We add lbzip2-rs's zero-copy output pattern** (`decode_into(&mut [u8])`):
- Caller pre-allocates output buffer (known from ZIP Central Directory).
- Decompressor writes directly into it. No intermediate Vec, no realloc.

---

## Module Structure

```
src/
  lib.rs          — public API: inflate_into(), inflate_to_vec(), inflate_segment*()
  bitreader.rs    — 64-bit branchless bit reader
  tables.rs       — Huffman table build + packed entry format
  fastloop.rs     — the hot decode loop (literals + matches)
  copy.rs         — SIMD match copy (const-generic N)
  fixed.rs        — pre-built fixed Huffman tables (static)
```

---

## 1. Bit Reader (`bitreader.rs`)

### Design: branchless 64-bit, libdeflate/zlib-ng style

```rust
pub struct BitReader<'a> {
    ptr: *const u8,       // current input position
    end: *const u8,       // one-past-end of input
    buf: u64,             // shift register, LSB-first
    bits: u32,            // number of valid bits in buf (0..63)
}
```

**Struct size: 32 bytes — one cache line.**

### Branchless refill (4 instructions, no branch)

```rust
#[inline(always)]
pub unsafe fn refill(&mut self) {
    // Load 8 bytes (may overread — safe if caller guarantees 8 bytes
    // past input end are readable, OR we clamp at end).
    let raw = core::ptr::read_unaligned(self.ptr as *const u64);
    self.buf |= u64::from_le(raw) << (self.bits as u8);
    let advance = ((63 ^ self.bits) >> 3) as usize;  // branchless byte count
    self.ptr = self.ptr.add(advance);
    self.bits |= 56;  // guarantee ≥ 56 bits
}
```

Guarantees: after `refill()`, `self.bits >= 56`. A lit/len decode needs at
most 15 bits, dist at most 15 bits, extra at most 13 bits = 43 bits max
per symbol. One refill per symbol is always sufficient.

### peek/consume split

```rust
#[inline(always)]
pub fn peek(&self, n: u32) -> u32 {
    (self.buf as u32) & ((1u32 << n) - 1)
}

#[inline(always)]
pub fn consume(&mut self, n: u32) {
    self.buf >>= n;
    self.bits -= n;
}

// Combined: consume n bits and return them.
#[inline(always)]
pub fn take(&mut self, n: u32) -> u32 {
    let v = self.peek(n);
    self.consume(n);
    v
}
```

### BMI2 variant (x86_64 with BMI2)

```rust
#[cfg(target_arch = "x86_64")]
#[inline(always)]
pub fn extract_var(&self, n: u32) -> u64 {
    #[cfg(target_feature = "bmi2")]
    { unsafe { core::arch::x86_64::_bzhi_u64(self.buf, n as u64) } }
    #[cfg(not(target_feature = "bmi2"))]
    { self.buf & ((1u64 << n) - 1) }
}
```

---

## 2. Huffman Tables (`tables.rs`)

### Design: 11-bit first level, packed u32 entries (libdeflate pattern)

```rust
const LITLEN_TABLEBITS: u32 = 11;   // 2048 main entries
const DIST_TABLEBITS: u32 = 8;      // 256 main entries
const PRECODE_TABLEBITS: u32 = 7;   // 128 entries (max precode len = 7)

// Litlen: max 2342 entries × 4 bytes = 9.4 KB
// Dist:   max 402 entries × 4 bytes  = 1.6 KB
// Total: ~11 KB — fits in 32 KB L1-D with headroom for stack + output window.
```

### Entry format (packed u32)

**Literal:**
```
bit 31:    1 (LITERAL flag)
bit 23-16: literal byte value
bit 3-0:   code length in bits
```

**Length (match):**
```
bit 31:    0
bit 24-16: length base value
bit 11-8:  code length
bit 4-0:   code_length + num_extra_bits (consume this many total)
```

**Subtable pointer:**
```
bit 15:    1 (SUBTABLE flag)
bit 14-8:  subtable index offset
bit 3-0:   main table bits
```

Key micro-optimization from libdeflate: `bits_to_consume = entry as u8`.
The full u32 entry can be subtracted from `bits` because only the low byte
matters. Saves a mask instruction.

### Table build

```rust
pub struct DecompressTables {
    litlen: [u32; 2342],
    dist: [u32; 402],
    precode: [u32; 128],
    // Scratch for code length processing (on stack, not heap)
    lens: [u8; 288 + 32],
    sorted_syms: [u16; 288],
}
```

**All on the stack or in a thread-local.** No heap allocation per block.

Thread-local pool (same pattern as lbzip2-rs's `TT_BUF`):
```rust
thread_local! {
    static TABLES: RefCell<Option<Box<DecompressTables>>> = RefCell::new(None);
}

pub fn with_tables<R>(f: impl FnOnce(&mut DecompressTables) -> R) -> R {
    TABLES.with(|cell| {
        let mut opt = cell.borrow_mut();
        let tables = opt.get_or_insert_with(|| Box::new(DecompressTables::zeroed()));
        f(tables)
    })
}
```

### #[cold] slow path

```rust
#[inline(always)]
fn decode_litlen(tables: &DecompressTables, bits: &mut BitReader) -> u32 {
    let entry = tables.litlen[bits.peek(LITLEN_TABLEBITS) as usize];
    if likely(entry & SUBTABLE_FLAG == 0) {
        bits.consume(entry as u8 as u32);
        return entry;
    }
    decode_litlen_slow(tables, bits, entry)  // #[cold]
}

#[cold]
#[inline(never)]
fn decode_litlen_slow(tables: &DecompressTables, bits: &mut BitReader, entry: u32) -> u32 {
    // Walk subtable...
}
```

---

## 3. Fast Loop (`fastloop.rs`)

### Design: libdeflate's preload-before-copy pattern

The key insight: after decoding length+distance, we **preload** the next
litlen entry and **then** refill bits. The match copy runs while the
preloaded entry's cache line is settling in L1.

```rust
pub unsafe fn inflate_fast(
    bits: &mut BitReader,
    tables: &DecompressTables,
    out: &mut [u8],
) -> Result<usize, InflateError> {
    let out_ptr = out.as_mut_ptr();
    let out_end = out_ptr.add(out.len());
    let mut out_pos = out_ptr;

    // Fastloop bounds: stop 260 bytes before output end, 15 bytes before input end.
    let out_fast_end = out_end.sub(260);
    let in_fast_end = bits.end.sub(15);

    bits.refill();
    let mut entry = tables.litlen[bits.peek(LITLEN_TABLEBITS) as usize];

    while bits.ptr < in_fast_end && out_pos < out_fast_end {
        // ── Literal fast path (up to 3 per iteration) ────────────
        let saved_buf = bits.buf;
        bits.buf >>= entry as u8;
        bits.bits -= entry as u8 as u32;

        if entry & LITERAL_FLAG != 0 {
            *out_pos = (entry >> 16) as u8;
            out_pos = out_pos.add(1);
            entry = tables.litlen[bits.peek(LITLEN_TABLEBITS) as usize];

            if entry & LITERAL_FLAG != 0 {
                bits.buf >>= entry as u8;
                bits.bits -= entry as u8 as u32;
                *out_pos = (entry >> 16) as u8;
                out_pos = out_pos.add(1);
                entry = tables.litlen[bits.peek(LITLEN_TABLEBITS) as usize];

                if entry & LITERAL_FLAG != 0 {
                    bits.buf >>= entry as u8;
                    bits.bits -= entry as u8 as u32;
                    *out_pos = (entry >> 16) as u8;
                    out_pos = out_pos.add(1);
                    entry = tables.litlen[bits.peek(LITLEN_TABLEBITS) as usize];
                    continue;
                }
            }
        }

        // ── End of block ─────────────────────────────────────────
        if entry & EOB_FLAG != 0 {
            break;
        }

        // ── Match: length + distance ─────────────────────────────
        let length = (entry >> 16) as usize
            + (extract_var(saved_buf, entry) >> ((entry >> 8) as u8)) as usize;

        // Decode distance
        let dist_entry = tables.dist[bits.peek(DIST_TABLEBITS) as usize];
        // ... extract distance + extra bits ...
        let dist = /* computed distance */ ;

        // ── PRELOAD next entry BEFORE match copy ─────────────────
        bits.refill();
        entry = tables.litlen[bits.peek(LITLEN_TABLEBITS) as usize];

        // ── Match copy (SIMD-accelerated) ────────────────────────
        copy_match::<CHUNK_SIZE>(out_ptr, out_pos, dist, length);
        out_pos = out_pos.add(length);
    }

    // ── Generic (safe) loop for remainder near buffer edges ──────
    // Byte-at-a-time decode for final symbols...

    Ok(out_pos.offset_from(out_ptr) as usize)
}
```

### Preload hides latency

The `entry = tables.litlen[...]` after refill initiates an L1 load (~4–5
cycles). While the CPU executes that load, the match copy runs (~3–20 cycles
for typical matches). By the time the loop iterates, `entry` is ready. This
is the same trick libdeflate uses — **instruction-level parallelism across
the loop boundary**.

---

## 4. SIMD Match Copy (`copy.rs`)

### Design: zlib-rs const-generic N + zlib-ng short-distance handling

```rust
/// Copy `length` bytes from `out_pos - dist` to `out_pos`.
///
/// Caller guarantees: `dist <= out_pos - out_start` and
/// `out_pos + length <= out_end - CHUNK_SIZE` (overwrite headroom).
#[inline(always)]
pub unsafe fn copy_match<const N: usize>(
    out_start: *const u8,
    out_pos: *mut u8,
    dist: usize,
    length: usize,
) {
    let src = out_pos.sub(dist);

    if dist >= N {
        // Non-overlapping: N-byte chunk copy
        copy_chunks::<N>(src, out_pos, length);
    } else if dist == 1 {
        // RLE: fill with single byte
        let b = *src;
        core::ptr::write_bytes(out_pos, b, length);
    } else if dist >= 8 {
        // Semi-short: word-stride copy (libdeflate pattern)
        copy_stride_word(src, out_pos, dist, length);
    } else {
        // Short distance (2..7): SIMD tile or stride copy
        #[cfg(target_arch = "x86_64")]
        {
            if is_x86_feature_detected!("avx2") {
                copy_short_avx2(src, out_pos, dist, length);
                return;
            }
        }
        copy_stride_word(src, out_pos, dist, length);
    }
}

/// Non-overlapping N-byte chunk copy.
#[inline(always)]
unsafe fn copy_chunks<const N: usize>(
    mut src: *const u8,
    mut dst: *mut u8,
    length: usize,
) {
    let end = src.add(length);
    while src < end {
        let chunk: [u8; N] = core::ptr::read_unaligned(src as *const [u8; N]);
        core::ptr::write_unaligned(dst as *mut [u8; N], chunk);
        src = src.add(N);
        dst = dst.add(N);
    }
}
// With N=32, LLVM emits vmovdqu ymm load + vmovdqu ymm store.
// With N=16, LLVM emits movdqu xmm load + movdqu xmm store.
```

### AVX2 short-distance tiling (zlib-ng pattern)

```rust
#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")]
unsafe fn copy_short_avx2(
    src: *const u8,
    mut dst: *mut u8,
    dist: usize,
    length: usize,
) {
    use core::arch::x86_64::*;

    // Load source bytes and tile across 32-byte register.
    let src_vec = _mm_loadu_si128(src as *const __m128i);
    let wide = _mm256_inserti128_si256(_mm256_castsi128_si256(src_vec), src_vec, 1);

    // Permutation LUT: for dist D, PERM_LUT[D-2] contains shuffle indices
    // that tile a D-byte pattern across 32 bytes.
    let perm = _mm256_load_si256(PERM_LUT[dist - 2].as_ptr() as *const __m256i);
    let tiled = _mm256_shuffle_epi8(wide, perm);

    let end = dst.add(length);
    while dst < end {
        _mm256_storeu_si256(dst as *mut __m256i, tiled);
        dst = dst.add(32);
    }
}

/// Pre-computed permutation indices for dist 2..7.
/// Each entry tiles the first `dist` bytes across 32 bytes.
static PERM_LUT: [[u8; 32]; 6] = [
    // dist=2: [0,1, 0,1, 0,1, ...]
    [0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1, 0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1],
    // dist=3: [0,1,2, 0,1,2, 0,1,2, ...]
    [0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0, 0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0],
    // dist=4: [0,1,2,3, 0,1,2,3, ...]
    [0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3, 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3],
    // dist=5: [0,1,2,3,4, 0,1,2,3,4, ...]
    [0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0, 0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0],
    // dist=6: [0,1,2,3,4,5, 0,1,2,3,4,5, ...]
    [0,1,2,3,4,5,0,1,2,3,4,5,0,1,2,3, 0,1,2,3,4,5,0,1,2,3,4,5,0,1,2,3],
    // dist=7: [0,1,2,3,4,5,6, 0,1,2,3,4,5,6, ...]
    [0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1, 0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1],
];
```

### Runtime dispatch (once per call, not per match)

```rust
const CHUNK_SIZE: usize = {
    #[cfg(target_arch = "x86_64")]
    { 32 }  // AVX2
    #[cfg(target_arch = "aarch64")]
    { 16 }  // NEON
    #[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
    { 8 }   // fallback
};
```

---

## 5. Public API (`lib.rs`)

```rust
/// Decompress DEFLATE data into a pre-allocated output buffer.
///
/// Returns the number of bytes written to `output`.
/// This is the zero-copy path — no allocation beyond what the caller provides.
///
/// The decompressor may write up to `CHUNK_SIZE` bytes past the actual
/// decompressed size (overwrite headroom for SIMD). Caller must account
/// for this in the allocation.
pub fn inflate_into(
    compressed: &[u8],
    output: &mut [u8],
) -> Result<usize, InflateError> {
    with_tables(|tables| {
        let mut bits = BitReader::new(compressed);
        // Parse DEFLATE blocks until BFINAL=1
        loop {
            let bfinal = bits.take(1);
            let btype = bits.take(2);
            match btype {
                0 => decode_stored(&mut bits, output, &mut out_pos)?,
                1 => {
                    build_fixed_tables(tables);
                    unsafe { inflate_fast(&mut bits, tables, output) }?;
                }
                2 => {
                    decode_dynamic_header(&mut bits, tables)?;
                    unsafe { inflate_fast(&mut bits, tables, output) }?;
                }
                _ => return Err(InflateError::InvalidBlockType),
            }
            if bfinal != 0 { break; }
        }
        Ok(out_pos)
    })
}

/// Convenience: allocates output buffer and decompresses.
pub fn inflate_to_vec(compressed: &[u8], max_size: usize) -> Result<Vec<u8>, InflateError> {
    let mut output = vec![0u8; max_size + CHUNK_SIZE]; // SIMD headroom
    let written = inflate_into(compressed, &mut output)?;
    output.truncate(written);
    Ok(output)
}
```

---

## 6. Cache Budget Analysis

On a typical modern x86_64 (Intel 12th gen / AMD Zen4):

| Component | Size | Fits in |
|-----------|------|---------|
| BitReader struct | 32 B | L1 (1 cache line) |
| litlen table | 9.4 KB | L1 |
| dist table | 1.6 KB | L1 |
| precode table | 512 B | L1 |
| lens + sorted_syms | 1.3 KB | L1 |
| Stack locals (loop vars) | ~128 B | registers + L1 |
| **Total hot working set** | **~13 KB** | **< 32 KB L1-D** ✅ |
| Output buffer (current page) | 4 KB | L1 (write-combine) |
| Input buffer (current region) | ~64 B ahead | L1 (prefetched by HW) |

The entire hot decode state fits in L1. No L2 misses in the inner loop
except for back-references with large distance (> 32 KB window), which
are inherently cache-cold but rare in typical DEFLATE streams.

---

## 7. Integration with ljar-rs / lgz-rs

### Worker thread hot path

```rust
// In ljar's or lgz's worker thread:
let segment = &data[item.start_byte..item.end_byte];

let mut out = vec![0u8; item.output_size_hint + linflate::OVERWRITE_HEADROOM];
match linflate::inflate_into(segment, &mut out) {
    Ok(written) => { out.truncate(written); Ok(out) }
    Err(e) => Err(e),
};
```

### Future: true zero-copy into pre-allocated slab

When the caller knows `uncompressed_size` (e.g. from ZIP Central Directory)
AND the entry is single-segment (no parallel split):

```rust
// Pre-allocate once for the entire entry
let mut slab = vec![0u8; entry.uncompressed_size as usize + linflate::OVERWRITE_HEADROOM];
// Worker decompresses directly into the slab — no Vec per segment
linflate::inflate_into(segment, &mut slab[offset..])?;
// Writer receives (entry_idx, written_len) — no data copy
```

---

## 8. Implementation Order

1. `bitreader.rs` — branchless refill, peek/consume, BMI2 variant
2. `tables.rs` — packed u32 entry format, 11-bit litlen, table build
3. `fixed.rs` — static fixed Huffman tables
4. `copy.rs` — SIMD match copy with const-generic N, AVX2 short-distance
5. `fastloop.rs` — the hot loop with preload-before-copy
6. `mod.rs` — public API, block parsing, dynamic header decode
7. Integration tests: verify against miniz_oxide on all 20 Maven JARs
8. Benchmarks: compare against miniz_oxide, measure throughput per core

---

## 9. Key Differences from Each Library

| Decision | Our choice | Why |
|----------|-----------|-----|
| Buffer model | Full-buffer (libdeflate) | Callers already have segments in memory |
| Huffman entry size | u32 packed (libdeflate) | Length base + extra baked in, one expression per match |
| First-level table bits | 11 (libdeflate) | Eliminates subtable lookups for real streams |
| Bit refill | Branchless XOR (zlib-ng) | 4 instructions, no branch |
| Literal unroll | 3 per iter (zlib-ng) | Best for literal-heavy class files |
| Match copy SIMD | Const-generic N (zlib-rs) | Clean Rust, LLVM emits right instructions |
| Short-dist SIMD | AVX2 shuffle LUT (zlib-ng) | One instruction tiles any dist 2..7 |
| Allocation | Thread-local table pool (lbzip2-rs) | Zero heap churn per block |
| Output model | `inflate_into(&mut [u8])` (lbzip2-rs) | Caller owns buffer, zero-copy |
| Slow path | `#[cold] #[inline(never)]` (lbzip2-rs) | Keep fast path hot, push rare code out |