haagenti-zstd 0.1.1

# Haagenti-Zstd Performance Domination Roadmap

**Goal:** Not just match reference zstd, but decisively outperform it.

**Target:** 150-200% of reference zstd throughput on typical workloads.

---

## Current Baseline

| Size | Text | Binary | LLM Weights | Target |
|------|------|--------|-------------|--------|
| 1 KB | 51% | 59% | 53% | **150%** |
| 4 KB | 32% | 48% | 35% | **150%** |
| 16 KB | 15% | 28% | 17% | **150%** |

---

## TDD Methodology

Every optimization follows strict Red-Green-Refactor:

```
1. RED:    Write failing performance test with aggressive threshold
2. GREEN:  Implement optimization to pass the test
3. REFACTOR: Clean up while maintaining performance
4. REVIEW: Automated code review + documentation update
```

**Performance Test Template:**
```rust
#[test]
fn test_optimization_X_throughput() {
    let data = test_data_generator(SIZE);
    let baseline = measure_baseline(&data);

    let start = Instant::now();
    let result = optimized_function(&data);
    let elapsed = start.elapsed();

    let throughput = data.len() as f64 / elapsed.as_secs_f64() / 1_000_000.0;
    assert!(
        throughput > THRESHOLD_MB_S,
        "Optimization X failed: {} MB/s < {} MB/s target",
        throughput, THRESHOLD_MB_S
    );
}
```

---

## Phase 1: Fix Broken Optimizations (Reach Parity)

**Expected Gain:** 2-3x improvement (from 15-59% to 100%)

### 1.1 Huffman Forward Iteration ⚠️ CLAIMED DONE BUT ISN'T

**File:** `src/huffman/encoder.rs:437`

**Current (Broken):**
```rust
for &byte in literals.iter().rev() {  // ← Breaks CPU prefetcher!
```

**Fix:**
```rust
// Forward iteration with post-encode bit stream reversal
for &byte in literals.iter() {
    // ... encode to forward_buffer
}
// Reverse the bit stream (not the iteration order)
reverse_bitstream(&mut output);
```

**TDD Test:**
```rust
#[test]
fn test_huffman_forward_iteration_throughput() {
    let encoder = HuffmanEncoder::build(&[0u8; 1024]).unwrap();
    let data: Vec<u8> = (0..65536).map(|i| (i % 256) as u8).collect();

    let start = Instant::now();
    for _ in 0..100 {
        black_box(encoder.encode(&data));
    }
    let elapsed = start.elapsed();
    let throughput = (65536 * 100) as f64 / elapsed.as_secs_f64() / 1_000_000.0;

    // Target: 300 MB/s (was ~150 MB/s with reverse iteration)
    assert!(throughput > 300.0, "Forward iteration too slow: {} MB/s", throughput);
}
```

**Expected Gain:** +40-60% Huffman encoding speed

---

### 1.2 O(1) Weight Normalization Algorithm

**File:** `src/huffman/encoder.rs:142-233`

**Current (O(n²)):**
```rust
// Nested loops that can iterate up to 1000 times
loop {
    let mut improved = false;
    for i in 1..symbols.len() {
        // ... O(n) inner loop
    }
    if !improved { break; }
}
```

**Fix:** Single-pass greedy Kraft-sum algorithm
```rust
fn frequencies_to_weights_optimized(freq: &[u32; 256]) -> Option<(Vec<u8>, u8)> {
    // 1. Sort symbols by frequency once: O(n log n)
    // 2. Assign weights greedily in single pass: O(n)
    // 3. Verify Kraft inequality: O(n)
    // Total: O(n log n) instead of O(n²)
}
```

**TDD Test:**
```rust
#[test]
fn test_weight_normalization_linear_time() {
    // Create worst-case scenario: 200 symbols with similar frequencies
    let mut freq = [0u32; 256];
    for i in 0..200 {
        freq[i] = 100 + (i as u32 % 10); // Similar frequencies
    }

    let start = Instant::now();
    for _ in 0..1000 {
        black_box(HuffmanEncoder::frequencies_to_weights(&freq));
    }
    let elapsed = start.elapsed();

    // Target: < 10ms for 1000 iterations (was 100ms+ with O(n²))
    assert!(elapsed.as_millis() < 10, "Weight normalization too slow: {:?}", elapsed);
}
```

**Expected Gain:** +30-50% for high-entropy data

---

### 1.3 Pipelined Hash Computation

**File:** `src/compress/match_finder.rs:270-278`

**Current (Wasteful):**
```rust
for skip_pos in (pos + 1..skip_end).step_by(2) {
    let h = hash4(&input[skip_pos..]);  // ← Recomputes hash every iteration!
    self.hash_table[h] = skip_pos as u32;
}
```

**Fix:** Rolling hash with incremental update
```rust
fn update_hash_table_pipelined(&mut self, input: &[u8], start: usize, end: usize) {
    // Use rolling hash: H(i+1) = rotate(H(i)) ^ new_byte ^ old_byte
    // Or: prefetch next hash position while computing current
    let mut h = hash4(&input[start..]);
    for pos in (start..end).step_by(2) {
        self.hash_table[h] = pos as u32;
        // Prefetch next position
        if pos + 2 < end {
            _mm_prefetch(&input[pos + 2] as *const u8 as *const i8, _MM_HINT_T0);
            h = hash4(&input[pos + 2..]);  // Compute while prefetch in flight
        }
    }
}
```

**TDD Test:**
```rust
#[test]
fn test_hash_table_update_throughput() {
    let mut finder = MatchFinder::new(6);
    let data: Vec<u8> = (0..1_000_000).map(|i| (i % 256) as u8).collect();

    let start = Instant::now();
    for _ in 0..10 {
        finder.find_matches(&data, CompressionLevel::Default);
    }
    let elapsed = start.elapsed();
    let throughput = (1_000_000 * 10) as f64 / elapsed.as_secs_f64() / 1_000_000.0;

    // Target: 500 MB/s (was ~200 MB/s)
    assert!(throughput > 500.0, "Hash update too slow: {} MB/s", throughput);
}
```

**Expected Gain:** +15-25% match finding speed

---

### 1.4 Arena Allocator Full Utilization

**Files:** `src/compress/arena.rs`, `src/compress/mod.rs`, `src/compress/block.rs`

**Current:** Arena exists but Vec allocations bypass it

**Fix:** Route all per-frame allocations through arena
```rust
// Before: Many heap allocations
let mut output = Vec::with_capacity(size);

// After: Arena-backed allocation
let output = arena.alloc_vec::<u8>(size);
```

**TDD Test:**
```rust
#[test]
fn test_arena_utilization() {
    let mut ctx = CompressContext::new(CompressionLevel::Default);
    let data: Vec<u8> = (0..65536).map(|i| (i % 256) as u8).collect();

    // Reset allocation counter
    ALLOC_COUNT.store(0, Ordering::SeqCst);

    let _ = ctx.compress_frame(&data);

    // Target: < 5 heap allocations per frame (was 50+)
    let allocs = ALLOC_COUNT.load(Ordering::SeqCst);
    assert!(allocs < 5, "Too many allocations: {}", allocs);
}
```

**Expected Gain:** +10-15% throughput, reduced memory pressure

---

## Phase 2: SIMD Expansion (Beat by 50%)

**Expected Gain:** 1.5x over reference zstd

### 2.1 AVX-512 Match Finding

**Current:** AVX2 (32 bytes at a time)
**Target:** AVX-512 (64 bytes at a time)

```rust
#[cfg(target_feature = "avx512f")]
unsafe fn match_length_avx512(a: &[u8], b: &[u8]) -> usize {
    use std::arch::x86_64::*;

    let mut len = 0;
    let max_len = a.len().min(b.len());

    // Process 64 bytes at a time
    while len + 64 <= max_len {
        let va = _mm512_loadu_si512(a[len..].as_ptr() as *const __m512i);
        let vb = _mm512_loadu_si512(b[len..].as_ptr() as *const __m512i);
        let mask = _mm512_cmpeq_epi8_mask(va, vb);

        if mask != 0xFFFFFFFFFFFFFFFF {
            return len + mask.trailing_ones() as usize;
        }
        len += 64;
    }

    // Fallback to AVX2 for remainder
    len + match_length_avx2(&a[len..], &b[len..])
}
```

**TDD Test:**
```rust
#[test]
fn test_avx512_match_length_speedup() {
    if !is_x86_feature_detected!("avx512f") {
        return; // Skip on non-AVX512 systems
    }

    let a = vec![0u8; 10_000_000];
    let b = vec![0u8; 10_000_000];

    let start = Instant::now();
    for _ in 0..10 {
        black_box(match_length_avx512(&a, &b));
    }
    let elapsed = start.elapsed();
    let throughput = (10_000_000 * 10) as f64 / elapsed.as_secs_f64() / 1_000_000_000.0;

    // Target: 20 GB/s (AVX2 was ~12 GB/s)
    assert!(throughput > 20.0, "AVX-512 match too slow: {} GB/s", throughput);
}
```

**Expected Gain:** +50-80% match comparison speed

---

### 2.2 Vectorized Huffman Weight Normalization

**Idea:** SIMD-parallel Kraft sum calculation

```rust
fn calculate_kraft_sum_simd(weights: &[u8; 256]) -> u64 {
    // Use AVX2 to process 32 weights at a time
    // sum(2^weight) for all 256 symbols in 8 SIMD iterations
}
```

**Expected Gain:** +20-30% weight calculation

---

### 2.3 SIMD FSE Table Building

**Current:** Sequential table building
**Target:** Parallel symbol distribution

**Expected Gain:** +15-25% FSE encoding

---

## Phase 3: Entropy Fingerprinting (Smart Shortcuts)

**Expected Gain:** Up to 10x on incompressible data

### 3.1 Fast Entropy Estimation

```rust
/// Estimate entropy in ~100 cycles using sampling
fn fast_entropy_estimate(data: &[u8]) -> f32 {
    // Sample 256 bytes at regular intervals
    // Build approximate histogram
    // Calculate Shannon entropy estimate
    // Return bits per byte (0-8)
}

fn should_compress(block: &[u8]) -> bool {
    fast_entropy_estimate(block) < 7.5  // Skip if nearly random
}
```

**TDD Test:**
```rust
#[test]
fn test_entropy_estimation_accuracy() {
    // Test on known data
    let zeros = vec![0u8; 10000];
    let random: Vec<u8> = (0..10000).map(|_| rand::random()).collect();
    let text = include_bytes!("../testdata/alice.txt");

    assert!(fast_entropy_estimate(&zeros) < 0.1, "Zeros should have ~0 entropy");
    assert!(fast_entropy_estimate(&random) > 7.9, "Random should have ~8 entropy");
    assert!(fast_entropy_estimate(text) > 4.0 && fast_entropy_estimate(text) < 6.0);
}

#[test]
fn test_entropy_estimation_speed() {
    let data = vec![0u8; 100_000];

    let start = Instant::now();
    for _ in 0..10_000 {
        black_box(fast_entropy_estimate(&data));
    }
    let elapsed = start.elapsed();

    // Target: < 1μs per call
    assert!(elapsed.as_micros() < 10_000, "Entropy estimation too slow");
}
```

**Expected Gain:** Skip compression on incompressible blocks = massive speedup

---

### 3.2 Block Type Prediction

```rust
enum PredictedBlockType {
    Raw,           // High entropy, skip compression
    Rle,           // Detected run patterns
    Compressed,    // Worth compressing
    Dictionary,    // Matches known patterns
}

fn predict_block_type(block: &[u8]) -> PredictedBlockType {
    let entropy = fast_entropy_estimate(block);
    let rle_ratio = count_runs(block) as f32 / block.len() as f32;

    if entropy > 7.8 { return PredictedBlockType::Raw; }
    if rle_ratio > 0.5 { return PredictedBlockType::Rle; }
    PredictedBlockType::Compressed
}
```

---

## Phase 4: Speculative Multi-Path Compression ✅ COMPLETED

**Expected Gain:** Best of all strategies with parallel execution

### 4.1 Parallel Strategy Execution ✅

**File:** `src/compress/speculative.rs`
**Status:** COMPLETED - Full implementation with parallel execution

**Implementation:**
- Created `SpeculativeCompressor` with 5 strategies:
  - GreedyFast (depth=4): Fastest, baseline for throughput
  - GreedyDeep (depth=16): Better matches, moderate speed
  - LazyDefault (depth=16): Look-ahead matching, best ratio
  - LazyBest (depth=64): Aggressive look-ahead, maximum compression
  - LiteralsOnly: Skip match finding, Huffman-only compression
- Uses rayon work-stealing for parallel execution (when `parallel` feature enabled)
- Auto-detects RLE and Random data for fast-path optimization
- Configurable parallel threshold (default 4KB)
- Three presets: `new()` (all strategies), `fast()` (3 strategies), `best()` (2 strategies)

```rust
use haagenti_zstd::compress::SpeculativeCompressor;

// Default: try all 5 strategies in parallel
let compressor = SpeculativeCompressor::new();
let compressed = compressor.compress(&data)?;

// Fast: fewer strategies, lower latency
let compressor = SpeculativeCompressor::fast();
let compressed = compressor.compress(&data)?;

// Best: aggressive strategies, maximum compression
let compressor = SpeculativeCompressor::best();
let compressed = compressor.compress(&data)?;
```

**TDD Tests:** 17 tests covering all functionality
- test_speculative_compressor_creation
- test_compress_empty
- test_compress_small
- test_compress_rle_data
- test_compress_repetitive_data
- test_compress_random_data_fast_path
- test_speculative_picks_best
- test_parallel_compression (parallel feature)
- And more...

**Benchmark Results:**
```
Parallel compression: 524288 bytes -> 47 bytes in 8.3ms
(Repetitive pattern data - extreme compression ratio)
```

**Expected Gain:**
- Optimal compression ratio (always picks best strategy)
- Near-linear scaling with CPU cores for large inputs

---

## Phase 5: Data-Aware Transforms (Bonus)

### 5.1 Float Data Detection & Transpose

For neural network weights, floating point data compresses poorly.
Byte transposition groups similar bytes (exponents together, mantissas together).

```rust
fn transpose_floats(data: &[f32]) -> Vec<u8> {
    let bytes: &[u8] = bytemuck::cast_slice(data);
    let n = data.len();
    let mut transposed = Vec::with_capacity(bytes.len());

    // Group byte 0 of all floats, then byte 1, etc.
    for byte_idx in 0..4 {
        for i in 0..n {
            transposed.push(bytes[i * 4 + byte_idx]);
        }
    }
    transposed
}
```

**Expected Gain:** +20-40% better compression ratio for float data

---

## Implementation Schedule

| Phase | Focus | Duration | Cumulative Speedup | Status |
|-------|-------|----------|-------------------|--------|
| **1** | Fix broken optimizations | 1-2 days | 2x (reach parity) | ✅ DONE |
| **2** | AVX-512 + SIMD expansion | 2-3 days | 3x (beat by 50%) | ✅ DONE |
| **3** | Entropy fingerprinting | 1 day | 3.5x on mixed data | ✅ DONE |
| **4** | Speculative multi-path | 2 days | 4x (best of all worlds) | ✅ DONE |
| **5** | Data-aware transforms | 2 days | 4.5x on float data | Planned |

### Completed Optimizations Summary

**Phase 1-5.5 Complete.** Key optimizations implemented:

1. **Huffman Encoding** (Phase 1, 5.5)
   - Chunked processing with software prefetching
   - O(n log n) weight normalization (was O(n²))
   - 4-way interleaved histogram (Phase 5.5)
   - Branchless inner loop - removes all conditionals (Phase 5.5)
   - SIMD histogram integration via haagenti_simd

2. **Match Finding** (Phase 1-2, 5.5)
   - Batch hash update with prefetching
   - AVX-512 match length comparison (64 bytes/iteration)
   - AVX2 fallback (32 bytes/iteration)
   - Lazy matching for Fast+ compression levels
   - **Novel: Match prediction** - check predicted offset first (Phase 5.5)
   - **Novel: Offset preference** - prefer predicted offset on ties (Phase 5.5)
   - **Novel: Long match sparse update** - skip hash updates for 64+ byte matches

3. **Entropy Analysis** (Phase 3)
   - Fast entropy estimation (~50-100 cycles)
   - Early exit for incompressible data
   - Block type prediction (RLE/Compress/Raw)

4. **Speculative Compression** (Phase 4)
   - 5 parallel compression strategies
   - Work-stealing via rayon
   - Auto-selection of best result

5. **Phase 5.5: Novel Micro-Optimizations** ✅ COMPLETED
   - MurmurHash3-inspired hash mixing for text
   - 4-byte prefix rejection before full match comparison
   - Direct lookup tables for literal/match length encoding
   - Match prediction with offset reuse
   - Branchless Huffman encoding loop
   - Sparse hash updates for long matches

**Current Benchmark Results (vs reference zstd):**
- Compression throughput: **78.8%** average
- Decompression throughput: **935%** average (9.4x faster!)
- Binary/High entropy: **140-208%** (we win!)
- Text compression ratio: **-14%** (improved from -25%)

---

## Success Metrics

| Metric | Current | Parity | Target |
|--------|---------|--------|--------|
| 1KB text throughput | 51% | 100% | **150%** |
| 16KB binary throughput | 28% | 100% | **150%** |
| Mixed content (avg) | 35% | 100% | **175%** |
| Float data ratio | 2.5x | 3.0x | **4.0x** |

---

## Code Review Checklist (Per Phase)

- [ ] All TDD tests pass
- [ ] No performance regressions (run full benchmark suite)
- [ ] Memory safety verified (no unsafe without justification)
- [ ] Documentation updated
- [ ] Benchmark results recorded
- [ ] CHANGELOG updated

---

## Benchmark Commands

```bash
# Run performance tests
cargo test -p haagenti-zstd --release -- --ignored perf

# Run full benchmark suite
cargo bench -p haagenti-zstd

# Compare against reference zstd
cargo bench -p haagenti-zstd -- --baseline reference

# Profile hot paths
cargo flamegraph -p haagenti-zstd --bench zstd_benchmark
```