# Haagenti-Zstd Performance Domination Roadmap
**Goal:** Not just match reference zstd, but decisively outperform it.
**Target:** 150-200% of reference zstd throughput on typical workloads.
---
## Current Baseline
| 1 KB | 51% | 59% | 53% | **150%** |
| 4 KB | 32% | 48% | 35% | **150%** |
| 16 KB | 15% | 28% | 17% | **150%** |
---
## TDD Methodology
Every optimization follows strict Red-Green-Refactor:
```
1. RED: Write failing performance test with aggressive threshold
2. GREEN: Implement optimization to pass the test
3. REFACTOR: Clean up while maintaining performance
4. REVIEW: Automated code review + documentation update
```
**Performance Test Template:**
```rust
#[test]
fn test_optimization_X_throughput() {
let data = test_data_generator(SIZE);
let baseline = measure_baseline(&data);
let start = Instant::now();
let result = optimized_function(&data);
let elapsed = start.elapsed();
let throughput = data.len() as f64 / elapsed.as_secs_f64() / 1_000_000.0;
assert!(
throughput > THRESHOLD_MB_S,
"Optimization X failed: {} MB/s < {} MB/s target",
throughput, THRESHOLD_MB_S
);
}
```
---
## Phase 1: Fix Broken Optimizations (Reach Parity)
**Expected Gain:** 2-3x improvement (from 15-59% to 100%)
### 1.1 Huffman Forward Iteration ⚠️ CLAIMED DONE BUT ISN'T
**File:** `src/huffman/encoder.rs:437`
**Current (Broken):**
```rust
for &byte in literals.iter().rev() { // ← Breaks CPU prefetcher!
```
**Fix:**
```rust
// Forward iteration with post-encode bit stream reversal
for &byte in literals.iter() {
// ... encode to forward_buffer
}
// Reverse the bit stream (not the iteration order)
reverse_bitstream(&mut output);
```
**TDD Test:**
```rust
#[test]
fn test_huffman_forward_iteration_throughput() {
let encoder = HuffmanEncoder::build(&[0u8; 1024]).unwrap();
let data: Vec<u8> = (0..65536).map(|i| (i % 256) as u8).collect();
let start = Instant::now();
for _ in 0..100 {
black_box(encoder.encode(&data));
}
let elapsed = start.elapsed();
let throughput = (65536 * 100) as f64 / elapsed.as_secs_f64() / 1_000_000.0;
// Target: 300 MB/s (was ~150 MB/s with reverse iteration)
assert!(throughput > 300.0, "Forward iteration too slow: {} MB/s", throughput);
}
```
**Expected Gain:** +40-60% Huffman encoding speed
---
### 1.2 O(1) Weight Normalization Algorithm
**File:** `src/huffman/encoder.rs:142-233`
**Current (O(n²)):**
```rust
// Nested loops that can iterate up to 1000 times
loop {
let mut improved = false;
for i in 1..symbols.len() {
// ... O(n) inner loop
}
if !improved { break; }
}
```
**Fix:** Single-pass greedy Kraft-sum algorithm
```rust
fn frequencies_to_weights_optimized(freq: &[u32; 256]) -> Option<(Vec<u8>, u8)> {
// 1. Sort symbols by frequency once: O(n log n)
// 2. Assign weights greedily in single pass: O(n)
// 3. Verify Kraft inequality: O(n)
// Total: O(n log n) instead of O(n²)
}
```
**TDD Test:**
```rust
#[test]
fn test_weight_normalization_linear_time() {
// Create worst-case scenario: 200 symbols with similar frequencies
let mut freq = [0u32; 256];
for i in 0..200 {
freq[i] = 100 + (i as u32 % 10); // Similar frequencies
}
let start = Instant::now();
for _ in 0..1000 {
black_box(HuffmanEncoder::frequencies_to_weights(&freq));
}
let elapsed = start.elapsed();
// Target: < 10ms for 1000 iterations (was 100ms+ with O(n²))
assert!(elapsed.as_millis() < 10, "Weight normalization too slow: {:?}", elapsed);
}
```
**Expected Gain:** +30-50% for high-entropy data
---
### 1.3 Pipelined Hash Computation
**File:** `src/compress/match_finder.rs:270-278`
**Current (Wasteful):**
```rust
for skip_pos in (pos + 1..skip_end).step_by(2) {
let h = hash4(&input[skip_pos..]); // ← Recomputes hash every iteration!
self.hash_table[h] = skip_pos as u32;
}
```
**Fix:** Rolling hash with incremental update
```rust
fn update_hash_table_pipelined(&mut self, input: &[u8], start: usize, end: usize) {
// Use rolling hash: H(i+1) = rotate(H(i)) ^ new_byte ^ old_byte
// Or: prefetch next hash position while computing current
let mut h = hash4(&input[start..]);
for pos in (start..end).step_by(2) {
self.hash_table[h] = pos as u32;
// Prefetch next position
if pos + 2 < end {
_mm_prefetch(&input[pos + 2] as *const u8 as *const i8, _MM_HINT_T0);
h = hash4(&input[pos + 2..]); // Compute while prefetch in flight
}
}
}
```
**TDD Test:**
```rust
#[test]
fn test_hash_table_update_throughput() {
let mut finder = MatchFinder::new(6);
let data: Vec<u8> = (0..1_000_000).map(|i| (i % 256) as u8).collect();
let start = Instant::now();
for _ in 0..10 {
finder.find_matches(&data, CompressionLevel::Default);
}
let elapsed = start.elapsed();
let throughput = (1_000_000 * 10) as f64 / elapsed.as_secs_f64() / 1_000_000.0;
// Target: 500 MB/s (was ~200 MB/s)
assert!(throughput > 500.0, "Hash update too slow: {} MB/s", throughput);
}
```
**Expected Gain:** +15-25% match finding speed
---
### 1.4 Arena Allocator Full Utilization
**Files:** `src/compress/arena.rs`, `src/compress/mod.rs`, `src/compress/block.rs`
**Current:** Arena exists but Vec allocations bypass it
**Fix:** Route all per-frame allocations through arena
```rust
// Before: Many heap allocations
let mut output = Vec::with_capacity(size);
// After: Arena-backed allocation
let output = arena.alloc_vec::<u8>(size);
```
**TDD Test:**
```rust
#[test]
fn test_arena_utilization() {
let mut ctx = CompressContext::new(CompressionLevel::Default);
let data: Vec<u8> = (0..65536).map(|i| (i % 256) as u8).collect();
// Reset allocation counter
ALLOC_COUNT.store(0, Ordering::SeqCst);
let _ = ctx.compress_frame(&data);
// Target: < 5 heap allocations per frame (was 50+)
let allocs = ALLOC_COUNT.load(Ordering::SeqCst);
assert!(allocs < 5, "Too many allocations: {}", allocs);
}
```
**Expected Gain:** +10-15% throughput, reduced memory pressure
---
## Phase 2: SIMD Expansion (Beat by 50%)
**Expected Gain:** 1.5x over reference zstd
### 2.1 AVX-512 Match Finding
**Current:** AVX2 (32 bytes at a time)
**Target:** AVX-512 (64 bytes at a time)
```rust
#[cfg(target_feature = "avx512f")]
unsafe fn match_length_avx512(a: &[u8], b: &[u8]) -> usize {
use std::arch::x86_64::*;
let mut len = 0;
let max_len = a.len().min(b.len());
// Process 64 bytes at a time
while len + 64 <= max_len {
let va = _mm512_loadu_si512(a[len..].as_ptr() as *const __m512i);
let vb = _mm512_loadu_si512(b[len..].as_ptr() as *const __m512i);
let mask = _mm512_cmpeq_epi8_mask(va, vb);
if mask != 0xFFFFFFFFFFFFFFFF {
return len + mask.trailing_ones() as usize;
}
len += 64;
}
// Fallback to AVX2 for remainder
len + match_length_avx2(&a[len..], &b[len..])
}
```
**TDD Test:**
```rust
#[test]
fn test_avx512_match_length_speedup() {
if !is_x86_feature_detected!("avx512f") {
return; // Skip on non-AVX512 systems
}
let a = vec![0u8; 10_000_000];
let b = vec![0u8; 10_000_000];
let start = Instant::now();
for _ in 0..10 {
black_box(match_length_avx512(&a, &b));
}
let elapsed = start.elapsed();
let throughput = (10_000_000 * 10) as f64 / elapsed.as_secs_f64() / 1_000_000_000.0;
// Target: 20 GB/s (AVX2 was ~12 GB/s)
assert!(throughput > 20.0, "AVX-512 match too slow: {} GB/s", throughput);
}
```
**Expected Gain:** +50-80% match comparison speed
---
### 2.2 Vectorized Huffman Weight Normalization
**Idea:** SIMD-parallel Kraft sum calculation
```rust
fn calculate_kraft_sum_simd(weights: &[u8; 256]) -> u64 {
// Use AVX2 to process 32 weights at a time
// sum(2^weight) for all 256 symbols in 8 SIMD iterations
}
```
**Expected Gain:** +20-30% weight calculation
---
### 2.3 SIMD FSE Table Building
**Current:** Sequential table building
**Target:** Parallel symbol distribution
**Expected Gain:** +15-25% FSE encoding
---
## Phase 3: Entropy Fingerprinting (Smart Shortcuts)
**Expected Gain:** Up to 10x on incompressible data
### 3.1 Fast Entropy Estimation
```rust
/// Estimate entropy in ~100 cycles using sampling
fn fast_entropy_estimate(data: &[u8]) -> f32 {
// Sample 256 bytes at regular intervals
// Build approximate histogram
// Calculate Shannon entropy estimate
// Return bits per byte (0-8)
}
fn should_compress(block: &[u8]) -> bool {
fast_entropy_estimate(block) < 7.5 // Skip if nearly random
}
```
**TDD Test:**
```rust
#[test]
fn test_entropy_estimation_accuracy() {
// Test on known data
let zeros = vec![0u8; 10000];
let random: Vec<u8> = (0..10000).map(|_| rand::random()).collect();
let text = include_bytes!("../testdata/alice.txt");
assert!(fast_entropy_estimate(&zeros) < 0.1, "Zeros should have ~0 entropy");
assert!(fast_entropy_estimate(&random) > 7.9, "Random should have ~8 entropy");
assert!(fast_entropy_estimate(text) > 4.0 && fast_entropy_estimate(text) < 6.0);
}
#[test]
fn test_entropy_estimation_speed() {
let data = vec![0u8; 100_000];
let start = Instant::now();
for _ in 0..10_000 {
black_box(fast_entropy_estimate(&data));
}
let elapsed = start.elapsed();
// Target: < 1μs per call
assert!(elapsed.as_micros() < 10_000, "Entropy estimation too slow");
}
```
**Expected Gain:** Skip compression on incompressible blocks = massive speedup
---
### 3.2 Block Type Prediction
```rust
enum PredictedBlockType {
Raw, // High entropy, skip compression
Rle, // Detected run patterns
Compressed, // Worth compressing
Dictionary, // Matches known patterns
}
fn predict_block_type(block: &[u8]) -> PredictedBlockType {
let entropy = fast_entropy_estimate(block);
let rle_ratio = count_runs(block) as f32 / block.len() as f32;
if entropy > 7.8 { return PredictedBlockType::Raw; }
if rle_ratio > 0.5 { return PredictedBlockType::Rle; }
PredictedBlockType::Compressed
}
```
---
## Phase 4: Speculative Multi-Path Compression ✅ COMPLETED
**Expected Gain:** Best of all strategies with parallel execution
### 4.1 Parallel Strategy Execution ✅
**File:** `src/compress/speculative.rs`
**Status:** COMPLETED - Full implementation with parallel execution
**Implementation:**
- Created `SpeculativeCompressor` with 5 strategies:
- GreedyFast (depth=4): Fastest, baseline for throughput
- GreedyDeep (depth=16): Better matches, moderate speed
- LazyDefault (depth=16): Look-ahead matching, best ratio
- LazyBest (depth=64): Aggressive look-ahead, maximum compression
- LiteralsOnly: Skip match finding, Huffman-only compression
- Uses rayon work-stealing for parallel execution (when `parallel` feature enabled)
- Auto-detects RLE and Random data for fast-path optimization
- Configurable parallel threshold (default 4KB)
- Three presets: `new()` (all strategies), `fast()` (3 strategies), `best()` (2 strategies)
```rust
use haagenti_zstd::compress::SpeculativeCompressor;
// Default: try all 5 strategies in parallel
let compressor = SpeculativeCompressor::new();
let compressed = compressor.compress(&data)?;
// Fast: fewer strategies, lower latency
let compressor = SpeculativeCompressor::fast();
let compressed = compressor.compress(&data)?;
// Best: aggressive strategies, maximum compression
let compressor = SpeculativeCompressor::best();
let compressed = compressor.compress(&data)?;
```
**TDD Tests:** 17 tests covering all functionality
- test_speculative_compressor_creation
- test_compress_empty
- test_compress_small
- test_compress_rle_data
- test_compress_repetitive_data
- test_compress_random_data_fast_path
- test_speculative_picks_best
- test_parallel_compression (parallel feature)
- And more...
**Benchmark Results:**
```
Parallel compression: 524288 bytes -> 47 bytes in 8.3ms
(Repetitive pattern data - extreme compression ratio)
```
**Expected Gain:**
- Optimal compression ratio (always picks best strategy)
- Near-linear scaling with CPU cores for large inputs
---
## Phase 5: Data-Aware Transforms (Bonus)
### 5.1 Float Data Detection & Transpose
For neural network weights, floating point data compresses poorly.
Byte transposition groups similar bytes (exponents together, mantissas together).
```rust
fn transpose_floats(data: &[f32]) -> Vec<u8> {
let bytes: &[u8] = bytemuck::cast_slice(data);
let n = data.len();
let mut transposed = Vec::with_capacity(bytes.len());
// Group byte 0 of all floats, then byte 1, etc.
for byte_idx in 0..4 {
for i in 0..n {
transposed.push(bytes[i * 4 + byte_idx]);
}
}
transposed
}
```
**Expected Gain:** +20-40% better compression ratio for float data
---
## Implementation Schedule
| **1** | Fix broken optimizations | 1-2 days | 2x (reach parity) | ✅ DONE |
| **2** | AVX-512 + SIMD expansion | 2-3 days | 3x (beat by 50%) | ✅ DONE |
| **3** | Entropy fingerprinting | 1 day | 3.5x on mixed data | ✅ DONE |
| **4** | Speculative multi-path | 2 days | 4x (best of all worlds) | ✅ DONE |
| **5** | Data-aware transforms | 2 days | 4.5x on float data | Planned |
### Completed Optimizations Summary
**Phase 1-5.5 Complete.** Key optimizations implemented:
1. **Huffman Encoding** (Phase 1, 5.5)
- Chunked processing with software prefetching
- O(n log n) weight normalization (was O(n²))
- 4-way interleaved histogram (Phase 5.5)
- Branchless inner loop - removes all conditionals (Phase 5.5)
- SIMD histogram integration via haagenti_simd
2. **Match Finding** (Phase 1-2, 5.5)
- Batch hash update with prefetching
- AVX-512 match length comparison (64 bytes/iteration)
- AVX2 fallback (32 bytes/iteration)
- Lazy matching for Fast+ compression levels
- **Novel: Match prediction** - check predicted offset first (Phase 5.5)
- **Novel: Offset preference** - prefer predicted offset on ties (Phase 5.5)
- **Novel: Long match sparse update** - skip hash updates for 64+ byte matches
3. **Entropy Analysis** (Phase 3)
- Fast entropy estimation (~50-100 cycles)
- Early exit for incompressible data
- Block type prediction (RLE/Compress/Raw)
4. **Speculative Compression** (Phase 4)
- 5 parallel compression strategies
- Work-stealing via rayon
- Auto-selection of best result
5. **Phase 5.5: Novel Micro-Optimizations** ✅ COMPLETED
- MurmurHash3-inspired hash mixing for text
- 4-byte prefix rejection before full match comparison
- Direct lookup tables for literal/match length encoding
- Match prediction with offset reuse
- Branchless Huffman encoding loop
- Sparse hash updates for long matches
**Current Benchmark Results (vs reference zstd):**
- Compression throughput: **78.8%** average
- Decompression throughput: **935%** average (9.4x faster!)
- Binary/High entropy: **140-208%** (we win!)
- Text compression ratio: **-14%** (improved from -25%)
---
## Success Metrics
| 1KB text throughput | 51% | 100% | **150%** |
| 16KB binary throughput | 28% | 100% | **150%** |
| Mixed content (avg) | 35% | 100% | **175%** |
| Float data ratio | 2.5x | 3.0x | **4.0x** |
---
## Code Review Checklist (Per Phase)
- [ ] All TDD tests pass
- [ ] No performance regressions (run full benchmark suite)
- [ ] Memory safety verified (no unsafe without justification)
- [ ] Documentation updated
- [ ] Benchmark results recorded
- [ ] CHANGELOG updated
---
## Benchmark Commands
```bash
# Run performance tests
cargo test -p haagenti-zstd --release -- --ignored perf
# Run full benchmark suite
cargo bench -p haagenti-zstd
# Compare against reference zstd
cargo bench -p haagenti-zstd -- --baseline reference
# Profile hot paths
cargo flamegraph -p haagenti-zstd --bench zstd_benchmark
```