pixo 0.4.1 - Docs.rs

# Performance Optimization Strategies

This document explains the performance optimization techniques used in pixo. These are general principles that apply to any high-performance code, illustrated with concrete examples from this library.

## The Optimization Mindset

Before diving into techniques, it's worth understanding the hierarchy of optimization impact:

```text
┌─────────────────────────────────────────────────────────────────┐
│  1. Algorithm Choice           (10x-1000x improvement possible) │
├─────────────────────────────────────────────────────────────────┤
│  2. Data Structure Selection   (2x-100x improvement possible)   │
├─────────────────────────────────────────────────────────────────┤
│  3. Memory Access Patterns     (2x-10x improvement possible)    │
├─────────────────────────────────────────────────────────────────┤
│  4. Low-Level Optimizations    (1.1x-2x improvement possible)   │
└─────────────────────────────────────────────────────────────────┘
```

Always start at the top. A better algorithm beats a perfectly-optimized bad algorithm every time.

## Core Principle: Do Less Work

The fastest code is code that doesn't run. Every optimization ultimately reduces the amount of work the CPU must do.

### Example: Deferred Operations

The Adler-32 checksum algorithm requires a modulo operation (`% 65521`) to prevent overflow. The naive approach applies it after every byte:

```text
// Slow: modulo per byte
for byte in data {
    s1 = (s1 + byte) % 65521;
    s2 = (s2 + s1) % 65521;
}
```

But modulo is expensive! Our optimization: do the math to find how many bytes we can process before overflow, then only apply modulo at chunk boundaries:

```rust,ignore
// From src/compress/adler32.rs
const NMAX: usize = 5552;  // Max bytes before overflow

for chunk in data.chunks(NMAX) {
    for &b in chunk {
        s1 += b as u32;
        s2 += s1;
    }
    // Modulo once per chunk instead of per byte
    s1 %= MOD_ADLER;
    s2 %= MOD_ADLER;
}
```

**Result**: We reduced 5552 modulo operations to just 2 per chunk, a 2776x reduction in the most expensive operation.

### Example: Early Exit

When selecting the best PNG filter, we try all 5 filters and pick the one with the lowest "score" (sum of absolute values). But if we find a perfect score of 0, we can stop immediately:

```rust,ignore
// From src/png/filter.rs
let score = score_filter(&scratch.sub);
if score < best_score {
    best_score = score;
    best_filter = FILTER_SUB;
    if best_score == 0 {  // Can't do better than 0!
        return;
    }
}
```

Similarly, for LZ77 matching, if we find a "good enough" match, we skip expensive lazy matching:

```rust,ignore
// From src/compress/lz77.rs
const GOOD_MATCH_LENGTH: usize = 32;

// Skip lazy matching for long matches - they're already good enough
if self.lazy_matching && length < GOOD_MATCH_LENGTH {
    // ... try next position
}
```

## Lookup Tables: Trading Memory for Speed

Computation takes time. Memory lookups are often faster. When you repeatedly compute the same function on a small domain, precompute all results.

### Example: Length and Distance Code Lookup

DEFLATE needs to convert match lengths (3-258) to symbol codes. The naive approach searches through a table:

```text
// Slow: linear search
fn length_code(length: u16) -> u16 {
    for (i, &base) in LENGTH_BASE.iter().enumerate() {
        if length < LENGTH_BASE[i + 1] {
            return 257 + i as u16;
        }
    }
}
```

Our optimization: precompute a direct lookup table at compile time:

```rust,ignore
// From src/compress/deflate.rs
const LENGTH_LOOKUP: [(u8, u8); 256] = {
    let mut table = [(0u8, 0u8); 256];
    // ... populate at compile time
    table
};

#[inline]
fn length_code(length: u16) -> (u16, u8, u16) {
    let idx = (length - 3) as usize;
    let (code_offset, extra_bits) = LENGTH_LOOKUP[idx];  // O(1) lookup
    // ...
}
```

**Result**: O(n) search becomes O(1) lookup. For hot paths called millions of times, this matters.

### Example: Bit Reversal Table

DEFLATE requires reversing the bit order of Huffman codes. Computing this involves a loop:

```text
// Slow: bit-by-bit reversal
fn reverse_bits_slow(code: u16, length: u8) -> u32 {
    let mut result = 0;
    for _ in 0..length {
        result = (result << 1) | (code & 1);
        code >>= 1;
    }
    result
}
```

Our optimization: precompute all 256 byte reversals, then combine them:

```rust,ignore
// From src/compress/deflate.rs
const REVERSE_BYTE: [u8; 256] = { /* computed at compile time */ };

#[inline]
fn reverse_bits(code: u16, length: u8) -> u32 {
    let low = REVERSE_BYTE[code as u8 as usize] as u16;
    let high = REVERSE_BYTE[(code >> 8) as u8 as usize] as u16;
    let reversed = (low << 8) | high;
    (reversed >> (16 - length)) as u32
}
```

**Result**: 16 loop iterations become 2 table lookups and some bit shifts.

## Integer Arithmetic vs Floating Point

Integer operations are generally faster and more predictable than floating-point. When you don't need the precision of floats, use integers with fixed-point arithmetic.

### Example: Color Space Conversion

RGB to YCbCr conversion uses these floating-point formulas:

```text
Y  = 0.299×R + 0.587×G + 0.114×B
Cb = -0.169×R - 0.331×G + 0.5×B + 128
Cr = 0.5×R - 0.419×G - 0.081×B + 128
```

The naive approach uses `f32` with `round()` and `clamp()`. Our optimization: scale coefficients by 256 and use integer math with bit shifts:

```rust,ignore
// From src/color.rs
#[inline]
pub fn rgb_to_ycbcr(r: u8, g: u8, b: u8) -> (u8, u8, u8) {
    let r = r as i32;
    let g = g as i32;
    let b = b as i32;

    // Fixed-point coefficients (scaled by 256)
    // +128 for rounding before right shift
    let y = (77 * r + 150 * g + 29 * b + 128) >> 8;
    let cb = ((-43 * r - 85 * g + 128 * b + 128) >> 8) + 128;
    let cr = ((128 * r - 107 * g - 21 * b + 128) >> 8) + 128;

    (y as u8, cb as u8, cr as u8)
}
```

**Key insight**: `>> 8` (right shift by 8) is equivalent to `/ 256`, but much faster. We've eliminated all floating-point operations.

## Choosing the Right Data Type

Using smaller data types can improve performance through better cache utilization and SIMD vectorization.

### Memory Hierarchy Matters

```text
┌─────────────────────────────────────────────────────────────┐
│  L1 Cache:   ~1 ns access,   64 KB                          │
├─────────────────────────────────────────────────────────────┤
│  L2 Cache:   ~4 ns access,   256 KB                         │
├─────────────────────────────────────────────────────────────┤
│  L3 Cache:   ~12 ns access,  8 MB                           │
├─────────────────────────────────────────────────────────────┤
│  RAM:        ~100 ns access, GBs                            │
└─────────────────────────────────────────────────────────────┘
```

If your data fits in a smaller cache level, everything runs faster. Using `u8` instead of `u32` means 4x more data fits in cache.

### Example: Hash Table Sizing

In LZ77, we store positions as `i32` (4 bytes) instead of `usize` (8 bytes on 64-bit):

```rust,ignore
// From src/compress/lz77.rs
pub struct Lz77Compressor {
    head: Vec<i32>,  // 4 bytes, not 8
    prev: Vec<i32>,  // 4 bytes, not 8
    // ...
}
```

This halves memory usage and doubles cache efficiency. We use `i32` (signed) so we can use `-1` as a sentinel for "no entry" — a common pattern.

## Batching Operations

Processing items one at a time has overhead. Processing them in batches amortizes that overhead.

### Example: Multi-Byte Comparison

When finding LZ77 match lengths, the naive approach compares byte by byte:

```text
// Slow: byte-by-byte
while length < max && data[pos1 + length] == data[pos2 + length] {
    length += 1;
}
```

Our optimization: compare 8 bytes at a time using `u64`:

```rust,ignore
// From src/compress/lz77.rs
// Compare 8 bytes at a time using u64
while length + 8 <= max_len {
    let a = u64::from_ne_bytes(data[pos1+length..pos1+length+8].try_into().unwrap());
    let b = u64::from_ne_bytes(data[pos2+length..pos2+length+8].try_into().unwrap());
    if a != b {
        // Find first differing byte using trailing zeros
        let xor = a ^ b;
        length += (xor.trailing_zeros() / 8) as usize;
        return length;
    }
    length += 8;
}
```

**Key insight**: `trailing_zeros()` on the XOR tells us exactly where the first difference is, without a loop.

### Example: Batch Bit Writing

The JPEG bit writer was originally implemented as a bit-by-bit loop. Our optimization processes multiple bits per iteration:

```rust,ignore
// From src/bits.rs (BitWriterMsb)
while remaining > 0 {
    let space = self.bit_position;
    let to_write = remaining.min(space);

    // Write up to 8 bits at once
    let shift = remaining - to_write;
    let mask = (1u32 << to_write) - 1;
    let bits = ((val >> shift) & mask) as u8;

    self.bit_position -= to_write;
    self.current_byte |= bits << self.bit_position;
    remaining -= to_write;
    // ...
}
```

## SIMD: Single Instruction, Multiple Data

SIMD instructions process multiple values simultaneously. A 128-bit register can hold 16 bytes, and one instruction operates on all 16.

```text
┌────────────────────────────────────────────────────────┐
│                   Scalar Addition                       │
│                                                         │
│   a[0] + b[0] = c[0]  (1 operation)                    │
│   a[1] + b[1] = c[1]  (1 operation)                    │
│   a[2] + b[2] = c[2]  (1 operation)                    │
│   ...                                                   │
│   a[15] + b[15] = c[15]  (1 operation)                 │
│                                                         │
│   Total: 16 operations                                  │
└────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────┐
│                   SIMD Addition                         │
│                                                         │
│   ┌─────────────────────────────────────────────────┐  │
│   │ a[0] a[1] a[2] ... a[15] │  (128-bit register)  │  │
│   └─────────────────────────────────────────────────┘  │
│                      +                                  │
│   ┌─────────────────────────────────────────────────┐  │
│   │ b[0] b[1] b[2] ... b[15] │  (128-bit register)  │  │
│   └─────────────────────────────────────────────────┘  │
│                      =                                  │
│   ┌─────────────────────────────────────────────────┐  │
│   │ c[0] c[1] c[2] ... c[15] │  (128-bit register)  │  │
│   └─────────────────────────────────────────────────┘  │
│                                                         │
│   Total: 1 operation (16x speedup potential)           │
└────────────────────────────────────────────────────────┘
```

### Example: SIMD Adler-32

The standard Adler-32 loop processes one byte at a time. With SSSE3, we process 16 bytes at once:

```rust,ignore
// From src/simd/x86_64.rs (simplified)
pub unsafe fn adler32_ssse3(data: &[u8]) -> u32 {
    // Process 16 bytes at a time
    let chunk = _mm_loadu_si128(ptr as *const __m128i);

    // Use SIMD multiply-add for weighted sums
    // (the actual implementation is more complex)
}
```

### Example: SIMD PNG Filter Scoring

Calculating the sum of absolute values for filter selection:

```rust,ignore
// From src/simd/x86_64.rs
pub unsafe fn score_filter_sse2(filtered: &[u8]) -> u64 {
    let mut sum = _mm_setzero_si128();

    // Process 16 bytes at a time
    while remaining >= 16 {
        let chunk = _mm_loadu_si128(ptr as *const __m128i);
        // Sum absolute differences from zero
        let sad = _mm_sad_epu8(chunk, _mm_setzero_si128());
        sum = _mm_add_epi64(sum, sad);
    }
    // ...
}
```

**Key SIMD operations used**:

- `_mm_loadu_si128`: Load 16 bytes into a register
- `_mm_sad_epu8`: Sum of Absolute Differences (perfect for filter scoring)
- `_mm_cmpeq_epi8`: Compare 16 bytes at once (for match length)
- `_mm_movemask_epi8`: Convert comparison results to a bitmask

### Runtime Feature Detection

Different CPUs support different SIMD instruction sets. We detect at runtime:

```rust,ignore
// From src/simd/mod.rs
pub fn adler32(data: &[u8]) -> u32 {
    #[cfg(target_arch = "x86_64")]
    {
        if is_x86_feature_detected!("ssse3") {
            return unsafe { x86_64::adler32_ssse3(data) };
        }
    }

    // Fallback to scalar implementation
    fallback::adler32(data)
}
```

This pattern ensures the code runs everywhere while taking advantage of faster instructions when available.

## Caching Computed Results

When you compute the same thing repeatedly, cache the result.

### Example: Fixed Huffman Code Caching

DEFLATE's fixed Huffman codes are the same for every compression. Instead of regenerating them each time:

```rust,ignore
// From src/compress/huffman.rs
use std::sync::LazyLock;

static FIXED_LITERAL_CODES: LazyLock<Vec<HuffmanCode>> = LazyLock::new(|| {
    // Generate codes once, on first access
    let mut lengths = vec![0u8; 288];
    // ... set up lengths
    generate_canonical_codes(&lengths)
});

#[inline]
pub fn fixed_literal_codes() -> &'static [HuffmanCode] {
    &FIXED_LITERAL_CODES  // O(1) after first call
}
```

`LazyLock` ensures thread-safe, one-time initialization.

### Example: Reusing Scratch Buffers

When processing PNG rows, we need temporary buffers for each filter type. Instead of allocating new buffers per row:

```rust,ignore
// From src/png/filter.rs
struct AdaptiveScratch {
    none: Vec<u8>,
    sub: Vec<u8>,
    up: Vec<u8>,
    avg: Vec<u8>,
    paeth: Vec<u8>,
}

impl AdaptiveScratch {
    fn clear(&mut self) {
        // Just reset length, keep capacity
        self.none.clear();
        self.sub.clear();
        // ...
    }
}
```

We allocate once and reuse. The vectors keep their capacity between rows, avoiding repeated allocation.

## Transform Before Processing

Sometimes the best optimization isn't in the algorithm—it's in reshaping the data before the algorithm runs.

### The Pattern

```text
┌─────────────────────────────────────────────────────────────────┐
│  Input Data          Transform           Optimized Data         │
│  ──────────    →    ───────────    →    ──────────────          │
│  [scattered]        [reorder for        [spatially              │
│                      locality]           coherent]              │
│                                                                 │
│  The same algorithm runs faster on well-structured data.        │
└─────────────────────────────────────────────────────────────────┘
```

### Example: Palette Reordering (Zeng Algorithm)

PNG palette images assign an index (0-255) to each pixel. The naive approach assigns indices in first-occurrence order. But compression works better when similar values are adjacent.

```text
Before: Pixel sequence uses scattered indices
┌─────────────────────────────────────────┐
│  3  7  3  12  7  3  45  12  7  3  ...  │  Indices jump around
└─────────────────────────────────────────┘

After: Reorder palette so spatially adjacent pixels have adjacent indices
┌─────────────────────────────────────────┐
│  1  2  1  3   2  1  4   3   2  1  ...  │  Smooth transitions
└─────────────────────────────────────────┘
```

The Zeng algorithm builds a **co-occurrence matrix** counting which colors appear next to each other, then reorders the palette to minimize index differences between neighbors.

**Result**: 20-40% smaller files for palette images, with zero change to the compression algorithm itself.

### Example: Progressive JPEG Scan Order

Instead of encoding each 8×8 block completely before moving to the next, progressive JPEG groups coefficients by frequency:

```text
Sequential: Block₁[all 64 coeffs] → Block₂[all 64] → Block₃[all 64] → ...

Progressive: All DC coeffs → All low AC → All high AC
             (similar data grouped together = better compression)
```

Same data, different order — 5-15% smaller files.

### When to Apply This Pattern

Ask: "Are there relationships in my data that the algorithm can't see?"

- **Spatial locality**: Adjacent pixels often have similar values
- **Frequency correlation**: Similar-frequency DCT coefficients have similar distributions
- **Temporal patterns**: Sequential data often has predictable deltas

Transform the data to expose these patterns before compression.

---

## Structure the Search Space

When facing a combinatorial decision problem, don't enumerate all possibilities. Model it as a graph and find the shortest path.

### The Problem: Exponential Choices

Many optimization problems seem to require trying every combination:

```text
Quantization example:
─────────────────────
64 coefficients, each could round up or down
Brute force: 2^64 possibilities = 18 quintillion combinations

LZ77 parsing example:
─────────────────────
At each position: literal OR match of length 3, 4, 5, ... 258
10,000 byte file = astronomical combinations
```

### The Solution: Model as a Graph

Instead of brute force, observe that:

1. **Decisions are sequential** — you make choices one position at a time
2. **Future costs don't depend on how you got here** — only on your current state
3. **You can compute optimal paths efficiently** — using DP or shortest-path algorithms

```text
Trellis Quantization Graph
══════════════════════════

Position 0        Position 1        Position 2
   [4]───────────────[2]───────────────[0]
    │╲               │╲               │
    │ ╲cost          │ ╲              │
   [5]──╲───────────[3]──╲───────────[1]
          ╲               ╲
          [6]─────────────[4]─────────────[2]

Each node = a possible quantized value
Each edge = cost (rate + λ × distortion) to transition
Solution  = shortest path from start to end
```

**Key insight**: You only need to track the _best way to reach each state_, not all possible paths. This reduces 2^N to O(N × S²) where S is states per position.

### Example: Optimal LZ77 Parsing

Greedy LZ77 takes the longest match at each position. But sometimes a shorter match enables a better match later:

```text
Data: "ABCABCABCD"

Greedy:     ABC (match 3) + ABCD (match 4) = suboptimal
Optimal:    ABCABC (match 6) + D (literal) = better

The greedy choice at position 3 blocks a longer match at position 0.
```

The optimal approach:

1. Build a graph: each position has edges for "literal" and all valid match lengths
2. Edge costs = actual bit cost (from Huffman statistics)
3. Find shortest path with forward DP

```rust,ignore
// Forward DP: cost[i] = minimum bits to encode bytes 0..i
for i in 0..n {
    // Try literal
    let lit_cost = cost[i] + literal_bits(data[i]);
    if lit_cost < cost[i + 1] {
        cost[i + 1] = lit_cost;
    }

    // Try each match length
    for (len, dist) in find_matches(data, i) {
        let match_cost = cost[i] + match_bits(len, dist);
        if match_cost < cost[i + len] {
            cost[i + len] = match_cost;
        }
    }
}
```

**Result**: 3-8% smaller files by finding globally optimal parse instead of locally greedy.

### When to Apply This Pattern

Ask: "Am I making sequential decisions where each choice affects future options?"

If yes, model it as a graph:

- **Nodes** = states (what information do I need to make the next decision?)
- **Edges** = choices with costs
- **Solution** = shortest path (Viterbi, Dijkstra, or forward DP)

The key properties that make this work:

- **Optimal substructure**: Best solution contains best sub-solutions
- **Limited state space**: You don't need to track everything—just enough to evaluate future costs

---

## Multi-Pass Processing

Sometimes you need information about the whole input to make optimal decisions. The solution: analyze first, then process.

### The Trade-off

```text
┌─────────────────────────────────────────────────────────────────┐
│                    Single-Pass vs Multi-Pass                     │
│                                                                  │
│  Single-pass:  Faster, streaming-friendly, but uses estimates   │
│  Multi-pass:   Slower, requires buffering, but optimal choices  │
│                                                                  │
│  When encoding time matters less than output quality,           │
│  multi-pass wins.                                               │
└─────────────────────────────────────────────────────────────────┘
```

### Example: Optimized Huffman Tables

DEFLATE can use fixed Huffman codes (single-pass) or build optimal codes for the specific data (two-pass):

```text
Pass 1: Count symbol frequencies
        literals: {0: 5, 1: 12, 2: 8, ...}
        distances: {1: 20, 2: 15, ...}

Pass 2: Build optimal Huffman tree from frequencies
        Encode with custom codes
```

**Result**: 2-5% smaller files. The table overhead is ~300 bits; savings are usually much larger.

### Example: Iterative Refinement

Some problems have circular dependencies—you need to solve A to solve B, but B affects A:

```text
LZ77 + Huffman Chicken-and-Egg:
──────────────────────────────
Best LZ77 parse depends on Huffman bit costs
Huffman bit costs depend on LZ77 symbol frequencies
Which do you compute first?
```

The solution: iterate until convergence.

```text
Iteration 1: Greedy LZ77 → rough frequencies → initial Huffman costs
Iteration 2: Optimal LZ77 with costs → better frequencies → refined costs
Iteration 3: Optimal LZ77 with refined costs → even better frequencies
...
Iteration N: Converged to (near-)optimal solution
```

Zopfli uses 15 iterations by default. Each iteration improves by smaller amounts until the solution stabilizes.

### When to Apply This Pattern

Multi-pass is worth it when:

- **Output size matters more than encoding speed** (storage, bandwidth)
- **The input fits in memory** (can buffer for second pass)
- **Decisions early in the stream affect optimal choices later**

---

## Algorithm Selection: The AAN Fast DCT

Sometimes the biggest wins come from choosing a better algorithm entirely.

### The Problem: DCT is Expensive

The naive 1D 8-point DCT needs 64 multiplications; a fully naive 2D 8×8 would take 4096. Separability drops this to 2×8³ = 1024 multiplications, but that's still costly:

```text
// Naive 1D DCT
for k in 0..8 {
    for n in 0..8 {
        output[k] += input[n] * cos((2n+1) * k * π / 16);
    }
}
```

That's O(n²) — 64 multiplications for 8 points.

### The Solution: AAN Algorithm

The Arai-Agui-Nakajima (AAN) algorithm reduces this to just **5 multiplications and 29 additions**:

```rust,ignore
// From src/jpeg/dct.rs
fn aan_dct_1d(data: &mut [f32; 8]) {
    // Stage 1: Butterfly operations (additions only)
    let tmp0 = data[0] + data[7];
    let tmp7 = data[0] - data[7];
    // ... more additions

    // Rotations (the only multiplications)
    let z1 = (tmp12 + tmp13) * A1;
    let z5 = (tmp10 - tmp12) * A5;
    // ... just 5 multiplications total

    // Apply post-scaling
    for i in 0..8 {
        data[i] *= S[i];
    }
}
```

**Result**: 5 multiplications instead of 64, a 12x reduction in the most expensive operation.

## Parallel Processing

Modern CPUs have multiple cores. Use them!

### Example: Parallel PNG Filter Selection

Each row of a PNG can be filtered independently (with access to the previous row for context). We parallelize with rayon:

```rust,ignore
// From src/png/filter.rs
#[cfg(feature = "parallel")]
fn apply_filters_parallel(data: &[u8], height: usize, ...) -> Vec<u8> {
    let rows: Vec<Vec<u8>> = (0..height)
        .into_par_iter()  // Parallel iterator!
        .map(|y| {
            // Each row processed independently
            let row = &data[y * row_bytes..(y+1) * row_bytes];
            filter_row(row, prev_row, ...)
        })
        .collect();

    // Combine results
    rows.into_iter().flatten().collect()
}
```

**Key insight**: Compression has inherent parallelism at the row level. The work per row is significant enough to amortize threading overhead.

## Bits, Bytes, and Binary Representation

Understanding binary representation is fundamental to compression optimization.

### Bit Shifting for Division and Multiplication

Bit shifts are much faster than division/multiplication by powers of 2:

```text
x >> n  is equivalent to  x / (2^n)   but faster
x << n  is equivalent to  x * (2^n)   but faster
```

We use this extensively for fixed-point arithmetic:

```rust,ignore
// Division by 256 using bit shift
let y = (77 * r + 150 * g + 29 * b + 128) >> 8;
```

### Bit Masking for Modulo

For powers of 2, bitwise AND is faster than modulo:

```text
x & (n-1)  is equivalent to  x % n  when n is a power of 2
```

We use this for hash table indexing:

```rust,ignore
// From src/compress/lz77.rs
const HASH_SIZE: usize = 1 << 15;  // 32768 (power of 2)

fn hash3(data: &[u8], pos: usize) -> usize {
    // ... compute hash
    (hash >> 17) as usize & (HASH_SIZE - 1)  // Fast modulo
}
```

### Finding Differences with XOR

XOR highlights differences between values:

```text
a ^ b = 0  means a == b
a ^ b ≠ 0  means a ≠ b, and the set bits show where they differ
```

We use this for fast string matching:

```rust,ignore
// Find first differing byte in 8-byte comparison
let xor = a ^ b;
if xor != 0 {
    // trailing_zeros tells us the position of first difference
    let diff_pos = (xor.trailing_zeros() / 8) as usize;
}
```

## Summary: The Optimization Checklist

When optimizing, consider these techniques in order of impact:

1. **Choose the right algorithm**

   - AAN DCT instead of naive DCT
   - Hash tables instead of linear search

2. **Transform data before processing**

   - Reorder for spatial locality (palette sorting)
   - Group similar data together (progressive encoding)

3. **Structure the search space**

   - Model decisions as a graph
   - Use DP/Viterbi instead of brute-force enumeration
   - Find optimal paths, not all paths

4. **Consider multi-pass when output quality matters**

   - Analyze first, then optimize (optimized Huffman)
   - Iterate when there are circular dependencies

5. **Use appropriate data structures**

   - Lookup tables for repeated computations
   - Power-of-2 sizes for fast modulo

6. **Reduce work**

   - Defer expensive operations (batch modulo)
   - Early exit when possible
   - Skip unnecessary work (good match threshold)

7. **Use efficient numeric representations**

   - Integer arithmetic instead of floating-point
   - Smaller types when range permits
   - Fixed-point for fractional values

8. **Batch operations**

   - Process 8 bytes at once with u64
   - Process 16 bytes at once with SIMD
   - Write multiple bits at once

9. **Cache results**

   - Precompute lookup tables
   - Lazy-initialize constants
   - Reuse scratch buffers

10. **Parallelize**
    - Use rayon for row-level parallelism
    - Runtime feature detection for SIMD

## Next Steps

These optimization patterns are applied throughout pixo. To see them in action:

- **Data transformation**: `src/png/mod.rs` (Zeng palette sorting)
- **Graph-based optimization**: `src/jpeg/trellis.rs` (trellis quantization)
- **Optimal parsing**: `src/compress/lz77.rs` (forward DP for LZ77)
- **Multi-pass processing**: `src/compress/deflate.rs` (iterative refinement)
- **Lookup tables**: `src/compress/deflate.rs` (LENGTH_LOOKUP, DISTANCE_LOOKUP_SMALL)
- **Fixed-point arithmetic**: `src/color.rs` (rgb_to_ycbcr)
- **SIMD implementations**: `src/simd/x86_64.rs`
- **Fast DCT algorithm**: `src/jpeg/dct.rs`
- **Batch bit operations**: `src/bits.rs`

Continue to [Compression Evolution](./compression-evolution.md) to explore the history and philosophy behind modern compression techniques.

---

## References

- Agner Fog's optimization manuals: <https://www.agner.org/optimize/>
- Intel Intrinsics Guide: <https://www.intel.com/content/www/us/en/docs/intrinsics-guide/>
- Arai, Agui, Nakajima (1988). "A Fast DCT-SQ Scheme for Images"
- "What Every Programmer Should Know About Memory" by Ulrich Drepper