succinctly 0.7.0

# Optimisation Techniques

This directory documents optimization techniques used in the succinctly library, organized by category. Each document covers the theory, implementation patterns, prior art, and practical lessons learned.

## Quick Reference

| Category             | Document                                                 | Key Techniques                                  |
|----------------------|----------------------------------------------------------|-------------------------------------------------|
| **Bit-level**        | [bit-manipulation.md](bit-manipulation.md)               | Popcount, CTZ/CLZ, broadword select, PDEP/PEXT  |
| **Lookup tables**    | [lookup-tables.md](lookup-tables.md)                     | Byte tables, nibble tables, state machine tables |
| **SIMD**             | [simd.md](simd.md)                                      | AVX2, AVX-512, SSE4.2, NEON                     |
| **Memory**           | [cache-memory.md](cache-memory.md)                       | Alignment, layout, prefetching, bandwidth        |
| **Data structures**  | [hierarchical-structures.md](hierarchical-structures.md) | Rank directory, select index, RangeMin           |
| **Control flow**     | [branchless.md](branchless.md)                           | CMOV, arithmetic selection, SIMD masking         |
| **Access patterns**  | [access-patterns.md](access-patterns.md)                 | Sequential, strided, random, search algorithms   |
| **Allocation**       | [zero-copy.md](zero-copy.md)                             | Type punning, memory mapping, streaming          |
| **Parallelism**      | [parallel-prefix.md](parallel-prefix.md)                 | Prefix XOR, cumulative index, carry propagation  |
| **Parsing**          | [state-machines.md](state-machines.md)                   | PFSM, fast-path bypass, two-stage pipeline       |
| **Compact encoding** | [end-positions.md](end-positions.md)                     | Bitmap encoding, advance index, sequential cursor|

---

## Performance Impact Summary

### Highest Impact Techniques

| Technique                      | Speedup            | Category         | Document                                                 |
|--------------------------------|--------------------|------------------|----------------------------------------------------------|
| Cumulative Index               | **627x**           | Hierarchical     | [hierarchical-structures.md](hierarchical-structures.md) |
| RangeMin Index                 | **40x**            | Hierarchical     | [hierarchical-structures.md](hierarchical-structures.md) |
| Byte Lookup Tables             | **11x**            | Lookup           | [lookup-tables.md](lookup-tables.md)                     |
| BMI2 PDEP Toggle               | **10x**            | Bit manipulation | [bit-manipulation.md](bit-manipulation.md)               |
| Lightweight DSV Index          | **5-9x**           | Cache/Memory     | [cache-memory.md](cache-memory.md)                       |
| AVX-512 VPOPCNTDQ              | **5.2x**           | SIMD             | [simd.md](simd.md)                                      |
| Exponential Search             | **3.1x**           | Access patterns  | [access-patterns.md](access-patterns.md)                 |
| NEON VMINV L1 Building         | **2.8x**           | SIMD             | [simd.md](simd.md)                                      |
| SSE4.1 PHMINPOSUW L1/L2       | **1-3%** (10M+)    | SIMD             | [simd.md](simd.md)                                      |
| AVX2 JSON Parser               | **1.78x**          | SIMD             | [simd.md](simd.md)                                      |
| PFSM Tables                    | **1.77x**          | State machines   | [state-machines.md](state-machines.md)                   |
| NEON DSV Index                 | **1.8x**           | SIMD             | [simd.md](simd.md)                                      |
| NEON 256B Popcount             | **1.15x**          | SIMD             | [simd.md](simd.md)                                      |
| SIMD Unquoted Structural Skip  | **3-8%**           | SIMD             | [simd.md](simd.md)                                      |
| Lazy line + direct indexing    | **2-6%**           | State machines   | [state-machines.md](state-machines.md)                   |
| P12-A Build Mitigation         | **11-85%** (build) | Compact encoding | [end-positions.md](end-positions.md)                     |
| O1 Sequential Cursor           | **3-13%** (query)  | Compact encoding | [end-positions.md](end-positions.md)                     |
| O2 Gap-Skip rank1              | **2-6%** (query)   | Compact encoding | [end-positions.md](end-positions.md)                     |
| jq Lazy String Slicing         | **8.5%** (vs regression), **2%** (vs baseline) | Control flow | src/jq/eval.rs |

### Notable Failures (Instructive)

| Technique                   | Penalty        | Reason                                       | Document                                       |
|-----------------------------|----------------|----------------------------------------------|-------------------------------------------------|
| BMI2 PDEP BitWriter         | **-71%**       | Wrong use case (consecutive bits)            | [bit-manipulation.md](bit-manipulation.md)     |
| NEON PFSM Shuffle           | **-47%**       | Serial dependency overhead                   | [state-machines.md](state-machines.md)         |
| BMI1 Mask Iteration         | **-26%**       | Optimised <1% of runtime                     | [simd.md](simd.md)                            |
| NEON Batched Popcount       | **-25%**       | Prefix sum inherently sequential             | [parallel-prefix.md](parallel-prefix.md)       |
| AVX-512 JSON Parser         | **-10%**       | Memory-bound, not compute-bound              | [simd.md](simd.md)                            |
| AVX-512 YAML Parser         | **-7%**        | Memory bandwidth bottleneck + benchmark flaw | [simd.md](simd.md)                            |
| SIMD Lookahead Quote Skip   | **-2 to -6%** | Short strings, SIMD overhead dominates       | [state-machines.md](state-machines.md)         |
| jq `#[inline(always)]` on generics | **Hangs tests** (60s vs 2s) | Code explosion in 18K-line generic-heavy file | src/jq/eval.rs |
| jq `#[cold]` on error paths | **-0.3%**      | Compiler already optimizes cold paths         | src/jq/eval.rs |
| jq `#[inline]` hint         | **-1.3%**      | Compiler's decisions were already optimal      | src/jq/eval.rs |

---

## Decision Framework

### When to Use Each Technique

```
Is the operation compute-bound or memory-bound?
├── Compute-bound → Consider wider SIMD (AVX-512)
└── Memory-bound → Stick with AVX2/NEON

Is the bottleneck a hot loop?
├── Yes → Profile which operation dominates
│   ├── Branch-heavy → Use lookup tables or branchless
│   ├── Many small allocations → Use buffer reuse or arenas
│   └── Random access → Consider sorting or cumulative index
└── No → Focus elsewhere

Is data access sequential or random?
├── Sequential → Hardware prefetcher handles it
│   └── Consider exponential search for queries
└── Random → This is expensive
    ├── Can you sort indices first?
    ├── Can you use a smaller data structure?
    └── Can you precompute (cumulative index)?

How many states in your FSM?
├── 2-4 states → Pack into single table lookup
├── 5-16 states → Standard table-driven
└── Many states → Consider minimisation
```

---

## Key Principles

### 1. Measure First

> "Premature optimization is the root of all evil" - Knuth

Always profile before optimising. The bottleneck is often not where you expect.

### 2. Simpler Is Often Faster

The DSV lightweight index (simple array) beat the theoretically optimal 3-level BitVec by 5-9x due to better cache behavior.

### 3. Wider SIMD ≠ Faster

AVX-512 is slower than AVX2 for memory-bound workloads. The memory system can't keep up with wider vectors.

**Evidence:**
- JSON parsing: AVX-512 was 7-17% slower, removed from codebase
- YAML parsing: AVX-512 was 7% slower at realistic 64B chunks
- Pattern: Sequential text parsing saturates RAM bandwidth at AVX2 width already

### 4. Algorithmic > Micro

A 627x speedup from cumulative index (algorithmic) dwarfs any SIMD optimization.

### 5. Profile Against Production

PFSM batched was 40% faster than reference but 25% slower than production. Always benchmark against what you're replacing.

### 6. Benchmark Design Matters

Micro-benchmarks must reflect real usage patterns, not just measure primitive operations in isolation.

**YAML P8 lesson**: Benchmark measured loop iterations (AVX-512: 4 iterations for 256B, AVX2: 8 iterations) and showed "2x speedup" - but real parsing does the same total work for both. The apparent win was measuring *fewer function calls*, not *more work per call*.

**Pattern to avoid**: Comparing different loop structures instead of comparing equivalent work.

### 7. Fast Paths Beat Micro-Optimizations

Simple fast-path checks that avoid work entirely beat trying to make work faster.

**jq lazy slicing example**: Three simple `if` checks to detect identity slices (`[:]`, `[0:]`, full slice after resolution) gave **8.5% speedup** by avoiding Unicode character counting. This beat all attempts at micro-optimization via compiler hints.

**Pattern**: Check for common cases (identity, empty, full) before doing expensive work.

### 8. Trust the Compiler on Generic Code

Rust's default inlining is excellent, especially for generic-heavy code. Forcing inlining with `#[inline(always)]` on generic functions can cause code explosion.

**jq inline(always) failure**: Adding `#[inline(always)]` to 3 generic helper functions in an 18K-line file with 227 generic functions caused tests to hang (60+ seconds instead of 2 seconds). The exponential monomorphization × inlining created pathologically large functions that LLVM struggled to optimize.

**Evidence from attempts**:
- `#[inline(always)]`: 0.9% gain but **hangs tests** ✗
- `#[cold]` only: 0.3% **regression** ✗
- `#[inline]` hint: 1.3% **regression** ✗

**Pattern**: Prefer `#[inline]` (hint) over `#[inline(always)]` (force). Better yet, let the compiler decide. Override only with strong profiling evidence showing the compiler is wrong.

---

## Reading Order

For newcomers to performance optimization:

1. **[cache-memory.md](cache-memory.md)** - Understanding memory hierarchy is fundamental
2. **[access-patterns.md](access-patterns.md)** - How access patterns affect cache
3. **[hierarchical-structures.md](hierarchical-structures.md)** - Data structure design for performance
4. **[bit-manipulation.md](bit-manipulation.md)** - Foundational bit tricks
5. **[lookup-tables.md](lookup-tables.md)** - Trading memory for computation
6. **[branchless.md](branchless.md)** - Eliminating branch mispredictions
7. **[simd.md](simd.md)** - Parallel processing with vectors
8. **[zero-copy.md](zero-copy.md)** - Avoiding unnecessary work
9. **[parallel-prefix.md](parallel-prefix.md)** - Cumulative operations
10. **[state-machines.md](state-machines.md)** - Optimising parsers
11. **[end-positions.md](end-positions.md)** - Bitmap encoding for position data (case study)

---

## References

### Books

- Knuth, D. E. "The Art of Computer Programming" (Volumes 1-4A)
- Warren, H. S. "Hacker's Delight" (2nd ed., 2012)
- Hennessy & Patterson "Computer Architecture: A Quantitative Approach"

### Papers

- Vigna, S. "Broadword Implementation of Rank/Select Queries" (2008)
- Zhou et al. "Space-Efficient, High-Performance Rank & Select" (2013) - Poppy
- Langdale & Lemire "Parsing Gigabytes of JSON per Second" (2019)
- Mytkowicz et al. "Data-Parallel Finite-State Machines" (2014)
- Navarro & Sadakane "Fully Functional Succinct Trees" (2014)

### Online Resources

- Drepper, U. "What Every Programmer Should Know About Memory"
- Fog, A. "Optimizing Software in C++"
- Intel "Intrinsics Guide" - https://www.intel.com/content/www/us/en/docs/intrinsics-guide
- ARM "NEON Intrinsics Reference"
- Anderson, S. "Bit Twiddling Hacks" (Stanford Graphics)

---

## Contributing

When adding new optimization documentation:

1. Create a new `.md` file in this directory
2. Follow the existing structure:
   - Overview table
   - Technique explanations with code examples
   - Usage in succinctly (file locations, results)
   - Key lessons
   - References
3. Update this README with the new document
4. Link to relevant source files using relative paths

---

## See Also

- [history.md](history.md) - Complete record of all optimizations attempted (successes and failures with exact measurements)
- [end-positions.md](end-positions.md) - Detailed case study of three EndPositions designs (Vec, 3-bitmap, 2-bitmap)
- [../../.claude/skills/](../../.claude/skills/) - Claude Code skill files for specific topics