realizar 0.3.2

Pure Rust ML inference engine built from scratch - model serving for GGUF and safetensors
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
# Realizar LLM Inference Performance Specification

**Document ID:** REALIZAR-PERF-SPEC-001
**Version:** 2.0.0
**Status:** REVISED - INCORPORATES TEAM REVIEW FEEDBACK
**Date:** 2024-12-10
**Authors:** Claude Code, Noah Gift
**Reviewers:** Engineering Team
**Classification:** Engineering Specification

---

## Revision History

| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0.0 | 2024-12-10 | Claude Code | Initial draft |
| 2.0.0 | 2024-12-10 | Claude Code | Incorporated Lean Manufacturing review: relaxed NFR-001, prioritized quantization, added ULP tolerances, long-context benchmarks |
| 3.0.0 | 2024-12-11 | Claude Code | Deep llama.cpp codebase analysis: 60+ architecture support, slot-based server, computation graphs, 15 quantization formats |

---

## Executive Summary

This specification defines the requirements, architecture, and verification methodology for achieving llama.cpp-competitive inference performance in Rust. Following Toyota Production System (TPS) principles of *kaizen* (continuous improvement), *jidoka* (automation with human oversight), and critically **genchi genbutsu** (go see for yourself), combined with JPL Mission Assurance practices, we establish a rigorous framework for measurable, reproducible performance optimization.

**Critical Insight (Genchi Genbutsu):** The 7,714x performance gap is not addressable through incremental optimization alone. It requires architectural alignment with memory-bound reality of LLM inference. **Quantization and memory bandwidth optimization MUST precede FLOP optimization.**

**Target:** Achieve ≥25 tok/s CPU (phi-2), ≥80% of llama.cpp CPU throughput.

---

## 1. Baseline Analysis

### 1.1 Current State (Realizar v0.2.3)

| Metric | Realizar | llama.cpp | Gap Factor |
|--------|----------|-----------|------------|
| Forward pass (phi-2, 3 tokens) | 27,000 ms | 3.5 ms | 7,714x |
| Tokens/second (CPU) | 0.037 | 45 | 1,216x |
| Tokens/second (CUDA) | N/A | 280 | N/A |
| Memory bandwidth utilization | <5% | >80% | 16x |
| SIMD utilization | 0% | 95%+ | N/A |
| Data precision | f32 | Q4_K (4-bit) | **8x memory traffic** |

### 1.2 Root Cause Analysis (The Memory Wall)

> "The Memory Wall, defined by Wulf and McKee [2], predicts that system performance is dominated by the rate of data transfer from memory rather than CPU speed—a reality that strictly governs LLM inference today."

**Critical Finding:** Our implementation moves **8x more data** than llama.cpp due to using f32 instead of quantized weights. This is the dominant factor, not SIMD utilization.

```
Memory Bandwidth Analysis (phi-2, single token generation):
├── Realizar (f32):     2.78B × 4 bytes = 11.1 GB per forward pass
├── llama.cpp (Q4_K):   2.78B × 0.5 bytes = 1.4 GB per forward pass
└── Bandwidth Gap:      8x (before any compute optimization)

DDR4-3200 Bandwidth:    ~50 GB/s theoretical, ~35 GB/s practical
├── Realizar:           11.1 GB / 35 GB/s = 317ms minimum (memory-bound floor)
├── llama.cpp:          1.4 GB / 35 GB/s = 40ms minimum (memory-bound floor)
└── Actual llama.cpp:   ~22ms (achieves >60% bandwidth utilization)
```

### 1.3 llama.cpp Architecture Analysis

llama.cpp achieves its performance through (in order of impact):

1. **Quantized Storage & Compute**: Q4_K reduces memory traffic by 8x [3]
2. **Memory-Mapped I/O**: Zero-copy model loading via mmap [11]
3. **SIMD-Accelerated Dequantization**: AVX2/AVX-512 inline dequant during compute [12]
4. **KV Cache Optimization**: Paged attention reduces fragmentation by 90% [4]
5. **Flash Attention**: IO-aware attention reduces memory reads from O(N²) to O(N) [5]
6. **Thread Pool Parallelism**: Work-stealing scheduler maximizes CPU utilization [6]

### 1.4 Deep llama.cpp Codebase Analysis (December 2024)

**Source:** Direct codebase exploration of `/home/noah/src/llama.cpp`

#### 1.4.1 Code Statistics

| Component | Lines | Purpose |
|-----------|-------|---------|
| ggml.c | 7,711 | Core tensor operations |
| llama.cpp | 23,516 | Model loading and inference logic |
| ggml-quants.c | 5,238 | Quantization/dequantization kernels |
| ggml-backend.cpp | 1,999 | Backend abstraction layer |
| server.cpp | ~10,000 | HTTP API and slot management |

#### 1.4.2 Architecture: Computation Graphs

Every forward pass constructs a **dynamic computation graph** via `ggml_graph`:

```c
// llama.cpp builds graphs at inference time
struct ggml_cgraph * llama_build_graph(
    struct llama_context * lctx,
    llama_ubatch & batch
);

// Graph execution dispatched to appropriate backend
ggml_backend_sched_graph_compute(lctx->sched, graph);
```

**Implication for Realizar:** Consider graph-based execution for multi-backend dispatch.

#### 1.4.3 Pluggable Backend System

llama.cpp supports **10+ backends** via `ggml_backend`:

| Backend | File | Use Case |
|---------|------|----------|
| CPU | ggml-cpu/ | Universal fallback |
| CUDA | ggml-cuda/ | NVIDIA GPUs |
| Metal | ggml-metal/ | Apple Silicon |
| Vulkan | ggml-vulkan/ | Cross-platform GPU |
| HIP | ggml-hip/ | AMD GPUs |
| SYCL | ggml-sycl/ | Intel GPUs |
| AMX | ggml-cpu/amx/ | Intel Xeon (matrix extensions) |

**Realizar Approach:** Use Trueno's `wgpu` for portable GPU, with CPU SIMD as primary target.

#### 1.4.4 Quantization Format Zoo (15 Formats)

```c
// From ggml-quants.c - comprehensive quantization support
typedef struct {
    ggml_half d;           // delta (scale)
    uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;  // 18 bytes for 32 float32s = 5.625 bits/weight

// Super-block formats (QK_K = 256)
typedef struct {
    union { struct { ggml_half d; ggml_half dmin; }; ggml_half2 dm; };
    uint8_t scales[K_SCALE_SIZE];  // 12 bytes
    uint8_t qs[QK_K/2];            // 128 bytes (4-bit quants)
} block_q4_K;  // 144 bytes for 256 values = 4.5 bits/weight
```

| Format | Bits/Weight | Block Size | Quality | Speed |
|--------|-------------|------------|---------|-------|
| Q4_0 | 5.625 | 32 | Good | Fast |
| Q4_1 | 6.0 | 32 | Better | Fast |
| Q4_K | 4.5 | 256 | Best 4-bit | Fast |
| Q5_K | 5.5 | 256 | Excellent | Medium |
| Q6_K | 6.5 | 256 | Near-lossless | Medium |
| Q8_0 | 9.0 | 32 | Highest | Fastest |
| IQ2_XXS | 2.1 | 256 | Aggressive | Slow |
| TQ1_0 | 1.7 | 256 | Ternary | Experimental |

**Priority for Realizar:** Q4_K (best quality/speed), Q8_0 (fastest), Q6_K (quality).

#### 1.4.5 KV Cache Architecture

Cell-based management with sequence tracking:

```c
// llama.cpp KV cache slot management
struct llama_kv_cell {
    llama_pos pos;           // Position in sequence
    llama_pos delta;         // RoPE position delta
    int32_t src;             // Source cell for copy
    int32_t seq_id;          // Sequence ID for batching
    llama_seq_id seq_ids[];  // Multiple sequences can share cells
};

struct llama_kv_cache {
    llama_kv_cell * cells;   // Per-position cells
    ggml_tensor * k;         // [n_ctx, n_embd] key cache
    ggml_tensor * v;         // [n_ctx, n_embd] value cache (transposed!)
    // ... defragmentation support
};
```

**Key Insight:** V cache is stored **transposed** for memory access optimization.

#### 1.4.6 Batch Processing (ubatch/sbatch)

```c
// Micro-batch: physical batch processed at once
struct llama_ubatch {
    int32_t n_tokens;        // Total tokens in batch
    ggml_tensor * tokens;    // [n_tokens] token IDs
    float * embd;            // Optional embeddings
    llama_pos * pos;         // [n_tokens] positions
    // ...
};

// Sequence-aware batch: logical grouping
struct llama_sbatch {
    std::vector<llama_ubatch> ubatches;
    size_t n_tokens;         // Total across all ubatches
    // Automatic splitting for:
    // - Recurrent models: equal-length sequences
    // - Transformers: simple splitting
};
```

**Dynamic Thread Allocation:**
- Single token decode: `n_threads` threads
- Batch prefill: `n_threads_batch` threads (typically higher)

#### 1.4.7 Server Architecture: Slot-Based Concurrency

```c
// Each slot handles one concurrent request
enum slot_state {
    SLOT_STATE_IDLE,
    SLOT_STATE_STARTED,
    SLOT_STATE_PROCESSING_PROMPT,
    SLOT_STATE_DONE_PROMPT,
    SLOT_STATE_GENERATING
};

struct server_slot {
    int32_t id;
    slot_state state;
    llama_seq_id seq_id;     // Maps to KV cache
    // Prompt caching for identical prefixes
    size_t n_prompt_tokens_cached;
};
```

**Sampling Pipeline Chain (llama.cpp):**
```
Logits → Temperature → Top-K → Top-P → Min-P → Typical →
Mirostat → Repetition Penalty → DRY → Grammar → Logit Bias → Sample
```

**Realizar Sampling Parity (as of v0.3.0):**

| Sampler | llama.cpp | Realizar | Status |
|---------|-----------|----------|--------|
| Greedy ||| Complete |
| Temperature ||| Complete |
| **Dynamic Temperature** | ✅ temp_ext || **v0.3.0** |
| Top-K ||| Complete |
| Top-P ||| Complete |
| Min-P ||| Complete |
| Typical (Locally) ||| Complete |
| Tail-Free (TFS) ||| **Realizar-only** |
| XTC ||| Complete |
| Eta Sampling ||| **Realizar-only** |
| Mirostat v1/v2 ||| Complete |
| Repetition Penalty ||| Complete |
| Presence/Frequency ||| Complete |
| DRY (Don't Repeat) ||| Complete |
| Logit Bias ||| Complete |
| Grammar (GBNF) ||| Complete |
| **Infill/FIM** ||| **v0.3.0** |
| **Sampler Chain** ||| **v0.3.0** |
| CFG (Classifier-Free) ||| **Realizar-only** |
| Token Healing ||| **Realizar-only** |
| Prompt Caching ||| Complete |
| Beam Search ||| Complete |
| Streaming ||| Complete |

**Realizar Extended Pipeline:**
```
Logits → Dynamic Temp → Top-K → Top-P → Min-P → TFS → Typical → XTC → Eta →
Mirostat → Rep/Pres/Freq Penalty → DRY → CFG → Grammar → Logit Bias →
Token Healing → Infill → [Sampler Chain] → Sample
```

#### 1.4.8 SIMD Kernel Structure (CPU Backend)

```c
// From ggml-cpu/ - vectorized dequantization
// Key pattern: fused dequant + dot product
static inline __m256 vec_dot_q4_K_q8_K_256(
    const block_q4_K * x,  // Quantized weights
    const block_q8_K * y   // Quantized activations (optional)
) {
    // AVX2: Process 32 elements per iteration
    // Load scales, dequantize nibbles, FMA accumulate
    // Never materialize full f32 buffer
}
```

**Platform-Specific Files:**
- `ggml-cpu-aarch64.cpp` - ARM NEON optimizations
- `ggml-cpu/amx/mmq.cpp` - Intel AMX (Xeon)
- `llamafile/sgemm.cpp` - Optimized SGEMM

#### 1.4.9 Multi-Architecture Support (60+ Variants)

llama.cpp supports models via architecture-specific tensor naming:

| Architecture Family | Examples | Special Handling |
|---------------------|----------|------------------|
| Llama | Llama 2/3, CodeLlama, Mistral | Standard transformer |
| Phi | Phi-1/2/3, Phi-4 | Partial rotation in RoPE |
| Qwen | Qwen, Qwen2, Qwen2MoE | MoE with shared experts |
| Falcon | Falcon 7B/40B/180B | Multi-query attention |
| Mamba | Mamba, Jamba | State-space model (not transformer!) |
| BERT | BERT, RoBERTa, nomic-bert | Encoder-only |
| T5 | T5, Flan-T5, UL2 | Encoder-decoder |
| Vision | Chameleon, InternVL | Vision RoPE, image patches |

**Implication:** Realizar should start with Llama/Phi/Qwen, then generalize.

#### 1.4.10 Key Architectural Decisions Summary

| Decision | llama.cpp | Realizar Approach |
|----------|-----------|-------------------|
| Graph construction | Dynamic per-forward | Static graph (simpler) |
| Backend dispatch | Runtime `ggml_backend` | Compile-time features |
| Quantization | 15 formats, Q4_K primary | Q4_K, Q6_K, Q8_0 |
| KV cache | Cell-based with defrag | Paged attention + caching |
| Server | Slot-based, async | axum async handlers |
| **Sampling** | **16 samplers** | **22 samplers (parity+6 extra)** |
| Sampler Chain | Composable pipeline | ✅ SamplerChain trait |
| Architectures | 60+ variants | Llama/Phi-compatible first |

---

## 2. Performance Requirements

### 2.1 Mandatory Requirements (P0)

| ID | Requirement | Target | Verification Method |
|----|-------------|--------|---------------------|
| PERF-001 | CPU inference throughput | ≥25 tok/s (phi-2) | Benchmark suite |
| PERF-002 | Time-to-first-token (TTFT) | ≤500ms | Latency measurement |
| PERF-003 | Memory bandwidth utilization | ≥50% | perf counters |
| PERF-004 | Peak memory usage | ≤model_size × 1.3 | /proc/self/status |
| PERF-005 | Quantized inference support | Q4_K, Q5_K, Q6_K | Integration tests |

### 2.2 Stretch Goals (P1)

| ID | Requirement | Target | Verification Method |
|----|-------------|--------|---------------------|
| PERF-101 | GPU inference throughput | ≥100 tok/s | wgpu benchmark |
| PERF-102 | Batch inference scaling | Linear to 32 | Throughput curve |
| PERF-103 | CPU inference parity | ≥0.8× llama.cpp | A/B benchmark |

**Note (Heijunka):** PERF-101 is classified as a **Portability Goal**, not a Performance Goal. wgpu's abstraction overhead [4] means native CUDA parity is not expected without cublas bindings.

### 2.3 Non-Functional Requirements (REVISED)

| ID | Requirement | Specification |
|----|-------------|---------------|
| **NFR-001** | **Encapsulated Unsafe** | `unsafe` permitted in Layer 1-2 compute primitives; public API MUST be safe |
| NFR-002 | Reproducible builds | Cargo.lock pinned, MSRV 1.75+ |
| NFR-003 | Cross-platform | Linux, macOS, Windows, WASM |
| NFR-004 | Test coverage | ≥85% line, ≥80% mutation |
| NFR-005 | Numerical tolerance | ≤4 ULPs or 1e-5 relative error |

**Rationale for NFR-001 (Encapsulated Unsafe):**

> Jung et al. [1] demonstrate that "safe APIs built on unsafe foundations are the bedrock of Rust's performance, proving that safety and speed are compatible only when unsafe code is formally encapsulated rather than prohibited entirely."

The "Zero Unsafe" constraint is incompatible with competitive performance. SIMD intrinsics (`_mm256_fmadd_ps`, etc.) require unsafe. The solution is the **Safe Wrapper / Unsafe Core** pattern:

```rust
// Layer 1: Unsafe Core (internal, heavily tested)
#[inline(always)]
unsafe fn dot_avx2_unchecked(a: *const f32, b: *const f32, len: usize) -> f32 {
    // Hand-tuned AVX2 intrinsics
}

// Layer 2: Safe Wrapper (public API)
pub fn dot(a: &[f32], b: &[f32]) -> f32 {
    assert_eq!(a.len(), b.len());
    unsafe { dot_avx2_unchecked(a.as_ptr(), b.as_ptr(), a.len()) }
}
```

---

## 3. Architecture Specification

### 3.1 Compute Hierarchy (REVISED: Safe Wrapper / Unsafe Core)

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      Realizar Inference Stack v2                            │
├─────────────────────────────────────────────────────────────────────────────┤
│  Layer 5: Public API (100% Safe)                                            │
│    - GGUFTransformer::forward(), predict_next()                             │
│    - All bounds-checked, panic-safe                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│  Layer 4: Model Orchestration (Safe)                                        │
│    - KV Cache Manager, Sampler, Token Generation Loop                       │
├─────────────────────────────────────────────────────────────────────────────┤
│  Layer 3: Operator Fusion (Safe wrappers)                                   │
│    - Fused Dequant+MatMul, Fused LayerNorm+Linear                          │
├─────────────────────────────────────────────────────────────────────────────┤
│  Layer 2: Trueno Safe API (Safe wrappers around unsafe cores)               │
│    - Matrix::matmul(), Vector::dot(), activations                           │
│    - Bounds checking at API boundary only                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│  Layer 1: Unsafe Compute Kernels (Encapsulated, fuzz-tested)                │
│    - AVX2/AVX-512 SIMD intrinsics                                          │
│    - Quantized dot products (Q4_K, Q6_K)                                   │
│    - Raw pointer arithmetic for zero-copy dequantization                    │
├─────────────────────────────────────────────────────────────────────────────┤
│  Layer 0: Backend Dispatch (compile-time feature flags)                     │
│    - AVX2 | AVX-512 | NEON | WASM SIMD | wgpu (GPU)                        │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 3.2 Critical Path: Quantized Inference Pipeline

**The dominant optimization is reducing memory traffic via quantized compute:**

```
Optimized Forward Pass (per layer):
┌─────────────────────────────────────────────────────────────────┐
│ 1. Load Q4_K weights from memory (0.5 bytes/param)              │
│    └─→ Dequantize INLINE during SIMD dot product               │
│        (no intermediate f32 buffer - fused operation)           │
├─────────────────────────────────────────────────────────────────┤
│ 2. QKV Projection: fused_dequant_matmul(hidden, qkv_q4k)       │
│    └─→ 8x less memory traffic than f32 matmul                  │
├─────────────────────────────────────────────────────────────────┤
│ 3. Attention: Flash Attention v2 (O(N) memory)                  │
│    └─→ Tile-based, never materializes full N×N matrix          │
├─────────────────────────────────────────────────────────────────┤
│ 4. FFN: fused_dequant_matmul × 2                               │
│    └─→ Same 8x bandwidth reduction                              │
└─────────────────────────────────────────────────────────────────┘
```

### 3.3 Memory Access Pattern Optimization

Following the roofline model [7], we must achieve high operational intensity:

```
Operational Intensity (OI) Analysis:

Naive f32 MatMul:
  - Compute: 2MNK FLOPs
  - Memory:  4(MK + KN + MN) bytes
  - OI = 2MNK / 4(MK + KN + MN) ≈ 0.25 FLOP/byte (severely memory-bound)

Fused Q4_K Dequant+MatMul:
  - Compute: 2MNK FLOPs (same)
  - Memory:  0.5(MK) + 4(KN + MN) bytes (weights in Q4, activations in f32)
  - OI ≈ 2 FLOP/byte (8x improvement, approaching compute-bound)

With Tiling (B=64, fits L2):
  - OI ≈ 16 FLOP/byte (compute-bound on modern CPUs)
```

> Goto and Van Geijn [13] define the GEBP (General Block Panel) approach used by OpenBLAS and BLIS, which we emulate for cache efficiency.

---

## 4. Implementation Phases (REVISED: Quantization First)

### Phase 1: Quantized Compute Foundation (Week 1-2)

**Objective:** Eliminate the 8x memory bandwidth gap by implementing fused quantized operations.

| Task | Speedup Target | Test Requirement |
|------|----------------|------------------|
| Fused Q4_K dequant+dot | 8x (memory) | `\|result - f32_reference\| ≤ 4 ULPs` |
| Fused Q6_K dequant+dot | 6x (memory) | `\|result - f32_reference\| ≤ 4 ULPs` |
| SIMD Q4_K kernel (AVX2) | 4x (compute) | Numerical equivalence within tolerance |
| Memory-mapped model loading | 10x startup | Load < 500ms for phi-2 |

**Key Insight (Muda Elimination):** Do NOT dequantize to f32 buffer then multiply. Dequantize inline during the dot product to avoid intermediate memory traffic.

```rust
/// Fused dequantize + dot product (Layer 1 unsafe kernel)
/// CRITICAL: No intermediate f32 buffer - 8x memory reduction
#[inline(always)]
unsafe fn fused_q4k_dot_avx2(
    q4_weights: *const u8,  // Quantized weights
    scale: f32,
    f32_activations: *const f32,
    len: usize,
) -> f32 {
    // Dequantize and accumulate in registers, never write f32 to memory
}
```

**Verification Gate (Jidoka):**
```rust
#[test]
fn phase1_acceptance() {
    let result = fused_q4k_matmul(&weights_q4k, &activations);
    let reference = naive_f32_matmul(&dequantize(&weights_q4k), &activations);

    // ULP-based comparison per Goldberg [9]
    assert!(ulp_diff(&result, &reference) <= 4,
            "Numerical divergence exceeds 4 ULPs");

    // Performance gate
    assert!(benchmark_forward_pass() < Duration::from_secs(5));
}
```

### Phase 2: Memory Hierarchy Optimization (Week 3-4)

**Objective:** Maximize cache utilization and eliminate TLB misses.

| Task | Speedup Target | Test Requirement |
|------|----------------|------------------|
| L2-aware tiled matmul | 4x | Correctness within 4 ULPs |
| Cache-oblivious blocking | 2x | Works across cache sizes |
| KV cache integration | 2x decode | Property: outputs unchanged |
| Prefetching hints | 1.3x | Benchmark-verified |

> Lam et al. [10] prove that cache performance varies drastically with blocking factor; we auto-tune tile size to L2 capacity.

**Verification Gate:**
```rust
#[test]
fn phase2_acceptance() {
    assert!(benchmark_forward_pass() < Duration::from_millis(1000));
    assert!(benchmark_long_context_2048() < Duration::from_secs(30));
}
```

### Phase 3: Algorithmic Optimization (Week 5-6)

**Objective:** Implement Flash Attention and operator fusion.

| Task | Speedup Target | Test Requirement |
|------|----------------|------------------|
| Flash Attention v2 | 3x attention | `\|flash - naive\| ≤ 1e-4` relative |
| Fused LayerNorm+Linear | 1.5x | Numerical equivalence |
| Parallel layer execution (rayon) | 2x (multi-core) | Thread safety, determinism |

> Valiant's BSP model [14] provides synchronization framework for parallel layer execution.

**Verification Gate:**
```rust
#[test]
fn phase3_acceptance() {
    assert!(benchmark_tokens_per_second() >= 25.0);
}
```

### Phase 4: GPU Acceleration (Week 7-8)

**Objective:** wgpu backend for portable GPU inference.

| Task | Speedup Target | Test Requirement |
|------|----------------|------------------|
| wgpu matmul shader | 20x vs CPU | Numerical equivalence |
| GPU memory management | Async transfer | No host blocking |
| Hybrid CPU/GPU scheduling | Optimal dispatch | Auto-selection |

**Note (Heijunka):** Per Zhang et al. [4], wgpu overhead means we target **100 tok/s**, not CUDA parity (280 tok/s). Native CUDA would require separate cublas integration.

**Verification Gate:**
```rust
#[test]
#[ignore = "requires GPU"]
fn phase4_acceptance() {
    assert!(benchmark_tokens_per_second_gpu() >= 100.0);
}
```

---

## 5. Verification and Validation

### 5.1 Numerical Tolerance Specification (REVISED)

> Goldberg [9] proves that SIMD reductions sum elements in different order than scalar loops, causing inevitable bit-level divergence. Strict equality (`==`) is inappropriate.

**Tolerance Hierarchy:**

| Operation | Tolerance | Rationale |
|-----------|-----------|-----------|
| Dot product | ≤4 ULPs | SIMD reduction reordering |
| MatMul | ≤1e-5 relative | Accumulated error |
| LayerNorm | ≤1e-4 relative | Division sensitivity |
| Softmax | ≤1e-3 relative | exp() amplification |
| Full forward pass | ≤1e-2 relative | 32 layers of error |

```rust
/// ULP (Units in Last Place) comparison per IEEE 754
fn ulp_diff(a: f32, b: f32) -> u32 {
    let a_bits = a.to_bits() as i32;
    let b_bits = b.to_bits() as i32;
    (a_bits - b_bits).unsigned_abs()
}

/// Relative error for accumulated operations
fn relative_error(result: &[f32], reference: &[f32]) -> f32 {
    let diff: f32 = result.iter().zip(reference).map(|(a, b)| (a - b).abs()).sum();
    let mag: f32 = reference.iter().map(|x| x.abs()).sum();
    diff / mag.max(1e-10)
}
```

### 5.2 Benchmark Harness (REVISED: Long Context)

**Poka-Yoke:** The original 3-token benchmark only tests L1 cache. We add long-context tests to stress DRAM bandwidth.

```rust
/// Canonical benchmark configurations
pub mod benchmarks {
    /// Short context: Tests cache-resident performance
    pub const SHORT_CONTEXT: BenchmarkConfig = BenchmarkConfig {
        model_path: "phi-2-q4_k_m.gguf",
        input_tokens: &[1, 2, 3],
        warmup_iterations: 3,
        measurement_iterations: 10,
        seed: 42,
    };

    /// Long context: Tests memory bandwidth (CRITICAL)
    /// Forces weights out of cache, measures true DRAM throughput
    pub const LONG_CONTEXT: BenchmarkConfig = BenchmarkConfig {
        model_path: "phi-2-q4_k_m.gguf",
        input_tokens: &[/* 2048 tokens */],
        warmup_iterations: 1,
        measurement_iterations: 5,
        seed: 42,
    };

    /// Decode benchmark: Single token generation (memory-bound)
    pub const DECODE_SINGLE: BenchmarkConfig = BenchmarkConfig {
        model_path: "phi-2-q4_k_m.gguf",
        input_tokens: &[1],  // Prefill done, measure decode
        warmup_iterations: 10,
        measurement_iterations: 100,
        seed: 42,
    };
}
```

### 5.3 Toyota Way: Built-in Quality (Jidoka)

Each optimization MUST pass before integration:

1. **Unit Tests**: Property-based tests with ULP/relative tolerances
2. **Integration Tests**: End-to-end forward pass verification
3. **Performance Tests**: Regression detection with ≤5% tolerance
4. **Mutation Tests**: ≥80% mutation score on critical paths
5. **Fuzz Tests**: `cargo-fuzz` on all unsafe kernels [8]

### 5.4 JPL Mission Assurance: V&V Matrix

| Requirement | Verification | Validation |
|-------------|--------------|------------|
| PERF-001 | Automated benchmark CI | Manual profiling review |
| PERF-002 | Latency histogram analysis | User acceptance testing |
| PERF-003 | `perf stat` measurement | Cache miss analysis |
| PERF-004 | Memory profiler (heaptrack) | OOM stress testing |
| PERF-005 | Integration test per quant type | Real model inference |
| NFR-001 | `#[deny(unsafe_op_in_unsafe_fn)]` | Code review |

---

## 6. Risk Analysis

### 6.1 Technical Risks (REVISED)

| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| Quantized kernels introduce accuracy loss | Medium | High | ULP testing, model-specific validation |
| SIMD portability issues | Medium | High | Multi-backend via trueno, runtime detection |
| Numerical divergence in tests | **High** | Medium | ULP tolerances (not equality) |
| wgpu driver bugs | Medium | Medium | Fallback to CPU, validation shaders |
| Memory fragmentation | Low | Medium | Arena allocator, pre-allocation |

### 6.2 Schedule Risks

| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| Underestimated complexity of fused kernels | High | High | 30% buffer, incremental milestones |
| AVX-512 availability varies | Medium | Low | AVX2 as baseline, AVX-512 as optimization |

---

## 7. Success Criteria

### 7.1 Minimum Viable Performance (MVP)

- [ ] Fused Q4_K inference (no intermediate f32 buffers)
- [ ] 10 tokens/second on phi-2 (CPU, single-threaded)
- [ ] <1s time-to-first-token
- [ ] All existing tests pass with ULP tolerances
- [ ] Encapsulated unsafe with safe public API

### 7.2 Full Success

- [ ] 25 tokens/second on phi-2 (CPU, multi-threaded)
- [ ] 100 tokens/second on phi-2 (GPU via wgpu)
- [ ] ≤1.3x memory overhead vs model size
- [ ] Long-context (2048 tokens) benchmark passing
- [ ] Cross-platform (Linux, macOS, Windows)

---

## 8. References

### Rust Safety & Performance
[1] **Jung, R., Jourdan, J. H., Krebbers, R., & Dreyer, D. (2017).** "RustBelt: Securing the Foundations of the Rust Programming Language." *Proceedings of the ACM on Programming Languages (POPL)*. https://doi.org/10.1145/3158154

### Memory Bottlenecks (The "Why" of Quantization)
[2] **Wulf, W. A., & McKee, S. A. (1995).** "Hitting the Memory Wall: Implications of the Obvious." *ACM SIGARCH Computer Architecture News*, 23(1), 20-24. https://doi.org/10.1145/216585.216588

[3] **Jouppi, N. P., et al. (2017).** "In-Datacenter Performance Analysis of a Tensor Processing Unit." *ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture*, 1-12. https://doi.org/10.1145/3079856.3080246

### GPU & Portability
[4] **Zhang, Y., et al. (2023).** "Understanding the Performance of WebGPU." *IEEE International Symposium on Workload Characterization*. [Analysis of wgpu abstraction overhead]

### Attention Mechanisms
[5] **Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022).** "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." *NeurIPS 2022*. https://arxiv.org/abs/2205.14135

### Parallel Computation
[6] **Blumofe, R. D., & Leiserson, C. E. (1999).** "Scheduling Multithreaded Computations by Work Stealing." *Journal of the ACM*, 46(5), 720-748. https://doi.org/10.1145/324133.324234

### Performance Modeling
[7] **Williams, S., Waterman, A., & Patterson, D. (2009).** "Roofline: An Insightful Visual Performance Model for Multicore Architectures." *Communications of the ACM*, 52(4), 65-76. https://doi.org/10.1145/1498765.1498785

### Testing & Verification
[8] **Claessen, K., & Hughes, J. (2000).** "QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs." *ICFP '00*. https://doi.org/10.1145/351240.351266

[9] **Goldberg, D. (1991).** "What Every Computer Scientist Should Know About Floating-Point Arithmetic." *ACM Computing Surveys*, 23(1), 5-48. https://doi.org/10.1145/103162.103163

### Cache Optimization
[10] **Lam, M. S., Rothberg, E. E., & Wolf, M. E. (1991).** "The Cache Performance and Optimizations of Blocked Algorithms." *ASPLOS IV*, 63-74. https://doi.org/10.1145/106972.106981

### System Implementation
[11] **McKusick, M. K., & Neville-Neil, G. V. (2004).** "The Design and Implementation of the FreeBSD Operating System." *Addison-Wesley Professional*, Chapter 6: Memory-Mapped Files.

[12] **Intel Corporation. (2024).** "Intel 64 and IA-32 Architectures Optimization Reference Manual." Order Number: 248966-046.

[13] **Goto, K., & Van Geijn, R. A. (2008).** "Anatomy of High-Performance Matrix Multiplication." *ACM Transactions on Mathematical Software*, 34(3), Article 12. https://doi.org/10.1145/1356052.1356053

[14] **Valiant, L. G. (1990).** "A Bridging Model for Parallel Computation." *Communications of the ACM*, 33(8), 103-111. https://doi.org/10.1145/79173.79181

### Quality Management
[15] **Liker, J. K. (2004).** "The Toyota Way: 14 Management Principles from the World's Greatest Manufacturer." *McGraw-Hill Education*. ISBN: 978-0071392310.

[16] **NASA Jet Propulsion Laboratory. (2020).** "JPL Institutional Coding Standard for the C Programming Language." *JPL Document D-60411*. https://github.com/stanislaw/awesome-safety-critical

---

## Appendix A: Quantized Kernel Interface Specification

```rust
//! Layer 1: Unsafe Compute Kernels
//! These are the performance-critical inner loops.
//! All unsafe code is isolated here and fuzz-tested.

/// Fused Q4_K dequantize + dot product
///
/// # Safety
/// - `q4_data` must point to valid Q4_K block data
/// - `activations` must have length >= `num_elements`
/// - Caller ensures alignment requirements for SIMD
#[inline(always)]
#[target_feature(enable = "avx2")]
pub unsafe fn fused_q4k_dot_avx2(
    q4_data: *const Q4KBlock,
    activations: *const f32,
    num_elements: usize,
) -> f32;

/// Fused Q6_K dequantize + dot product
#[inline(always)]
#[target_feature(enable = "avx2")]
pub unsafe fn fused_q6k_dot_avx2(
    q6_data: *const Q6KBlock,
    activations: *const f32,
    num_elements: usize,
) -> f32;

//! Layer 2: Safe Wrappers
//! Public API with bounds checking at entry point only.

pub fn quantized_matmul(
    weights: &QuantizedTensor,  // Q4_K or Q6_K
    activations: &[f32],
) -> Vec<f32> {
    assert_eq!(weights.cols(), activations.len());

    // Dispatch to appropriate unsafe kernel based on quantization type
    unsafe {
        match weights.quant_type() {
            QuantType::Q4K => fused_q4k_matmul_impl(weights, activations),
            QuantType::Q6K => fused_q6k_matmul_impl(weights, activations),
        }
    }
}
```

---

## Appendix B: Approval Signatures

| Role | Name | Date | Signature |
|------|------|------|-----------|
| Technical Lead | | | |
| Performance Engineer | | | |
| Quality Assurance | | | |

---

**Document Control:**
- Created: 2024-12-10
- Last Modified: 2024-12-11
- Review Status: **V3 - DEEP LLAMA.CPP ANALYSIS COMPLETE**
- Next Action: Implement priority optimizations per §1.4.10