aprender-serve 0.34.0

Pure Rust ML inference engine built from scratch - model serving for GGUF and safetensors
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
# Claude Code Development Guide for Realizar

## Project Overview

**Realizar** - Pure Rust ML inference engine built from scratch for GGUF and Safetensors model serving.

- **Philosophy:** Total control, zero compromise - build everything ourselves except HTTP infrastructure
- **Architecture:** Model parsers → Inference engine → Trueno compute primitives
- **Methodology:** EXTREME TDD with mutation testing, property-based testing, 85%+ coverage
- **Quality Target:** TDG Score ≥95.0/100 (A+)

## CRITICAL: Contract-First Design

**NEVER write code before writing a provable contract.**

All code changes MUST have a corresponding contract (YAML in ../provable-contracts/contracts/<project>/ or .pmat-work/<TICKET>/contract.json) BEFORE implementation. This is enforced by `pmat comply` CB-1400.

- Use `pmat comply check` to verify contract coverage
- Minimum verification level: L1 (recommended L3+)
- See docs/agent-instructions/provable-contract-first-agents.md for the full workflow

## Critical Dependencies - ALWAYS USE LATEST

### Trueno (SIMD/GPU Compute Primitives)

**IMPORTANT:** Trueno is actively developed and frequently updated. **ALWAYS check for the latest version.**

```bash
# Check trueno version before any development work
cd ../trueno && git pull && grep "^version" Cargo.toml
```

**Current Integration:**
- Path: `../trueno`
- Features: `["gpu"]` for GPU acceleration
- Status: v0.4.2 (2025-11-21) - SIMD attribute compliance, PMAT integration, zero warnings

**Update Workflow:**
1. Pull latest trueno: `cd ../trueno && git pull`
2. Check version: `grep "^version" Cargo.toml`
3. Update realizar's Cargo.toml with new version
4. Test integration: `cargo test --lib`
5. Commit with clear message about trueno version bump

**Trueno Capabilities:**
- Vector operations: add, sub, mul, div, dot, sum, norm_l1, norm_l2
- SIMD backends: AVX2, SSE2, NEON, WASM, Scalar
- GPU backend: wgpu-based (optional feature)
- Activation functions: ReLU, sigmoid, GELU, swish, mish, selu, hardswish
- Performance: 2-11x SIMD speedups on compute-bound operations

**Trueno GPU Kernels (trueno-gpu crate):**
- `GemmKernel` - Matrix multiplication (naive, tiled, tensor core)
- `AttentionKernel` - FlashAttention-style tiled attention with online softmax
- `SoftmaxKernel` - Numerically stable softmax with warp shuffle
- `LayerNormKernel` - Fused layer normalization
- `QuantizeKernel` - Q4_K dequantization fused with matmul
- `Q5KKernel` - Q5_K dequantization
- `Q6KKernel` - Q6_K dequantization

## ⚠️ CRITICAL ANTI-PATTERN: NO HAND-ROLLED PTX

**NEVER write PTX strings directly in realizar code.**

### Why This Is Forbidden

1. **Trueno exists** - The `trueno-gpu` crate has tested, optimized kernels
2. **PTX is fragile** - Syntax errors, wrong compute capabilities, shared memory limits
3. **Trueno has trueno-explain** - Static analysis tool to find PTX bugs
4. **Maintenance burden** - Hand-rolled PTX must be updated for each GPU generation
5. **Testing** - Trueno kernels have property tests; hand-rolled PTX does not

### The Anti-Pattern (DO NOT DO THIS)

```rust
// ❌ WRONG - Hand-rolled PTX string in realizar
fn generate_attention_ptx(seq_len: u32, head_dim: u32) -> String {
    format!(r"
.version 8.0
.target sm_89
.address_size 64
.visible .entry attention(...) {{
    // 200 lines of hand-written PTX
}}
")
}
```

### The Correct Pattern (DO THIS)

```rust
// ✅ CORRECT - Use trueno-gpu kernels
use trueno_gpu::kernels::{AttentionKernel, Kernel};

let kernel = AttentionKernel::new(seq_len, head_dim)
    .with_causal()
    .with_tiles(64, 64);
let ptx = kernel.emit_ptx();
```

### If Trueno Is Missing a Kernel

1. **Add it to trueno-gpu** - Push to `../trueno`, not realizar
2. **Use the PTX builder API** - `PtxKernel::new().param().build(|ctx| {...})`
3. **Add property tests** - Ensure kernel works for all valid dimensions
4. **Use trueno-explain** - Run `trueno-explain bugs --kernel <name>` to find issues

## ⚠️ CRITICAL: LAYOUT-002 Row-Major Mandate

**Realizar is EXCLUSIVELY row-major. All data from GGUF is transposed by aprender at import.**

### Why This Matters

GGUF uses column-major layout (GGML convention). Realizar's fused Q4K/Q6K kernels expect row-major layout. Using the wrong layout produces **garbage output**.

```
GGUF (column-major)     Realizar (row-major)
─────────────────────   ─────────────────────
W[i,j] at j*rows + i    W[i,j] at i*cols + j

Same bytes → WRONG interpretation → "olumbia+lsi nunca/localENTS" (garbage)
```

### The Architecture

```
┌─────────────────────────────────────────────────────────┐
│              REALIZAR DOMAIN (Row-Major Only)            │
│                                                          │
│  APR file ──► GGUF loader ──► fused_q4k_dot ──► output  │
│  (already row-major,         (expects row-major)         │
│   transposed by aprender)                                │
└─────────────────────────────────────────────────────────┘
```

**Realizar never handles layout conversion.** Aprender's converter (`src/format/converter/write.rs`) transposes GGUF data during import. By the time data reaches realizar, it's already row-major.

### FORBIDDEN: Trueno Column-Major Kernels

```rust
// ❌ NEVER USE - These expect column-major layout
use trueno::backends::q4k::matmul_q4k_f32_colmajor;
use trueno::backends::q6k::matmul_q6k_f32_colmajor;

// ✅ ALWAYS USE - Row-major kernels in realizar
use crate::quantize::fused_q4k_parallel_matvec;
use crate::quantize::fused_q6k_parallel_matvec;
```

### Key Implementation Files

| File | Responsibility |
|------|----------------|
| `src/quantize/fused_k.rs` | Row-major Q4K/Q6K matmul kernels |
| `src/quantize/parallel_k.rs` | Parallel row-major kernels (ONE WAY ONLY) |
| `src/gguf/loader.rs` | Loads APR (pre-transposed by aprender) |

### DELETED: Legacy Aliases (2026-02-03)

These confusing aliases were **purged** to enforce ONE WAY ONLY:
- ~~`fused_q6k_colmajor_matvec`~~ → Use `fused_q6k_parallel_matvec`
- ~~`fused_q4k_auto_matvec_into`~~ → Use `fused_q4k_parallel_matvec_into`

**If you see these function names in old code, they no longer exist.**

### Falsification Test (F-LAYOUT-001)

```bash
# Test that GGUF→APR→realizar produces coherent output
apr import model.gguf -o model.apr
realizar run model.apr --prompt "2+2=" --max-tokens 10
# Expected: "4" (coherent math)
# NOT: "olumbia+lsi" (garbage = layout bug)
```

## ⚠️ CRITICAL: PMAT-216 GPU Parity Mandate

**GPU inference MUST match CPU inference. This is enforced by CI.**

### Root Cause (Five Whys)

| Why | Answer |
|-----|--------|
| 1. Why garbage GPU output? | LM head produces wrong values |
| 2. Why wrong LM head? | Weight matrix not properly transposed |
| 3. Why not transposed? | `lm_head_weight_t` contained original data |
| 4. Why? | Argument order in `from_apr_weights` swapped |
| 5. Why? | No type safety on weight parameters |

### Fix Applied (2026-02-05)

1. **Type-safe wrappers** in `types.rs`:
   - `LmHeadWeight` - Original layout [vocab_size, hidden_dim]
   - `LmHeadWeightTransposed` - GPU layout [hidden_dim, vocab_size]

2. **Runtime validation** in `from_apr_weights`:
   - Checks first row of original == first column of transposed
   - Fails with `PMAT-216: Arguments may be swapped` on mismatch

3. **Mandatory parity test** (`tests/gpu_cpu_trace_compare.rs`):
   ```bash
   cargo test --features cuda --test gpu_cpu_trace_compare
   # Expected: CPU L2 ≈ GPU L2 (diff < 0.01%)
   ```

### Why Tracing Didn't Catch This

| Gap | Impact |
|-----|--------|
| `GpuModel` has no `forward_traced` | Can't trace GPU layer-by-layer |
| No `TracedForward` trait | CPU/GPU can diverge silently |
| No parity test in CI | GPU bugs ship undetected |

### Mandatory GPU Verification

```rust
// ALWAYS compare CPU vs GPU for new models:
let cpu_trace = apr_model.forward_traced(&tokens)?;
let gpu_logits = gpu_model.forward_gpu(&tokens)?;
let cpu_l2 = cpu_trace.logits.iter().map(|x| x * x).sum::<f32>().sqrt();
let gpu_l2 = gpu_logits.iter().map(|x| x * x).sum::<f32>().sqrt();
assert!((cpu_l2 - gpu_l2).abs() / cpu_l2 < 0.01, "GPU diverged from CPU!");
```

### Aprender (ML Library)

**IMPORTANT:** Aprender is actively developed and frequently released. **ALWAYS check for the latest version.**

```bash
# Check aprender version and status
cd ../aprender && git pull && grep "^version" Cargo.toml
```

**Current Status:**
- Version: v0.1.0 (released to crates.io 2024-11-18)
- TDG Score: 95.6/100 (A+)
- Test Coverage: 97.72%
- Path: `../aprender`

**Aprender Primitives (Fallback Option):**
- `Vector<T>` - Generic 1D array with sum, mean, dot, norm, variance
- `Matrix<T>` - Row-major 2D array with matmul, transpose, Cholesky
- **Pure Rust:** Forbids unsafe code entirely
- **Battle-tested:** 149 tests (127 unit + 22 property)

**When to Use Aprender:**
- If trueno has compilation issues (rare)
- For pure Rust fallback without SIMD/GPU
- Can swap implementations transparently

**Update Workflow:**
1. Pull latest aprender: `cd ../aprender && git pull`
2. Check if relevant for inference primitives
3. Consider integration if trueno unavailable
4. Document in commit message

## Python Usage Policy

**IMPORTANT: Avoid Python unless absolutely necessary. This is a pure Rust project.**

### When Python IS Acceptable
- Generating reference values from HuggingFace transformers for verification
- Quick one-off debugging comparisons (not permanent scripts)
- No Rust equivalent exists for the task

### When Python is NOT Acceptable
- Production code (use Rust)
- Build scripts (use Rust/Makefile)
- Tests (use Rust tests)
- Benchmarks (use Criterion)

### If Python Is Required, Use `uv`

**NEVER use pip, virtualenv, conda, or poetry. ONLY use `uv`.**

```bash
# Run Python script with dependencies
uv run --with torch --with transformers python script.py

# Or use inline script dependencies (PEP 723)
uv run script.py  # If script has # /// script metadata

# Interactive REPL with deps
uv run --with torch python
```

**Why uv:**
- Fast dependency resolution (10-100x faster than pip)
- Deterministic environments
- No need to manage venvs manually
- Works with pyproject.toml or inline deps

## Ground Truth Verification

**CRITICAL: Always verify inference outputs against multiple reference implementations.**

All reference implementations live in `~/src/`:

### Reference Implementations (Priority Order)

1. **llama.cpp** (`~/src/llama.cpp`) - Primary reference for GGUF inference
   ```bash
   cd ~/src/llama.cpp
   ./llama-cli -m /path/to/model.gguf -p "prompt" -n 1 --verbose
   # Or for embeddings/hidden states:
   ./llama-embedding -m /path/to/model.gguf -p "prompt"
   ```

2. **Ollama** (`~/src/ollama`) - Production GGUF serving reference
   ```bash
   ollama run tinyllama "prompt" --verbose
   # Check logs for token probabilities
   ```

3. **HuggingFace Transformers** - FP32 ground truth (via uv)
   ```bash
   uv run --with torch --with transformers python3 << 'EOF'
   from transformers import AutoModelForCausalLM, AutoTokenizer
   model = AutoModelForCausalLM.from_pretrained("model-name")
   # Get logits, hidden states, etc.
   EOF
   ```

4. **Candle** (`~/src/candle`) - Rust reference implementation
   ```bash
   cd ~/src/candle
   cargo run --release --example llama -- --model /path/to/model --prompt "test"
   ```

### Verification Checklist

When debugging inference issues, verify in order:

1. **Embedding lookup** - Token → embedding vector
   - Compare L2 norm and first 10 elements with HF
   - Note: GGUF may use Q4_K quantized embeddings

2. **RMSNorm** - Layer normalization
   - Compare L2 norm before/after norm
   - Verify weight values match

3. **Attention projections** (Q/K/V) - Per-layer
   - Compare Q output L2 with HF for same input
   - Check per-head L2 norms

4. **FFN projections** (gate/up/down) - Per-layer
   - Check FFN hidden (gate * up) L2
   - Verify FFN output doesn't cause catastrophic cancellation

5. **Layer-by-layer hidden state L2** - Track through all layers
   - Should closely match HF layer-by-layer
   - Watch for divergence accumulation

6. **Final logits** - Top-k comparison
   - Compare L2 norm (should be within 10%)
   - Verify top-5 tokens match HF top-5
   - Check cosine similarity > 0.99

### Quantization Tolerance

Expected differences due to quantization:
- **Q4_K**: ±5% element-wise, <1% L2 norm
- **Q6_K**: ±2% element-wise, <0.5% L2 norm
- **FP16**: ±0.1% element-wise

### Creating Verification Scripts

Store verification scripts in `examples/par_*` (parity tests):
```
examples/
  par_001_*.rs     # Token embedding verification
  par_002_*.rs     # Layer-by-layer hidden states
  par_003_*.rs     # Logit comparison
  debug_*.rs       # One-off debugging scripts
```

## Development Workflow

### Before Starting Any Work

```bash
# 1. Check ecosystem versions
cd ../trueno && git pull && grep "^version" Cargo.toml
cd ../aprender && git pull && grep "^version" Cargo.toml
cd realizar

# 2. Update dependencies if needed
# Edit Cargo.toml with new versions

# 3. Verify clean build
cargo clean
cargo test --lib

# 4. Check quality baselines
pmat analyze tdg
pmat analyze satd
pmat analyze complexity
```

## Code Search (pmat query)

**NEVER use grep or rg for code discovery.** Use `pmat query` instead -- it returns quality-annotated, ranked results with TDG scores and fault annotations.

```bash
# Find functions by intent
pmat query "inference forward pass" --limit 10

# Find high-quality code
pmat query "attention mechanism" --min-grade A --exclude-tests

# Find with fault annotations (unwrap, panic, unsafe, etc.)
pmat query "tokenizer decode" --faults

# Filter by complexity
pmat query "gguf loading" --max-complexity 10

# Cross-project search (e.g., find trueno SIMD kernels)
pmat query "simd matmul" --include-project ../trueno

# Search across the stack
pmat query "quantization Q4_K" --include-project ../aprender
pmat query "model checkpoint" --include-project ../entrenar

# Git history search (find code by commit intent via RRF fusion)
pmat query "fix inference output" -G
pmat query "kernel optimization" --git-history

# Enrichment flags (combine freely)
pmat query "attention mechanism" --churn           # git volatility (commit count, churn score)
pmat query "gguf loading" --duplicates             # code clone detection (MinHash+LSH)
pmat query "tokenizer" --entropy                   # pattern diversity (repetitive vs unique)
pmat query "forward pass" --churn --duplicates --entropy --faults -G  # full audit
```

### Coverage-Guided Search (pmat 3.0.0+)

**Use `pmat query --coverage` to find untested code. NEVER parse coverage JSON manually.**

```bash
# Find top uncovered functions (no query needed)
pmat query --coverage-gaps

# Find uncovered functions matching a semantic query
pmat query "quantization" --coverage --uncovered-only

# Use pre-existing coverage data (avoids re-running cargo llvm-cov)
pmat query --coverage-gaps --coverage-file /path/to/coverage.json

# Coverage auto-detection: runs `cargo llvm-cov report --json` automatically
# Prerequisite: run `cargo llvm-cov test --lib --no-report` first to generate data
```

**Workflow for coverage improvement:**
1. `cargo llvm-cov test --lib --no-report` — generate coverage data
2. `pmat query --coverage-gaps` — find top uncovered functions
3. Write tests targeting those functions
4. `make coverage` — verify improvement

### EXTREME TDD Methodology

**Follow RED-GREEN-REFACTOR:**

1. **RED:** Write failing tests first
   - Comprehensive test coverage (edge cases, errors, valid inputs)
   - Property-based tests for mathematical correctness
   - Document expected behavior

2. **GREEN:** Minimal implementation to pass tests
   - Focus on correctness, not optimization
   - Use clear, readable code
   - Leverage trueno primitives where applicable

3. **REFACTOR:** Clean up and optimize
   - Fix clippy warnings (zero tolerance)
   - Apply rustfmt formatting
   - Extract helper functions
   - Document with examples

**Quality Gates (all must pass):**
```bash
make fmt-check     # Format check
make clippy        # Zero warnings
make test          # All tests pass
make test-fast     # < 5 minutes
make coverage      # <10 minutes, aim for 85%+
```

### Trueno Integration Patterns

**Prefer Trueno for Compute:**
```rust
// Good: Use trueno for vector operations
use trueno::Vector;

let a = Vector::from_slice(&[1.0, 2.0, 3.0]);
let b = Vector::from_slice(&[4.0, 5.0, 6.0]);
let result = a.dot(&b); // SIMD-accelerated
```

**Matrix Operations:**
```rust
// Good: Use trueno for matrix multiplication
use trueno::Matrix;

let weights = Matrix::from_slice(128, 256, &data);
let input = Matrix::from_slice(1, 128, &input_data);
let output = weights.matmul(&input); // GPU-accelerated if available
```

**Activation Functions:**
```rust
// Good: Use trueno activations for inference
use trueno::Vector;

let logits = Vector::from_slice(&[0.1, -0.5, 0.3]);
let activated = logits.relu(); // SIMD-accelerated ReLU
```

## Phase 1 Roadmap Progress

### Week 1-2: Model Parsers ✅ COMPLETE
- ✅ GGUF parser (header + metadata + tensor_info)
- ✅ Safetensors parser (JSON metadata + zero-copy data)
- ✅ 26 tests passing
- ✅ TDG Score: 96.2/100 (A+)
- ✅ Zero SATD violations

### Week 3-4: Transformer Components ✅ COMPLETE
- ✅ Layer normalization (7 tests, epsilon-based normalization)
- ✅ Linear layer (6 tests, weight/bias loading)
- ✅ GELU activation (5 tests, tanh approximation)
- ✅ Feed-forward networks (FFN) (6 tests, 2-layer with GELU)
- ✅ Softmax activation (6 tests, numerically stable)
- ✅ Attention mechanism (8 tests, scaled dot-product attention)
- ✅ RoPE position embeddings (11 tests, rotary position encoding)
- ✅ KV cache management (10 tests, efficient inference caching)

### Week 5-6: Quantization ✅ COMPLETE
- ✅ Q4_0 dequantization (4-bit, block size 32)
- ✅ Q8_0 dequantization (8-bit, block size 32)
- ✅ Dequantization for inference
- ✅ EXTREME TDD (5 comprehensive tests)
- [ ] Mixed precision support (deferred)

### Week 7-8: Tokenizer & Inference ✅ COMPLETE
- ✅ Basic tokenizer (10 tests, encode/decode)
- ✅ Embedding layer (6 tests, token to vector)
- ✅ Complete Model struct (5 tests, end-to-end inference)
- ✅ Generation loop (6 tests, token sampling)
- ✅ Sampling strategies (16 tests, greedy/top-k/top-p)
- ✅ BPE tokenizer (14 tests, byte pair encoding)
- ✅ SentencePiece tokenizer (14 tests, unigram model)
- ✅ HTTP API with axum (8 tests, REST endpoints)

## Quality Standards

**Mandatory Requirements:**
- **TDG Score:** ≥95.0/100 (A+ grade)
- **Test Coverage:** ≥85%
- **Mutation Score:** ≥80%
- **Cyclomatic Complexity:** ≤10 per function
- **Clippy Warnings:** 0 (zero tolerance)
- **SATD Comments:** 0 (implement or remove TODOs)

**Testing Requirements:**
- Unit tests for all public APIs
- Property-based tests for mathematical operations
- Integration tests for end-to-end workflows
- Benchmark tests for performance-critical paths

## Git Workflow

**Branch Policy:** Work directly on `main` branch (per CLAUDE.md in ~/.claude/)

**Commit Message Format:**
```
<type>: <subject>

<body>

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
```

**Types:**
- `feat`: New feature
- `fix`: Bug fix
- `perf`: Performance improvement
- `refactor`: Code restructuring
- `test`: Add/update tests
- `docs`: Documentation
- `chore`: Maintenance (deps, config)

## Monitoring Ecosystem Updates

**Daily Checks (if actively developing):**
```bash
# Quick version check
cd ../trueno && git log --oneline -1 && grep "^version" Cargo.toml
cd ../aprender && git log --oneline -1 && grep "^version" Cargo.toml
```

**When to Update Realizar:**
- New trueno version with relevant features (vector ops, activations)
- Bug fixes in trueno that affect realizar
- Performance improvements in trueno SIMD/GPU backends
- New aprender primitives useful for inference

**Testing After Updates:**
1. `cargo clean` - Clear build artifacts
2. `cargo test --lib` - Verify all tests pass
3. `cargo clippy --lib -- -D warnings` - Zero warnings
4. `make quality-gates` - Full quality suite
5. Commit with version bump and rationale

## Architecture Principles

**1. Pure Rust from Scratch:**
- Build all ML components ourselves (parsers, transformer, quantization, tokenizer)
- Use trueno for compute primitives only
- HTTP server is swappable (axum default)

**2. Zero Unsafe in Public API:**
- All unsafe code isolated in trueno/aprender
- Realizar public API is 100% safe Rust

**3. Backend Agnostic:**
- Trueno handles SIMD/GPU dispatch automatically
- Fallback to scalar for unknown architectures
- WASM support via trueno

**4. Swappable HTTP Server:**
```rust
pub trait HttpServer {
    fn serve(&self, addr: &str) -> Result<()>;
}

// Currently: axum
// Future: hyper, actix-web, custom
```

## Performance Targets

**Inference Latency (1B models):**
- p50: <100ms
- p95: <200ms
- p99: <500ms

**Memory Usage:**
- Model: As loaded (no unnecessary copies)
- Runtime: <512MB overhead
- KV cache: Bounded and configurable

**Throughput:**
- Single request: Minimize latency
- Batch inference: Maximize throughput (Phase 2)

## Resources

**Documentation:**
- GGUF Spec: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
- Safetensors Spec: https://github.com/huggingface/safetensors
- Trueno README: ../trueno/README.md
- Aprender README: ../aprender/README.md

**Related Projects:**
- [Trueno]https://github.com/paiml/trueno - SIMD/GPU compute primitives
- [Aprender]https://github.com/paiml/aprender - ML library in pure Rust
- [Renacer]https://github.com/paiml/renacer - Profiling tools
- [paiml-mcp-agent-toolkit]https://github.com/paiml/paiml-mcp-agent-toolkit - Quality gates
- [bashrs]https://github.com/paiml/bashrs - Script enforcement

**Quality Tools:**
- `pmat`: Multi-dimensional analysis (TDG, complexity, SATD, coverage)
- `cargo-mutants`: Mutation testing
- `cargo-llvm-cov`: Code coverage
- `proptest`: Property-based testing

---

## Hardware Environment

**GPU:** NVIDIA GeForce RTX 4090
- CUDA Compute Capability: 8.9 (Ada Lovelace)
- VRAM: 24GB GDDR6X
- Tensor Cores: 4th Gen (FP16/BF16/INT8)
- CUDA Cores: 16384
- Memory Bandwidth: 1008 GB/s

**⚠️ CRITICAL: CUDA IS ALWAYS AVAILABLE**

99% of development happens on RTX 4090. CUDA hardware is ALWAYS present.

- **NEVER** say "CUDA not available" or "skipping because no GPU"
- **NEVER** use `CudaExecutor::is_available()` as a pre-check - it can return false even when GPU exists
- **ALWAYS** try `CudaExecutor::new(0)` directly and inspect the actual error if it fails
- If CUDA tests fail, **investigate the real error** - don't assume hardware is missing

```rust
// ❌ WRONG - Don't pre-check availability
if !CudaExecutor::is_available() {
    println!("Skipping: CUDA not available");  // WRONG ASSUMPTION
    return;
}

// ✅ CORRECT - Try to create and show actual error
match CudaExecutor::new(0) {
    Ok(exec) => { /* use it */ },
    Err(e) => {
        eprintln!("CUDA init failed: {:?}", e);  // Shows REAL problem
        // Investigate: library path? permissions? driver version?
    }
}
```

**⚠️ MANDATORY GPU TESTING:**
```bash
# ALWAYS run GPU tests - RTX 4090 is available
cargo test --lib --features cuda

# For integration tests with multiple CudaExecutor instances, use single thread
# to avoid CUDA_ERROR_NOT_INITIALIZED race condition:
cargo test --test cuda_combinatorial_coverage --features cuda -- --test-threads=1

# DO NOT use #[ignore] for GPU tests
# ALL GPU tests must execute, not be skipped
```

**Benchmark Targets (RTX 4090):**
- Ollama phi2:2.7b: ~225-266 tok/s (baseline)
- llama.cpp CUDA: ~256 tok/s
- Target: <1.25x gap to Ollama

**Development Iteration ("implement using pmat work"):**
1. `pmat analyze satd` - check SATD
2. `cargo clippy --lib --features cuda` - zero warnings
3. `cargo test --lib --features cuda` - **ALL tests including GPU**
4. Update spec with results

---

## CRITICAL: TUI Simulation Debugging (Probar-Style)

**⚠️ MANDATORY FOR ALL GPU/CUDA DEBUGGING**

When debugging GPU scheduler issues (CUDA vs wgpu parity, buffer management, kernel execution),
you MUST use TUI simulation workflow tests. This pattern was proven critical in PARITY-114 where
it detected a **state accumulation bug** that simple unit tests missed.

### Why TUI Simulation is Required

1. **Watches the Flow**: Step-by-step visualization of data through schedulers
2. **Catches State Bugs**: Sequential operations reveal accumulation/leakage issues
3. **Provides Diagnosis**: Automatic analysis of failure ratios (8x = accumulator bug, 4x = tile bug)
4. **Probar Alignment**: Matches probar's proven TUI testing methodology

### TUI Simulation Test Pattern

```rust
/// Example: TUI simulation for scheduler parity testing
#[test]
#[cfg(feature = "cuda")]
fn test_scheduler_parity_tui_simulation() {
    use realizar::gpu::{CudaScheduler, HybridScheduler};

    println!("╔══════════════════════════════════════════════════════════════════════╗");
    println!("║  TUI SIMULATION: Watch Data Flow Through Schedulers                  ║");
    println!("╚══════════════════════════════════════════════════════════════════════╝");

    let mut sim = MatmulSimulator::new();

    // Define steps
    let step_init = sim.add_step("INIT", "Initialize test matrices");
    let step_cpu = sim.add_step("CPU", "Compute reference");
    let step_cuda = sim.add_step("CUDA", "Execute via CudaScheduler");
    let step_check = sim.add_step("CHECK", "Verify parity");

    // Execute with visual feedback
    sim.start_step(step_init);
    println!("  ◐ Initializing...");
    // ... setup code ...
    sim.complete_step(step_init, values, None);
    println!("  ● Complete");

    // Render final TUI frame
    println!("{}", sim.render_final());
}
```

### State Isolation Test Pattern

**CRITICAL**: Always test sequential operations to catch state bugs:

```rust
/// Test for state accumulation bugs
#[test]
fn test_scheduler_state_isolation() {
    let mut scheduler = CudaScheduler::new().unwrap();

    // Same operation twice - results MUST be identical
    let r1 = scheduler.matmul(&a, &b, m, k, n).unwrap();
    let r2 = scheduler.matmul(&a, &b, m, k, n).unwrap();

    assert_eq!(r1[0], r2[0], "State leak detected: first={}, second={}", r1[0], r2[0]);
}
```

### Running TUI Workflow Tests

```bash
# Run all GPU parity workflow tests with visual output
cargo test --test gpu_parity_workflow --features cuda -- --nocapture

# Specific TUI simulation test
cargo test --test gpu_parity_workflow test_parity_114_tui_simulation --features cuda -- --nocapture
```

### Failure Analysis Guide

| Ratio | Diagnosis | Check |
|-------|-----------|-------|
| 8x | Accumulator/tile loop bug | Inner loop iterations, FMA instruction |
| 4x | Partial tile accumulation | n_tiles calculation, tile bounds |
| 2x | Half iterations | Loop termination condition |
| Varies | State accumulation | Output buffer not cleared between calls |

### Bug Discovery: PARITY-114 Case Study

The TUI simulation discovered that **the same operation produced different results**:

```
Op 1: 4×64×8, expected 64, got 8
Op 3: 4×64×8, expected 64, got 16  ← DIFFERENT from Op 1!
```

This proved the output buffer was accumulating between calls rather than being cleared.
Simple unit tests would NOT have caught this - only sequential TUI simulation revealed it.

---

**Last Updated:** 2026-01-21
**Realizar Version:** 0.8.0
**GPU Spec Version:** v5.2.0 (CUDA Monolith Shattered + Lint Zero)
**Trueno Version:** 0.16.0
**Aprender Version:** 0.27.0
**Entrenar Version:** 0.7.2
**paiml-mcp-agent-toolkit Version:** v2.200.0 (with Known Defects Scorer, SATD Detector, Defect Analyzer)
**TDG Score:** 93.9/100 (A)
**Rust Project Score:** 137.9/134 (103%, Grade A+)
**Test Coverage:** 80.97% (region), 88.75% (function), 80.08% (lines)
**Total Tests:** 6324 (all passing), 32 ignored
**Mutation Score:** 100% on api.rs (18/18 viable mutants caught)
**Documentation:** 15.0/15 (100%) ✅ Perfect score!
**Known Defects:** 20.0/20 (100%) ✅ Perfect score!
**Dependency Health:** 10.5/12 (87.5%) - Modular feature flags
**Benchmarks:** 4 suites (tensor_ops, inference, cache, tokenizer)
**Examples:** 7 (inference, api_server, tokenization, safetensors_loading, model_cache, gguf_loading, convert_and_bench_apr)
**Performance:**
  - **APR Q4_0: 17.0-17.3 tok/s (1.36x faster than GGUF)** ✅ v0.3.4
  - GGUF Q4_0: 12.5-13.0 tok/s (Candle parity exceeded)
  - APR F32: 0.1 tok/s (memory bandwidth limited)
  - <1ms p50 for 5-token generation
  - **38-41% of llama.cpp** (target: 100%+)
**CLI Binary:**`realizar serve --demo` (65% coverage)
**Quality Improvements:**
  - Added workspace-level lints (unsafe_op_in_unsafe_fn, unreachable_pub, checked_conversions)
  - Created .clippy.toml for cognitive complexity thresholds
  - Fixed critical unwrap() in safetensors.rs (replaced with expect())
  - Updated to latest trueno v0.4.2 with SIMD attribute compliance and PMAT integration
  - Integrated paiml-mcp-agent-toolkit v2.200.0 (Known Defects, SATD, Defect Analysis)
**GPU Performance Parity (M29-M32):**
  - M29: Error Recovery (ErrorRecoveryStrategy, DegradationManager, FailureIsolator)
  - M30: Resource Management (ConnectionPool, ResourceLimiter, ResourceMonitor)
  - M31: Resilience (RetryPolicy, CircuitBreaker, BulkheadManager)
  - M32: Diagnostics (Logger, PhaseTimer, MemoryTracker, DiagnosticsCollector, DebugMode)
**APR Q4_0 Format (v0.3.5):**
  - `QuantizedAprTransformerQ4` - Pure Rust quantized inference
  - RoPE (Rotary Position Embeddings) with configurable theta
  - Grouped Query Attention (GQA) for TinyLlama compatibility
  - SIMD matmul via `fused_q4_0_q8_0_parallel_matvec`
  - **Parallel attention heads** via rayon (32 heads parallelized)
  - **Parallel FFN up/gate** via rayon::join
  - **KV Cache** for efficient autoregressive generation
    - `AprKVCache` stores K/V per layer, avoids recomputation
    - `forward_with_cache()` for context-aware generation
    - `causal_attention_cached()` with parallel head processing
  - **13-19 tok/s** context-aware generation (32-45% of llama.cpp)
**CUDA Refactor (v5.2.0):**
  - Shattered 23K-line cuda.rs monolith into 9 atomic modules
  - Split 21K-line executor.rs into domain submodules (activations, core, gemm, layer, quantized, workspace)
  - Split 15K-line impl_main.rs into 9 focused submodules
  - 65 files cleaned for zero clippy warnings
  - Fixed broken benchmarks (GGUFTransformer → AprTransformer)
**Latest Achievement:** CUDA monolith shattered + comprehensive lint cleanup (65 files, 2089 insertions, 1040 deletions)
**Completed:** Weeks 1-8 + GPU parity M1-M32 + APR Q4_0 (M2) + Rayon (M3) + KV Cache (M4) + CUDA Refactor (v5.2.0)


## Stack Documentation Search

**IMPORTANT: Proactively use the batuta RAG oracle when:**
- Looking up SIMD/GPU patterns from trueno
- Finding inference patterns from TGI ground truth corpus
- Understanding quantization approaches (GGUF, APR formats)
- Researching KV cache, attention, or batching implementations

```bash
# Search across the entire Sovereign AI Stack
batuta oracle --rag "your question here"

# Examples for realizar development
batuta oracle --rag "KV cache optimization patterns"
batuta oracle --rag "continuous batching TGI"
batuta oracle --rag "CUDA kernel matmul implementation"
batuta oracle --rag "quantization Q4_K dequantization"
batuta oracle --rag "FlashAttention tiled attention"

# Reindex if needed (persists to ~/.cache/batuta/rag/)
batuta oracle --rag-index
```

The RAG index includes 335 documents across:
- All Sovereign AI Stack repos (trueno, aprender, entrenar, etc.)
- Python ground truth corpora (HuggingFace, JAX, vLLM patterns)
- Rust ground truth corpora (TGI inference patterns, MLOps)

Index auto-updates via post-commit hooks and `ora-fresh` on shell login.
To manually check freshness: `ora-fresh`
To force full reindex: `batuta oracle --rag-index --force`

## SSC Training / Blackwell: Inference NOT Affected (2026-03-22)

- **Inference is NOT affected** by the Blackwell training JIT bug (trueno#200)
- **realizar uses cuBLAS (GPU) or trueno SIMD (CPU)** for all GEMMs — pre-compiled kernels, no JIT
- **NF4 fused kernel and cuBLAS backward kernels** are training-only (entrenar) — realizar never calls them
- **When the SSC model ships**: realizar loads the LoRA adapter via standard PEFT/safetensors path — no special Blackwell handling needed
- **Trained model (LoRA adapter)**: Architecture-independent — works on any GPU or CPU
- **Key tickets**: trueno#200 (Blackwell JIT), trueno#203 (pre-compiled kernels), entrenar#300 (cuBLAS backward)