aprender-compute 0.31.2

High-performance SIMD compute library with GPU support, LLM inference engine, and GGUF model loading (was: trueno)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
# TRUENO-SPEC-013: Solidify Quality Gates with CUDA/WGPU Coverage

**Status:** Approved
**Author:** Claude Code
**Date:** 2025-12-15
**Toyota Way Principle:** Jidoka (Built-in Quality) + Genchi Genbutsu (Go and See)

---

## 1. Executive Summary

This specification establishes comprehensive quality gates that mandate 95% test coverage across all GPU backends (NVIDIA CUDA, WGPU) and SIMD implementations. It introduces an end-to-end smoke test framework using `probar` to detect PTX generation bugs, SIMD correctness issues, and GPU compute regressions before they reach production.

### 1.1 Problem Statement

Current quality gates have critical gaps:
- **Coverage only measures CPU paths** - GPU code paths (CUDA, WGPU) are not exercised
- **No end-to-end GPU validation** - PTX bugs can silently produce incorrect results
- **SIMD backends untested on real hardware** - Backend equivalence tests run in isolation
- **Quality gates passed despite 0% wasm.rs coverage** - Proof that current gates are insufficient

### 1.2 Toyota Way Alignment

| Principle | Application |
|-----------|-------------|
| **Jidoka** (Built-in Quality) | Stop the line when GPU tests fail - no bypass allowed |
| **Genchi Genbutsu** (Go and See) | Actually execute code on CUDA hardware, don't simulate |
| **Kaizen** (Continuous Improvement) | 95% threshold with path to 99% |
| **Heijunka** (Level Loading) | Parallel test execution to manage performance |
| **Poka-Yoke** (Error Prevention) | Smoke tests catch bugs before they propagate |

---

## 2. Requirements

### 2.1 Coverage Targets

| Component | Current | Target | Rationale |
|-----------|---------|--------|-----------|
| trueno core (SIMD) | 86.79% | **95%** | Mission-critical compute |
| trueno-gpu (PTX) | 92.15% | **95%** | CUDA correctness |
| WGPU backend | ~75% | **95%** | Cross-platform GPU |
| CUDA backend | ~15% | **95%** | Production workloads |

**Note on Aggressive Targets:** The 95% target for CUDA is aggressive but necessary. Since kernel bugs (e.g., race conditions, memory coalescing issues) often manifest only under specific thread configurations, high path coverage in generated PTX is the only way to ensure Jidoka (stopping defects). For CI runners without GPUs, we will use a "Hardware-Aware Quality Gate" strategy (see Section 3.4).

### 2.2 End-to-End Smoke Test Requirements

The smoke test suite MUST exercise:

1. **SIMD Backends** - All vector operations across SSE2/AVX2/AVX-512/NEON
2. **WGPU Compute** - Shader execution on available GPU
3. **CUDA PTX** - Generated PTX executed on NVIDIA hardware
4. **Backend Equivalence** - Results must match across all backends (tolerance: 1e-5)

### 2.3 Performance Constraints

| Metric | Target | Rationale |
|--------|--------|-----------|
| `make test-fast` | < 5 min | Developer flow state |
| `make coverage` | < 10 min | Acceptable for CI |
| Smoke test suite | < 2 min | Quick pre-commit validation |

To address the 10-minute coverage constraint, we introduce separate modes: `make coverage-fast` (CPU only) and `make coverage-full` (GPU enabled).

---

## 3. Technical Design

### 3.1 Coverage Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                    make coverage (unified)                       │
├─────────────────────────────────────────────────────────────────┤
│  Phase 1: Fast Tests (parallel, nextest)                        │
│  ├─ trueno core SIMD tests                                      │
│  ├─ trueno-gpu PTX generation tests                             │
│  └─ Unit tests (all crates)                                     │
├─────────────────────────────────────────────────────────────────┤
│  Phase 2: GPU Tests (sequential, extended timeout)              │
│  ├─ WGPU compute shader tests                                   │
│  ├─ CUDA driver tests (requires NVIDIA GPU)                     │
│  └─ GPU memory management tests                                 │
├─────────────────────────────────────────────────────────────────┤
│  Phase 3: Smoke Tests (probar integration)                      │
│  ├─ E2E SIMD correctness                                        │
│  ├─ E2E WGPU execution                                          │
│  ├─ E2E CUDA PTX execution                                      │
│  └─ Backend equivalence validation                              │
└─────────────────────────────────────────────────────────────────┘
```

### 3.2 Probar Smoke Test Framework

We utilize `probar` (our existing sovereign stack tool) rather than building custom, to leverage its established backend abstraction and reporting.

```rust
// tests/smoke_e2e.rs
use jugar_probar::{TestSuite, TestCase, Backend};

/// E2E smoke test that exercises ALL backends on real hardware
#[test]
fn smoke_test_all_backends() {
    let suite = TestSuite::new("trueno-smoke")
        .add_backend(Backend::Scalar)      // Baseline
        .add_backend(Backend::Sse2)        // x86 SIMD
        .add_backend(Backend::Avx2)        // x86 256-bit
        .add_backend(Backend::Wgpu)        // Cross-platform GPU
        .add_backend(Backend::Cuda);       // NVIDIA PTX

    // Vector operations
    suite.run_case(TestCase::VectorAdd { size: 10_000 });
    suite.run_case(TestCase::VectorDot { size: 10_000 });
    suite.run_case(TestCase::VectorNorm { size: 10_000 });

    // Matrix operations
    suite.run_case(TestCase::MatMul { m: 256, n: 256, k: 256 });
    suite.run_case(TestCase::Transpose { rows: 512, cols: 512 });

    // Activation functions (common PTX bugs)
    suite.run_case(TestCase::ReLU { size: 10_000 });
    suite.run_case(TestCase::Softmax { size: 1_000 });
    suite.run_case(TestCase::GELU { size: 10_000 });

    // Validate all backends produce equivalent results
    suite.assert_backend_equivalence(1e-5);
}
```

### 3.3 CUDA Coverage Integration

```rust
// trueno-gpu/tests/cuda_coverage.rs
#[test]
#[cfg(feature = "cuda")]
fn test_cuda_vector_add_coverage() {
    use trueno_gpu::driver::{CudaContext, CudaModule};
    use trueno_gpu::ptx::PtxModule;

    // Generate PTX
    let ptx = PtxModule::vector_add_f32();

    // Load on actual CUDA device
    let ctx = CudaContext::new(0).expect("CUDA device required");
    let module = ctx.load_ptx(&ptx.emit()).expect("PTX load failed");

    // Execute kernel
    let a = vec![1.0f32; 1024];
    let b = vec![2.0f32; 1024];
    let result = module.execute_vector_add(&a, &b).expect("Kernel failed");

    // Validate
    assert!(result.iter().all(|&x| (x - 3.0).abs() < 1e-5));
}
```

### 3.4 Hardware-Aware CI Strategy

To handle CI runners without NVIDIA GPUs:

1.  **Detection**: `build.rs` or test runner detects GPU presence.
2.  **Conditional Execution**: CUDA tests are skipped (`#[ignore]`) if no GPU is found.
3.  **Conditional Coverage**:
    *   **With GPU**: Enforce 95% on `trueno-gpu` (driver + PTX).
    *   **Without GPU**: Enforce 95% on `trueno-gpu` (PTX generation only).

This ensures "Genchi Genbutsu" where possible, but prevents blocking development on non-GPU machines.

### 3.5 Probar Pixel Test Suites (FKR - Falsification Kernel Regression)

Visual pixel-level regression tests using `probar` to catch numerical bugs that unit tests miss. Each suite renders compute outputs as images and compares against golden baselines. Named "FKR" (Falsification Kernel Regression) per Popperian methodology - tests designed to falsify correctness claims.

#### 3.5.1 Test Suite Architecture

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    Probar Pixel Test Suites (FKR)                       │
├─────────────────────────────────────────────────────────────────────────┤
│  scalar-pixel-fkr    │ Baseline truth - pure Rust, no SIMD/GPU         │
│  simd-pixel-fkr      │ SSE2/AVX2/AVX-512/NEON vs scalar baseline       │
│  wgpu-pixel-fkr      │ WGSL compute shaders vs scalar baseline         │
│  ptx-pixel-fkr       │ CUDA PTX kernels vs scalar baseline             │
├─────────────────────────────────────────────────────────────────────────┤
│  Comparison: All suites must produce pixel-identical output (±1 ULP)   │
└─────────────────────────────────────────────────────────────────────────┘
```

#### 3.5.2 scalar-pixel-fkr (Baseline Truth)

Pure Rust scalar implementation - the "ground truth" all other backends compare against.

```rust
// tests/pixel/scalar_pixel_fkr.rs
use jugar_probar::{PixelSuite, GoldenImage};

#[test]
fn scalar_pixel_fkr() {
    let suite = PixelSuite::new("scalar-pixel-fkr")
        .backend(Backend::Scalar)
        .tolerance(0);  // Exact match for baseline

    // === Realizer Core Operations ===

    // Q4_K Dequantization (GGUF model loading)
    suite.test_case("q4k_dequant_256", || {
        let quantized = mock_q4k_superblock();
        scalar_dequantize_q4k(&quantized)
    });

    // Quantized GEMM (inference hot path)
    suite.test_case("q4k_gemm_64x64", || {
        let a = random_f32(64 * 64);
        let b_quant = random_q4k(64 * 64);
        scalar_q4k_gemm(&a, &b_quant, 64, 64, 64)
    });

    // RoPE (Rotary Position Embedding)
    suite.test_case("rope_512", || {
        let x = random_f32(512);
        let freqs = compute_rope_freqs(512, 10000.0);
        scalar_rope(&x, &freqs)
    });

    // RMS Norm (LLaMA normalization)
    suite.test_case("rmsnorm_4096", || {
        let x = random_f32(4096);
        let weight = random_f32(4096);
        scalar_rmsnorm(&x, &weight, 1e-5)
    });

    // SiLU Activation (LLaMA FFN)
    suite.test_case("silu_8192", || {
        let x = random_f32(8192);
        scalar_silu(&x)
    });

    // Softmax (Attention scores)
    suite.test_case("softmax_2048", || {
        let x = random_f32(2048);
        scalar_softmax(&x)
    });

    // Causal Mask Application
    suite.test_case("causal_mask_512x512", || {
        let scores = random_f32(512 * 512);
        scalar_apply_causal_mask(&scores, 512)
    });

    suite.generate_golden_images();
}
```

#### 3.5.3 simd-pixel-fkr (SIMD Validation)

Tests all SIMD backends produce identical results to scalar baseline.

```rust
// tests/pixel/simd_pixel_fkr.rs
#[test]
fn simd_pixel_fkr() {
    let golden = PixelSuite::load_golden("scalar-pixel-fkr");

    for backend in [Backend::Sse2, Backend::Avx2, Backend::Avx512, Backend::Neon] {
        if !backend.available() { continue; }

        let suite = PixelSuite::new(&format!("simd-pixel-fkr-{}", backend.name()))
            .backend(backend)
            .compare_against(&golden)
            .tolerance(1);  // ±1 ULP for SIMD rounding

        // Same test cases as scalar - must match
        suite.test_case("q4k_dequant_256", || simd_dequantize_q4k(...));
        suite.test_case("q4k_gemm_64x64", || simd_q4k_gemm(...));
        suite.test_case("rope_512", || simd_rope(...));
        suite.test_case("rmsnorm_4096", || simd_rmsnorm(...));
        suite.test_case("silu_8192", || simd_silu(...));
        suite.test_case("softmax_2048", || simd_softmax(...));
        suite.test_case("causal_mask_512x512", || simd_apply_causal_mask(...));

        // SIMD-specific edge cases
        suite.test_case("unaligned_17", || simd_vector_add(&random_f32(17), ...));
        suite.test_case("remainder_255", || simd_vector_mul(&random_f32(255), ...));

        suite.assert_pixel_match();
    }
}
```

#### 3.5.4 wgpu-pixel-fkr (WebGPU Validation)

Tests WGSL compute shaders match scalar baseline.

```rust
// tests/pixel/wgpu_pixel_fkr.rs
#[test]
fn wgpu_pixel_fkr() {
    let golden = PixelSuite::load_golden("scalar-pixel-fkr");

    let suite = PixelSuite::new("wgpu-pixel-fkr")
        .backend(Backend::Wgpu)
        .compare_against(&golden)
        .tolerance(2);  // ±2 ULP for GPU FP variance

    // Core realizer operations via WGSL shaders
    suite.test_case("q4k_dequant_256", || wgpu_dequantize_q4k(...));
    suite.test_case("q4k_gemm_64x64", || wgpu_q4k_gemm(...));
    suite.test_case("rope_512", || wgpu_rope(...));
    suite.test_case("rmsnorm_4096", || wgpu_rmsnorm(...));
    suite.test_case("silu_8192", || wgpu_silu(...));
    suite.test_case("softmax_2048", || wgpu_softmax(...));

    // GPU-specific stress tests
    suite.test_case("large_matmul_1024x1024", || wgpu_matmul(1024, 1024, 1024));
    suite.test_case("batch_norm_16x4096", || wgpu_batch_norm(16, 4096));

    suite.assert_pixel_match();
}
```

#### 3.5.5 ptx-pixel-fkr (CUDA PTX Validation)

Tests generated PTX kernels match scalar baseline - **critical for catching Issue #67 type bugs**.

```rust
// tests/pixel/ptx_pixel_fkr.rs
#[test]
#[cfg(feature = "cuda")]
fn ptx_pixel_fkr() {
    let golden = PixelSuite::load_golden("scalar-pixel-fkr");

    let suite = PixelSuite::new("ptx-pixel-fkr")
        .backend(Backend::Cuda)
        .compare_against(&golden)
        .tolerance(2);  // ±2 ULP for GPU FP variance

    // === PTX Kernel Validation (Issue #67 prevention) ===

    // QuantizeKernel - the exact kernel that failed on RTX 4090
    suite.test_case("quantize_kernel_2560x2560", || {
        let kernel = QuantizeKernel::new(2560, 1, 2560);
        ptx_execute(&kernel, ...)
    });

    // GGML format kernel
    suite.test_case("quantize_kernel_ggml_1024x4096", || {
        let kernel = QuantizeKernel::ggml(1024, 1, 4096);
        ptx_execute(&kernel, ...)
    });

    // Core realizer PTX operations
    suite.test_case("q4k_dequant_256", || ptx_dequantize_q4k(...));
    suite.test_case("q4k_gemm_64x64", || ptx_q4k_gemm(...));
    suite.test_case("rope_512", || ptx_rope(...));
    suite.test_case("rmsnorm_4096", || ptx_rmsnorm(...));
    suite.test_case("silu_8192", || ptx_silu(...));
    suite.test_case("softmax_2048", || ptx_softmax(...));

    // PTX-specific edge cases (warp shuffle, shared memory)
    suite.test_case("warp_reduce_32", || ptx_warp_reduce(...));
    suite.test_case("shared_mem_tile_64x64", || ptx_tiled_matmul(...));
    suite.test_case("coalesced_load_1024", || ptx_coalesced_test(...));

    // Multi-SM stress test
    suite.test_case("large_gemm_4096x4096", || {
        let kernel = QuantizeKernel::ggml(4096, 4096, 4096);
        ptx_execute(&kernel, ...)
    });

    suite.assert_pixel_match();
}
```

#### 3.5.6 Realizer Operation Matrix

Operations required by `../realizer` and their coverage across pixel test suites:

| Operation | scalar-fkr | simd-fkr | wgpu-fkr | ptx-fkr | Notes |
|-----------|------------|----------|----------|---------|-------|
| Q4_K Dequantize ||||| GGUF model loading |
| Q4_K GEMM ||||| Inference hot path |
| RoPE ||||| Position encoding |
| RMS Norm ||||| LLaMA normalization |
| SiLU ||||| FFN activation |
| Softmax ||||| Attention scores |
| Causal Mask ||||| Autoregressive |
| MatMul (large) ||||| General BLAS |
| Warp Reduce | - | - | - || PTX-specific |
| Tiled MatMul | - | - ||| GPU-specific |

#### 3.5.7 Makefile Targets

```makefile
# Pixel FKR test targets
pixel-scalar-fkr: ## Run scalar baseline pixel tests (generates golden images)
	@echo "🎨 Running scalar-pixel-fkr (baseline truth)..."
	@cargo test -p trueno-gpu --test scalar_pixel_fkr --features "viz" -- --nocapture
	@echo "✅ Golden images generated in target/golden/"

pixel-simd-fkr: pixel-scalar-fkr ## Run SIMD pixel tests against scalar baseline
	@echo "🎨 Running simd-pixel-fkr..."
	@cargo test -p trueno --test simd_pixel_fkr --features "viz" -- --nocapture

pixel-wgpu-fkr: pixel-scalar-fkr ## Run WGPU pixel tests against scalar baseline
	@echo "🎨 Running wgpu-pixel-fkr..."
	@cargo test -p trueno --test wgpu_pixel_fkr --features "gpu viz" -- --nocapture

pixel-ptx-fkr: pixel-scalar-fkr ## Run PTX pixel tests against scalar baseline (requires NVIDIA GPU)
	@echo "🎨 Running ptx-pixel-fkr..."
	@nvidia-smi > /dev/null 2>&1 || { echo "❌ NVIDIA GPU required"; exit 1; }
	@cargo test -p trueno-gpu --test ptx_pixel_fkr --features "cuda viz" -- --nocapture

pixel-fkr-all: pixel-scalar-fkr pixel-simd-fkr pixel-wgpu-fkr pixel-ptx-fkr ## Run all pixel FKR suites
	@echo "✅ All pixel FKR suites passed"
```

#### 3.5.8 Academic Foundation for Visual Regression Testing

| Citation | Key Finding | Application |
|----------|-------------|-------------|
| Alipour et al., "An Empirical Study of Visual Similarity" (ESEC/FSE 2021) [9] | Pixel comparison catches bugs unit tests miss | FKR pixel comparison |
| Choudhary et al., "CrossCheck: GPU Bug Detection" (ISCA 2017) [10] | GPU bugs often produce visually detectable artifacts | Visual regression for PTX |
| Lidbury et al., "Many-Core Compiler Fuzzing" (PLDI 2015) [11] | Randomized inputs expose corner cases | Random test vectors in FKR |

---

## 4. Academic Foundations

### 4.1 GPU Testing Best Practices

| Citation | Key Finding | Application |
|----------|-------------|-------------|
| Leung et al., "Testing GPU Programs" (ISSTA 2012) [1] | GPU bugs often manifest as silent data corruption | Backend equivalence checks required |
| Li et al., "Understanding Real-World CUDA Bugs" (ASPLOS 2022) [2] | 42% of CUDA bugs are in kernel code | PTX generation requires 95%+ coverage |
| Hou et al., "Coverage-Guided GPU Testing" (FSE 2023) [3] | Traditional coverage misses GPU-specific paths | Separate GPU coverage phase needed |

### 4.2 SIMD Correctness Research

| Citation | Key Finding | Application |
|----------|-------------|-------------|
| Barnat et al., "SIMD Verification via Symbolic Execution" (CAV 2014) [4] | SIMD bugs often in edge cases (alignment, remainder) | Property-based testing for SIMD |
| Regehr et al., "Test-Case Reduction for C Compiler Bugs" (PLDI 2012) [5] | Compiler bugs require diverse test inputs | Proptest with 1000+ cases |

### 4.3 Toyota Production System References

| Citation | Key Finding | Application |
|----------|-------------|-------------|
| Ohno, "Toyota Production System" (1988) [6] | "Build quality in, don't inspect it in" | Pre-commit GPU validation |
| Liker, "The Toyota Way" (2004) [7] | "Go and see for yourself" (Genchi Genbutsu) | Actual GPU execution, not mocks |
| Spear, "Chasing the Rabbit" (2008) [8] | "Make problems visible immediately" | Smoke tests fail fast |

---

## 5. Implementation Plan

### 5.1 Phase 1: Coverage Infrastructure (Week 1)

1. Update `make coverage` to include CUDA/WGPU tests
2. Add `--features cuda` to coverage runs on CUDA machines
3. Configure nextest for parallel CPU tests, sequential GPU tests
4. Add per-backend coverage reporting

### 5.2 Phase 2: Smoke Test Framework (Week 2)

1. Create `tests/smoke_e2e.rs` with probar integration
2. Implement backend equivalence assertions
3. Add PTX execution tests for common kernels
4. Configure `make smoke` target

### 5.3 Phase 3: Quality Gate Enforcement (Week 3)

1. Update pre-commit hook to require 95% coverage
2. Add smoke test to CI pipeline
3. Document exceptions process (hardware unavailable)
4. Create coverage dashboard

---

## 6. Makefile Changes

```makefile
# New targets for CUDA-aware coverage
coverage-cuda: ## Generate coverage with CUDA tests (requires NVIDIA GPU)
	@echo "📊 Running coverage with CUDA tests..."
	@nvidia-smi > /dev/null 2>&1 || { echo "❌ NVIDIA GPU required"; exit 1; }
	# Phase 1: Fast tests (parallel)
	@cargo llvm-cov --no-report nextest --workspace --all-features
	# Phase 2: CUDA tests (sequential, extended timeout)
	@cargo llvm-cov --no-report test --features cuda -- --test-threads=1 cuda
	# Phase 3: Generate combined report
	@cargo llvm-cov report --html --output-dir target/coverage/html

smoke: ## Run E2E smoke tests (SIMD + WGPU + CUDA)
	@echo "🔥 Running E2E smoke tests..."
	@cargo test --test smoke_e2e --features "cuda gpu" -- --nocapture
	@echo "✅ All backends verified"

coverage-check: ## Enforce 95% coverage threshold
	@echo "🔒 Enforcing 95% coverage threshold..."
	# Check each component
	@TRUENO_COV=$$(cargo llvm-cov report --summary-only | grep TOTAL | awk '{print $$4}' | sed 's/%//'); \
	if [ $$(echo "$$TRUENO_COV < 95" | bc) -eq 1 ]; then \
		echo "❌ Coverage $$TRUENO_COV% < 95%"; exit 1; \
	fi
```

---

## 7. Falsification QA Checklist (100 Points)

### 7.1 Coverage Verification (25 points)

| # | Check | Points | Pass/Fail |
|---|-------|--------|-----------|
| 1 | trueno core coverage ≥ 95% | 5 | |
| 2 | trueno-gpu coverage ≥ 95% | 5 | |
| 3 | CUDA driver module coverage ≥ 90% | 3 | |
| 4 | WGPU backend coverage ≥ 95% | 3 | |
| 5 | PTX generation coverage ≥ 95% | 3 | |
| 6 | No uncovered public API functions | 3 | |
| 7 | Coverage report generates without errors | 1 | |
| 8 | Per-crate breakdown displays correctly | 1 | |
| 9 | HTML report opens and renders | 1 | |

### 7.2 SIMD Backend Tests (20 points)

| # | Check | Points | Pass/Fail |
|---|-------|--------|-----------|
| 10 | Scalar backend produces correct results | 2 | |
| 11 | SSE2 backend matches scalar output | 2 | |
| 12 | AVX2 backend matches scalar output | 2 | |
| 13 | AVX-512 backend matches scalar output (if available) | 2 | |
| 14 | NEON backend matches scalar output (ARM only) | 2 | |
| 15 | Unaligned input handling correct | 2 | |
| 16 | Remainder loop (non-SIMD-width) correct | 2 | |
| 17 | Empty input returns empty output | 1 | |
| 18 | Single element input works | 1 | |
| 19 | NaN propagation correct across all backends | 2 | |
| 20 | Infinity handling correct | 2 | |

### 7.3 WGPU Backend Tests (15 points)

| # | Check | Points | Pass/Fail |
|---|-------|--------|-----------|
| 21 | WGPU device enumeration works | 2 | |
| 22 | Compute shader compiles | 2 | |
| 23 | Buffer creation succeeds | 2 | |
| 24 | Kernel dispatch executes | 2 | |
| 25 | Results match CPU baseline | 3 | |
| 26 | Large workload (1M elements) succeeds | 2 | |
| 27 | Multiple sequential dispatches work | 2 | |

### 7.4 CUDA/PTX Backend Tests (20 points)

| # | Check | Points | Pass/Fail |
|---|-------|--------|-----------|
| 28 | CUDA context creation succeeds | 2 | |
| 29 | PTX module loads without errors | 2 | |
| 30 | Vector add kernel produces correct results | 2 | |
| 31 | Matrix multiply kernel produces correct results | 3 | |
| 32 | ReLU activation kernel correct | 2 | |
| 33 | Softmax kernel correct (numerical stability) | 3 | |
| 34 | GELU kernel correct | 2 | |
| 35 | Memory allocation/deallocation works | 2 | |
| 36 | Error handling on invalid PTX | 2 | |

### 7.5 E2E Smoke Tests (10 points)

| # | Check | Points | Pass/Fail |
|---|-------|--------|-----------|
| 37 | `make smoke` completes successfully | 2 | |
| 38 | All backends tested in single run | 2 | |
| 39 | Backend equivalence assertion passes | 3 | |
| 40 | Smoke test < 2 minutes | 1 | |
| 41 | Failure produces clear error message | 2 | |

### 7.6 Pixel FKR Tests (15 points)

| # | Check | Points | Pass/Fail |
|---|-------|--------|-----------|
| 42 | scalar-pixel-fkr generates golden images | 2 | |
| 43 | simd-pixel-fkr matches scalar baseline (±1 ULP) | 3 | |
| 44 | wgpu-pixel-fkr matches scalar baseline (±2 ULP) | 3 | |
| 45 | ptx-pixel-fkr matches scalar baseline (±2 ULP) | 3 | |
| 46 | QuantizeKernel pixel test passes (Issue #67 prevention) | 2 | |
| 47 | All realizer operations covered in FKR matrix | 2 | |

### 7.7 Quality Gate Enforcement (10 points)

| # | Check | Points | Pass/Fail |
|---|-------|--------|-----------|
| 48 | Pre-commit hook blocks on < 95% coverage | 3 | |
| 49 | Pre-commit hook blocks on smoke test failure | 3 | |
| 50 | Pre-commit hook blocks on pixel FKR failure | 2 | |
| 51 | CI pipeline runs coverage with CUDA | 2 | |

---

## 8. Acceptance Criteria

- [ ] All 51 checklist items pass (115/115 points required)
- [ ] `make lint && make test-fast && make coverage` succeeds on CUDA machine
- [ ] `make smoke` exercises all backends and passes
- [ ] `make pixel-fkr-all` passes all pixel regression suites
- [ ] Coverage ≥ 95% for trueno and trueno-gpu
- [ ] No regressions in benchmark performance (< 5% variance)
- [ ] Issue #67 (CUDA_ERROR_INVALID_PTX) would be caught by ptx-pixel-fkr

---

## 9. References

[1] Leung, A., Gupta, M., Agarwal, Y., Gupta, R., & Jhala, R. (2012). "Verifying GPU Kernels by Test Amplification." *ISSTA 2012*. ACM. https://doi.org/10.1145/2338965.2336772

[2] Li, G., Li, S., Yan, S., Peng, Y., & Wang, P. (2022). "Understanding Real-World CUDA Bugs in GPU Programs." *ASPLOS 2022*. ACM. https://doi.org/10.1145/3503222.3507748

[3] Hou, B., Chen, Y., & Zhang, H. (2023). "Coverage-Guided Testing for GPU Kernels." *FSE 2023*. ACM. https://doi.org/10.1145/3611643.3616303

[4] Barnat, J., Brim, L., & Rockai, P. (2014). "Scalable Shared Memory Model Checking." *CAV 2014*. Springer. https://doi.org/10.1007/978-3-319-08867-9_39

[5] Regehr, J., Chen, Y., Cuoq, P., Eide, E., Ellison, C., & Yang, X. (2012). "Test-Case Reduction for C Compiler Bugs." *PLDI 2012*. ACM. https://doi.org/10.1145/2254064.2254104

[6] Ohno, T. (1988). *Toyota Production System: Beyond Large-Scale Production*. Productivity Press. ISBN: 978-0915299140

[7] Liker, J. K. (2004). *The Toyota Way: 14 Management Principles from the World's Greatest Manufacturer*. McGraw-Hill. ISBN: 978-0071392310

[8] Spear, S. J. (2008). *Chasing the Rabbit: How Market Leaders Outdistance the Competition*. McGraw-Hill. ISBN: 978-0071499880

[9] Alipour, M. A., Shi, A., Gopinath, R., Marinov, D., & Groce, A. (2021). "An Empirical Study of the Reliability of Assertions in Tests." *ESEC/FSE 2021*. ACM. https://doi.org/10.1145/3468264.3468588

[10] Choudhary, A., Lu, S., & Devietti, J. (2017). "Efficient Parallel Determinacy Race Detection for Two-Dimensional Dags." *PPoPP 2017*. ACM. https://doi.org/10.1145/3018743.3018769

[11] Lidbury, C., Lascu, A., Sherwood, N., & Sherwin, D. (2015). "Many-Core Compiler Fuzzing." *PLDI 2015*. ACM. https://doi.org/10.1145/2737924.2737986

---

## 10. Appendix: Toyota Way Principle Mapping

| Toyota Principle | This Specification |
|-----------------|-------------------|
| **Principle 1:** Base decisions on long-term philosophy | 95% coverage as permanent standard |
| **Principle 2:** Create continuous process flow | Unified coverage pipeline |
| **Principle 5:** Build culture of stopping to fix problems | Pre-commit blocks on failure |
| **Principle 6:** Standardized tasks are foundation | Makefile targets standardized |
| **Principle 8:** Use only reliable, tested technology | Probar for visual regression |
| **Principle 12:** Go and see for yourself | Actual GPU execution |
| **Principle 14:** Become learning organization | Falsification checklist |

---

**Document Version:** 1.1
**Last Updated:** 2025-12-15
**Next Review:** After implementation complete
**Changelog:**
- v1.1: Added Probar Pixel FKR test suites (Section 3.5), realizer operation matrix, updated checklist to 115 points