aprender-compute 0.31.2

High-performance SIMD compute library with GPU support, LLM inference engine, and GGUF model loading (was: trueno)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
# CUDA Library Parity Specification for the Sovereign AI Stack

**Version**: 1.0.0
**Status**: Draft
**Date**: 2026-03-04
**Authors**: Sovereign AI Stack Team
**Toyota Way Alignment**: Jidoka (built-in quality), Kaizen (continuous improvement), Genchi Genbutsu (go and see)
**Batuta Oracle Confidence**: trueno 85-95% for linear algebra; gaps identified for FFT, solvers, image processing, RNG

## Executive Summary

This specification defines the roadmap for achieving feature parity between NVIDIA's CUDA Library ecosystem (cuBLAS, cuSPARSE, cuFFT, cuSOLVER, cuTENSOR, NPP, nvCOMP, nvJPEG, cuRAND, cuPQC) and the Sovereign AI Stack's pure Rust implementation. The gap analysis reveals that **trueno** already covers dense BLAS (GEMM, GEMV, batched matmul) and data compression (LZ4), but lacks dedicated sparse linear algebra, FFT, general solvers, image processing, and GPU RNG. Each new capability is specified across four design dimensions: provable-contracts verification, Brick profiling integration, pure Rust PTX implementation, and world-class algorithmic foundations from peer-reviewed research.

### Sovereign Stack Current Coverage

| CUDA Library | Sovereign Equivalent | Coverage | Gap |
|---|---|---|---|
| cuBLAS (Level 1-3) | trueno (vector_ops, matrix_ops, GEMM/GEMV) | ~70% | Batched BLAS, mixed-precision GEMM variants |
| cuBLASLt | trueno-gpu (tiled GEMM, tensor core GEMM) | ~50% | Algorithm selection, epilogue fusion |
| cuSPARSE | **None** | 0% | Full sparse library needed |
| cuFFT | **None** (rustfft CPU only in aprender) | 0% | GPU FFT needed |
| cuSOLVER | trueno (Jacobi eigendecomp only) | ~5% | LU, QR, SVD, Cholesky |
| cuTENSOR | **None** | 0% | N-dimensional contractions |
| NPP | **None** (rmedia for video, not GPU image) | 0% | GPU image processing |
| nvCOMP | trueno (LZ4), trueno-zram | ~20% | GPU-accelerated codecs |
| nvJPEG/nvTIFF | **None** | 0% | GPU image codecs |
| cuRAND | **None** | 0% | GPU RNG |
| cuPQC | aprender (AES-256-GCM, Ed25519 — CPU) | ~10% | GPU-accelerated crypto |

---

## 1. New Crate Architecture

Each CUDA library equivalent becomes a module within the trueno workspace, following the existing pattern of `trueno-gpu`, `trueno-quant`, etc.

```
trueno/                          (workspace root)
├── trueno/                      (core: SIMD vector/matrix ops)
├── trueno-gpu/                  (PTX generation + GPU kernels)
│   └── src/kernels/
│       ├── sparse/              NEW: SpMV, SpMM, SpGEMM kernels
│       ├── fft/                 NEW: Stockham FFT kernels
│       ├── solve/               NEW: LU, QR, SVD, Cholesky kernels
│       ├── image/               NEW: Convolution, morphology, histogram
│       ├── rand/                NEW: Philox, XORWOW counter-based RNG
│       └── tensor/              NEW: N-dimensional contraction kernels
├── crates/
│   ├── trueno-sparse/           NEW: Sparse matrix formats + CPU SIMD
│   ├── trueno-fft/              NEW: FFT CPU + GPU unified API
│   ├── trueno-solve/            NEW: Dense/sparse solvers
│   ├── trueno-image/            NEW: Image processing primitives
│   └── trueno-rand/             NEW: Parallel RNG
└── contracts/                   NEW: provable-contracts YAML
    ├── sparse-spmv-v1.yaml
    ├── fft-stockham-v1.yaml
    ├── solve-lu-v1.yaml
    ├── solve-qr-v1.yaml
    ├── solve-svd-v1.yaml
    ├── image-conv2d-v1.yaml
    └── rand-philox-v1.yaml
```

### Dependency Graph

```
trueno-image ──► trueno-fft (convolution via FFT for large kernels)
trueno-solve ──► trueno-sparse (sparse solvers use sparse formats)
trueno-solve ──► trueno (dense BLAS for panel factorization)
trueno-sparse ──► trueno (dense sub-operations within sparse)
trueno-fft ──► trueno (butterfly arithmetic)
trueno-rand ──► trueno-gpu (PTX kernel generation)
All ──► trueno-gpu (GPU kernel infrastructure)
All ──► provable-contracts (YAML verification contracts)
```

### PAIML MCP Agent Toolkit Alignment

Per pmat conventions, each new crate follows:

- **Uniform Contracts**: `BaseAnalysisContract` pattern with `#[serde(flatten)]` for shared parameters
- **Service Layer**: `impl Service for SparseService { type Input; type Output; type Error; }`
- **MCP Tool Registration**: Each crate exposes pmat-compatible analysis tools
- **TDG Grade**: A- minimum (score >= 88/100)
- **Coverage**: >= 95% line coverage, >= 80% mutation score
- **EXTREME TDD**: 8+ tests per feature, property-based testing for all numerical code
- **Zero SATD**: No TODO/FIXME/HACK in shipped code
- **Feature Gating**: Optional heavy features behind Cargo feature flags

---

## 2. trueno-sparse: Sparse Linear Algebra

### 2A. Provable-Contracts Design

**Contract**: `sparse-spmv-v1.yaml`

```yaml
metadata:
  version: "1.0.0"
  author: "Sovereign AI Stack"
  description: "Sparse matrix-vector multiply (CSR format)"
  references:
    - "Merrill & Garland, 'Merge-Based Parallel Sparse Matrix-Vector Multiplication', PPoPP 2016"
    - "LAProof: Princeton formal verification of SpMV in C (Appel et al.)"

equations:
  spmv:
    formula: "y_i = α · Σ_j A_{ij} · x_j + β · y_i"
    domain: "A ∈ ℝ^{m×n} (CSR), x ∈ ℝ^n, y ∈ ℝ^m, α,β ∈ ℝ"
    codomain: "y ∈ ℝ^m"
    invariants:
      - "Output length equals number of rows: len(y) == m"
      - "Backward error bounded: |Ay - y_exact| ≤ nnz_per_row · u · |A| · |x|"

proof_obligations:
  - type: invariant
    property: "Output dimension correctness"
    formal: "len(result) == A.rows()"
    applies_to: all

  - type: equivalence
    property: "Sparse result matches dense reference"
    formal: "|sparse_spmv(A,x) - dense_matvec(A.to_dense(), x)| < ε"
    tolerance: 1.0e-5
    applies_to: all

  - type: bound
    property: "Backward error bound (LAProof)"
    formal: "||Ax_computed - Ax_exact||_inf ≤ n · u · ||A||_inf · ||x||_inf"
    tolerance: "n * f32::EPSILON"
    applies_to: all

  - type: linearity
    property: "SpMV is linear in x"
    formal: "|spmv(A, αx + βy) - α·spmv(A,x) - β·spmv(A,y)| < ε"
    tolerance: 1.0e-4
    applies_to: all

  - type: equivalence
    property: "SIMD matches scalar within ULP"
    tolerance: 8.0
    applies_to: simd

  - type: equivalence
    property: "GPU matches CPU within tolerance"
    tolerance: 16.0
    applies_to: simd

kernel_structure:
  phases:
    - name: format_validation
      description: "Validate CSR invariants (sorted columns, valid offsets)"
      invariant: "offsets[0] == 0 && offsets[m] == nnz && offsets monotonically increasing"
    - name: merge_path_partition
      description: "Binary search to partition work equally among thread blocks"
      invariant: "Each partition processes approximately nnz/num_blocks elements"
    - name: merge_spmv
      description: "Parallel merge of row-offsets and NZ-data"
      invariant: "Each element accumulated into correct row"
    - name: reduction
      description: "Reduce partial row sums across thread blocks"
      invariant: "y[i] = sum of all partial contributions to row i"

simd_dispatch:
  spmv_csr:
    scalar: "spmv_csr_scalar"
    avx2: "spmv_csr_avx2"
    avx512: "spmv_csr_avx512"
    neon: "spmv_csr_neon"
    ptx: "spmv_csr_merge_gpu"

enforcement:
  format_invariants:
    description: "CSR offsets must be monotonically increasing with offsets[0]=0"
    check: "contract_tests::FALSIFY-SPARSE-001"
    severity: "ERROR"
  no_oob_access:
    description: "Column indices must be in [0, n)"
    check: "contract_tests::FALSIFY-SPARSE-002"
    severity: "ERROR"
  backward_error:
    description: "Numerical error bounded by LAProof formula"
    check: "contract_tests::FALSIFY-SPARSE-003"
    severity: "ERROR"

falsification_tests:
  - id: FALSIFY-SPARSE-001
    rule: "CSR format validity"
    prediction: "Reject CSR with non-monotonic offsets"
    test: "Construct invalid CSR, assert error returned"
    if_fails: "Format validation missing or incomplete"

  - id: FALSIFY-SPARSE-002
    rule: "Dimension correctness"
    prediction: "spmv(A[m,n], x[n]) produces y[m]"
    test: "proptest with random dimensions"
    if_fails: "Index arithmetic error in merge-path"

  - id: FALSIFY-SPARSE-003
    rule: "Backward error bound"
    prediction: "|y_sparse - y_dense| ≤ nnz_per_row * eps * |A|_inf * |x|_inf"
    test: "proptest with random CSR + dense reference"
    if_fails: "Accumulation order error or missing Kahan summation"

  - id: FALSIFY-SPARSE-004
    rule: "SIMD-scalar equivalence"
    prediction: "|y_simd - y_scalar| ≤ 8 ULP per element"
    test: "proptest with AVX2 vs scalar on same input"
    if_fails: "SIMD reassociation exceeds tolerance"

  - id: FALSIFY-SPARSE-005
    rule: "GPU-CPU equivalence"
    prediction: "|y_gpu - y_cpu| ≤ 16 ULP per element"
    test: "Run merge-based GPU kernel vs CPU scalar reference"
    if_fails: "GPU kernel accumulation or memory access error"

kani_harnesses:
  - id: KANI-SPARSE-001
    obligation: "Format invariants"
    property: "CSR offsets form valid partition of [0, nnz)"
    bound: 16
    strategy: exhaustive
    harness: verify_csr_offsets_valid

  - id: KANI-SPARSE-002
    obligation: "Merge-path partition correctness"
    property: "Binary search finds correct merge-path diagonal"
    bound: 32
    strategy: bounded_int
    harness: verify_merge_path_partition
```

**Contract**: `sparse-spmm-v1.yaml` (SpMM) and `sparse-spgemm-v1.yaml` (SpGEMM) follow the same pattern with matrix-output equations.

### 2B. Brick Profiling Design

```rust
// New BrickId variants for sparse operations
pub enum BrickId {
    // ... existing variants ...
    // Sparse operations (new)
    SpMV,           // Sparse matrix-vector
    SpMM,           // Sparse matrix-dense matrix
    SpGEMM,         // Sparse matrix-sparse matrix
    FormatConvert,  // CSR↔COO↔BSR conversion
}

// New BrickCategory
pub enum BrickCategory {
    // ... existing ...
    Sparse,  // All sparse operations
}
```

**Roofline Classification**:
- SpMV: **memory-bound** (arithmetic intensity ~0.25 FLOP/byte for f32 CSR)
- SpMM: **mixed** (depends on dense matrix width; memory-bound for narrow, compute-bound for wide)
- SpGEMM: **memory-bound** (irregular access, hash-table accumulation)

**Tuner Features** (extend `TunerFeatures` v1.2.0):
```rust
// Sparse-specific features (6 new dimensions)
pub struct SparseTunerFeatures {
    pub nnz: u64,                    // Total nonzeros
    pub rows: u64,                   // Matrix rows
    pub cols: u64,                   // Matrix columns
    pub avg_nnz_per_row: f32,        // Average nonzeros per row
    pub row_length_variance: f32,    // Variance of row lengths (key for algorithm selection)
    pub format: SparseFormat,        // CSR, COO, BSR, ELL
}
```

**Algorithm Selection via BrickTuner**:
- `row_length_variance < 0.1`: Row-split SpMV (regular matrices)
- `row_length_variance >= 0.1`: Merge-based SpMV (irregular matrices)
- `avg_nnz_per_row > 32 && GPU available`: GPU merge-based
- `avg_nnz_per_row <= 32`: CPU SIMD (memory-bound, GPU launch overhead dominates)

**ModelTracer Extensions** (MLT-06):
```rust
pub struct SparseOperationTrace {
    pub format: SparseFormat,
    pub nnz: u64,
    pub effective_bandwidth_gbps: f64,
    pub achieved_vs_roofline_pct: f64,
    pub load_balance_efficiency: f64,  // 1.0 = perfect balance
}
```

### 2C. Pure Rust Implementation

#### Sparse Formats (CPU + SIMD)

```rust
/// Compressed Sparse Row format
pub struct CsrMatrix<T> {
    pub rows: usize,
    pub cols: usize,
    pub offsets: Vec<u32>,    // len = rows + 1
    pub col_indices: Vec<u32>, // len = nnz
    pub values: Vec<T>,        // len = nnz
}

/// Coordinate format (for construction)
pub struct CooMatrix<T> {
    pub rows: usize,
    pub cols: usize,
    pub row_indices: Vec<u32>,
    pub col_indices: Vec<u32>,
    pub values: Vec<T>,
}

/// Block Sparse Row (for structured sparsity)
pub struct BsrMatrix<T> {
    pub rows: usize,
    pub cols: usize,
    pub block_rows: usize,
    pub block_cols: usize,
    pub offsets: Vec<u32>,
    pub col_indices: Vec<u32>,
    pub values: Vec<T>,  // stored as dense blocks
}

/// SELL-C-sigma (Sliced ELLPACK with sorting)
pub struct SellMatrix<T> {
    pub rows: usize,
    pub cols: usize,
    pub slice_size: usize,     // typically 32 or 256
    pub slice_starts: Vec<u32>,
    pub col_indices: Vec<u32>,
    pub values: Vec<T>,
}
```

#### SIMD SpMV (AVX2)

Algorithm: Row-split with 8-wide f32 accumulation.

```rust
// Backend trait extension
pub trait SparseBackend {
    fn spmv_csr(
        alpha: f32, matrix: &CsrMatrix<f32>, x: &[f32],
        beta: f32, y: &mut [f32],
    );
    fn spmm_csr(
        alpha: f32, matrix: &CsrMatrix<f32>, b: &[f32], b_cols: usize,
        beta: f32, c: &mut [f32],
    );
}
```

For AVX2 SpMV: gather from `x` using `_mm256_i32gather_ps` with column indices, FMA accumulate, horizontal sum per row. For rows shorter than 8 elements, use masked operations.

#### GPU Kernel: Merge-Based SpMV (PTX)

Per Merrill & Garland (PPoPP 2016):

```rust
pub struct MergeSpMVKernel {
    pub rows: u32,
    pub nnz: u32,
    pub items_per_thread: u32,
}

impl Kernel for MergeSpMVKernel {
    fn name(&self) -> &str { "spmv_merge" }

    fn build_ptx(&self) -> PtxKernel {
        // 1. Each thread block computes its merge-path diagonal
        //    via binary search on (row_offsets, nz_indices)
        // 2. Each thread walks its portion of the merge path:
        //    - When advancing in row_offsets: emit partial sum for previous row
        //    - When advancing in nz_indices: accumulate val[j] * x[col[j]]
        // 3. Block-level reduction of partial row sums via shared memory
        // 4. Inter-block reduction for rows that span multiple blocks
        PtxKernel::new("spmv_merge")
            .param(PtxType::U64, "offsets_ptr")
            .param(PtxType::U64, "col_indices_ptr")
            .param(PtxType::U64, "values_ptr")
            .param(PtxType::U64, "x_ptr")
            .param(PtxType::U64, "y_ptr")
            .param(PtxType::U32, "rows")
            .param(PtxType::U32, "nnz")
            .param(PtxType::F32, "alpha")
            .param(PtxType::F32, "beta")
            .shared_memory(4096)  // Partial row sums
            .build(|ctx| {
                // Merge-path binary search + accumulation
                // (full PTX generation via builder extension traits)
            })
    }
}
```

### 2D. World-Class Algorithmic Foundation

**Primary Algorithm**: Merge-Based SpMV (Merrill & Garland, PPoPP 2016)
- Frames SpMV as a merge of row-offsets list and NZ-data list
- Work-balanced: every thread processes the same number of merge-path steps
- No format conversion required (works directly on CSR)
- Achieved 1.84x over cuSPARSE on irregular matrices (Cluster Computing 2024)

**Formal Verification**: LAProof (Princeton, Appel et al.)
- Machine-checked backward error bounds for CSR SpMV
- Error bound: `||r||_inf ≤ n · u · ||A||_inf · ||x||_inf` where `u = f32::EPSILON/2`
- Verified Jacobi iterative solver convergence
- Translates directly to Kani harnesses for bounded verification

**Adaptive Format Selection**: MANet (arXiv 2018)
- ML-based sparse format selection (CSR vs ELL vs COO vs BSR)
- Integrates with BrickTuner via `row_length_variance` feature

**Reference Papers**:
1. Merrill & Garland, "Merge-Based Parallel Sparse Matrix-Vector Multiplication", PPoPP 2016
2. Acc-SpMM, arXiv:2501.09251 (Tensor Core SpMM, January 2025)
3. HC-SpMM, arXiv:2412.08902 (Hybrid Core SpMM, December 2024)
4. LAProof, Princeton CS (formal SpMV verification in Coq/VST)
5. Block Strategy with Adaptive Storage, Cluster Computing 2024 (1.84x over cuSPARSE)

---

## 3. trueno-fft: Fast Fourier Transforms

### 3A. Provable-Contracts Design

**Contract**: `fft-stockham-v1.yaml`

```yaml
metadata:
  version: "1.0.0"
  author: "Sovereign AI Stack"
  description: "GPU-accelerated FFT via Stockham auto-sort algorithm"
  references:
    - "Stockham, 'High-speed convolution and correlation', AFIPS 1966"
    - "VkFFT: Efficient Vulkan/CUDA/HIP/OpenCL/Metal FFT, IEEE 2023"
    - "Cooley & Tukey, 'An algorithm for the machine calculation of complex Fourier series', 1965"

equations:
  dft:
    formula: "X_k = Σ_{n=0}^{N-1} x_n · exp(-2πi·k·n/N)"
    domain: "x ∈ ℂ^N, N ≥ 1"
    codomain: "X ∈ ℂ^N"
    invariants:
      - "Parseval's theorem: Σ|x_n|² = (1/N)·Σ|X_k|²"
      - "Linearity: DFT(αx + βy) = α·DFT(x) + β·DFT(y)"
      - "Inverse: IDFT(DFT(x)) = x"

  r2c:
    formula: "X_k = Σ_{n=0}^{N-1} x_n · exp(-2πi·k·n/N), x ∈ ℝ^N"
    domain: "x ∈ ℝ^N"
    codomain: "X ∈ ℂ^{N/2+1} (Hermitian symmetry)"
    invariants:
      - "X_{N-k} = conj(X_k) (conjugate symmetry for real input)"
      - "Output length = N/2 + 1"

proof_obligations:
  - type: invariant
    property: "Parseval energy conservation"
    formal: "|Σ|x_n|² - (1/N)·Σ|X_k|²| < ε"
    tolerance: 1.0e-4
    applies_to: all

  - type: invariant
    property: "Inverse roundtrip"
    formal: "|IFFT(FFT(x)) - x| < ε element-wise"
    tolerance: 1.0e-5
    applies_to: all

  - type: linearity
    property: "FFT linearity"
    formal: "|FFT(αx + βy) - α·FFT(x) - β·FFT(y)| < ε"
    tolerance: 1.0e-4
    applies_to: all

  - type: equivalence
    property: "GPU matches CPU reference"
    tolerance: 32.0
    applies_to: simd

  - type: conservation
    property: "Parseval energy conservation"
    formal: "energy_time_domain ≈ energy_freq_domain"
    tolerance: 1.0e-4
    applies_to: all

kernel_structure:
  phases:
    - name: twiddle_precompute
      description: "Precompute twiddle factors in FP64, store as FP32"
      invariant: "|W_k - exp(-2πik/N)| < f64::EPSILON"
    - name: butterfly_stages
      description: "log2(N) stages of butterfly operations (Stockham)"
      invariant: "Each stage halves the DFT problem size"
    - name: transpose
      description: "Shared memory transpose between stages (bank-conflict-free)"
      invariant: "Data permutation is correct"
    - name: output_scale
      description: "Optional 1/N scaling for inverse"
      invariant: "Scale applied uniformly"

simd_dispatch:
  fft_1d:
    scalar: "fft_1d_scalar"
    avx2: "fft_1d_avx2"
    avx512: "fft_1d_avx512"
    neon: "fft_1d_neon"
    ptx: "fft_stockham_gpu"

falsification_tests:
  - id: FALSIFY-FFT-001
    rule: "Parseval energy conservation"
    prediction: "Time-domain energy equals frequency-domain energy within tolerance"
    test: "proptest with random complex vectors, N = 2^k for k in 4..16"
    if_fails: "Butterfly coefficient or twiddle factor error"

  - id: FALSIFY-FFT-002
    rule: "Inverse roundtrip"
    prediction: "IFFT(FFT(x)) reproduces x within tolerance"
    test: "proptest with random input, all supported sizes"
    if_fails: "Twiddle sign error or missing conjugation in inverse"

  - id: FALSIFY-FFT-003
    rule: "Known-value test (DFT of impulse)"
    prediction: "FFT([1, 0, 0, ..., 0]) = [1, 1, 1, ..., 1]"
    test: "Exact comparison for impulse input"
    if_fails: "Fundamental algorithm error"

  - id: FALSIFY-FFT-004
    rule: "GPU-CPU equivalence"
    prediction: "|FFT_gpu(x) - FFT_cpu(x)| ≤ 32 ULP per element"
    test: "Run Stockham GPU vs scalar CPU on same input"
    if_fails: "Shared memory transpose or twiddle error in PTX"

kani_harnesses:
  - id: KANI-FFT-001
    obligation: "Butterfly index correctness"
    property: "Butterfly pairs are non-overlapping and cover all elements"
    bound: 16
    strategy: bounded_int
    harness: verify_butterfly_indices

  - id: KANI-FFT-002
    obligation: "Twiddle factor indexing"
    property: "Twiddle index k is in [0, N/2) at each stage"
    bound: 16
    strategy: bounded_int
    harness: verify_twiddle_indices
```

### 3B. Brick Profiling Design

```rust
pub enum BrickId {
    // ... existing ...
    // FFT operations (new)
    FFT1D,
    FFT2D,
    FFT3D,
    IFFT1D,
    TwiddlePrecompute,
    ButterflyStage,
}

pub enum BrickCategory {
    // ... existing ...
    FFT,
}
```

**Roofline Classification**:
- Small FFT (N < 1024): **compute-bound** (butterfly arithmetic dominates)
- Large FFT (N > 65536): **memory-bound** (data movement between stages dominates)
- 2D/3D FFT: **memory-bound** (dominated by transpose between dimensions)

**Tuner Features**:
```rust
pub struct FFTTunerFeatures {
    pub n: u64,                      // Transform size
    pub batch_size: u64,             // Number of transforms
    pub dimensions: u8,              // 1D, 2D, 3D
    pub is_real: bool,               // R2C vs C2C
    pub is_power_of_two: bool,       // Fast path available
    pub largest_prime_factor: u32,   // For mixed-radix planning
}
```

### 3C. Pure Rust Implementation

#### CPU FFT (Stockham Auto-Sort)

```rust
pub struct FftPlan {
    pub n: usize,
    pub direction: FftDirection,
    pub twiddles: Vec<Complex<f32>>,   // Precomputed in f64, stored as f32
    pub radix_sequence: Vec<usize>,    // Mixed-radix decomposition
}

pub enum FftDirection { Forward, Inverse }

pub trait Fft {
    fn fft_1d(&self, input: &[Complex<f32>], output: &mut [Complex<f32>]);
    fn ifft_1d(&self, input: &[Complex<f32>], output: &mut [Complex<f32>]);
    fn fft_r2c(&self, input: &[f32], output: &mut [Complex<f32>]);
    fn fft_c2r(&self, input: &[Complex<f32>], output: &mut [f32]);
    fn fft_2d(&self, input: &[Complex<f32>], output: &mut [Complex<f32>],
              nx: usize, ny: usize);
}
```

**SIMD Butterfly**: AVX2 processes 4 complex f32 (8 floats) per instruction:
- Load pairs, multiply by twiddle (FMA), add/subtract for butterfly
- Interleaved real/imaginary format for coalesced access

#### GPU FFT (Stockham with Shared Memory)

Per VkFFT (IEEE 2023) design:

```rust
pub struct StockhamFFTKernel {
    pub n: u32,
    pub radix: u32,           // 2, 4, 8 (register-level)
    pub threads_per_block: u32,
    pub ffts_per_block: u32,  // Multiple small FFTs per block (bank conflict avoidance)
}

impl Kernel for StockhamFFTKernel {
    fn name(&self) -> &str { "fft_stockham" }

    fn build_ptx(&self) -> PtxKernel {
        // Each stage:
        // 1. Load from global/shared into registers (coalesced)
        // 2. Compute radix-N butterfly in registers
        // 3. Multiply by twiddle factors
        // 4. Store to shared memory (with padding for bank conflicts)
        // 5. Barrier sync
        // 6. Next stage reads from shared
        //
        // Final stage stores to global memory
        //
        // Key optimizations:
        // - Radix-4/8 reduces shared memory round-trips
        // - Multiple FFTs per block amortizes sync overhead
        // - FP64 twiddle precomputation for accuracy (VkFFT approach)
        todo!()
    }
}
```

**Non-Power-of-Two Support**: Bluestein's chirp-z transform:
1. Zero-pad to next power-of-two
2. Convolve with chirp sequence (3 power-of-two FFTs)
3. Extract N-point result

### 3D. World-Class Algorithmic Foundation

**Primary Algorithm**: Stockham Auto-Sort FFT
- No bit-reversal permutation (unlike Cooley-Tukey)
- Ping-pong buffers give natural coalesced access
- Mixed-radix decomposition: N = r1 × r2 × ... × rk for small primes

**Twiddle Factor Precision**: VkFFT approach
- Precompute in FP128 (or FP64), store as FP32
- More precise than cuFFT (IEEE 2023 paper demonstrates this)

**Non-Power-of-Two**: Rader's algorithm (prime N) + Bluestein's (arbitrary N)

**Reference Papers**:
1. Stockham, "High-speed convolution and correlation", AFIPS 1966
2. VkFFT, "Efficient Vulkan/CUDA/HIP/OpenCL/Metal FFT library", IEEE 2023
3. TurboFNO, arXiv:2504.11681 (fused FFT-GEMM-iFFT, April 2025)
4. Mixed-Precision FFT, arXiv:2508.10202 (Pareto-optimal precision, 2025)

---

## 4. trueno-solve: Dense & Sparse Solvers

### 4A. Provable-Contracts Design

**Contract**: `solve-lu-v1.yaml`

```yaml
metadata:
  version: "1.0.0"
  author: "Sovereign AI Stack"
  description: "LU factorization with partial pivoting"
  references:
    - "Golub & Van Loan, 'Matrix Computations', 4th ed., 2013"
    - "High Accuracy Low Precision QR, arXiv:1912.05508"

equations:
  lu_factorization:
    formula: "PA = LU where L lower triangular (unit diagonal), U upper triangular, P permutation"
    domain: "A ∈ ℝ^{n×n}"
    codomain: "L ∈ ℝ^{n×n}, U ∈ ℝ^{n×n}, P ∈ {0,1}^{n×n}"
    invariants:
      - "PA = LU within backward error"
      - "L is unit lower triangular: L_{ii} = 1, L_{ij} = 0 for j > i"
      - "U is upper triangular: U_{ij} = 0 for j < i"
      - "P is a permutation matrix"

proof_obligations:
  - type: invariant
    property: "Factorization backward error"
    formal: "||PA - LU||_F / ||A||_F < n · u"
    tolerance: "n * f32::EPSILON"
    applies_to: all

  - type: invariant
    property: "Triangular structure"
    formal: "L[i][j] == 0 for j > i && U[i][j] == 0 for j < i"
    applies_to: all

  - type: invariant
    property: "Solution accuracy for Ax=b"
    formal: "||Ax_computed - b||_inf / (||A||_inf · ||x||_inf) < κ(A) · n · u"
    applies_to: all

  - type: equivalence
    property: "GPU matches CPU reference"
    tolerance: 32.0
    applies_to: simd
```

**Contract**: `solve-qr-v1.yaml`

```yaml
equations:
  qr_factorization:
    formula: "A = QR where Q orthogonal (Q^T Q = I), R upper triangular"
    domain: "A ∈ ℝ^{m×n}, m ≥ n"
    codomain: "Q ∈ ℝ^{m×m}, R ∈ ℝ^{m×n}"
    invariants:
      - "Q^T Q = I (orthogonality)"
      - "||A - QR||_F / ||A||_F < √n · u"
      - "R is upper triangular"

proof_obligations:
  - type: invariant
    property: "Orthogonality"
    formal: "||Q^T Q - I||_F < √n · u"
    tolerance: "sqrt(n) * f32::EPSILON"
    applies_to: all

  - type: invariant
    property: "Factorization accuracy"
    formal: "||A - QR||_F / ||A||_F < √n · u"
    tolerance: "sqrt(n) * f32::EPSILON"
    applies_to: all
```

**Contract**: `solve-svd-v1.yaml`

```yaml
equations:
  svd:
    formula: "A = UΣV^T where U,V orthogonal, Σ diagonal with σ_1 ≥ σ_2 ≥ ... ≥ σ_min(m,n) ≥ 0"
    domain: "A ∈ ℝ^{m×n}"
    codomain: "U ∈ ℝ^{m×m}, Σ ∈ ℝ^{m×n}, V ∈ ℝ^{n×n}"
    invariants:
      - "U^T U = I, V^T V = I"
      - "Σ diagonal with non-negative decreasing entries"
      - "||A - UΣV^T||_F / ||A||_F < √n · u"

proof_obligations:
  - type: invariant
    property: "Orthogonality of U"
    formal: "||U^T U - I||_F < √m · u"
    tolerance: "sqrt(m) * f32::EPSILON"
    applies_to: all

  - type: invariant
    property: "Orthogonality of V"
    formal: "||V^T V - I||_F < √n · u"
    tolerance: "sqrt(n) * f32::EPSILON"
    applies_to: all

  - type: ordering
    property: "Singular values non-increasing"
    formal: "σ_i ≥ σ_{i+1} for all i"
    applies_to: all

  - type: bound
    property: "Singular values non-negative"
    formal: "σ_i ≥ 0 for all i"
    applies_to: all

  - type: invariant
    property: "Reconstruction accuracy"
    formal: "||A - UΣV^T||_F / ||A||_F < √min(m,n) · u"
    tolerance: "sqrt(min(m,n)) * f32::EPSILON"
    applies_to: all
```

### 4B. Brick Profiling Design

```rust
pub enum BrickId {
    // ... existing ...
    LUFactorize,
    LUSolve,
    QRFactorize,
    QRSolve,
    SVDCompute,
    CholeskyFactorize,
    CholeskySolve,
    EigenSymmetric,
    TriangularSolve,
}

pub enum BrickCategory {
    // ... existing ...
    Solver,
}
```

**Roofline Classification**:
- Panel factorization (LU/QR column operations): **memory-bound**
- Trailing matrix update (GEMM): **compute-bound**
- SVD (Jacobi rotations): **compute-bound** for small matrices
- Triangular solve: **memory-bound**

**Tuner Features**:
```rust
pub struct SolverTunerFeatures {
    pub n: u64,                      // Matrix dimension
    pub m: u64,                      // Rows (for rectangular)
    pub algorithm: SolverAlgorithm,  // LU, QR, SVD, Cholesky
    pub is_positive_definite: bool,  // Can use Cholesky
    pub is_symmetric: bool,          // Can use symmetric eigensolvers
    pub condition_number_estimate: f64, // For algorithm selection
    pub batch_size: u64,             // For batched solvers
}
```

### 4C. Pure Rust Implementation

#### Block LU Factorization

```rust
pub trait Solver {
    /// LU factorization with partial pivoting: PA = LU
    fn lu_factorize(&self, a: &mut [f32], n: usize, pivot: &mut [i32]) -> SolverResult<()>;

    /// Solve Ax = b using LU factors
    fn lu_solve(&self, lu: &[f32], n: usize, pivot: &[i32], b: &mut [f32]) -> SolverResult<()>;

    /// QR factorization via Householder reflections: A = QR
    fn qr_factorize(&self, a: &mut [f32], m: usize, n: usize,
                     tau: &mut [f32]) -> SolverResult<()>;

    /// SVD: A = U Σ V^T
    fn svd(&self, a: &mut [f32], m: usize, n: usize,
           u: &mut [f32], s: &mut [f32], vt: &mut [f32]) -> SolverResult<()>;

    /// Cholesky: A = LL^T for positive definite A
    fn cholesky(&self, a: &mut [f32], n: usize) -> SolverResult<()>;
}
```

**GPU Strategy**: Block hybrid approach
1. **Panel factorization**: Small matrix operations on CPU (latency-sensitive)
2. **Trailing matrix update**: GEMM on GPU (compute-intensive)
3. **Look-ahead**: Overlap next panel with current update

For SVD, use one-sided Jacobi (arXiv:1707.05141) which is inherently parallel -- each rotation is independent. For large matrices, the divide-and-conquer approach (arXiv:2508.11467) achieves 1293x over rocSOLVER.

### 4D. World-Class Algorithmic Foundation

**LU**: Block LU with partial pivoting (Golub & Van Loan Ch. 3)
- Panel factorization + Schur complement update
- GPU: trailing update is GEMM (90%+ of compute)

**QR**: Householder with block reflectors (arXiv:1912.05508)
- TensorCore-accelerated: 2.9-14.7x speedup for tall-skinny matrices
- Block size tuned to shared memory capacity

**SVD**: Hierarchically blocked one-sided Jacobi (arXiv:1707.05141)
- Inherently parallel: each Jacobi rotation independent
- Maps to GPU memory hierarchy: warp-level → block-level → grid-level
- For large matrices: bidiagonal divide-and-conquer (arXiv:2508.11467)
- Randomized SVD for approximate low-rank (arXiv:2110.03423)

**Cholesky**: Block Cholesky with GEMM-heavy updates
- Positive definite only; ~2x faster than LU

**Reference Papers**:
1. Golub & Van Loan, "Matrix Computations", 4th ed., 2013
2. Efficient Batch SVD on GPUs, arXiv:2601.17979 (January 2026)
3. GPU Divide-and-Conquer SVD, arXiv:2508.11467 (1293x over rocSOLVER)
4. High Accuracy Low Precision QR, arXiv:1912.05508 (TensorCore QR)
5. Batched QR and SVD on GPUs, arXiv:1707.05141 (one-sided Jacobi)
6. Performant Unified GPU SVD Kernels, arXiv:2508.06339 (Julia-based)

---

## 5. trueno-image: GPU Image Processing

### 5A. Provable-Contracts Design

**Contract**: `image-conv2d-v1.yaml`

```yaml
metadata:
  version: "1.0.0"
  author: "Sovereign AI Stack"
  description: "2D image convolution (separable and non-separable)"
  references:
    - "Gonzalez & Woods, 'Digital Image Processing', 4th ed."
    - "Parallel Canny Edge Detection, PLOS One 2024"

equations:
  conv2d:
    formula: "O(i,j) = Σ_p Σ_q I(i+p, j+q) · K(p, q)"
    domain: "I ∈ ℝ^{H×W}, K ∈ ℝ^{kh×kw}"
    codomain: "O ∈ ℝ^{H×W} (same-padding)"
    invariants:
      - "Output dimensions match input (with same-padding)"
      - "Separable kernel: K = v · h^T gives same result as 2D convolution"
      - "Commutativity: I * K = K * I"

  separable_conv2d:
    formula: "O = (I * v) * h^T where K = v · h^T"
    domain: "I ∈ ℝ^{H×W}, v ∈ ℝ^{kh}, h ∈ ℝ^{kw}"
    codomain: "O ∈ ℝ^{H×W}"
    invariants:
      - "Result matches non-separable convolution within tolerance"
      - "Complexity: O(H·W·(kh+kw)) vs O(H·W·kh·kw)"

proof_obligations:
  - type: equivalence
    property: "Separable matches non-separable"
    formal: "|conv2d(I, v*h^T) - separable_conv2d(I, v, h)| < ε"
    tolerance: 1.0e-5
    applies_to: all

  - type: invariant
    property: "Gaussian blur is symmetric"
    formal: "blur(I, σ) has same result for any axis ordering"
    applies_to: all

  - type: equivalence
    property: "GPU matches CPU"
    tolerance: 4.0
    applies_to: simd

falsification_tests:
  - id: FALSIFY-IMG-001
    rule: "Separable equivalence"
    prediction: "Separable Gaussian matches 2D Gaussian within tolerance"
    test: "proptest with random images and Gaussian kernels"
    if_fails: "Separable decomposition or border handling error"

  - id: FALSIFY-IMG-002
    rule: "Identity kernel"
    prediction: "Convolution with delta kernel reproduces input exactly"
    test: "conv2d(I, [[0,0,0],[0,1,0],[0,0,0]]) == I"
    if_fails: "Off-by-one in kernel centering"

  - id: FALSIFY-IMG-003
    rule: "Canny edge detection"
    prediction: "Edges are 1-pixel wide (non-maximum suppression property)"
    test: "Verify no two adjacent pixels are both marked as edges along gradient direction"
    if_fails: "NMS implementation error"
```

### 5B. Brick Profiling Design

```rust
pub enum BrickId {
    // ... existing ...
    Conv2D,
    GaussianBlur,
    SobelFilter,
    Canny,
    Morphology,
    Histogram,
    Resize,
    ColorConvert,
}

pub enum BrickCategory {
    // ... existing ...
    ImageProcessing,
}
```

### 5C. Pure Rust Implementation

```rust
pub trait ImageOps {
    // Filtering
    fn conv2d(&self, image: &ImageBuf, kernel: &[f32], kw: usize, kh: usize,
              border: BorderMode) -> ImageBuf;
    fn gaussian_blur(&self, image: &ImageBuf, sigma: f32) -> ImageBuf;
    fn sobel(&self, image: &ImageBuf) -> (ImageBuf, ImageBuf); // (gx, gy)
    fn canny(&self, image: &ImageBuf, low: f32, high: f32) -> ImageBuf;

    // Morphology
    fn dilate(&self, image: &ImageBuf, element: &StructElement) -> ImageBuf;
    fn erode(&self, image: &ImageBuf, element: &StructElement) -> ImageBuf;

    // Geometry
    fn resize(&self, image: &ImageBuf, new_w: usize, new_h: usize,
              interp: Interpolation) -> ImageBuf;

    // Color
    fn rgb_to_gray(&self, image: &ImageBuf) -> ImageBuf;
    fn rgb_to_hsv(&self, image: &ImageBuf) -> ImageBuf;

    // Analysis
    fn histogram(&self, image: &ImageBuf, bins: usize) -> Vec<u32>;
    fn connected_components(&self, binary: &ImageBuf) -> (ImageBuf, u32);
}

pub struct ImageBuf {
    pub data: Vec<u8>,     // or Vec<f32>
    pub width: usize,
    pub height: usize,
    pub channels: usize,
    pub dtype: ImageDType,
}

pub enum BorderMode { Replicate, Reflect, Zero, Wrap }
pub enum Interpolation { Nearest, Bilinear, Bicubic, Lanczos }
```

**GPU Strategy**: Each operation is a separate PTX kernel:
- **Conv2D**: Shared memory tiling with halo (border elements loaded from global)
- **Gaussian blur**: Two 1D passes (separable) for O(kh+kw) instead of O(kh*kw)
- **Sobel**: 3x3 shared memory tile, 8 neighbor reads per pixel
- **Histogram**: Per-block shared memory histogram + global atomicAdd reduction
- **Resize**: Output-centric parallelism (each thread computes one output pixel)

### 5D. World-Class Algorithmic Foundation

**Canny Pipeline** (GPU, per PLOS One 2024):
1. Gaussian blur (separable, shared memory)
2. Sobel gradient magnitude + direction
3. Non-maximum suppression (per-pixel, gradient direction lookup)
4. Hysteresis thresholding (parallel connected component propagation)

**Separable Convolution**: Two 1D passes — first horizontal (row-parallel), then vertical (column-parallel). Each pass uses shared memory tiling with halo regions. This is the standard approach used by NPP internally.

**Reference Papers**:
1. Parallel Canny Edge Detection, PLOS One 2024 (1.21x over CUDA Canny)
2. cubic: CUDA-accelerated Bioimage Computing, arXiv:2510.14143

---

## 6. trueno-rand: Parallel Random Number Generation

### 6A. Provable-Contracts Design

**Contract**: `rand-philox-v1.yaml`

```yaml
metadata:
  version: "1.0.0"
  author: "Sovereign AI Stack"
  description: "Philox 4x32-10 counter-based PRNG for GPU"
  references:
    - "Salmon et al., 'Parallel Random Numbers: As Easy as 1, 2, 3', SC 2011"

equations:
  philox4x32:
    formula: "output = philox_round^10(counter, key) where each round applies Feistel network"
    domain: "counter ∈ ℤ^4 (128-bit), key ∈ ℤ^2 (64-bit)"
    codomain: "output ∈ {0,1}^128 (4 × 32-bit random values)"
    invariants:
      - "Deterministic: same (counter, key) always produces same output"
      - "Period: 2^128 (counter space)"
      - "Statistical quality: passes BigCrush (TestU01)"

proof_obligations:
  - type: invariant
    property: "Determinism"
    formal: "philox(c, k) == philox(c, k) always"
    applies_to: all

  - type: invariant
    property: "Counter independence"
    formal: "philox(c1, k) and philox(c2, k) are statistically independent for c1 ≠ c2"
    applies_to: all

  - type: equivalence
    property: "GPU matches CPU bit-exact"
    tolerance: 0.0
    applies_to: simd

falsification_tests:
  - id: FALSIFY-RNG-001
    rule: "Determinism"
    prediction: "Same counter+key always produces same output"
    test: "Generate twice with same state, compare bit-exact"
    if_fails: "Uninitialized state or race condition"

  - id: FALSIFY-RNG-002
    rule: "Uniformity"
    prediction: "Chi-squared test passes for bucket distribution"
    test: "Generate 1M samples, chi-squared goodness-of-fit"
    if_fails: "Round function is biased"

  - id: FALSIFY-RNG-003
    rule: "GPU-CPU bit-exactness"
    prediction: "GPU and CPU produce identical bits"
    test: "Generate 10K values on GPU and CPU, compare bit-exact"
    if_fails: "Integer arithmetic divergence (should be impossible for Philox)"

kani_harnesses:
  - id: KANI-RNG-001
    obligation: "Feistel round invertibility"
    property: "Each round is bijective (no information loss)"
    bound: 4
    strategy: exhaustive
    harness: verify_feistel_bijective
```

### 6B-C. Implementation

**Philox 4x32-10** (Salmon et al., SC 2011):
- Counter-based: embarrassingly parallel (no state sharing between threads)
- Each GPU thread computes `philox(global_thread_id + offset, key)` for 4 random u32s
- 10 rounds of Feistel network using widening multiply
- Passes BigCrush statistical test suite

```rust
pub trait GpuRng {
    fn fill_uniform_f32(&self, output: &mut GpuBuffer<f32>, seed: u64, offset: u64);
    fn fill_normal_f32(&self, output: &mut GpuBuffer<f32>, seed: u64, mean: f32, std: f32);
    fn fill_uniform_u32(&self, output: &mut GpuBuffer<u32>, seed: u64, offset: u64);
}
```

**Normal distribution**: Box-Muller transform applied to pairs of uniform values.

### 6D. World-Class Algorithmic Foundation

**Reference**: Salmon et al., "Parallel Random Numbers: As Easy as 1, 2, 3", SC 2011
- Counter-based design: no per-thread state, perfect for GPU
- Philox: 10 rounds of Feistel using 32×32→64 widening multiply
- Threefry: 20 rounds using rotations and XOR (backup for non-multiply hardware)
- Both pass BigCrush and Crush from TestU01

---

## 7. trueno-tensor: N-Dimensional Tensor Contractions

### 7A. Provable-Contracts Design

**Contract**: `tensor-contraction-v1.yaml`

```yaml
metadata:
  version: "1.0.0"
  description: "General tensor contraction (Einstein summation)"
  references:
    - "Springer et al., 'Design of a High-Performance GEMM-like Tensor-Tensor Multiplication', ACM TOMS 2018"

equations:
  contraction:
    formula: "C_{free_C} = α · Σ_{contracted} A_{free_A, contracted} · B_{free_B, contracted} + β · C_{free_C}"
    domain: "A ∈ ℝ^{d1×...×dk}, B ∈ ℝ^{dk+1×...×dn}"
    codomain: "C with dimensions = free indices of A ∪ free indices of B"
    invariants:
      - "Contracted dimensions must match between A and B"
      - "Output dimensions = free_A dimensions × free_B dimensions"
      - "For rank-2 tensors, reduces to GEMM"

proof_obligations:
  - type: invariant
    property: "Reduces to GEMM for matrices"
    formal: "contraction(A[m,k], B[k,n], 'ik,kj->ij') == gemm(A, B)"
    tolerance: 1.0e-5
    applies_to: all

  - type: invariant
    property: "Output shape correctness"
    formal: "output_shape == infer_shape(A_shape, B_shape, contraction_indices)"
    applies_to: all

  - type: associativity
    property: "Contraction order independence"
    formal: "|contract(contract(A,B), C) - contract(A, contract(B,C))| < ε"
    tolerance: 1.0e-3
    applies_to: all
```

### 7C. Pure Rust Implementation

**Strategy**: Reduce all tensor contractions to GEMM via TTGT (Transpose-Transpose-GEMM-Transpose):
1. Reshape/transpose A to 2D: (free_A, contracted)
2. Reshape/transpose B to 2D: (contracted, free_B)
3. Call existing trueno GEMM
4. Reshape/transpose result to output dimensions

This leverages the heavily optimized GEMM already in trueno-gpu (tiled, tensor core variants).

For specialized contractions (e.g., batched outer products, trace), emit custom PTX kernels that avoid the transpose overhead.

```rust
pub fn einsum(
    subscripts: &str,          // e.g., "ijk,jkl->il"
    inputs: &[&Tensor],
    output: &mut Tensor,
) -> Result<(), TensorError>;
```

---

## 8. Enhanced Dense BLAS (cuBLAS Parity)

### Missing Operations (Priority Order)

| Operation | cuBLAS Name | Status | Priority |
|---|---|---|---|
| General GEMM | gemm | Exists (tiled, tensor core) | Done |
| Batched GEMM | gemmBatched | Exists (batched_4d) | Done |
| Strided Batched | gemmStridedBatched | Partial | P1 |
| Mixed Precision | gemmEx | Missing | P1 |
| Symmetric Rank-k | syrk/syr2k | Missing | P2 |
| Triangular Solve | trsm | Missing (needed for solvers) | P1 |
| Triangular Multiply | trmm | Missing | P2 |
| Symmetric Multiply | symm/hemm | Missing | P2 |
| Level-2 GEMV | gemv | Exists (CoalescedGemvKernel) | Done |
| Level-1 (dot, nrm2, etc.) | Various | Exists in trueno core | Done |
| Epilogue Fusion | gemmEx with epilogue | Missing | P2 |

**Contract**: `blas-trsm-v1.yaml` — triangular solve is critical for LU/QR/Cholesky solvers.

```yaml
equations:
  trsm:
    formula: "Solve AX = B for X, where A is triangular"
    domain: "A ∈ ℝ^{n×n} (triangular), B ∈ ℝ^{n×nrhs}"
    codomain: "X ∈ ℝ^{n×nrhs}"
    invariants:
      - "AX = B within backward error"
      - "||AX - B||_F / (||A||_F · ||X||_F) < n · u"
```

---

## 9. Enhanced Compression (nvCOMP Parity)

### Current State

trueno has LZ4 CPU + GPU kernels. trueno-zram provides ZRAM integration. Missing:

| Algorithm | nvCOMP | Sovereign Stack | Priority |
|---|---|---|---|
| LZ4 | Yes | Yes (trueno LZ4 kernel) | Done |
| Zstd | Yes | trueno-zram (CPU) | P1: GPU kernel |
| Snappy | Yes | Missing | P3 |
| GDeflate | Yes | Missing | P3 |
| Cascaded | Yes | Missing | P3 |
| ANS | Yes | Missing | P2 |

**Strategy**: Focus on Zstd GPU kernel (most impactful for KV cache compression in realizar) and ANS (used for neural codec compression).

---

## 10. Implementation Roadmap

### Phase 1: Foundation (Months 1-3)

| Priority | Crate | Deliverable | Contract |
|---|---|---|---|
| P0 | trueno-sparse | CSR/COO formats + CPU SIMD SpMV | sparse-spmv-v1.yaml |
| P0 | trueno-sparse | Merge-based GPU SpMV kernel | sparse-spmv-v1.yaml |
| P0 | trueno-gpu | Triangular solve (TRSM) kernel | blas-trsm-v1.yaml |
| P0 | trueno-gpu | Mixed-precision GEMM (FP16 in, FP32 acc) | blas-gemmex-v1.yaml |

### Phase 2: Solvers + FFT (Months 3-6)

| Priority | Crate | Deliverable | Contract |
|---|---|---|---|
| P1 | trueno-solve | LU factorization (block, GPU) | solve-lu-v1.yaml |
| P1 | trueno-solve | QR factorization (Householder, GPU) | solve-qr-v1.yaml |
| P1 | trueno-solve | SVD (one-sided Jacobi, batched) | solve-svd-v1.yaml |
| P1 | trueno-fft | 1D C2C + R2C Stockham FFT (GPU) | fft-stockham-v1.yaml |
| P1 | trueno-fft | 2D FFT via 1D + transpose | fft-2d-v1.yaml |
| P1 | trueno-sparse | SpMM kernel (row-split + merge) | sparse-spmm-v1.yaml |

### Phase 3: Image + Ecosystem (Months 6-9)

| Priority | Crate | Deliverable | Contract |
|---|---|---|---|
| P2 | trueno-image | Separable conv2d + Gaussian blur | image-conv2d-v1.yaml |
| P2 | trueno-image | Sobel + Canny edge detection | image-canny-v1.yaml |
| P2 | trueno-image | Histogram + morphology | image-histogram-v1.yaml |
| P2 | trueno-rand | Philox 4x32-10 GPU RNG | rand-philox-v1.yaml |
| P2 | trueno-tensor | Einstein summation via TTGT | tensor-contraction-v1.yaml |
| P2 | trueno-solve | Cholesky factorization | solve-cholesky-v1.yaml |

### Phase 4: Polish + Parity (Months 9-12)

| Priority | Crate | Deliverable | Contract |
|---|---|---|---|
| P3 | trueno-sparse | SpGEMM (sparse × sparse) | sparse-spgemm-v1.yaml |
| P3 | trueno-sparse | BSR + SELL formats | sparse-formats-v1.yaml |
| P3 | trueno-fft | 3D FFT + batched | fft-3d-v1.yaml |
| P3 | trueno-fft | Bluestein non-power-of-two | fft-bluestein-v1.yaml |
| P3 | trueno-image | Resize (bilinear/bicubic/Lanczos) | image-resize-v1.yaml |
| P3 | trueno-image | Color conversion + connected components | image-color-v1.yaml |
| P3 | trueno-gpu | Zstd GPU compression kernel | comp-zstd-v1.yaml |

---

## 11. Quality Gates (Per PAIML MCP Agent Toolkit Standards)

Every new crate and kernel must pass before merge to master:

### Code Quality

| Gate | Threshold | Tool |
|---|---|---|
| Test Coverage | >= 95% line | `cargo llvm-cov` |
| Mutation Score | >= 80% kill | `cargo mutants` |
| TDG Grade | >= A- (88+) | `pmat tdg` |
| Cyclomatic Complexity | <= 15 per fn | `pmat complexity` |
| SATD Count | 0 | `pmat satd` |
| Clippy | 0 warnings | `cargo clippy -- -D warnings` |
| Rustfmt | Pass | `cargo fmt --check` |

### Numerical Quality

| Gate | Verification | Tool |
|---|---|---|
| Backward Error | Per-contract tolerance | proptest + contract tests |
| SIMD-Scalar Parity | <= 8 ULP | proptest with ULP comparison |
| GPU-CPU Parity | <= 32 ULP | contract tests with GPU runner |
| Known-Value Tests | Exact match | Hardcoded reference values |
| Property Tests | 10K+ iterations | proptest |
| Kani Verification | All harnesses pass | `cargo kani` |

### Performance Quality

| Gate | Threshold | Tool |
|---|---|---|
| Roofline Efficiency | >= 50% of theoretical | BrickProfiler + roofline model |
| Budget Compliance | Meet contract budget | ComputeBrick.verify() |
| No Performance Regression | < 5% degradation | Criterion.rs + CI baseline |

### Provable-Contracts Pipeline

For each new operation:
1. **Phase 1 (Extract)**: Identify paper, extract equations
2. **Phase 2 (Specify)**: Write YAML contract with obligations + falsification tests
3. **Phase 3 (Scaffold)**: Generate trait + test stubs via `pv scaffold`
4. **Phase 4 (Implement)**: Scalar reference first, then SIMD, then GPU
5. **Phase 5 (Falsify)**: Property-based tests via `pv probar`
6. **Phase 6 (Verify)**: Kani bounded model checking via `pv verify`
7. **Phase 7 (Prove)**: Lean 4 proofs for critical properties (optional)

---

## 12. Integration with Sovereign Stack Components

### Batuta Oracle Integration Points

Per oracle recommendations:
- **trueno** is the compute foundation (85-95% confidence for linear algebra)
- **provable-contracts** validates correctness (95% confidence for verification)
- **pepita** provides kernel-level I/O primitives (io_uring, ublk)
- **repartir** distributes multi-GPU work (CPU/GPU executors, checkpointing)

### Downstream Consumer Benefits

| Consumer | New Capability Unlocked |
|---|---|
| **aprender** | PCA via SVD, spectral clustering via eigendecomp, graph Laplacian via sparse |
| **entrenar** | FFT-based convolution layers, sparse attention, QR-based optimizer |
| **realizar** | Sparse attention patterns, FFT for audio models, image preprocessing |
| **trueno-rag** | Sparse retrieval indices, FFT for audio chunking |
| **rmedia** | GPU image processing pipeline (resize, filter, color convert) |
| **simular** | FFT for physics simulation, sparse systems for FEM |

### Cross-Crate Contract Binding

Each new trueno capability gets a binding entry in `provable-contracts/contracts/trueno/binding.yaml`:

```yaml
version: "1.0.0"
target_crate: trueno
bindings:
  - contract: sparse-spmv-v1.yaml
    equation: spmv
    module_path: "trueno_sparse::ops::spmv_csr"
    function: spmv_csr
    status: not_implemented

  - contract: fft-stockham-v1.yaml
    equation: dft
    module_path: "trueno_fft::fft_1d"
    function: fft_1d
    status: not_implemented
```

---

## 13. Competitive Positioning

### Sovereign Stack Advantages Over CUDA Libraries

| Advantage | Details |
|---|---|
| **No vendor lock-in** | wgpu backend supports Vulkan, Metal, DX12, WebGPU |
| **Pure Rust safety** | Zero unsafe in public API, ownership-based memory safety |
| **Zero external toolchains** | No nvcc, no LLVM, no C/C++ compiler needed |
| **Cross-platform** | Same code runs on CPU SIMD, GPU, and WASM |
| **Formal verification** | provable-contracts with Kani + Lean 4 (CUDA has no equivalent) |
| **ML-based auto-tuning** | BrickTuner with learned kernel selection (CUDA relies on manual selection) |
| **Built-in profiling** | Every operation profiled via Brick with roofline analysis |
| **Cargo ecosystem** | `cargo add trueno-sparse` vs manual CMake + library linking |

### Where CUDA Libraries Still Lead

| Gap | CUDA Advantage | Timeline to Close |
|---|---|---|
| **Maturity** | 15+ years of optimization | 2-3 years for parity |
| **Sparse algorithm breadth** | 20+ SpMV algorithms, preconditioners | 6-12 months for core |
| **FFT performance** | Heavily tuned for each GPU architecture | 6-12 months for competitive |
| **Solver completeness** | Full LAPACK equivalent | 12-18 months |
| **Image processing breadth** | 200+ NPP functions | 12-24 months for core 50 |
| **Multi-GPU** | NCCL, cuBLASMp | Partially covered by repartir |

---

## Appendix A: Batuta Oracle Capability Map

```
┌──────────────────────────────────────────────────────────────┐
│                    Sovereign AI Stack                         │
│                                                              │
│  ┌─────────────────── Orchestration ──────────────────────┐  │
│  │ batuta (workflow)  repartir (distributed)  pforge (MCP) │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                              │
│  ┌─────────────── Training & Inference ───────────────────┐  │
│  │ entrenar (train)   realizar (infer)   whisper-apr (ASR) │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                              │
│  ┌──────────────────── ML Algorithms ─────────────────────┐  │
│  │ aprender (classical ML + neural + text + audio)         │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                              │
│  ┌─────────────── Compute Primitives ─────────────────────┐  │
│  │ trueno (SIMD/GPU)   trueno-gpu (PTX)   pepita (kernel)  │  │
│  │ trueno-db (analytics)  trueno-graph  trueno-rag          │  │
│  │                                                          │  │
│  │ ┌── NEW (This Spec) ─────────────────────────────────┐  │  │
│  │ │ trueno-sparse    trueno-fft     trueno-solve        │  │  │
│  │ │ trueno-image     trueno-rand    trueno-tensor       │  │  │
│  │ └────────────────────────────────────────────────────┘  │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                              │
│  ┌──────────────── Quality & Profiling ───────────────────┐  │
│  │ pmat (TDG)  certeza (coverage)  provable-contracts      │  │
│  │ probar (property)  renacer (strace)  batuta (oracle)    │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                              │
│  ┌─────────────────── Transpilers ────────────────────────┐  │
│  │ depyler (Py→Rust)  decy (C→Rust)  ruchy (script→Rust)  │  │
│  └────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────┘
```

## Appendix B: CUDA Library → Sovereign Stack Mapping

| CUDA Library | Sovereign Crate | Status |
|---|---|---|
| cuBLAS Level 1 | trueno (vector_ops) | Complete |
| cuBLAS Level 2 | trueno (GEMV) + trueno-gpu (CoalescedGemvKernel) | ~80% |
| cuBLAS Level 3 | trueno-gpu (GEMM variants, batched) | ~70% |
| cuBLASLt | trueno-gpu (algorithm selection via BrickTuner) | ~50% |
| cuBLASMp | repartir (distributed executor) | ~30% |
| cuSPARSE | **trueno-sparse** (NEW) | 0% → Phase 1 |
| cuSPARSELt | **trueno-sparse** (structured sparsity) | 0% → Phase 3 |
| cuFFT | **trueno-fft** (NEW) | 0% → Phase 2 |
| cuFFTMp | repartir + trueno-fft (distributed) | 0% → Phase 4 |
| cuSOLVER Dense | **trueno-solve** (NEW) | 5% → Phase 2 |
| cuSOLVER Sparse | trueno-solve + trueno-sparse | 0% → Phase 3 |
| cuTENSOR | **trueno-tensor** (NEW) | 0% → Phase 3 |
| NPP | **trueno-image** (NEW) | 0% → Phase 3 |
| nvCOMP | trueno (LZ4) + trueno-zram | ~20% → Phase 4 |
| nvJPEG | **trueno-image** (codec module) | 0% → Phase 4 |
| cuRAND | **trueno-rand** (NEW) | 0% → Phase 3 |
| cuPQC | aprender (CPU crypto) | ~10% → Future |
| MathDx | trueno-gpu (device-side math) | ~40% |

---

*This specification is a living document. Each phase completion triggers a review and update cycle.*

*Generated with guidance from batuta oracle, provable-contracts framework, and arXiv research survey.*