aprender-compute 0.32.0

High-performance SIMD compute library with GPU support, LLM inference engine, and GGUF model loading (was: trueno)
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
# Trueno: PyTorch/NumPy Replacement Specification

**Version**: 1.0
**Date**: 2025-11-17
**Status**: Living Document
**Priority**: CRITICAL - Core Strategic Positioning

---

## Executive Summary

Trueno is a high-performance compute library designed as a **drop-in replacement** for NumPy (data processing) and PyTorch (inference + training) in Rust-native applications. This specification defines the roadmap from current state (v0.2.0) to full replacement parity.

**Current State (v0.2.0)**:
- **NumPy Replacement**: ~35% complete (1D operations only)
- **PyTorch Replacement**: ~15% complete (inference only, no autograd)
- **Strategic Positioning**: High-performance 1D vector compute library with GPU acceleration

**Target State (v1.0.0)**:
- **NumPy Replacement**: 80%+ complete (multi-dimensional arrays, broadcasting, core ops)
- **PyTorch Replacement**: 60%+ complete (autograd, training, common layers)
- **Strategic Positioning**: Production-ready deep learning library for Rust

---

## 1. Current State Analysis (v0.2.0)

### 1.1 NumPy Replacement: ~35% Complete

#### ✅ What Works Today

**1D Element-wise Operations** (100% coverage):
```rust
use trueno::Vector;

// NumPy: np.add(a, b), np.exp(a), np.log(a)
let a = Vector::from_slice(&[1.0, 2.0, 3.0]);
let b = Vector::from_slice(&[4.0, 5.0, 6.0]);
let result = a.add(&b).unwrap();  // Works identically to NumPy
```

**Supported operations**:
- Arithmetic: `add`, `sub`, `mul`, `div`, `neg`
- Transcendental: `exp`, `log`, `pow`, `sqrt`
- Trigonometric: `sin`, `cos`, `tan`
- Reductions: `sum`, `mean`, `std`, `min`, `max`, `norm`
- Linear algebra: `dot`, `matmul` (1D→2D via reshape)

**Reductions** (100% coverage for 1D):
```rust
// NumPy: np.sum(arr), np.mean(arr), np.std(arr)
let data = Vector::from_slice(&[1.0, 2.0, 3.0, 4.0, 5.0]);
let total = data.sum().unwrap();      // 15.0
let average = data.mean().unwrap();   // 3.0
let stdev = data.std().unwrap();      // ~1.414
```

**GPU Acceleration** (14 operations):
- **OpComplexity::Low** (>100K threshold): `vec_add`, `dot`, `relu`, `leaky_relu`, `elu`, `sigmoid`, `tanh`, `swish`, `gelu`, `clip`
- **OpComplexity::Medium** (>10K threshold): `softmax`, `log_softmax`
- **OpComplexity::High** (>1K threshold): `matmul`, `convolve2d`

**Performance**:
- CPU SIMD: 2-8x faster than scalar (SSE2/AVX/AVX2)
- GPU: 10-50x faster than scalar for large workloads (>100K elements)

#### ❌ Critical Gaps (Blocking NumPy Replacement)

**Multi-dimensional Arrays** (0% coverage):
```python
# NumPy - WORKS
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])  # 2D array (2×3)
arr.shape  # (2, 3)

# Trueno - MISSING
# Currently only supports 1D Vector<f32>
```

**Broadcasting** (0% coverage):
```python
# NumPy - WORKS
a = np.array([[1, 2, 3]])        # Shape (1, 3)
b = np.array([[1], [2], [3]])    # Shape (3, 1)
result = a + b                    # Shape (3, 3) via broadcasting

# Trueno - MISSING
# No broadcasting support, must manually tile/repeat
```

**Advanced Indexing** (0% coverage):
```python
# NumPy - WORKS
arr = np.array([1, 2, 3, 4, 5])
arr[1:4]        # Slicing: [2, 3, 4]
arr[[0, 2, 4]]  # Fancy indexing: [1, 3, 5]
arr[arr > 2]    # Boolean masking: [3, 4, 5]

# Trueno - MISSING
# Only supports full array operations
```

**Reshaping/Transposing** (0% coverage):
```python
# NumPy - WORKS
arr = np.array([1, 2, 3, 4, 5, 6])
arr.reshape(2, 3)   # [[1, 2, 3], [4, 5, 6]]
arr.transpose()     # Swap axes

# Trueno - MISSING
# No reshape or transpose operations
```

**Multiple Data Types** (0% coverage):
```python
# NumPy - WORKS
np.array([1, 2, 3], dtype=np.int32)
np.array([1.0, 2.0], dtype=np.float64)

# Trueno - ONLY f32
# Only supports Vector<f32>, no int32/float64/etc.
```

#### 🎯 Viable NumPy Use Cases Today

**1. 1D Signal Processing**:
```rust
use trueno::Vector;

// Audio processing: apply filter to 1D signal
let signal = Vector::from_slice(&audio_samples);
let filtered = signal.convolve2d(&filter_kernel).unwrap();
```

**2. Vector Similarity/Distance**:
```rust
// Compute cosine similarity between document embeddings
let doc1 = Vector::from_slice(&embedding1);
let doc2 = Vector::from_slice(&embedding2);
let similarity = doc1.dot(&doc2).unwrap() / (doc1.norm().unwrap() * doc2.norm().unwrap());
```

**3. Large-scale Element-wise Transformations** (GPU-accelerated):
```rust
// Activate 1M logits with GPU acceleration
let logits = Vector::from_slice(&vec![...; 1_000_000]);
let probs = logits.sigmoid().unwrap();  // Auto-uses GPU for >100K elements
```

---

### 1.2 PyTorch Replacement: ~15% Complete

#### ✅ What Works Today

**GPU-Accelerated Activations** (Inference Only):
```rust
use trueno::Vector;

// PyTorch: F.relu(x), F.sigmoid(x), F.gelu(x)
let hidden = Vector::from_slice(&[...]);
let activated = hidden.relu().unwrap();      // ReLU
let squashed = hidden.sigmoid().unwrap();    // Sigmoid
let gelu_out = hidden.gelu().unwrap();       // GELU (BERT/GPT)
let attention = scores.softmax().unwrap();   // Softmax (transformers)
```

**Supported activations** (14 GPU-accelerated):
- Standard: ReLU, Leaky ReLU, ELU, Sigmoid, Tanh
- Modern: Swish, GELU, Softmax, Log-Softmax
- Utility: Clip (gradient clipping)

**Basic Tensor Operations** (1D only):
```rust
// PyTorch: torch.dot(a, b), torch.matmul(A, B)
let a = Vector::from_slice(&[1.0, 2.0, 3.0]);
let b = Vector::from_slice(&[4.0, 5.0, 6.0]);
let dot_product = a.dot(&b).unwrap();  // Works like PyTorch
```

**Performance** (GPU-accelerated):
- Element-wise ops (>100K): 10-50x faster than scalar
- Multi-pass ops (>10K): 5-20x faster than scalar (softmax, log_softmax)
- Matrix ops (>1K×1K): 10-50x faster than naive CPU

#### ❌ Critical Blockers (Preventing PyTorch Replacement)

**No Autograd** (0% coverage):
```python
# PyTorch - WORKS
import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = (x ** 2).sum()
y.backward()       # Compute gradients
print(x.grad)      # [2.0, 4.0, 6.0]

# Trueno - MISSING
# Cannot train neural networks without gradient computation
```

**No Layers** (0% coverage):
```python
# PyTorch - WORKS
import torch.nn as nn
linear = nn.Linear(784, 128)
conv = nn.Conv2d(3, 64, kernel_size=3)
norm = nn.BatchNorm2d(64)

# Trueno - MISSING
# No nn::Module trait, no layer abstractions
```

**No Loss Functions** (0% coverage):
```python
# PyTorch - WORKS
criterion = nn.CrossEntropyLoss()
loss = criterion(outputs, targets)

# Trueno - MISSING
# Must manually implement loss computation
```

**No Optimizers** (0% coverage):
```python
# PyTorch - WORKS
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
optimizer.zero_grad()
loss.backward()
optimizer.step()

# Trueno - MISSING
# No SGD, Adam, AdamW, etc.
```

**No Multi-dimensional Tensors** (0% coverage):
```python
# PyTorch - WORKS
x = torch.randn(32, 3, 224, 224)  # Batch of images (N, C, H, W)

# Trueno - MISSING
# Only 1D Vector<f32>
```

**No Model Saving/Loading** (0% coverage):
```python
# PyTorch - WORKS
torch.save(model.state_dict(), 'model.pth')
model.load_state_dict(torch.load('model.pth'))

# Trueno - MISSING
# No checkpoint serialization
```

#### 🎯 Viable PyTorch Use Cases Today

**1. Neural Network Inference** (Forward Pass Only, 1D):
```rust
use trueno::Vector;

// Inference for a simple 1D feedforward network
let input = Vector::from_slice(&features);
let hidden1 = input.matmul(&weights1).unwrap().add(&bias1).unwrap().relu().unwrap();
let hidden2 = hidden1.matmul(&weights2).unwrap().add(&bias2).unwrap().gelu().unwrap();
let logits = hidden2.matmul(&weights3).unwrap().add(&bias3).unwrap();
let probs = logits.softmax().unwrap();
```

**2. Custom Activation Function Testing**:
```rust
// Compare custom activation vs standard activations
let data = Vector::from_slice(&test_data);
let relu_out = data.relu().unwrap();
let swish_out = data.swish().unwrap();
let gelu_out = data.gelu().unwrap();
```

**3. Batch Activation Processing** (GPU-accelerated):
```rust
// Process large batches of activations efficiently
let batch_logits = Vector::from_slice(&vec![...; 1_000_000]);  // 10K batch × 100 classes
let batch_probs = batch_logits.softmax().unwrap();  // GPU-accelerated
```

---

## 2. Strategic Gaps Analysis

### 2.1 Blocker Priorities (Must-Have for Replacement)

**Priority 1: Multi-dimensional Tensors** (Blocks 90% of use cases)
- **Blocker for**: Images, sequences, batches, convolutions, attention
- **Impact**: Cannot process 2D/3D/4D data (images, videos, point clouds)
- **Current workaround**: Flatten to 1D (loses structure, inefficient)
- **Target**: Phase 2 (v0.4.0 - v0.6.0)

**Priority 2: Autograd Engine** (Blocks training)
- **Blocker for**: Neural network training, gradient-based optimization
- **Impact**: Inference-only library (cannot train models)
- **Current workaround**: None (fundamental missing feature)
- **Target**: Phase 3 (v0.7.0)

**Priority 3: Broadcasting** (Blocks 70% of operations)
- **Blocker for**: Shape-polymorphic operations, batch processing
- **Impact**: Must manually tile/repeat tensors (inefficient, error-prone)
- **Current workaround**: Manual shape manipulation
- **Target**: Phase 2 (v0.5.0)

### 2.2 Nice-to-Have Features (Improve Usability)

**Multiple Data Types** (f16, f64, i32, etc.):
- **Impact**: Limited to f32 (no mixed precision, no integer ops)
- **Workaround**: Cast externally, use only f32
- **Target**: Phase 2 (v0.6.0)

**Advanced Indexing/Slicing**:
- **Impact**: Verbose data access, cannot use NumPy-style slicing
- **Workaround**: Manual index computation
- **Target**: Phase 2 (v0.5.0)

**Model Serialization**:
- **Impact**: Cannot save/load trained models
- **Workaround**: Manual weight export/import
- **Target**: Phase 3 (v1.0.0)

---

## 3. Phased Roadmap to Replacement Parity

### Phase 1: Complete 1D Operations (v0.2.x - v0.3.0)
**Timeline**: 2-3 months
**Goal**: Best-in-class 1D vector compute
**NumPy Parity**: ~40% (still 1D only)
**PyTorch Parity**: ~18% (inference only)

**Deliverables**:
- [x] GPU softmax/log_softmax (v0.2.1) ✅
- [ ] Remaining activations: hardswish, mish, selu
- [ ] GPU reductions: argmax/argmin, sum/mean/std
- [ ] GPU binary ops: add/sub/mul/div
- [ ] Async GPU API (batch operations)
- [ ] Comprehensive benchmarks (validate 10-50x claims)
- [ ] WASM SIMD128 backend

**Success Metrics**:
- ✅ All common 1D operations GPU-accelerated
- ✅ 10-50x GPU speedup validated by benchmarks
- ✅ Used in production by ≥3 projects

---

### Phase 2: Multi-Dimensional Tensors (v0.4.0 - v0.6.0)
**Timeline**: 6-12 months
**Goal**: NumPy-competitive for 2D/3D arrays
**NumPy Parity**: ~80% (multi-dim, broadcasting, core ops)
**PyTorch Parity**: ~20% (still no autograd)

#### v0.4.0: Tensor Type Foundation (3-4 months)

**Core Type**:
```rust
pub struct Tensor<T, const N: usize> {
    data: Vec<T>,
    shape: [usize; N],
    strides: [usize; N],
    offset: usize,
}

// Examples:
let scalar = Tensor::<f32, 0>::from_scalar(5.0);        // 0D scalar
let vector = Tensor::<f32, 1>::from_slice(&[1, 2, 3]);  // 1D vector
let matrix = Tensor::<f32, 2>::new([2, 3]);             // 2D matrix (2×3)
let image = Tensor::<f32, 3>::new([3, 224, 224]);       // 3D image (C, H, W)
let batch = Tensor::<f32, 4>::new([32, 3, 224, 224]);   // 4D batch (N, C, H, W)
```

**Design Trade-offs (Const Generics for Rank)**:

The use of `const N: usize` for tensor rank provides compile-time dimensionality checking—a significant safety advantage over Python-based frameworks. However, this creates a trade-off between static safety and dynamic flexibility:

**Advantages of Static Rank**:
- Compile-time shape verification (prevents runtime dimension errors)
- Zero-cost abstraction (no runtime rank checks)
- Type-safe operations (e.g., `matmul` requires 2D tensors)
- Excellent for fixed architectures (CNNs, transformers)

**Challenges**:
- Dynamic rank use cases (e.g., processing lists of tensors with varying ranks)
- Generic functions that accept tensors of any rank
- Interop with dynamic frameworks (loading ONNX models)

**Mitigation Strategy** (Future consideration):
```rust
// Option 1: AnyTensor enum (adds complexity but enables dynamic use cases)
pub enum AnyTensor<T> {
    Rank0(Tensor<T, 0>),
    Rank1(Tensor<T, 1>),
    Rank2(Tensor<T, 2>),
    Rank3(Tensor<T, 3>),
    Rank4(Tensor<T, 4>),
}

// Option 2: Trait-based approach for rank-polymorphic functions
pub trait TensorLike {
    fn rank(&self) -> usize;
    fn shape(&self) -> &[usize];
}
```

For v0.4.0, prioritize the static `Tensor<T, N>` design for 80% use cases (ranks 0-4). Dynamic rank support can be added in v0.5.x if user feedback indicates strong demand.

**Storage**:
- **Row-major layout** (C-contiguous, matches NumPy default)
- Owned data vs. views (zero-copy slicing)
- Shape, strides, offset tracking for views

**Storage Layout Considerations** (Performance):

While row-major is the default, the `strides` field enables representing column-major layouts **without changing the underlying data**. This is critical for performance-sensitive operations like matrix multiplication, where performance can be significantly improved when one matrix is row-major and the other is column-major [Goto & van de Geijn, 2008].

```rust
// Zero-copy transpose (just swap strides)
pub fn transpose(&self) -> Tensor<T, 2> {
    Tensor {
        data: self.data.clone(),  // Shared data (Arc in production)
        shape: [self.shape[1], self.shape[0]],
        strides: [self.strides[1], self.strides[0]],  // Swapped!
        offset: self.offset,
    }
}
```

**Key Insight**: By supporting arbitrary strides, we can:
- Implement zero-copy transpose (critical for linear algebra)
- Represent both row-major and column-major layouts
- Enable optimized BLAS routines (which often require specific layouts)
- Avoid expensive data reorganization

This design aligns with established practices in NumPy and BLAS libraries [Van der Walt et al., 2011].

**2D Operations**:
```rust
// Transpose
let mat = Tensor::<f32, 2>::new([2, 3]);  // 2×3
let transposed = mat.transpose();         // 3×2

// Reshape
let vec = Tensor::<f32, 1>::new([6]);     // [1, 2, 3, 4, 5, 6]
let mat = vec.reshape([2, 3]);            // [[1, 2, 3], [4, 5, 6]]

// Flatten
let mat = Tensor::<f32, 2>::new([2, 3]);  // 2×3
let vec = mat.flatten();                  // 1D [6]

// Slicing
let mat = Tensor::<f32, 2>::new([5, 5]);
let sub = mat.slice([1..3, 2..4]);        // Sub-matrix (2×2 view)
```

**Row/Column Access**:
```rust
let mat = Tensor::<f32, 2>::new([3, 4]);
let row0 = mat.row(0);     // View of first row
let col1 = mat.column(1);  // View of second column
```

**2D Matrix Multiplication** (optimized):
```rust
// NumPy: C = A @ B
let a = Tensor::<f32, 2>::new([128, 256]);   // 128×256
let b = Tensor::<f32, 2>::new([256, 512]);   // 256×512
let c = a.matmul(&b).unwrap();               // 128×512 (GPU-accelerated)
```

**Deliverables**:
- Tensor<T, N> type with const generics
- Shape, strides, offset tracking
- Views vs owned data (zero-copy slicing)
- Transpose, reshape, flatten
- 2D matmul (GPU-accelerated, optimized)
- Row/column indexing

**Success Metrics**:
- Can represent images, batches, sequences
- Zero-copy views for efficient slicing
- 2D matmul competitive with NumPy

#### v0.5.0: Broadcasting (2-3 months)

**Broadcasting Rules** (NumPy-compatible):
```rust
// NumPy broadcasting semantics
let a = Tensor::<f32, 2>::new([3, 1]);   // Shape (3, 1)
let b = Tensor::<f32, 2>::new([1, 4]);   // Shape (1, 4)
let c = a.add(&b).unwrap();              // Shape (3, 4) via broadcasting

// Rules:
// 1. If ranks differ, prepend 1s to smaller rank
// 2. Dimensions are compatible if they're equal or one is 1
// 3. Broadcast stretches dimension 1 to match other dimension
```

**Shape Compatibility Checking**:
```rust
fn can_broadcast(shape_a: &[usize], shape_b: &[usize]) -> bool {
    // Implementation of NumPy broadcasting rules
}
```

**Element-wise Ops with Broadcasting**:
```rust
// All element-wise ops support broadcasting
let a = Tensor::<f32, 2>::new([64, 1]);     // Batch of 64 vectors
let bias = Tensor::<f32, 1>::new([128]);    // Bias vector
let result = a.add(&bias).unwrap();         // (64, 128) via broadcasting
```

**Broadcasting Implementation Quality** (*Jidoka* - Build Quality In):

Broadcasting is notoriously complex to implement correctly [Van der Walt et al., 2011]. The implementation must be heavily validated:

**1. Property-Based Testing Against NumPy**:
```rust
use proptest::prelude::*;

proptest! {
    #[test]
    fn test_broadcasting_matches_numpy(
        a_shape in prop::array::uniform2(1usize..10),
        b_shape in prop::array::uniform2(1usize..10),
    ) {
        // Generate compatible shapes for broadcasting
        let a = Tensor::<f32, 2>::randn(a_shape);
        let b = Tensor::<f32, 2>::randn(b_shape);

        // Compare Trueno result with NumPy
        let trueno_result = a.add(&b).unwrap();
        let numpy_result = numpy_add(a.to_numpy(), b.to_numpy());

        assert_tensors_close(trueno_result, numpy_result, eps=1e-5);
    }
}
```

This differential testing approach [McKeeman, 1998] catches subtle bugs that unit tests miss.

**2. Fused Kernel Optimization** (GPU Performance):

Naive broadcasting creates large intermediate tensors, consuming excessive memory. Modern tensor compilers avoid this by fusing broadcast + operation into a single kernel:

```rust
// Naive implementation (inefficient - materializes (64, 128) tensor)
let broadcasted_bias = bias.broadcast_to([64, 128]);  // Allocates 32KB
let result = a.add(&broadcasted_bias);                // Allocates another 32KB

// Fused kernel (optimal - no intermediate allocation)
let result = a.add_broadcasted(&bias);  // Single GPU kernel, direct to output
```

**GPU Kernel Pseudocode**:
```wgsl
@compute @workgroup_size(256)
fn add_with_broadcast(
    @builtin(global_invocation_id) gid: vec3<u32>
) {
    let row = gid.x;
    let col = gid.y;

    // Compute broadcasted indices on-the-fly
    let a_idx = row * a_stride[0] + (col % a_shape[1]) * a_stride[1];
    let b_idx = (row % b_shape[0]) * b_stride[0] + col * b_stride[1];

    output[row * output_stride + col] = a[a_idx] + b[b_idx];
}
```

This fused approach is standard in modern frameworks like JAX [Bradbury et al., 2018] and achieves:
- Zero intermediate memory allocation
- Single GPU kernel launch (reduces overhead)
- Improved memory bandwidth utilization

For v0.5.0, implement fused broadcasting for element-wise ops on GPU. This aligns with best practices in tensor compiler design [Chen et al., 2018].

**Advanced Indexing**:
```rust
// Boolean masking
let arr = Tensor::<f32, 1>::from_slice(&[1, 2, 3, 4, 5]);
let mask = arr.gt(2.0);                     // [false, false, true, true, true]
let filtered = arr.masked_select(&mask);    // [3, 4, 5]

// Integer array indexing
let indices = Tensor::<i32, 1>::from_slice(&[0, 2, 4]);
let selected = arr.index_select(&indices);  // [1, 3, 5]

// Slicing syntax (via macro)
let mat = Tensor::<f32, 2>::new([10, 10]);
let sub = tensor_slice!(mat, [1..5, ::2]);  // Rows 1-4, every other column
```

**Deliverables**:
- NumPy-compatible broadcasting rules
- Shape compatibility checking
- Broadcasted element-wise operations
- Boolean masking
- Integer array indexing
- Slicing macro (NumPy-style syntax)

**Success Metrics**:
- Pass 80%+ of NumPy broadcasting tests
- Idiomatic batch processing
- Efficient memory usage (avoid unnecessary copies)

#### v0.6.0: NumPy Parity (Core Ops) (3-4 months)

**All NumPy dtypes**:
```rust
pub enum DType {
    F16, F32, F64,
    I8, I16, I32, I64,
    U8, U16, U32, U64,
    Bool,
}

let f32_tensor = Tensor::<f32, 2>::new([3, 3]);
let i32_tensor = Tensor::<i32, 2>::new([3, 3]);
let f64_tensor = Tensor::<f64, 2>::new([3, 3]);
```

**NumPy-style API**:
```rust
// Creation functions
let zeros = Tensor::<f32, 2>::zeros([3, 4]);
let ones = Tensor::<f32, 2>::ones([3, 4]);
let arange = Tensor::<f32, 1>::arange(0.0, 10.0, 1.0);
let linspace = Tensor::<f32, 1>::linspace(0.0, 1.0, 100);

// Concatenation/stacking
let a = Tensor::<f32, 2>::new([2, 3]);
let b = Tensor::<f32, 2>::new([2, 3]);
let cat = Tensor::concatenate(&[a, b], axis=0).unwrap();  // (4, 3)
let stack = Tensor::stack(&[a, b], axis=0).unwrap();      // (2, 2, 3)

// Splitting
let parts = Tensor::split(&tensor, 3, axis=0);  // Split into 3 parts

// Conditional operations
let mask = x.gt(0.5);
let result = Tensor::where(&mask, &x, &y);  // x if mask else y
let indices = Tensor::argwhere(&mask);      // Indices where mask is true
```

**Deliverables**:
- Generic dtype support (F16, F32, F64, I32, I64, U32, etc.)
- NumPy creation functions (zeros, ones, arange, linspace)
- Concatenate, stack, split
- Conditional operations (where, argwhere)
- Comprehensive test suite (80%+ NumPy compatibility)

**Performance Target**:
- 80-120% of NumPy speed (within 20% either way)

**Success Metrics**:
- Pass 80%+ of NumPy test suite (for covered ops)
- ≥10 production deployments
- Competitive performance with NumPy

---

### Phase 3: Autograd & Training (v0.7.0 - v1.0.0)
**Timeline**: 12-18 months
**Goal**: PyTorch-competitive for training
**NumPy Parity**: ~80% (stable)
**PyTorch Parity**: ~60% (training-ready)

#### v0.7.0: Autograd Engine (4-6 months)

**Computational Graph**:
```rust
pub struct Tensor<T, const N: usize> {
    data: TensorData<T, N>,
    grad: Option<Box<Tensor<T, N>>>,
    grad_fn: Option<Box<dyn BackwardFunction>>,
    requires_grad: bool,
}

// Example usage
let x = Tensor::<f32, 1>::from_slice(&[1.0, 2.0, 3.0]).requires_grad();
let y = (&x * &x).sum();  // y = sum(x^2)
y.backward();             // Compute gradients
println!("{:?}", x.grad()); // Some([2.0, 4.0, 6.0])
```

**Reverse-mode Automatic Differentiation**:
- Dynamic graph construction (like PyTorch eager mode)
- Gradient tape for backpropagation
- Efficient memory management (release intermediates)

**Core Operations with Gradients**:
```rust
// All element-wise ops
add_backward, sub_backward, mul_backward, div_backward
exp_backward, log_backward, pow_backward, sqrt_backward
sin_backward, cos_backward, tanh_backward, sigmoid_backward

// Reductions
sum_backward, mean_backward, max_backward

// Linear algebra
matmul_backward, conv2d_backward

// Activations (all 14 GPU ops)
relu_backward, sigmoid_backward, gelu_backward, softmax_backward, etc.
```

**Autograd Quality Gates** (*Jidoka* - Halt the Line on Defects):

The autograd engine is the most complex and highest-risk component. A single bug in gradient computation causes silent training failures that are extremely difficult to debug [Baydin et al., 2018]. **Gradient checking must be a first-class citizen** in the testing suite.

**1. Automatic Gradient Checking** (Required for Every Operation):

```rust
/// Verify analytical gradients against numerical gradients
pub fn check_gradients<F>(
    op: F,
    inputs: &[Tensor],
    eps: f64,
    tolerance: f64
) -> Result<(), GradientCheckError>
where
    F: Fn(&[Tensor]) -> Tensor
{
    // Compute analytical gradients via backward()
    let output = op(inputs);
    output.backward();
    let analytical_grads: Vec<_> = inputs.iter().map(|x| x.grad().unwrap()).collect();

    // Compute numerical gradients via finite differences
    let numerical_grads = compute_numerical_gradients(op, inputs, eps);

    // Compare (must be very close)
    for (analytical, numerical) in analytical_grads.iter().zip(&numerical_grads) {
        assert_tensors_close(analytical, numerical, tolerance)?;
    }

    Ok(())
}

/// Numerical gradient: f'(x) ≈ (f(x + ε) - f(x - ε)) / (2ε)
fn compute_numerical_gradients<F>(
    op: F,
    inputs: &[Tensor],
    eps: f64
) -> Vec<Tensor>
where
    F: Fn(&[Tensor]) -> Tensor
{
    inputs.iter().map(|input| {
        let mut grad = Tensor::zeros_like(input);

        for idx in 0..input.numel() {
            // f(x + ε)
            let mut input_plus = input.clone();
            input_plus.data[idx] += eps;
            let output_plus = op(&[input_plus]).sum().item();

            // f(x - ε)
            let mut input_minus = input.clone();
            input_minus.data[idx] -= eps;
            let output_minus = op(&[input_minus]).sum().item();

            // Central difference
            grad.data[idx] = (output_plus - output_minus) / (2.0 * eps);
        }

        grad
    }).collect()
}
```

**2. Automated Testing for All Operations**:

```rust
#[cfg(test)]
mod autograd_tests {
    use super::*;

    macro_rules! test_gradient {
        ($op_name:ident, $op:expr, $input_shape:expr) => {
            #[test]
            fn $op_name() {
                let x = Tensor::randn($input_shape).requires_grad();
                let y = Tensor::randn($input_shape).requires_grad();

                check_gradients(
                    |inputs| $op(&inputs[0], &inputs[1]),
                    &[x, y],
                    eps = 1e-5,
                    tolerance = 1e-4
                ).expect("Gradient check failed");
            }
        };
    }

    test_gradient!(test_add_grad, |x, y| x + y, [10, 10]);
    test_gradient!(test_mul_grad, |x, y| x * y, [10, 10]);
    test_gradient!(test_matmul_grad, |x, y| x.matmul(y), [10, 20]);
    test_gradient!(test_relu_grad, |x, _| x.relu(), [100]);
    test_gradient!(test_softmax_grad, |x, _| x.softmax(dim=1), [10, 10]);
    // ... (test every single operation)
}
```

This automated verification prevents entire classes of subtle, hard-to-debug training failures. Research on deep learning testing emphasizes gradient checking as essential [Pei et al., 2017; Zhang et al., 2020].

**3. Property-Based Testing for Gradient Laws**:

```rust
use proptest::prelude::*;

proptest! {
    #[test]
    fn test_chain_rule(
        x in tensor_strategy([10])
    ) {
        // d/dx[f(g(x))] = f'(g(x)) * g'(x)
        let x = x.requires_grad();
        let y = x.relu();  // g(x)
        let z = y.sum();   // f(g(x))

        z.backward();

        // Verify chain rule holds
        let grad_z_wrt_x = x.grad().unwrap();
        // ... verify against manually computed gradient
    }

    #[test]
    fn test_gradient_linearity(
        x in tensor_strategy([10]),
        a in -10.0f32..10.0,
        b in -10.0f32..10.0
    ) {
        // d/dx[a*f(x) + b*g(x)] = a*f'(x) + b*g'(x)
        // ... verify linearity property
    }
}
```

**Memory Optimization**:
```rust
// Gradient checkpointing (save memory for deep networks)
let checkpoint = Checkpoint::new();
let y = checkpoint.run(|| {
    // Forward pass here (intermediates not saved)
    model.forward(x)
});

// In-place operations (where safe)
let mut x = Tensor::<f32, 2>::new([3, 3]);
x.relu_();  // In-place ReLU (saves memory, but cannot backward through this)
```

**Deliverables**:
- Reverse-mode AD engine
- Dynamic graph construction
- Gradient tape with backward functions
- All element-wise ops with gradients
- Matmul, conv2d with gradients
- All 14 activations with gradients
- Gradient checkpointing
- In-place operations

**Success Metrics**:
- Can train simple feedforward networks
- Gradients match PyTorch (< 1e-5 error)
- Memory-efficient gradient computation

#### v0.8.0: Neural Network Layers (3-4 months)

**nn::Module Trait**:
```rust
pub trait Module {
    type Input;
    type Output;

    fn forward(&self, input: Self::Input) -> Self::Output;
    fn parameters(&self) -> Vec<&Tensor>;
    fn zero_grad(&mut self);
}

// Example: Linear layer
pub struct Linear {
    weight: Tensor<f32, 2>,
    bias: Option<Tensor<f32, 1>>,
}

impl Module for Linear {
    type Input = Tensor<f32, 2>;
    type Output = Tensor<f32, 2>;

    fn forward(&self, input: Self::Input) -> Self::Output {
        let output = input.matmul(&self.weight.transpose());
        if let Some(ref bias) = self.bias {
            output.add(bias)
        } else {
            output
        }
    }

    fn parameters(&self) -> Vec<&Tensor> {
        let mut params = vec![&self.weight];
        if let Some(ref bias) = self.bias {
            params.push(bias);
        }
        params
    }
}
```

**Core Layers**:
```rust
// Fully connected
let linear = nn::Linear::new(784, 128);

// Convolutional
let conv2d = nn::Conv2d::new(3, 64, kernel_size=3, stride=1, padding=1);
let maxpool = nn::MaxPool2d::new(kernel_size=2, stride=2);

// Normalization
let batchnorm = nn::BatchNorm2d::new(64);
let layernorm = nn::LayerNorm::new(128);

// Regularization
let dropout = nn::Dropout::new(p=0.5);
```

**Sequential Container**:
```rust
let model = nn::Sequential::new()
    .add(nn::Linear::new(784, 256))
    .add(nn::ReLU::new())
    .add(nn::Dropout::new(0.5))
    .add(nn::Linear::new(256, 128))
    .add(nn::ReLU::new())
    .add(nn::Linear::new(128, 10));

let output = model.forward(input);
```

**Loss Functions**:
```rust
// Cross-entropy loss (classification)
let criterion = nn::CrossEntropyLoss::new();
let loss = criterion.forward(outputs, targets);

// MSE loss (regression)
let mse = nn::MSELoss::new();
let loss = mse.forward(predictions, targets);

// Binary cross-entropy (binary classification)
let bce = nn::BCELoss::new();
let loss = bce.forward(predictions, targets);

// Custom loss support
trait Loss {
    fn forward(&self, predictions: &Tensor, targets: &Tensor) -> Tensor;
}
```

**Deliverables**:
- nn::Module trait
- Linear (fully connected) layer
- Conv2d, MaxPool2d layers
- BatchNorm, LayerNorm
- Dropout
- Sequential container
- CrossEntropyLoss, MSELoss, BCELoss

**Success Metrics**:
- Can build ResNet-18, VGG-16
- Layer API matches PyTorch ergonomics
- Automatic parameter tracking

#### v0.9.0: Optimizers (2-3 months)

**Optimizer Trait**:
```rust
pub trait Optimizer {
    fn step(&mut self, params: &mut [Tensor]);
    fn zero_grad(&mut self, params: &mut [Tensor]);
}

// SGD with momentum
let optimizer = optim::SGD::new(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    nesterov=true
);

// Training loop
for (inputs, targets) in dataloader {
    optimizer.zero_grad();
    let outputs = model.forward(inputs);
    let loss = criterion.forward(outputs, targets);
    loss.backward();
    optimizer.step();
}
```

**Core Optimizers**:
```rust
// SGD (with momentum, Nesterov)
let sgd = optim::SGD::new(params, lr=0.01, momentum=0.9, nesterov=true);

// Adam (with weight decay, AMSGrad)
let adam = optim::Adam::new(params, lr=0.001, betas=(0.9, 0.999), weight_decay=0.01);

// AdamW (decoupled weight decay)
let adamw = optim::AdamW::new(params, lr=0.001, weight_decay=0.01);

// RMSprop
let rmsprop = optim::RMSprop::new(params, lr=0.01, alpha=0.99);
```

**Learning Rate Schedulers**:
```rust
// Step decay
let scheduler = optim::StepLR::new(optimizer, step_size=30, gamma=0.1);

// Exponential decay
let scheduler = optim::ExponentialLR::new(optimizer, gamma=0.95);

// Cosine annealing
let scheduler = optim::CosineAnnealingLR::new(optimizer, T_max=100);

// Usage in training loop
for epoch in 0..num_epochs {
    train_one_epoch();
    scheduler.step();
}
```

**Deliverables**:
- Optimizer trait
- SGD (with momentum, Nesterov)
- Adam (with weight decay, AMSGrad)
- AdamW, RMSprop
- Learning rate schedulers (StepLR, ExponentialLR, CosineAnnealing)

**Success Metrics**:
- Can train ResNet-50 to convergence
- Optimizer behavior matches PyTorch
- Efficient gradient updates (in-place where possible)

#### v1.0.0: Training-Ready (3-4 months)

**Model Serialization**:
```rust
// Save checkpoint
let checkpoint = Checkpoint {
    model: model.state_dict(),
    optimizer: optimizer.state_dict(),
    epoch: 42,
    loss: 0.123,
};
checkpoint.save("model.pth")?;

// Load checkpoint
let checkpoint = Checkpoint::load("model.pth")?;
model.load_state_dict(checkpoint.model);
optimizer.load_state_dict(checkpoint.optimizer);

// ONNX export
model.export_onnx("model.onnx")?;
```

**Distributed Training** (Data Parallelism):
```rust
// Initialize distributed training
let world_size = 4;  // 4 GPUs
let rank = get_rank();

// Wrap model in DistributedDataParallel
let ddp_model = nn::parallel::DistributedDataParallel::new(model, rank);

// Training loop (gradients synchronized automatically)
for (inputs, targets) in dataloader {
    let outputs = ddp_model.forward(inputs);
    let loss = criterion.forward(outputs, targets);
    loss.backward();
    optimizer.step();  // Gradients averaged across GPUs
}
```

**Gradient Synchronization**:
- AllReduce for gradient averaging
- Efficient communication (NCCL-like)
- Fault tolerance (checkpoint/resume)

**Production Features**:
- Mixed precision training (FP16/BF16)
- Gradient accumulation
- Gradient clipping
- Early stopping
- TensorBoard logging

**Deliverables**:
- Model save/load (state_dict)
- ONNX export
- Distributed data parallelism
- Gradient synchronization
- Mixed precision support
- Production training utilities

**Performance Target**:
- 60-80% of PyTorch speed (within 20-40%)

**Success Metrics**:
- ✅ Train ResNet-50 on CIFAR-10 in <30 minutes (single GPU)
- ✅ 60-80% of PyTorch performance
- ✅ Used in production ML training pipelines

**Full MNIST/CIFAR-10 Examples**:
```rust
// Complete CIFAR-10 training example
use trueno::{nn, optim, data};

fn main() {
    // Load data
    let train_loader = data::DataLoader::new("cifar10", train=true, batch_size=128);
    let test_loader = data::DataLoader::new("cifar10", train=false, batch_size=128);

    // Build ResNet-18
    let model = nn::resnet18(num_classes=10);

    // Loss and optimizer
    let criterion = nn::CrossEntropyLoss::new();
    let optimizer = optim::Adam::new(model.parameters(), lr=0.001);

    // Training loop
    for epoch in 0..100 {
        for (images, labels) in train_loader {
            optimizer.zero_grad();
            let outputs = model.forward(images);
            let loss = criterion.forward(outputs, labels);
            loss.backward();
            optimizer.step();
        }

        // Validation
        let accuracy = evaluate(&model, &test_loader);
        println!("Epoch {}: Accuracy {:.2}%", epoch, accuracy * 100.0);
    }
}
```

---

## 4. Performance Targets

### 4.1 NumPy Performance Parity

**Target**: 80-120% of NumPy speed (within 20% either way)

**Benchmark Methodology**:
- Test sizes: 100, 1K, 10K, 100K, 1M, 10M elements
- Operations: add, mul, matmul, sum, mean, softmax
- Backends: Scalar, SIMD (AVX2), GPU
- Compare against: NumPy 1.26+ (with OpenBLAS)

**Expected Performance**:

| Operation | Size | NumPy (baseline) | Trueno CPU (SIMD) | Trueno GPU | Speedup |
|-----------|------|------------------|-------------------|------------|---------|
| add | 1K | 10 µs | 8 µs | - | 1.25x |
| add | 100K | 1 ms | 500 µs | - | 2x |
| add | 1M | 10 ms | 5 ms | 200 µs | 2x (CPU), 50x (GPU) |
| matmul | 128×128 | 200 µs | 180 µs | - | 1.1x |
| matmul | 1K×1K | 15 ms | 12 ms | 1.5 ms | 1.25x (CPU), 10x (GPU) |
| softmax | 10K | 120 µs | 100 µs | 60 µs | 1.2x (CPU), 2x (GPU) |
| softmax | 1M | 12 ms | 10 ms | 600 µs | 1.2x (CPU), 20x (GPU) |

**Key Insight**: GPU wins at >100K elements (transfer overhead amortized)

### 4.2 PyTorch Performance Parity

**Target**: 60-80% of PyTorch speed (within 20-40%)

**Benchmark Methodology**:
- Networks: ResNet-18, ResNet-50, VGG-16, Transformer (BERT-base)
- Tasks: CIFAR-10 training, ImageNet inference, NLP fine-tuning
- Hardware: NVIDIA A100 (GPU), AMD EPYC (CPU)
- Compare against: PyTorch 2.0+ (with cuDNN)

**Expected Performance**:

| Task | PyTorch (baseline) | Trueno | Speedup |
|------|-------------------|--------|---------|
| ResNet-50 training (CIFAR-10, 1 epoch) | 45s | 60s | 0.75x |
| ResNet-18 inference (ImageNet, batch=32) | 120 ms | 150 ms | 0.80x |
| BERT-base forward pass (seq_len=128) | 8 ms | 12 ms | 0.67x |
| Softmax (1M elements) | 800 µs | 600 µs | 1.33x |

**Key Insight**: Trueno may be faster for specific ops (softmax, activations) but slower for full models (cuDNN optimization)

---

## 5. API Compatibility Strategy

### 5.1 NumPy Compatibility

**Goal**: Minimize porting effort from NumPy → Trueno

**Approach 1: Idiomatic Rust API** (Primary)
```rust
// Rust-native API (ownership, error handling)
let a = Tensor::<f32, 2>::from_slice(&[1.0, 2.0, 3.0, 4.0], [2, 2]);
let b = Tensor::<f32, 2>::from_slice(&[5.0, 6.0, 7.0, 8.0], [2, 2]);
let c = a.matmul(&b).unwrap();
```

**Approach 2: NumPy-like Macros** (Convenience)
```rust
// NumPy-style syntax via macros
let a = tensor![[1.0, 2.0], [3.0, 4.0]];
let b = tensor![[5.0, 6.0], [7.0, 8.0]];
let c = a.matmul(&b).unwrap();
```

**Approach 3: Function API** (Migration Path)
```rust
// NumPy function-style API
use trueno::np;

let a = np::array([[1.0, 2.0], [3.0, 4.0]]);
let b = np::array([[5.0, 6.0], [7.0, 8.0]]);
let c = np::matmul(&a, &b);
```

**Migration Guide**: Document NumPy → Trueno equivalents
```rust
// NumPy                          // Trueno
np.array([1, 2, 3])               Tensor::from_slice(&[1, 2, 3])
arr.reshape(2, 3)                 tensor.reshape([2, 3])
arr.transpose()                   tensor.transpose()
np.dot(a, b)                      a.dot(&b)
np.matmul(a, b)                   a.matmul(&b)
```

### 5.2 PyTorch Compatibility

**Goal**: Familiar API for PyTorch users

**Approach 1: Rust-native nn::Module** (Primary)
```rust
use trueno::nn;

struct MyModel {
    fc1: nn::Linear,
    fc2: nn::Linear,
}

impl nn::Module for MyModel {
    type Input = Tensor<f32, 2>;
    type Output = Tensor<f32, 2>;

    fn forward(&self, x: Self::Input) -> Self::Output {
        let x = self.fc1.forward(x).relu();
        let x = self.fc2.forward(x);
        x.softmax(dim=1)
    }
}
```

**Approach 2: Macro-based Sequential** (Convenience)
```rust
let model = sequential![
    nn::Linear::new(784, 256),
    nn::ReLU::new(),
    nn::Linear::new(256, 10),
];
```

**Migration Guide**: Document PyTorch → Trueno equivalents
```rust
// PyTorch                        // Trueno
torch.tensor([1, 2, 3])           Tensor::from_slice(&[1, 2, 3])
x.requires_grad_(True)            x.requires_grad()
y.backward()                      y.backward()
x.grad                            x.grad()
nn.Linear(784, 128)               nn::Linear::new(784, 128)
F.relu(x)                         x.relu()
```

---

## 6. Use Case Validation

### 6.1 NumPy Replacement Use Cases

**Use Case 1: Scientific Computing** (Phase 2 - v0.6.0)
```rust
// Matrix operations for linear algebra
use trueno::Tensor;

let a = Tensor::<f64, 2>::random([1000, 1000]);
let b = Tensor::<f64, 2>::random([1000, 1000]);

// Solve Ax = b
let x = linalg::solve(&a, &b).unwrap();

// Eigenvalues/eigenvectors
let (eigenvalues, eigenvectors) = linalg::eig(&a).unwrap();
```

**Use Case 2: Data Preprocessing** (Phase 2 - v0.5.0)
```rust
// Normalize features for ML
let data = Tensor::<f32, 2>::from_csv("data.csv");
let mean = data.mean(axis=0, keepdims=true);
let std = data.std(axis=0, keepdims=true);
let normalized = (data - mean) / std;
```

**Use Case 3: Image Processing** (Phase 2 - v0.4.0)
```rust
// Apply filters to images
let image = Tensor::<f32, 3>::from_image("photo.jpg");  // (C, H, W)
let kernel = Tensor::<f32, 2>::gaussian_kernel(sigma=1.0);
let blurred = image.convolve2d(&kernel);
```

### 6.2 PyTorch Replacement Use Cases

**Use Case 1: Transfer Learning** (Phase 3 - v1.0.0)
```rust
// Fine-tune ResNet-50 on custom dataset
use trueno::{nn, optim, vision};

let mut model = vision::resnet50(pretrained=true);
model.fc = nn::Linear::new(2048, num_classes);  // Replace final layer

let criterion = nn::CrossEntropyLoss::new();
let optimizer = optim::Adam::new(model.parameters(), lr=0.001);

for epoch in 0..10 {
    for (images, labels) in train_loader {
        optimizer.zero_grad();
        let outputs = model.forward(images);
        let loss = criterion.forward(outputs, labels);
        loss.backward();
        optimizer.step();
    }
}
```

**Use Case 2: Custom Neural Network** (Phase 3 - v0.8.0)
```rust
// Build custom architecture
use trueno::nn;

struct Autoencoder {
    encoder: nn::Sequential,
    decoder: nn::Sequential,
}

impl nn::Module for Autoencoder {
    type Input = Tensor<f32, 2>;
    type Output = Tensor<f32, 2>;

    fn forward(&self, x: Self::Input) -> Self::Output {
        let encoded = self.encoder.forward(x);
        let decoded = self.decoder.forward(encoded);
        decoded
    }
}

let model = Autoencoder {
    encoder: sequential![
        nn::Linear::new(784, 256),
        nn::ReLU::new(),
        nn::Linear::new(256, 64),
    ],
    decoder: sequential![
        nn::Linear::new(64, 256),
        nn::ReLU::new(),
        nn::Linear::new(256, 784),
    ],
};
```

**Use Case 3: Inference Optimization** (Phase 3 - v0.8.0)
```rust
// Deploy trained model for fast inference
use trueno::nn;

let model = nn::load("model.pth").unwrap();
model.eval();  // Set to inference mode (disables dropout, etc.)

// Inference on GPU
let input = Tensor::<f32, 4>::from_image_batch(&images);  // (N, C, H, W)
let output = model.forward(input);  // GPU-accelerated
let predictions = output.argmax(dim=1);  // Get class predictions
```

---

## 7. Migration Path for Existing Projects

### 7.1 NumPy → Trueno Migration

**Step 1: Identify Vectorizable Code**
- Use trueno-analyze tool to scan Python codebase
- Identify NumPy operations that can be replaced

**Step 2: Incremental Replacement**
- Start with 1D operations (available today in v0.2.0)
- Use FFI to call Trueno from Python (via PyO3)
- Gradually replace more operations as Trueno gains features

**Step 3: Full Transpilation**
- Use Depyler (NumPy → Trueno transpiler) for automated conversion
- Manually verify correctness
- Benchmark performance improvements

**Example Migration**:
```python
# Before (NumPy)
import numpy as np

def process_signals(data):
    normalized = (data - data.mean()) / data.std()
    activated = 1.0 / (1.0 + np.exp(-normalized))
    return activated.sum()

# After (Trueno via PyO3)
import trueno_py

def process_signals(data):
    tensor = trueno_py.Tensor.from_numpy(data)
    normalized = tensor.normalize()
    activated = normalized.sigmoid()
    return activated.sum()
```

### 7.2 PyTorch → Trueno Migration

**Step 1: Identify Inference-Only Models**
- Target models that don't need retraining
- Export PyTorch model weights

**Step 2: Rewrite Forward Pass**
- Translate PyTorch forward() to Trueno (Phase 3)
- Load pretrained weights

**Step 3: Optimize Performance**
- Use Trueno GPU acceleration
- Benchmark against PyTorch inference

**Example Migration**:
```python
# Before (PyTorch)
import torch
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return torch.softmax(x, dim=1)

# After (Trueno - Phase 3)
use trueno::nn;

struct SimpleNet {
    fc1: nn::Linear,
    fc2: nn::Linear,
}

impl nn::Module for SimpleNet {
    type Input = Tensor<f32, 2>;
    type Output = Tensor<f32, 2>;

    fn forward(&self, x: Self::Input) -> Self::Output {
        let x = self.fc1.forward(x).relu();
        let x = self.fc2.forward(x);
        x.softmax(dim=1)
    }
}
```

---

## 8. Risk Analysis & Mitigation

### 8.1 Technical Risks

**Risk 1: GPU Performance Gap**
- **Threat**: Trueno GPU slower than PyTorch (cuDNN optimized)
- **Probability**: HIGH
- **Impact**: HIGH (undermines value proposition)
- **Mitigation**:
  - Focus on operations where we can win (activations, reductions)
  - Document where PyTorch is faster (cuDNN convolutions)
  - Target 60-80% of PyTorch speed (acceptable for Rust benefits)
  - Optimize critical kernels (matmul, conv2d)

**Risk 2: Autograd Complexity**
- **Threat**: Reverse-mode AD too complex to implement correctly
- **Probability**: MEDIUM
- **Impact**: HIGH (blocks training entirely)
- **Mitigation**:
  - Study PyTorch autograd internals
  - Start with simple ops (add, mul) before complex (matmul, conv)
  - Comprehensive gradient checking tests
  - Hire autograd expert if needed

**Risk 3: Memory Overhead**
- **Threat**: Autograd graph consumes excessive memory
- **Probability**: MEDIUM
- **Impact**: MEDIUM (limits model size)
- **Mitigation**:
  - Implement gradient checkpointing
  - Release intermediates eagerly
  - Provide in-place operations where safe
  - Profile memory usage, optimize hotspots

### 8.2 Strategic Risks

**Risk 1: PyTorch Ecosystem Lock-in**
- **Threat**: Users prefer PyTorch due to ecosystem (libraries, pretrained models)
- **Probability**: HIGH
- **Impact**: MEDIUM (limits adoption)
- **Mitigation**:
  - Provide ONNX import/export
  - Load PyTorch pretrained weights
  - Target Rust-native ML use cases (embedded, production)
  - Emphasize benefits: type safety, performance, no Python overhead

**Risk 2: Feature Creep**
- **Threat**: Trying to match every PyTorch feature delays v1.0
- **Probability**: MEDIUM
- **Impact**: HIGH (never reach production-ready state)
- **Mitigation**:
  - Focus on 80% use cases (ResNet, BERT, simple CNNs)
  - Explicitly declare non-goals (symbolic computation, JIT compilation)
  - Ship v1.0 with core features, add advanced features in v1.x

**Risk 3: Rust Ergonomics**
- **Threat**: Rust's ownership model makes ML code verbose/difficult
- **Probability**: MEDIUM
- **Impact**: MEDIUM (poor developer experience)
- **Mitigation**:
  - Provide ergonomic macros (tensor!, sequential!)
  - Smart default behaviors (auto-grad on by default)
  - Comprehensive examples and tutorials
  - Gather user feedback, iterate on API

---

## 9. Success Metrics & KPIs

### 9.1 Technical Metrics

**Phase 1 (v0.3.0) - 1D Operations Complete**:
- ✅ All common 1D operations GPU-accelerated
- ✅ 10-50x GPU speedup validated by benchmarks
- ✅ >90% test coverage
- ✅ Mutation testing ≥80% kill rate

**Phase 2 (v0.6.0) - NumPy Parity**:
- ✅ 80-120% of NumPy performance
- ✅ Pass 80%+ of NumPy test suite (for covered ops)
- ✅ Multi-dimensional arrays (0D-4D)
- ✅ Broadcasting support

**Phase 3 (v1.0.0) - Training Ready**:
- ✅ Train ResNet-50 on CIFAR-10 in <30 minutes (single GPU)
- ✅ 60-80% of PyTorch performance
- ✅ Autograd matches PyTorch (< 1e-5 gradient error)

### 9.2 Adoption Metrics

**Phase 1**:
- ✅ Used in production by ≥3 projects
- ✅ ≥100 GitHub stars
- ✅ ≥10 contributors

**Phase 2**:
- ✅ ≥10 production deployments
- ✅ ≥500 GitHub stars
- ✅ ≥50 contributors

**Phase 3**:
- ✅ Used in production ML training pipelines
- ✅ ≥1,000 GitHub stars
- ✅ ≥100 contributors
- ✅ Featured in Rust ML blog posts/talks

### 9.3 Quality Metrics (Continuous)

**EXTREME TDD Standards** (All phases):
- Test coverage: ≥90%
- Mutation testing: ≥80% kill rate
- Property-based tests: All core operations
- Backend equivalence: GPU vs SIMD vs Scalar (< 1e-5 error)
- **Differential testing**: All operations vs NumPy/PyTorch (Phase 2+)
- **Fuzz testing**: Robustness against malformed inputs (Phase 2+)
- Gradient checking: All autograd operations (Phase 3)
- Zero clippy warnings
- PMAT TDG: ≥B+ (85/100)

**Additional Testing Strategies** (*Jidoka* - Build Quality In):

**1. Differential Testing** (Catch Numerical Bugs):

For every operation, run it in Trueno, NumPy, and PyTorch with the same random inputs and assert outputs are numerically close [McKeeman, 1998]:

```rust
#[test]
fn test_matmul_vs_numpy() {
    let a = Tensor::randn([100, 200]);
    let b = Tensor::randn([200, 300]);

    // Trueno result
    let trueno_result = a.matmul(&b).unwrap();

    // NumPy result (via PyO3)
    let numpy_result = numpy_matmul(a.to_numpy(), b.to_numpy());

    // Must be very close
    assert_tensors_close(&trueno_result, &numpy_result, eps=1e-5);
}
```

This is incredibly powerful for finding subtle bugs in numerical implementations that unit tests miss.

**2. Fuzz Testing** (Security + Robustness):

Use `cargo-fuzz` to feed random, potentially malformed data into functions [Miller et al., 1990]:

```rust
// fuzz/fuzz_targets/model_loading.rs
#![no_main]
use libfuzzer_sys::fuzz_target;
use trueno::nn;

fuzz_target!(|data: &[u8]| {
    // Try to load model from arbitrary bytes
    // Should not crash, even on malformed input
    let _ = nn::load_from_bytes(data);
});
```

Fuzz testing is essential for:
- Model loading (ONNX, checkpoint files)
- Complex control flow (broadcasting, indexing)
- Ensuring no panics on malformed input
- Security (prevent exploitation via malicious model files)

---

## 10. Non-Goals

### What Trueno Will NOT Be

❌ **100% PyTorch-compatible** - Inspired by, not clone of
- We will NOT replicate every PyTorch feature
- Focus on core 80% use cases (ResNet, BERT, simple CNNs)
- Explicitly omit: JIT compilation, symbolic computation, TorchScript

❌ **Research-first library** - Production performance is priority
- We will NOT prioritize cutting-edge research features
- Focus on battle-tested architectures
- Add experimental features only after proven in production

❌ **Python-first** - Rust-native, Python bindings optional
- Primary API is Rust, not Python
- PyO3 bindings for interop, but Rust is first-class
- Performance over Python compatibility

❌ **Dynamic typing** - Static typing for safety
- Compile-time shape checking where possible
- Type errors at compile time, not runtime
- Trade ergonomics for safety/performance

❌ **Symbolic computation** - Eager execution only
- No graph optimization, no lazy evaluation
- Simple mental model: operations execute immediately
- Focus on imperative PyTorch-style API

---

## 11. Appendix: Competitive Analysis

### 11.1 Trueno vs. NumPy

| Feature | NumPy | Trueno v0.2.0 | Trueno v0.6.0 (Target) |
|---------|-------|---------------|------------------------|
| **Multi-dim arrays** | ✅ 0D-32D | ❌ 1D only | ✅ 0D-4D |
| **Broadcasting** | ✅ Full | ❌ None | ✅ Full |
| **Data types** | ✅ 20+ | ❌ f32 only | ✅ 10+ (f16/f32/f64/i32/etc.) |
| **GPU acceleration** | ❌ CPU only | ✅ 14 ops | ✅ 50+ ops |
| **Performance (SIMD)** | ❌ Limited | ✅ 2-8x faster | ✅ 2-8x faster |
| **Performance (GPU)** | ❌ N/A | ✅ 10-50x faster | ✅ 10-50x faster |
| **Type safety** | ❌ Runtime | ✅ Compile-time | ✅ Compile-time |
| **Memory safety** | ❌ C backend | ✅ Rust safe | ✅ Rust safe |

**Verdict**: Trueno v0.6.0 will match NumPy functionality with superior performance and safety.

### 11.2 Trueno vs. PyTorch

| Feature | PyTorch | Trueno v0.2.0 | Trueno v1.0.0 (Target) |
|---------|---------|---------------|------------------------|
| **Autograd** | ✅ Full | ❌ None | ✅ Full |
| **GPU ops** | ✅ 1000+ | ✅ 14 | ✅ 100+ |
| **Training** | ✅ Full | ❌ Inference only | ✅ Full |
| **Layers** | ✅ 100+ | ❌ None | ✅ 20+ core layers |
| **Optimizers** | ✅ 10+ | ❌ None | ✅ 5+ (SGD, Adam, AdamW, etc.) |
| **Pretrained models** | ✅ 1000+ | ❌ None | ✅ 10+ (ResNet, BERT, etc.) |
| **Performance (cuDNN)** | ✅ Baseline | ❌ N/A | ✅ 60-80% |
| **Type safety** | ❌ Runtime | ✅ Compile-time | ✅ Compile-time |
| **Memory safety** | ❌ Python/C++ | ✅ Rust safe | ✅ Rust safe |

**Verdict**: Trueno v1.0.0 will enable training with 60-80% of PyTorch performance, prioritizing safety.

### 11.3 Trueno vs. Other Rust ML Libraries

**vs. burn** (Rust deep learning):
- burn: Dynamic graphs, flexible, but less mature
- Trueno: Focus on production performance, GPU optimization, NumPy/PyTorch API

**vs. candle** (HuggingFace):
- candle: Inference-focused, minimal dependencies
- Trueno: Training + inference, SIMD + GPU, broader API surface

**vs. ndarray** (Rust NumPy):
- ndarray: CPU-only, no GPU, no autograd
- Trueno: GPU-accelerated, autograd (Phase 3), training-ready

**Trueno Differentiation**:
- ✅ GPU-first design (14 ops today, 100+ by v1.0)
- ✅ EXTREME TDD (>90% coverage, mutation testing)
- ✅ PyTorch/NumPy compatibility (familiar API)
- ✅ Production focus (safety, performance, PMAT quality gates)

---

## 12. Future Directions & Developer Experience

### 12.1 Tensor Compiler Integration (*Long-Term Vision*)

**Current State**: Trueno uses hand-written WGSL shaders for GPU operations. While effective, this approach has limitations:
- **cuDNN Gap**: PyTorch's cuDNN integration is heavily optimized by NVIDIA for specific architectures
- **Maintenance Burden**: Each operation requires custom shader implementation
- **Architecture-Specific**: Hard to optimize for different GPUs (NVIDIA vs AMD vs Apple M-series)

**Future Direction** (Post-v1.0): Integrate with a tensor compiler backend like Apache TVM or MLIR [Chen et al., 2018; Lattner et al., 2020]. These tools automatically generate highly optimized kernels for a given operation and hardware target.

**Benefits**:
- **Match/Exceed cuDNN**: Auto-optimization can match or beat vendor libraries
- **Multi-Hardware**: Single codebase targets NVIDIA, AMD, Apple, Intel GPUs
- **Reduced Maintenance**: Compiler generates kernels, we write high-level ops
- **Novel Architectures**: Easy to support new hardware (e.g., Google TPUs)

**Integration Path** (Not v1.0, but architecture should not preclude this):
```rust
// High-level operation (unchanged)
let result = a.matmul(&b);

// Under the hood (v1.0: hand-written WGSL)
gpu.execute_wgsl_shader("matmul.wgsl", ...);

// Under the hood (future: TVM/MLIR compiler)
let compiled_kernel = tvm::compile_matmul(a.shape(), b.shape(), target_gpu);
gpu.execute_compiled_kernel(compiled_kernel, ...);
```

This aligns with the direction of modern ML frameworks like JAX (uses XLA compiler) and PyTorch (integrating TorchInductor).

### 12.2 Error Message Quality (*Respect for People*)

The quality of compiler error messages will be make-or-break for usability [Ko et al., 2011]. Rust's ownership system already provides excellent error messages; Trueno must match this standard for tensor operations.

**Bad Error Message** (Current Rust libraries):
```
thread 'main' panicked at 'assertion failed: self.shape[1] == other.shape[0]'
```

**Good Error Message** (Trueno Target):
```
Error: Cannot multiply tensors due to incompatible shapes

  Found:
    Left tensor:  shape (3, 4)  [3 rows, 4 columns]
    Right tensor: shape (5, 6)  [5 rows, 6 columns]

  Problem: Inner dimensions do not match (4 ≠ 5)

  For matrix multiplication A @ B:
    - A must have shape (m, k)
    - B must have shape (k, n)
    - Result will have shape (m, n)

  Hint: Did you mean to transpose one of the tensors?
    - a.transpose().matmul(&b)  // Shape (4, 3) @ (5, 6)
    - a.matmul(&b.transpose())  // Shape (3, 4) @ (6, 5)
```

**Implementation Strategy**:
```rust
impl Tensor {
    pub fn matmul(&self, other: &Tensor) -> Result<Tensor, TruenoError> {
        if self.shape[1] != other.shape[0] {
            return Err(TruenoError::ShapeMismatch {
                operation: "matmul",
                left_shape: self.shape.clone(),
                right_shape: other.shape.clone(),
                issue: format!(
                    "Inner dimensions do not match ({} ≠ {})",
                    self.shape[1], other.shape[0]
                ),
                hint: "Did you mean to transpose one of the tensors?",
            });
        }
        // ... implementation
    }
}
```

Investing in diagnostic quality pays enormous dividends in user adoption. Research on HCI shows that error message quality significantly impacts developer productivity [Meyer et al., 2017].

### 12.3 Ecosystem - Model Hub (*Reduce Activation Energy*)

**Strategic Risk**: PyTorch Ecosystem Lock-in is the biggest external threat. Users prefer PyTorch due to the vast ecosystem of pretrained models.

**Mitigation** (Beyond ONNX Support):

Create a **Trueno Model Hub** with popular architectures pre-ported and ready to use:

```rust
use trueno::hub;

// Load pretrained ResNet-50 (ImageNet weights)
let model = hub::load("resnet50", pretrained=true)?;

// Immediate transfer learning
let mut model = model;
model.fc = nn::Linear::new(2048, num_custom_classes);

let optimizer = optim::Adam::new(model.parameters(), lr=0.001);
// ... train on custom dataset
```

**Model Hub Contents** (Target for v1.0):
- **Vision**: ResNet-{18,34,50,101,152}, VGG-{11,13,16,19}, MobileNet-V2/V3, EfficientNet
- **NLP**: BERT-{base,large}, GPT-2, DistilBERT
- **Detection**: YOLO-v5, Faster R-CNN (Phase 4)

**Implementation**:
- Host pretrained weights (convert from PyTorch using ONNX)
- Provide model definitions in idiomatic Trueno code
- Benchmark inference speed vs PyTorch
- Document transfer learning workflows

This drastically lowers the activation energy for new users who want to do transfer learning—the dominant use case in industry [Amershi et al., 2019]. The ecosystem effect is a primary driver of platform adoption.

---

## 13. Tiered TDD-X Workflow & Quality Gates (Certeza Insights)

This section integrates lessons from the **certeza** project (https://github.com/paiml/certeza), a scientific experiment achieving **97.7% mutation score** through asymptotic test effectiveness. Certeza demonstrates that EXTREME TDD is achievable and sustainable when verification is properly tiered [1,2,3].

### 13.1 The Flow State Problem

**Critical Finding**: Different verification techniques operate at different time scales. Fast feedback enables flow state; slow feedback causes context switching waste [certeza spec §2.2]. Running mutation testing on every file save destroys productivity (10-100x loss) [4].

**Anti-Pattern** ❌:
```bash
# NEVER do this - destroys flow state
watch -n 1 'cargo mutants'  # Running mutation tests every second
```

**Solution**: Tiered feedback loops that match verification intensity to development phase [5,6].

### 13.2 Tiered Workflow (Certeza Model)

```
┌──────────────────────────────────────────────────────────┐
│              Tiered TDD-X Workflow                        │
├──────────────────────────────────────────────────────────┤
│                                                           │
│  TIER 1: ON-SAVE (Sub-second feedback)                   │
│  ├─ Write failing unit/property test                     │
│  ├─ Implement minimal code                               │
│  ├─ Run focused test suite                               │
│  ├─ Static analysis (cargo check, clippy)                │
│  └─ Iterate rapidly in flow state                        │
│                                                           │
│  TIER 2: ON-COMMIT (1-5 minutes)                         │
│  ├─ Full property-based test suite                       │
│  ├─ Coverage analysis (target: 90%+ line)                │
│  ├─ Integration tests                                    │
│  ├─ Differential testing (vs NumPy/PyTorch)              │
│  └─ Pre-commit hook enforcement                          │
│                                                           │
│  TIER 3: ON-MERGE/NIGHTLY (Hours)                        │
│  ├─ Comprehensive mutation testing                       │
│  │  └─ Target: ≥80% score, analyze survivors            │
│  ├─ Formal verification (critical paths)                 │
│  ├─ Performance benchmarks                               │
│  ├─ Security audits                                      │
│  └─ CI/CD gate for main branch                           │
│                                                           │
└──────────────────────────────────────────────────────────┘
```

**Workflow Mapping** (Trueno-Specific):

| Activity | Tier | Frequency | Time Budget | Purpose |
|----------|------|-----------|-------------|---------|
| Unit tests (focused) | 1 | Every save | <1s | Rapid iteration |
| Clippy (fast mode) | 1 | Every save | <1s | Quick lint check |
| Property tests (focused) | 1 | Every save | <3s | Design validation |
| Full test suite | 2 | Every commit | <5m | Regression prevention |
| Coverage analysis | 2 | Every commit | <2m | Completeness check |
| Differential testing | 2 | Every commit | <5m | NumPy/PyTorch equivalence |
| Mutation testing | 3 | Pre-merge/nightly | <2h | Test quality assurance |
| Formal verification | 3 | Pre-merge | <4h | Proof of invariants |
| Benchmarks (full suite) | 3 | Nightly | <1h | Performance regression |

### 13.3 Risk-Based Verification Matrix

**Core Principle**: Not all code requires the same verification intensity. Apply rigorous techniques to high-risk components [certeza spec §2.3].

**Trueno Component Classification**:

```
       │ Low Complexity  │ Medium Complexity │ High Complexity
───────┼─────────────────┼───────────────────┼─────────────────
High   │ Property +      │ Property +        │ Property +
Crit   │ Coverage +      │ Coverage +        │ Coverage +
       │ Mutation (85%)  │ Mutation (90%)    │ Mutation (90%) +
       │                 │                   │ Formal Verify
───────┼─────────────────┼───────────────────┼─────────────────
Medium │ Property +      │ Property +        │ Property +
Crit   │ Coverage (95%)  │ Coverage +        │ Coverage +
       │                 │ Mutation (80%)    │ Mutation (85%)
───────┼─────────────────┼───────────────────┼─────────────────
Low    │ Unit Tests +    │ Unit Tests +      │ Property +
Crit   │ Coverage (90%)  │ Property +        │ Coverage (95%)
       │                 │ Coverage (90%)    │
```

**Component Risk Assessment** (Trueno-Specific):

| Risk Level | Component Examples | Verification Approach | Rationale |
|------------|-------------------|----------------------|-----------|
| **Very High** | SIMD intrinsics (`unsafe` blocks), GPU compute shaders (WGSL), autograd reverse-mode AD engine, memory allocators | Property + Coverage (95%) + Mutation (90%) + Formal | Memory safety, correctness-critical, high complexity |
| **High** | Tensor operations (matmul, conv2d), broadcasting logic, gradient computation, optimizer updates | Property + Coverage (95%) + Mutation (85-90%) | Core algorithms, numerical correctness critical |
| **Medium** | Activation functions (relu, sigmoid), loss functions, data loaders, layer implementations | Property + Coverage (90%) + Mutation (80%) | Important but less complex, well-understood algorithms |
| **Low** | Utility functions, simple accessors, configuration parsing, error messages | Unit tests + Coverage (90%) | Low complexity, limited failure modes |

**Resource Allocation**: Spend 40% of verification time on the 5-10% highest-risk code (SIMD intrinsics, GPU shaders, autograd engine) [7].

### 13.4 Testing Pyramid Distribution

Following certeza's empirically validated distribution [certeza README]:

```
               ┌─────────────────┐
               │  Formal (Kani)  │  <- Invariant proofs
               │   ~1-5% code    │     (capacity >= len, no overflows)
               ├─────────────────┤
               │   Integration   │  <- System properties
               │    ~10% tests   │     (end-to-end workflows)
               ├─────────────────┤
               │  Property-Based │  <- Algorithmic correctness
               │    ~30% tests   │     (commutativity, associativity)
               ├─────────────────┤
               │   Unit Tests    │  <- Basic functionality
               │    ~60% tests   │     (edge cases, error handling)
               └─────────────────┘

         Coverage ────────────────────> Mutation ────────────────────> Properties
```

**Trueno Target Distribution** (v1.0.0):
- **Unit tests**: ~150-200 tests (60%) - Basic operation correctness, edge cases
- **Property-based tests**: ~75-100 properties (30%) - Equivalence with NumPy/PyTorch, mathematical properties
- **Integration tests**: ~25-30 scenarios (10%) - End-to-end training, model inference
- **Formal verification**: ~3-5 proofs (1-5% of code) - Tensor invariants, bounds checking

**Property-Based Testing Emphasis** (Inspired by certeza's 53 properties):

```rust
use proptest::prelude::*;

proptest! {
    #[test]
    fn test_add_commutative(
        a in prop::collection::vec(-1000.0f32..1000.0, 1..10000),
        b in prop::collection::vec(-1000.0f32..1000.0, 1..10000)
    ) {
        let a_vec = Vector::from_slice(&a);
        let b_vec = Vector::from_slice(&b);

        let ab = a_vec.add(&b_vec).unwrap();
        let ba = b_vec.add(&a_vec).unwrap();

        // Commutativity: a + b == b + a
        assert_tensors_close(&ab, &ba, 1e-5);
    }

    #[test]
    fn test_numpy_equivalence(
        a in prop::collection::vec(-100.0f32..100.0, 1..1000)
    ) {
        // Differential testing: Trueno must match NumPy
        let trueno_result = Vector::from_slice(&a).exp().unwrap();
        let numpy_result = call_numpy_exp(&a);  // Via PyO3

        assert_tensors_close(&trueno_result, &numpy_result, 1e-5);
    }
}
```

### 13.5 Quality Gates Per Tier

**Tier 1: ON-SAVE** (Sub-second):
```bash
make tier1  # Certeza-style Makefile target
```
- `cargo check` passes
-`cargo clippy --lib` (fast mode) - zero warnings
- ✅ Unit tests pass (focused subset)
- ✅ Property tests pass (small case count: PROPTEST_CASES=10)

**Tier 2: ON-COMMIT** (1-5 minutes):
```bash
make tier2
```
- `cargo fmt --check` - code formatted
-`cargo clippy --all-targets --all-features` - zero warnings
- ✅ Full test suite passes (`cargo test --all-features`)
- ✅ Coverage ≥90% (`cargo llvm-cov`)
- ✅ Property tests pass (full cases: PROPTEST_CASES=256-1000)
- ✅ Differential tests pass (vs NumPy/PyTorch, error < 1e-5)
- ✅ PMAT TDG ≥B+ (85/100)
- ✅ Zero SATD comments (TODO/FIXME/HACK)

**Tier 3: ON-MERGE/NIGHTLY** (Hours):
```bash
make tier3
```
- ✅ Mutation testing ≥80% kill rate (`cargo mutants`)
- ✅ Benchmarks: no regressions >5%
- ✅ Security audit (`cargo audit`, `cargo deny`)
- ✅ Formal verification passes (Kani proofs for critical invariants)
- ✅ PMAT repo score ≥90/110

### 13.6 Anti-Patterns to Avoid

Based on certeza specification §2.2 and empirical findings:

1. **❌ Running Mutation Tests on Every Save**
   - **Problem**: Destroys flow state (10-100x productivity loss)
   - **Solution**: Run mutation tests only in Tier 3 (pre-merge/nightly)

2. **❌ Chasing Metrics Without Understanding**
   - **Problem**: Goodhart's Law - "When a measure becomes a target, it ceases to be a good measure"
   - **Solution**: Mutation analysis is a learning exercise, not just a metric

3. **❌ Applying Full Verification to Low-Risk Code**
   - **Problem**: Over-processing waste (Toyota Way: *Muda*)
   - **Solution**: Risk-based resource allocation (40% time on 5-10% highest-risk code)

4. **❌ Ignoring Cognitive Load Limits**
   - **Problem**: Analyzing 1000 surviving mutants in one session causes burnout
   - **Solution**: Time-boxing (2-hour sessions), batching, pair programming for mutation analysis

5. **❌ Skipping Differential Testing**
   - **Problem**: Missing NumPy/PyTorch incompatibilities until production
   - **Solution**: Every operation must have differential test (vs NumPy/PyTorch)

6. **❌ No Gradient Checking for Autograd**
   - **Problem**: Analytical gradients may be incorrect (chain rule bugs)
   - **Solution**: Automated gradient checking (analytical vs numerical) for every autograd operation

### 13.7 Certeza as Empirical Proof

**Key Finding**: certeza achieved **97.7% mutation score** with 231 tests, proving that EXTREME TDD targets are achievable and sustainable when properly tiered [certeza README].

**Certeza Achievement Breakdown**:
- **Unit tests**: 150+ tests (60%)
- **Property-based tests**: 53 properties (30%)
- **Integration tests**: 26 scenarios (10%)
- **Formal verification**: 3 Kani proofs (capacity invariant, push/pop correctness, bounds checking)
- **Mutation score**: 97.7% (only ~10 surviving mutants out of 400+)
- **Coverage**: >95% line coverage

**Lesson for Trueno**: If a simple vector data structure (certeza::TruenoVec) can achieve 97.7% mutation score, then trueno's operations (elementwise ops, reductions, activations) can achieve ≥90% mutation score by following the same tiered workflow.

**Workflow Commands** (Trueno should adopt certeza's Makefile structure):

```bash
# Tier 1: Sub-second feedback (flow state)
make tier1

# Tier 2: Pre-commit (1-5 minutes)
make tier2

# Tier 3: Pre-merge/nightly (hours)
make tier3

# Kaizen: Continuous improvement cycle
make kaizen  # Analyze coverage, complexity, TDG, binary size

# Demo mode: Interactive demonstration
make demo-mode
```

### 13.8 References (Certeza-Specific)

1. Certeza Specification: "Asymptotic Test Effectiveness: A Practical Framework for High-Assurance Rust Verification" (v1.1, ~14K words)
2. Certeza README: 97.7% mutation score achievement with 231 tests
3. Jia, Y., & Harman, M. (2011). An analysis and survey of the development of mutation testing. *IEEE TSE, 37(5)*.
4. Fowler, M. (2013). *Refactoring: Improving the Design of Existing Code* (2nd ed.). Chapter on Test-Driven Development and flow state.
5. Google Testing Blog. (2014). Test Sizes. https://testing.googleblog.com/2010/12/test-sizes.html
6. Humble, J., & Farley, D. (2010). *Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation*. Addison-Wesley.
7. McConnell, S. (2004). *Code Complete* (2nd ed.). Microsoft Press. Chapter 20: Software Quality Landscape.

---

## 14. Conclusion

Trueno is strategically positioned to become the **de facto PyTorch/NumPy replacement for Rust**. This specification defines a clear, achievable roadmap:

**v0.2.0 → v0.3.0** (2-3 months): Complete 1D operations, validate GPU performance
**v0.3.0 → v0.6.0** (6-12 months): Multi-dimensional tensors, broadcasting, NumPy parity
**v0.6.0 → v1.0.0** (12-18 months): Autograd, training, PyTorch parity

**Total timeline**: 20-33 months to production-ready deep learning library.

**Key Success Factors**:
1. **Focus**: Prioritize 80% use cases, defer advanced features
2. **Quality**: Maintain EXTREME TDD standards throughout
3. **Performance**: Validate GPU speedup claims, optimize critical paths
4. **Adoption**: Ship early, gather feedback, iterate on API

**Next Steps**:
1. ✅ Complete v0.2.1 (softmax/log_softmax GPU) - DONE
2. Benchmark all GPU ops (validate 10-50x claims)
3. Implement remaining activations (hardswish, mish, selu)
4. Begin Phase 2 design (Tensor<T, N> type)

---

## 15. Academic References

This specification is grounded in academic research and industry best practices. Key publications informing Trueno's design:

**Deep Learning Frameworks**:
1. Abadi, M., et al. (2016). TensorFlow: A System for Large-Scale Machine Learning. *OSDI 16*.
2. Paszke, A., et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. *NeurIPS 2019*.
3. Chen, T., et al. (2018). TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. *OSDI 18*.
4. Lattner, C., et al. (2020). MLIR: A Compiler Infrastructure for the End of Moore's Law. *arXiv:2002.11054*.

**Automatic Differentiation**:
5. Baydin, A. G., et al. (2018). Automatic Differentiation in Machine Learning: a Survey. *Journal of Machine Learning Research, 18*.
6. Griewank, A., & Walther, A. (2008). *Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation*. SIAM.

**Numerical Computing**:
7. Van der Walt, S., et al. (2011). The NumPy Array: A Structure for Efficient Numerical Computation. *Computing in Science & Engineering, 13(2)*.
8. Goto, K., & van de Geijn, R. (2008). Anatomy of High-Performance Matrix Multiplication. *ACM Transactions on Mathematical Software, 34(3)*.
9. Bradbury, J., et al. (2018). JAX: composable transformations of Python+NumPy programs. *Google Research*.

**Software Testing**:
10. Pei, K., et al. (2017). DeepXplore: Automated Whitebox Testing of Deep Learning Systems. *SOSP 2017*.
11. Zhang, H., et al. (2020). An Empirical Study of Common Bugs in Deep Learning Applications. *ISSRE 2020*.
12. McKeeman, W. M. (1998). Differential testing for software. *Digital Technical Journal, 10(1)*.
13. Miller, B. P., et al. (1990). An empirical study of the reliability of UNIX utilities. *CACM, 33(12)*.

**Human-Computer Interaction**:
14. Ko, A. J., et al. (2011). The state of the art in error messages: an empirical study. *ISSTA 2011*.
15. Meyer, A. N., et al. (2017). The pragmatic programmer's programmer: A study of professional developers' learning strategies. *ICER 2017*.

**Software Engineering for ML**:
16. Amershi, S., et al. (2019). Software Engineering for Machine Learning: A Case Study. *ICSE-SEIP 2019*.
17. Sculley, D., et al. (2015). Hidden Technical Debt in Machine Learning Systems. *NeurIPS 2015*.
18. Jia, X., et al. (2019). A Survey of Software Engineering for Machine Learning. *arXiv:1906.07548*.

**Foundational Architectures** (Target Use Cases):
19. Vaswani, A., et al. (2017). Attention Is All You Need. *NeurIPS 2017*. (Transformers)
20. He, K., et al. (2016). Deep Residual Learning for Image Recognition. *CVPR 2016*. (ResNets)

---

**Document Version**: 1.2
**Last Updated**: 2025-11-17
**Status**: Living Document (update as roadmap evolves)
**Owner**: Trueno Core Team

**Changelog**:
- **v1.2** (2025-11-17): Added Section 13 - Tiered TDD-X Workflow & Quality Gates based on certeza scientific experiment (97.7% mutation score). Includes: tiered workflow (tier1/tier2/tier3), risk-based verification matrix, testing pyramid distribution, quality gates per tier, anti-patterns, certeza empirical proof. ~270 lines added.
- **v1.1** (2025-11-17): Added Kaizen improvements: tensor type trade-offs, storage layout considerations, broadcasting quality gates, autograd gradient checking, differential/fuzz testing, tensor compiler vision, error message quality, model hub strategy, academic citations
- **v1.0** (2025-11-17): Initial comprehensive specification