# Trueno: PyTorch/NumPy Replacement Specification
**Version**: 1.0
**Date**: 2025-11-17
**Status**: Living Document
**Priority**: CRITICAL - Core Strategic Positioning
---
## Executive Summary
Trueno is a high-performance compute library designed as a **drop-in replacement** for NumPy (data processing) and PyTorch (inference + training) in Rust-native applications. This specification defines the roadmap from current state (v0.2.0) to full replacement parity.
**Current State (v0.2.0)**:
- **NumPy Replacement**: ~35% complete (1D operations only)
- **PyTorch Replacement**: ~15% complete (inference only, no autograd)
- **Strategic Positioning**: High-performance 1D vector compute library with GPU acceleration
**Target State (v1.0.0)**:
- **NumPy Replacement**: 80%+ complete (multi-dimensional arrays, broadcasting, core ops)
- **PyTorch Replacement**: 60%+ complete (autograd, training, common layers)
- **Strategic Positioning**: Production-ready deep learning library for Rust
---
## 1. Current State Analysis (v0.2.0)
### 1.1 NumPy Replacement: ~35% Complete
#### ✅ What Works Today
**1D Element-wise Operations** (100% coverage):
```rust
use trueno::Vector;
// NumPy: np.add(a, b), np.exp(a), np.log(a)
let a = Vector::from_slice(&[1.0, 2.0, 3.0]);
let b = Vector::from_slice(&[4.0, 5.0, 6.0]);
let result = a.add(&b).unwrap(); // Works identically to NumPy
```
**Supported operations**:
- Arithmetic: `add`, `sub`, `mul`, `div`, `neg`
- Transcendental: `exp`, `log`, `pow`, `sqrt`
- Trigonometric: `sin`, `cos`, `tan`
- Reductions: `sum`, `mean`, `std`, `min`, `max`, `norm`
- Linear algebra: `dot`, `matmul` (1D→2D via reshape)
**Reductions** (100% coverage for 1D):
```rust
// NumPy: np.sum(arr), np.mean(arr), np.std(arr)
let data = Vector::from_slice(&[1.0, 2.0, 3.0, 4.0, 5.0]);
let total = data.sum().unwrap(); // 15.0
let average = data.mean().unwrap(); // 3.0
let stdev = data.std().unwrap(); // ~1.414
```
**GPU Acceleration** (14 operations):
- **OpComplexity::Low** (>100K threshold): `vec_add`, `dot`, `relu`, `leaky_relu`, `elu`, `sigmoid`, `tanh`, `swish`, `gelu`, `clip`
- **OpComplexity::Medium** (>10K threshold): `softmax`, `log_softmax`
- **OpComplexity::High** (>1K threshold): `matmul`, `convolve2d`
**Performance**:
- CPU SIMD: 2-8x faster than scalar (SSE2/AVX/AVX2)
- GPU: 10-50x faster than scalar for large workloads (>100K elements)
#### ❌ Critical Gaps (Blocking NumPy Replacement)
**Multi-dimensional Arrays** (0% coverage):
```python
# NumPy - WORKS
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]]) # 2D array (2×3)
arr.shape # (2, 3)
# Trueno - MISSING
# Currently only supports 1D Vector<f32>
```
**Broadcasting** (0% coverage):
```python
# NumPy - WORKS
a = np.array([[1, 2, 3]]) # Shape (1, 3)
b = np.array([[1], [2], [3]]) # Shape (3, 1)
result = a + b # Shape (3, 3) via broadcasting
# Trueno - MISSING
# No broadcasting support, must manually tile/repeat
```
**Advanced Indexing** (0% coverage):
```python
# NumPy - WORKS
arr = np.array([1, 2, 3, 4, 5])
arr[1:4] # Slicing: [2, 3, 4]
arr[[0, 2, 4]] # Fancy indexing: [1, 3, 5]
arr[arr > 2] # Boolean masking: [3, 4, 5]
# Trueno - MISSING
# Only supports full array operations
```
**Reshaping/Transposing** (0% coverage):
```python
# NumPy - WORKS
arr = np.array([1, 2, 3, 4, 5, 6])
arr.reshape(2, 3) # [[1, 2, 3], [4, 5, 6]]
arr.transpose() # Swap axes
# Trueno - MISSING
# No reshape or transpose operations
```
**Multiple Data Types** (0% coverage):
```python
# NumPy - WORKS
np.array([1, 2, 3], dtype=np.int32)
np.array([1.0, 2.0], dtype=np.float64)
# Trueno - ONLY f32
# Only supports Vector<f32>, no int32/float64/etc.
```
#### 🎯 Viable NumPy Use Cases Today
**1. 1D Signal Processing**:
```rust
use trueno::Vector;
// Audio processing: apply filter to 1D signal
let signal = Vector::from_slice(&audio_samples);
let filtered = signal.convolve2d(&filter_kernel).unwrap();
```
**2. Vector Similarity/Distance**:
```rust
// Compute cosine similarity between document embeddings
let doc1 = Vector::from_slice(&embedding1);
let doc2 = Vector::from_slice(&embedding2);
let similarity = doc1.dot(&doc2).unwrap() / (doc1.norm().unwrap() * doc2.norm().unwrap());
```
**3. Large-scale Element-wise Transformations** (GPU-accelerated):
```rust
// Activate 1M logits with GPU acceleration
let logits = Vector::from_slice(&vec![...; 1_000_000]);
let probs = logits.sigmoid().unwrap(); // Auto-uses GPU for >100K elements
```
---
### 1.2 PyTorch Replacement: ~15% Complete
#### ✅ What Works Today
**GPU-Accelerated Activations** (Inference Only):
```rust
use trueno::Vector;
// PyTorch: F.relu(x), F.sigmoid(x), F.gelu(x)
let hidden = Vector::from_slice(&[...]);
let activated = hidden.relu().unwrap(); // ReLU
let squashed = hidden.sigmoid().unwrap(); // Sigmoid
let gelu_out = hidden.gelu().unwrap(); // GELU (BERT/GPT)
let attention = scores.softmax().unwrap(); // Softmax (transformers)
```
**Supported activations** (14 GPU-accelerated):
- Standard: ReLU, Leaky ReLU, ELU, Sigmoid, Tanh
- Modern: Swish, GELU, Softmax, Log-Softmax
- Utility: Clip (gradient clipping)
**Basic Tensor Operations** (1D only):
```rust
// PyTorch: torch.dot(a, b), torch.matmul(A, B)
let a = Vector::from_slice(&[1.0, 2.0, 3.0]);
let b = Vector::from_slice(&[4.0, 5.0, 6.0]);
let dot_product = a.dot(&b).unwrap(); // Works like PyTorch
```
**Performance** (GPU-accelerated):
- Element-wise ops (>100K): 10-50x faster than scalar
- Multi-pass ops (>10K): 5-20x faster than scalar (softmax, log_softmax)
- Matrix ops (>1K×1K): 10-50x faster than naive CPU
#### ❌ Critical Blockers (Preventing PyTorch Replacement)
**No Autograd** (0% coverage):
```python
# PyTorch - WORKS
import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = (x ** 2).sum()
y.backward() # Compute gradients
print(x.grad) # [2.0, 4.0, 6.0]
# Trueno - MISSING
# Cannot train neural networks without gradient computation
```
**No Layers** (0% coverage):
```python
# PyTorch - WORKS
import torch.nn as nn
linear = nn.Linear(784, 128)
conv = nn.Conv2d(3, 64, kernel_size=3)
norm = nn.BatchNorm2d(64)
# Trueno - MISSING
# No nn::Module trait, no layer abstractions
```
**No Loss Functions** (0% coverage):
```python
# PyTorch - WORKS
criterion = nn.CrossEntropyLoss()
loss = criterion(outputs, targets)
# Trueno - MISSING
# Must manually implement loss computation
```
**No Optimizers** (0% coverage):
```python
# PyTorch - WORKS
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Trueno - MISSING
# No SGD, Adam, AdamW, etc.
```
**No Multi-dimensional Tensors** (0% coverage):
```python
# PyTorch - WORKS
x = torch.randn(32, 3, 224, 224) # Batch of images (N, C, H, W)
# Trueno - MISSING
# Only 1D Vector<f32>
```
**No Model Saving/Loading** (0% coverage):
```python
# PyTorch - WORKS
torch.save(model.state_dict(), 'model.pth')
model.load_state_dict(torch.load('model.pth'))
# Trueno - MISSING
# No checkpoint serialization
```
#### 🎯 Viable PyTorch Use Cases Today
**1. Neural Network Inference** (Forward Pass Only, 1D):
```rust
use trueno::Vector;
// Inference for a simple 1D feedforward network
let input = Vector::from_slice(&features);
let hidden1 = input.matmul(&weights1).unwrap().add(&bias1).unwrap().relu().unwrap();
let hidden2 = hidden1.matmul(&weights2).unwrap().add(&bias2).unwrap().gelu().unwrap();
let logits = hidden2.matmul(&weights3).unwrap().add(&bias3).unwrap();
let probs = logits.softmax().unwrap();
```
**2. Custom Activation Function Testing**:
```rust
// Compare custom activation vs standard activations
let data = Vector::from_slice(&test_data);
let relu_out = data.relu().unwrap();
let swish_out = data.swish().unwrap();
let gelu_out = data.gelu().unwrap();
```
**3. Batch Activation Processing** (GPU-accelerated):
```rust
// Process large batches of activations efficiently
let batch_logits = Vector::from_slice(&vec![...; 1_000_000]); // 10K batch × 100 classes
let batch_probs = batch_logits.softmax().unwrap(); // GPU-accelerated
```
---
## 2. Strategic Gaps Analysis
### 2.1 Blocker Priorities (Must-Have for Replacement)
**Priority 1: Multi-dimensional Tensors** (Blocks 90% of use cases)
- **Blocker for**: Images, sequences, batches, convolutions, attention
- **Impact**: Cannot process 2D/3D/4D data (images, videos, point clouds)
- **Current workaround**: Flatten to 1D (loses structure, inefficient)
- **Target**: Phase 2 (v0.4.0 - v0.6.0)
**Priority 2: Autograd Engine** (Blocks training)
- **Blocker for**: Neural network training, gradient-based optimization
- **Impact**: Inference-only library (cannot train models)
- **Current workaround**: None (fundamental missing feature)
- **Target**: Phase 3 (v0.7.0)
**Priority 3: Broadcasting** (Blocks 70% of operations)
- **Blocker for**: Shape-polymorphic operations, batch processing
- **Impact**: Must manually tile/repeat tensors (inefficient, error-prone)
- **Current workaround**: Manual shape manipulation
- **Target**: Phase 2 (v0.5.0)
### 2.2 Nice-to-Have Features (Improve Usability)
**Multiple Data Types** (f16, f64, i32, etc.):
- **Impact**: Limited to f32 (no mixed precision, no integer ops)
- **Workaround**: Cast externally, use only f32
- **Target**: Phase 2 (v0.6.0)
**Advanced Indexing/Slicing**:
- **Impact**: Verbose data access, cannot use NumPy-style slicing
- **Workaround**: Manual index computation
- **Target**: Phase 2 (v0.5.0)
**Model Serialization**:
- **Impact**: Cannot save/load trained models
- **Workaround**: Manual weight export/import
- **Target**: Phase 3 (v1.0.0)
---
## 3. Phased Roadmap to Replacement Parity
### Phase 1: Complete 1D Operations (v0.2.x - v0.3.0)
**Timeline**: 2-3 months
**Goal**: Best-in-class 1D vector compute
**NumPy Parity**: ~40% (still 1D only)
**PyTorch Parity**: ~18% (inference only)
**Deliverables**:
- [x] GPU softmax/log_softmax (v0.2.1) ✅
- [ ] Remaining activations: hardswish, mish, selu
- [ ] GPU reductions: argmax/argmin, sum/mean/std
- [ ] GPU binary ops: add/sub/mul/div
- [ ] Async GPU API (batch operations)
- [ ] Comprehensive benchmarks (validate 10-50x claims)
- [ ] WASM SIMD128 backend
**Success Metrics**:
- ✅ All common 1D operations GPU-accelerated
- ✅ 10-50x GPU speedup validated by benchmarks
- ✅ Used in production by ≥3 projects
---
### Phase 2: Multi-Dimensional Tensors (v0.4.0 - v0.6.0)
**Timeline**: 6-12 months
**Goal**: NumPy-competitive for 2D/3D arrays
**NumPy Parity**: ~80% (multi-dim, broadcasting, core ops)
**PyTorch Parity**: ~20% (still no autograd)
#### v0.4.0: Tensor Type Foundation (3-4 months)
**Core Type**:
```rust
pub struct Tensor<T, const N: usize> {
data: Vec<T>,
shape: [usize; N],
strides: [usize; N],
offset: usize,
}
// Examples:
let scalar = Tensor::<f32, 0>::from_scalar(5.0); // 0D scalar
let vector = Tensor::<f32, 1>::from_slice(&[1, 2, 3]); // 1D vector
let matrix = Tensor::<f32, 2>::new([2, 3]); // 2D matrix (2×3)
let image = Tensor::<f32, 3>::new([3, 224, 224]); // 3D image (C, H, W)
let batch = Tensor::<f32, 4>::new([32, 3, 224, 224]); // 4D batch (N, C, H, W)
```
**Design Trade-offs (Const Generics for Rank)**:
The use of `const N: usize` for tensor rank provides compile-time dimensionality checking—a significant safety advantage over Python-based frameworks. However, this creates a trade-off between static safety and dynamic flexibility:
**Advantages of Static Rank**:
- Compile-time shape verification (prevents runtime dimension errors)
- Zero-cost abstraction (no runtime rank checks)
- Type-safe operations (e.g., `matmul` requires 2D tensors)
- Excellent for fixed architectures (CNNs, transformers)
**Challenges**:
- Dynamic rank use cases (e.g., processing lists of tensors with varying ranks)
- Generic functions that accept tensors of any rank
- Interop with dynamic frameworks (loading ONNX models)
**Mitigation Strategy** (Future consideration):
```rust
// Option 1: AnyTensor enum (adds complexity but enables dynamic use cases)
pub enum AnyTensor<T> {
Rank0(Tensor<T, 0>),
Rank1(Tensor<T, 1>),
Rank2(Tensor<T, 2>),
Rank3(Tensor<T, 3>),
Rank4(Tensor<T, 4>),
}
// Option 2: Trait-based approach for rank-polymorphic functions
pub trait TensorLike {
fn rank(&self) -> usize;
fn shape(&self) -> &[usize];
}
```
For v0.4.0, prioritize the static `Tensor<T, N>` design for 80% use cases (ranks 0-4). Dynamic rank support can be added in v0.5.x if user feedback indicates strong demand.
**Storage**:
- **Row-major layout** (C-contiguous, matches NumPy default)
- Owned data vs. views (zero-copy slicing)
- Shape, strides, offset tracking for views
**Storage Layout Considerations** (Performance):
While row-major is the default, the `strides` field enables representing column-major layouts **without changing the underlying data**. This is critical for performance-sensitive operations like matrix multiplication, where performance can be significantly improved when one matrix is row-major and the other is column-major [Goto & van de Geijn, 2008].
```rust
// Zero-copy transpose (just swap strides)
pub fn transpose(&self) -> Tensor<T, 2> {
Tensor {
data: self.data.clone(), // Shared data (Arc in production)
shape: [self.shape[1], self.shape[0]],
strides: [self.strides[1], self.strides[0]], // Swapped!
offset: self.offset,
}
}
```
**Key Insight**: By supporting arbitrary strides, we can:
- Implement zero-copy transpose (critical for linear algebra)
- Represent both row-major and column-major layouts
- Enable optimized BLAS routines (which often require specific layouts)
- Avoid expensive data reorganization
This design aligns with established practices in NumPy and BLAS libraries [Van der Walt et al., 2011].
**2D Operations**:
```rust
// Transpose
let mat = Tensor::<f32, 2>::new([2, 3]); // 2×3
let transposed = mat.transpose(); // 3×2
// Reshape
let vec = Tensor::<f32, 1>::new([6]); // [1, 2, 3, 4, 5, 6]
let mat = vec.reshape([2, 3]); // [[1, 2, 3], [4, 5, 6]]
// Flatten
let mat = Tensor::<f32, 2>::new([2, 3]); // 2×3
let vec = mat.flatten(); // 1D [6]
// Slicing
let mat = Tensor::<f32, 2>::new([5, 5]);
let sub = mat.slice([1..3, 2..4]); // Sub-matrix (2×2 view)
```
**Row/Column Access**:
```rust
let mat = Tensor::<f32, 2>::new([3, 4]);
let row0 = mat.row(0); // View of first row
let col1 = mat.column(1); // View of second column
```
**2D Matrix Multiplication** (optimized):
```rust
// NumPy: C = A @ B
let a = Tensor::<f32, 2>::new([128, 256]); // 128×256
let b = Tensor::<f32, 2>::new([256, 512]); // 256×512
let c = a.matmul(&b).unwrap(); // 128×512 (GPU-accelerated)
```
**Deliverables**:
- Tensor<T, N> type with const generics
- Shape, strides, offset tracking
- Views vs owned data (zero-copy slicing)
- Transpose, reshape, flatten
- 2D matmul (GPU-accelerated, optimized)
- Row/column indexing
**Success Metrics**:
- Can represent images, batches, sequences
- Zero-copy views for efficient slicing
- 2D matmul competitive with NumPy
#### v0.5.0: Broadcasting (2-3 months)
**Broadcasting Rules** (NumPy-compatible):
```rust
// NumPy broadcasting semantics
let a = Tensor::<f32, 2>::new([3, 1]); // Shape (3, 1)
let b = Tensor::<f32, 2>::new([1, 4]); // Shape (1, 4)
let c = a.add(&b).unwrap(); // Shape (3, 4) via broadcasting
// Rules:
// 1. If ranks differ, prepend 1s to smaller rank
// 2. Dimensions are compatible if they're equal or one is 1
// 3. Broadcast stretches dimension 1 to match other dimension
```
**Shape Compatibility Checking**:
```rust
fn can_broadcast(shape_a: &[usize], shape_b: &[usize]) -> bool {
// Implementation of NumPy broadcasting rules
}
```
**Element-wise Ops with Broadcasting**:
```rust
// All element-wise ops support broadcasting
let a = Tensor::<f32, 2>::new([64, 1]); // Batch of 64 vectors
let bias = Tensor::<f32, 1>::new([128]); // Bias vector
let result = a.add(&bias).unwrap(); // (64, 128) via broadcasting
```
**Broadcasting Implementation Quality** (*Jidoka* - Build Quality In):
Broadcasting is notoriously complex to implement correctly [Van der Walt et al., 2011]. The implementation must be heavily validated:
**1. Property-Based Testing Against NumPy**:
```rust
use proptest::prelude::*;
proptest! {
#[test]
fn test_broadcasting_matches_numpy(
a_shape in prop::array::uniform2(1usize..10),
b_shape in prop::array::uniform2(1usize..10),
) {
// Generate compatible shapes for broadcasting
let a = Tensor::<f32, 2>::randn(a_shape);
let b = Tensor::<f32, 2>::randn(b_shape);
// Compare Trueno result with NumPy
let trueno_result = a.add(&b).unwrap();
let numpy_result = numpy_add(a.to_numpy(), b.to_numpy());
assert_tensors_close(trueno_result, numpy_result, eps=1e-5);
}
}
```
This differential testing approach [McKeeman, 1998] catches subtle bugs that unit tests miss.
**2. Fused Kernel Optimization** (GPU Performance):
Naive broadcasting creates large intermediate tensors, consuming excessive memory. Modern tensor compilers avoid this by fusing broadcast + operation into a single kernel:
```rust
// Naive implementation (inefficient - materializes (64, 128) tensor)
let broadcasted_bias = bias.broadcast_to([64, 128]); // Allocates 32KB
let result = a.add(&broadcasted_bias); // Allocates another 32KB
// Fused kernel (optimal - no intermediate allocation)
let result = a.add_broadcasted(&bias); // Single GPU kernel, direct to output
```
**GPU Kernel Pseudocode**:
```wgsl
@compute @workgroup_size(256)
fn add_with_broadcast(
@builtin(global_invocation_id) gid: vec3<u32>
) {
let row = gid.x;
let col = gid.y;
// Compute broadcasted indices on-the-fly
let a_idx = row * a_stride[0] + (col % a_shape[1]) * a_stride[1];
let b_idx = (row % b_shape[0]) * b_stride[0] + col * b_stride[1];
output[row * output_stride + col] = a[a_idx] + b[b_idx];
}
```
This fused approach is standard in modern frameworks like JAX [Bradbury et al., 2018] and achieves:
- Zero intermediate memory allocation
- Single GPU kernel launch (reduces overhead)
- Improved memory bandwidth utilization
For v0.5.0, implement fused broadcasting for element-wise ops on GPU. This aligns with best practices in tensor compiler design [Chen et al., 2018].
**Advanced Indexing**:
```rust
// Boolean masking
let arr = Tensor::<f32, 1>::from_slice(&[1, 2, 3, 4, 5]);
let mask = arr.gt(2.0); // [false, false, true, true, true]
let filtered = arr.masked_select(&mask); // [3, 4, 5]
// Integer array indexing
let indices = Tensor::<i32, 1>::from_slice(&[0, 2, 4]);
let selected = arr.index_select(&indices); // [1, 3, 5]
// Slicing syntax (via macro)
let mat = Tensor::<f32, 2>::new([10, 10]);
let sub = tensor_slice!(mat, [1..5, ::2]); // Rows 1-4, every other column
```
**Deliverables**:
- NumPy-compatible broadcasting rules
- Shape compatibility checking
- Broadcasted element-wise operations
- Boolean masking
- Integer array indexing
- Slicing macro (NumPy-style syntax)
**Success Metrics**:
- Pass 80%+ of NumPy broadcasting tests
- Idiomatic batch processing
- Efficient memory usage (avoid unnecessary copies)
#### v0.6.0: NumPy Parity (Core Ops) (3-4 months)
**All NumPy dtypes**:
```rust
pub enum DType {
F16, F32, F64,
I8, I16, I32, I64,
U8, U16, U32, U64,
Bool,
}
let f32_tensor = Tensor::<f32, 2>::new([3, 3]);
let i32_tensor = Tensor::<i32, 2>::new([3, 3]);
let f64_tensor = Tensor::<f64, 2>::new([3, 3]);
```
**NumPy-style API**:
```rust
// Creation functions
let zeros = Tensor::<f32, 2>::zeros([3, 4]);
let ones = Tensor::<f32, 2>::ones([3, 4]);
let arange = Tensor::<f32, 1>::arange(0.0, 10.0, 1.0);
let linspace = Tensor::<f32, 1>::linspace(0.0, 1.0, 100);
// Concatenation/stacking
let a = Tensor::<f32, 2>::new([2, 3]);
let b = Tensor::<f32, 2>::new([2, 3]);
let cat = Tensor::concatenate(&[a, b], axis=0).unwrap(); // (4, 3)
let stack = Tensor::stack(&[a, b], axis=0).unwrap(); // (2, 2, 3)
// Splitting
let parts = Tensor::split(&tensor, 3, axis=0); // Split into 3 parts
// Conditional operations
let mask = x.gt(0.5);
let result = Tensor::where(&mask, &x, &y); // x if mask else y
let indices = Tensor::argwhere(&mask); // Indices where mask is true
```
**Deliverables**:
- Generic dtype support (F16, F32, F64, I32, I64, U32, etc.)
- NumPy creation functions (zeros, ones, arange, linspace)
- Concatenate, stack, split
- Conditional operations (where, argwhere)
- Comprehensive test suite (80%+ NumPy compatibility)
**Performance Target**:
- 80-120% of NumPy speed (within 20% either way)
**Success Metrics**:
- Pass 80%+ of NumPy test suite (for covered ops)
- ≥10 production deployments
- Competitive performance with NumPy
---
### Phase 3: Autograd & Training (v0.7.0 - v1.0.0)
**Timeline**: 12-18 months
**Goal**: PyTorch-competitive for training
**NumPy Parity**: ~80% (stable)
**PyTorch Parity**: ~60% (training-ready)
#### v0.7.0: Autograd Engine (4-6 months)
**Computational Graph**:
```rust
pub struct Tensor<T, const N: usize> {
data: TensorData<T, N>,
grad: Option<Box<Tensor<T, N>>>,
grad_fn: Option<Box<dyn BackwardFunction>>,
requires_grad: bool,
}
// Example usage
let x = Tensor::<f32, 1>::from_slice(&[1.0, 2.0, 3.0]).requires_grad();
let y = (&x * &x).sum(); // y = sum(x^2)
y.backward(); // Compute gradients
println!("{:?}", x.grad()); // Some([2.0, 4.0, 6.0])
```
**Reverse-mode Automatic Differentiation**:
- Dynamic graph construction (like PyTorch eager mode)
- Gradient tape for backpropagation
- Efficient memory management (release intermediates)
**Core Operations with Gradients**:
```rust
// All element-wise ops
add_backward, sub_backward, mul_backward, div_backward
exp_backward, log_backward, pow_backward, sqrt_backward
sin_backward, cos_backward, tanh_backward, sigmoid_backward
// Reductions
sum_backward, mean_backward, max_backward
// Linear algebra
matmul_backward, conv2d_backward
// Activations (all 14 GPU ops)
relu_backward, sigmoid_backward, gelu_backward, softmax_backward, etc.
```
**Autograd Quality Gates** (*Jidoka* - Halt the Line on Defects):
The autograd engine is the most complex and highest-risk component. A single bug in gradient computation causes silent training failures that are extremely difficult to debug [Baydin et al., 2018]. **Gradient checking must be a first-class citizen** in the testing suite.
**1. Automatic Gradient Checking** (Required for Every Operation):
```rust
/// Verify analytical gradients against numerical gradients
pub fn check_gradients<F>(
op: F,
inputs: &[Tensor],
eps: f64,
tolerance: f64
) -> Result<(), GradientCheckError>
where
F: Fn(&[Tensor]) -> Tensor
{
// Compute analytical gradients via backward()
let output = op(inputs);
output.backward();
let analytical_grads: Vec<_> = inputs.iter().map(|x| x.grad().unwrap()).collect();
// Compute numerical gradients via finite differences
let numerical_grads = compute_numerical_gradients(op, inputs, eps);
// Compare (must be very close)
for (analytical, numerical) in analytical_grads.iter().zip(&numerical_grads) {
assert_tensors_close(analytical, numerical, tolerance)?;
}
Ok(())
}
/// Numerical gradient: f'(x) ≈ (f(x + ε) - f(x - ε)) / (2ε)
fn compute_numerical_gradients<F>(
op: F,
inputs: &[Tensor],
eps: f64
) -> Vec<Tensor>
where
F: Fn(&[Tensor]) -> Tensor
{
inputs.iter().map(|input| {
let mut grad = Tensor::zeros_like(input);
for idx in 0..input.numel() {
// f(x + ε)
let mut input_plus = input.clone();
input_plus.data[idx] += eps;
let output_plus = op(&[input_plus]).sum().item();
// f(x - ε)
let mut input_minus = input.clone();
input_minus.data[idx] -= eps;
let output_minus = op(&[input_minus]).sum().item();
// Central difference
grad.data[idx] = (output_plus - output_minus) / (2.0 * eps);
}
grad
}).collect()
}
```
**2. Automated Testing for All Operations**:
```rust
#[cfg(test)]
mod autograd_tests {
use super::*;
macro_rules! test_gradient {
($op_name:ident, $op:expr, $input_shape:expr) => {
#[test]
fn $op_name() {
let x = Tensor::randn($input_shape).requires_grad();
let y = Tensor::randn($input_shape).requires_grad();
check_gradients(
|inputs| $op(&inputs[0], &inputs[1]),
&[x, y],
eps = 1e-5,
tolerance = 1e-4
).expect("Gradient check failed");
}
};
}
test_gradient!(test_add_grad, |x, y| x + y, [10, 10]);
test_gradient!(test_mul_grad, |x, y| x * y, [10, 10]);
test_gradient!(test_matmul_grad, |x, y| x.matmul(y), [10, 20]);
test_gradient!(test_relu_grad, |x, _| x.relu(), [100]);
test_gradient!(test_softmax_grad, |x, _| x.softmax(dim=1), [10, 10]);
// ... (test every single operation)
}
```
This automated verification prevents entire classes of subtle, hard-to-debug training failures. Research on deep learning testing emphasizes gradient checking as essential [Pei et al., 2017; Zhang et al., 2020].
**3. Property-Based Testing for Gradient Laws**:
```rust
use proptest::prelude::*;
proptest! {
#[test]
fn test_chain_rule(
x in tensor_strategy([10])
) {
// d/dx[f(g(x))] = f'(g(x)) * g'(x)
let x = x.requires_grad();
let y = x.relu(); // g(x)
let z = y.sum(); // f(g(x))
z.backward();
// Verify chain rule holds
let grad_z_wrt_x = x.grad().unwrap();
// ... verify against manually computed gradient
}
#[test]
fn test_gradient_linearity(
x in tensor_strategy([10]),
a in -10.0f32..10.0,
b in -10.0f32..10.0
) {
// d/dx[a*f(x) + b*g(x)] = a*f'(x) + b*g'(x)
// ... verify linearity property
}
}
```
**Memory Optimization**:
```rust
// Gradient checkpointing (save memory for deep networks)
let checkpoint = Checkpoint::new();
model.forward(x)
});
// In-place operations (where safe)
let mut x = Tensor::<f32, 2>::new([3, 3]);
x.relu_(); // In-place ReLU (saves memory, but cannot backward through this)
```
**Deliverables**:
- Reverse-mode AD engine
- Dynamic graph construction
- Gradient tape with backward functions
- All element-wise ops with gradients
- Matmul, conv2d with gradients
- All 14 activations with gradients
- Gradient checkpointing
- In-place operations
**Success Metrics**:
- Can train simple feedforward networks
- Gradients match PyTorch (< 1e-5 error)
- Memory-efficient gradient computation
#### v0.8.0: Neural Network Layers (3-4 months)
**nn::Module Trait**:
```rust
pub trait Module {
type Input;
type Output;
fn forward(&self, input: Self::Input) -> Self::Output;
fn parameters(&self) -> Vec<&Tensor>;
fn zero_grad(&mut self);
}
// Example: Linear layer
pub struct Linear {
weight: Tensor<f32, 2>,
bias: Option<Tensor<f32, 1>>,
}
impl Module for Linear {
type Input = Tensor<f32, 2>;
type Output = Tensor<f32, 2>;
fn forward(&self, input: Self::Input) -> Self::Output {
let output = input.matmul(&self.weight.transpose());
if let Some(ref bias) = self.bias {
output.add(bias)
} else {
output
}
}
fn parameters(&self) -> Vec<&Tensor> {
let mut params = vec![&self.weight];
if let Some(ref bias) = self.bias {
params.push(bias);
}
params
}
}
```
**Core Layers**:
```rust
// Fully connected
let linear = nn::Linear::new(784, 128);
// Convolutional
let conv2d = nn::Conv2d::new(3, 64, kernel_size=3, stride=1, padding=1);
let maxpool = nn::MaxPool2d::new(kernel_size=2, stride=2);
// Normalization
let batchnorm = nn::BatchNorm2d::new(64);
let layernorm = nn::LayerNorm::new(128);
// Regularization
let dropout = nn::Dropout::new(p=0.5);
```
**Sequential Container**:
```rust
let model = nn::Sequential::new()
.add(nn::Linear::new(784, 256))
.add(nn::ReLU::new())
.add(nn::Dropout::new(0.5))
.add(nn::Linear::new(256, 128))
.add(nn::ReLU::new())
.add(nn::Linear::new(128, 10));
let output = model.forward(input);
```
**Loss Functions**:
```rust
// Cross-entropy loss (classification)
let criterion = nn::CrossEntropyLoss::new();
let loss = criterion.forward(outputs, targets);
// MSE loss (regression)
let mse = nn::MSELoss::new();
let loss = mse.forward(predictions, targets);
// Binary cross-entropy (binary classification)
let bce = nn::BCELoss::new();
let loss = bce.forward(predictions, targets);
// Custom loss support
trait Loss {
fn forward(&self, predictions: &Tensor, targets: &Tensor) -> Tensor;
}
```
**Deliverables**:
- nn::Module trait
- Linear (fully connected) layer
- Conv2d, MaxPool2d layers
- BatchNorm, LayerNorm
- Dropout
- Sequential container
- CrossEntropyLoss, MSELoss, BCELoss
**Success Metrics**:
- Can build ResNet-18, VGG-16
- Layer API matches PyTorch ergonomics
- Automatic parameter tracking
#### v0.9.0: Optimizers (2-3 months)
**Optimizer Trait**:
```rust
pub trait Optimizer {
fn step(&mut self, params: &mut [Tensor]);
fn zero_grad(&mut self, params: &mut [Tensor]);
}
// SGD with momentum
let optimizer = optim::SGD::new(
model.parameters(),
lr=0.01,
momentum=0.9,
nesterov=true
);
// Training loop
for (inputs, targets) in dataloader {
optimizer.zero_grad();
let outputs = model.forward(inputs);
let loss = criterion.forward(outputs, targets);
loss.backward();
optimizer.step();
}
```
**Core Optimizers**:
```rust
// SGD (with momentum, Nesterov)
let sgd = optim::SGD::new(params, lr=0.01, momentum=0.9, nesterov=true);
// Adam (with weight decay, AMSGrad)
let adam = optim::Adam::new(params, lr=0.001, betas=(0.9, 0.999), weight_decay=0.01);
// AdamW (decoupled weight decay)
let adamw = optim::AdamW::new(params, lr=0.001, weight_decay=0.01);
// RMSprop
let rmsprop = optim::RMSprop::new(params, lr=0.01, alpha=0.99);
```
**Learning Rate Schedulers**:
```rust
// Step decay
let scheduler = optim::StepLR::new(optimizer, step_size=30, gamma=0.1);
// Exponential decay
let scheduler = optim::ExponentialLR::new(optimizer, gamma=0.95);
// Cosine annealing
let scheduler = optim::CosineAnnealingLR::new(optimizer, T_max=100);
// Usage in training loop
for epoch in 0..num_epochs {
train_one_epoch();
scheduler.step();
}
```
**Deliverables**:
- Optimizer trait
- SGD (with momentum, Nesterov)
- Adam (with weight decay, AMSGrad)
- AdamW, RMSprop
- Learning rate schedulers (StepLR, ExponentialLR, CosineAnnealing)
**Success Metrics**:
- Can train ResNet-50 to convergence
- Optimizer behavior matches PyTorch
- Efficient gradient updates (in-place where possible)
#### v1.0.0: Training-Ready (3-4 months)
**Model Serialization**:
```rust
// Save checkpoint
let checkpoint = Checkpoint {
model: model.state_dict(),
optimizer: optimizer.state_dict(),
epoch: 42,
loss: 0.123,
};
checkpoint.save("model.pth")?;
// Load checkpoint
let checkpoint = Checkpoint::load("model.pth")?;
model.load_state_dict(checkpoint.model);
optimizer.load_state_dict(checkpoint.optimizer);
// ONNX export
model.export_onnx("model.onnx")?;
```
**Distributed Training** (Data Parallelism):
```rust
// Initialize distributed training
let world_size = 4; // 4 GPUs
let rank = get_rank();
// Wrap model in DistributedDataParallel
let ddp_model = nn::parallel::DistributedDataParallel::new(model, rank);
// Training loop (gradients synchronized automatically)
for (inputs, targets) in dataloader {
let outputs = ddp_model.forward(inputs);
let loss = criterion.forward(outputs, targets);
loss.backward();
optimizer.step(); // Gradients averaged across GPUs
}
```
**Gradient Synchronization**:
- AllReduce for gradient averaging
- Efficient communication (NCCL-like)
- Fault tolerance (checkpoint/resume)
**Production Features**:
- Mixed precision training (FP16/BF16)
- Gradient accumulation
- Gradient clipping
- Early stopping
- TensorBoard logging
**Deliverables**:
- Model save/load (state_dict)
- ONNX export
- Distributed data parallelism
- Gradient synchronization
- Mixed precision support
- Production training utilities
**Performance Target**:
- 60-80% of PyTorch speed (within 20-40%)
**Success Metrics**:
- ✅ Train ResNet-50 on CIFAR-10 in <30 minutes (single GPU)
- ✅ 60-80% of PyTorch performance
- ✅ Used in production ML training pipelines
**Full MNIST/CIFAR-10 Examples**:
```rust
// Complete CIFAR-10 training example
use trueno::{nn, optim, data};
fn main() {
// Load data
let train_loader = data::DataLoader::new("cifar10", train=true, batch_size=128);
let test_loader = data::DataLoader::new("cifar10", train=false, batch_size=128);
// Build ResNet-18
let model = nn::resnet18(num_classes=10);
// Loss and optimizer
let criterion = nn::CrossEntropyLoss::new();
let optimizer = optim::Adam::new(model.parameters(), lr=0.001);
// Training loop
for epoch in 0..100 {
for (images, labels) in train_loader {
optimizer.zero_grad();
let outputs = model.forward(images);
let loss = criterion.forward(outputs, labels);
loss.backward();
optimizer.step();
}
// Validation
let accuracy = evaluate(&model, &test_loader);
println!("Epoch {}: Accuracy {:.2}%", epoch, accuracy * 100.0);
}
}
```
---
## 4. Performance Targets
### 4.1 NumPy Performance Parity
**Target**: 80-120% of NumPy speed (within 20% either way)
**Benchmark Methodology**:
- Test sizes: 100, 1K, 10K, 100K, 1M, 10M elements
- Operations: add, mul, matmul, sum, mean, softmax
- Backends: Scalar, SIMD (AVX2), GPU
- Compare against: NumPy 1.26+ (with OpenBLAS)
**Expected Performance**:
| add | 1K | 10 µs | 8 µs | - | 1.25x |
| add | 100K | 1 ms | 500 µs | - | 2x |
| add | 1M | 10 ms | 5 ms | 200 µs | 2x (CPU), 50x (GPU) |
| matmul | 128×128 | 200 µs | 180 µs | - | 1.1x |
| matmul | 1K×1K | 15 ms | 12 ms | 1.5 ms | 1.25x (CPU), 10x (GPU) |
| softmax | 10K | 120 µs | 100 µs | 60 µs | 1.2x (CPU), 2x (GPU) |
| softmax | 1M | 12 ms | 10 ms | 600 µs | 1.2x (CPU), 20x (GPU) |
**Key Insight**: GPU wins at >100K elements (transfer overhead amortized)
### 4.2 PyTorch Performance Parity
**Target**: 60-80% of PyTorch speed (within 20-40%)
**Benchmark Methodology**:
- Networks: ResNet-18, ResNet-50, VGG-16, Transformer (BERT-base)
- Tasks: CIFAR-10 training, ImageNet inference, NLP fine-tuning
- Hardware: NVIDIA A100 (GPU), AMD EPYC (CPU)
- Compare against: PyTorch 2.0+ (with cuDNN)
**Expected Performance**:
| ResNet-50 training (CIFAR-10, 1 epoch) | 45s | 60s | 0.75x |
| ResNet-18 inference (ImageNet, batch=32) | 120 ms | 150 ms | 0.80x |
| BERT-base forward pass (seq_len=128) | 8 ms | 12 ms | 0.67x |
| Softmax (1M elements) | 800 µs | 600 µs | 1.33x |
**Key Insight**: Trueno may be faster for specific ops (softmax, activations) but slower for full models (cuDNN optimization)
---
## 5. API Compatibility Strategy
### 5.1 NumPy Compatibility
**Goal**: Minimize porting effort from NumPy → Trueno
**Approach 1: Idiomatic Rust API** (Primary)
```rust
// Rust-native API (ownership, error handling)
let a = Tensor::<f32, 2>::from_slice(&[1.0, 2.0, 3.0, 4.0], [2, 2]);
let b = Tensor::<f32, 2>::from_slice(&[5.0, 6.0, 7.0, 8.0], [2, 2]);
let c = a.matmul(&b).unwrap();
```
**Approach 2: NumPy-like Macros** (Convenience)
```rust
// NumPy-style syntax via macros
let a = tensor![[1.0, 2.0], [3.0, 4.0]];
let b = tensor![[5.0, 6.0], [7.0, 8.0]];
let c = a.matmul(&b).unwrap();
```
**Approach 3: Function API** (Migration Path)
```rust
// NumPy function-style API
use trueno::np;
let a = np::array([[1.0, 2.0], [3.0, 4.0]]);
let b = np::array([[5.0, 6.0], [7.0, 8.0]]);
let c = np::matmul(&a, &b);
```
**Migration Guide**: Document NumPy → Trueno equivalents
```rust
// NumPy // Trueno
np.array([1, 2, 3]) Tensor::from_slice(&[1, 2, 3])
arr.reshape(2, 3) tensor.reshape([2, 3])
arr.transpose() tensor.transpose()
np.dot(a, b) a.dot(&b)
np.matmul(a, b) a.matmul(&b)
```
### 5.2 PyTorch Compatibility
**Goal**: Familiar API for PyTorch users
**Approach 1: Rust-native nn::Module** (Primary)
```rust
use trueno::nn;
struct MyModel {
fc1: nn::Linear,
fc2: nn::Linear,
}
impl nn::Module for MyModel {
type Input = Tensor<f32, 2>;
type Output = Tensor<f32, 2>;
fn forward(&self, x: Self::Input) -> Self::Output {
let x = self.fc1.forward(x).relu();
let x = self.fc2.forward(x);
x.softmax(dim=1)
}
}
```
**Approach 2: Macro-based Sequential** (Convenience)
```rust
let model = sequential![
nn::Linear::new(784, 256),
nn::ReLU::new(),
nn::Linear::new(256, 10),
];
```
**Migration Guide**: Document PyTorch → Trueno equivalents
```rust
// PyTorch // Trueno
torch.tensor([1, 2, 3]) Tensor::from_slice(&[1, 2, 3])
x.requires_grad_(True) x.requires_grad()
y.backward() y.backward()
x.grad x.grad()
nn.Linear(784, 128) nn::Linear::new(784, 128)
F.relu(x) x.relu()
```
---
## 6. Use Case Validation
### 6.1 NumPy Replacement Use Cases
**Use Case 1: Scientific Computing** (Phase 2 - v0.6.0)
```rust
// Matrix operations for linear algebra
use trueno::Tensor;
let a = Tensor::<f64, 2>::random([1000, 1000]);
let b = Tensor::<f64, 2>::random([1000, 1000]);
// Solve Ax = b
let x = linalg::solve(&a, &b).unwrap();
// Eigenvalues/eigenvectors
let (eigenvalues, eigenvectors) = linalg::eig(&a).unwrap();
```
**Use Case 2: Data Preprocessing** (Phase 2 - v0.5.0)
```rust
// Normalize features for ML
let data = Tensor::<f32, 2>::from_csv("data.csv");
let mean = data.mean(axis=0, keepdims=true);
let std = data.std(axis=0, keepdims=true);
let normalized = (data - mean) / std;
```
**Use Case 3: Image Processing** (Phase 2 - v0.4.0)
```rust
// Apply filters to images
let image = Tensor::<f32, 3>::from_image("photo.jpg"); // (C, H, W)
let kernel = Tensor::<f32, 2>::gaussian_kernel(sigma=1.0);
let blurred = image.convolve2d(&kernel);
```
### 6.2 PyTorch Replacement Use Cases
**Use Case 1: Transfer Learning** (Phase 3 - v1.0.0)
```rust
// Fine-tune ResNet-50 on custom dataset
use trueno::{nn, optim, vision};
let mut model = vision::resnet50(pretrained=true);
model.fc = nn::Linear::new(2048, num_classes); // Replace final layer
let criterion = nn::CrossEntropyLoss::new();
let optimizer = optim::Adam::new(model.parameters(), lr=0.001);
for epoch in 0..10 {
for (images, labels) in train_loader {
optimizer.zero_grad();
let outputs = model.forward(images);
let loss = criterion.forward(outputs, labels);
loss.backward();
optimizer.step();
}
}
```
**Use Case 2: Custom Neural Network** (Phase 3 - v0.8.0)
```rust
// Build custom architecture
use trueno::nn;
struct Autoencoder {
encoder: nn::Sequential,
decoder: nn::Sequential,
}
impl nn::Module for Autoencoder {
type Input = Tensor<f32, 2>;
type Output = Tensor<f32, 2>;
fn forward(&self, x: Self::Input) -> Self::Output {
let encoded = self.encoder.forward(x);
let decoded = self.decoder.forward(encoded);
decoded
}
}
let model = Autoencoder {
encoder: sequential![
nn::Linear::new(784, 256),
nn::ReLU::new(),
nn::Linear::new(256, 64),
],
decoder: sequential![
nn::Linear::new(64, 256),
nn::ReLU::new(),
nn::Linear::new(256, 784),
],
};
```
**Use Case 3: Inference Optimization** (Phase 3 - v0.8.0)
```rust
// Deploy trained model for fast inference
use trueno::nn;
let model = nn::load("model.pth").unwrap();
model.eval(); // Set to inference mode (disables dropout, etc.)
// Inference on GPU
let input = Tensor::<f32, 4>::from_image_batch(&images); // (N, C, H, W)
let output = model.forward(input); // GPU-accelerated
let predictions = output.argmax(dim=1); // Get class predictions
```
---
## 7. Migration Path for Existing Projects
### 7.1 NumPy → Trueno Migration
**Step 1: Identify Vectorizable Code**
- Use trueno-analyze tool to scan Python codebase
- Identify NumPy operations that can be replaced
**Step 2: Incremental Replacement**
- Start with 1D operations (available today in v0.2.0)
- Use FFI to call Trueno from Python (via PyO3)
- Gradually replace more operations as Trueno gains features
**Step 3: Full Transpilation**
- Use Depyler (NumPy → Trueno transpiler) for automated conversion
- Manually verify correctness
- Benchmark performance improvements
**Example Migration**:
```python
# Before (NumPy)
import numpy as np
def process_signals(data):
normalized = (data - data.mean()) / data.std()
activated = 1.0 / (1.0 + np.exp(-normalized))
return activated.sum()
# After (Trueno via PyO3)
import trueno_py
def process_signals(data):
tensor = trueno_py.Tensor.from_numpy(data)
normalized = tensor.normalize()
activated = normalized.sigmoid()
return activated.sum()
```
### 7.2 PyTorch → Trueno Migration
**Step 1: Identify Inference-Only Models**
- Target models that don't need retraining
- Export PyTorch model weights
**Step 2: Rewrite Forward Pass**
- Translate PyTorch forward() to Trueno (Phase 3)
- Load pretrained weights
**Step 3: Optimize Performance**
- Use Trueno GPU acceleration
- Benchmark against PyTorch inference
**Example Migration**:
```python
# Before (PyTorch)
import torch
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.fc2 = nn.Linear(256, 10)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return torch.softmax(x, dim=1)
# After (Trueno - Phase 3)
use trueno::nn;
struct SimpleNet {
fc1: nn::Linear,
fc2: nn::Linear,
}
impl nn::Module for SimpleNet {
type Input = Tensor<f32, 2>;
type Output = Tensor<f32, 2>;
fn forward(&self, x: Self::Input) -> Self::Output {
let x = self.fc1.forward(x).relu();
let x = self.fc2.forward(x);
x.softmax(dim=1)
}
}
```
---
## 8. Risk Analysis & Mitigation
### 8.1 Technical Risks
**Risk 1: GPU Performance Gap**
- **Threat**: Trueno GPU slower than PyTorch (cuDNN optimized)
- **Probability**: HIGH
- **Impact**: HIGH (undermines value proposition)
- **Mitigation**:
- Focus on operations where we can win (activations, reductions)
- Document where PyTorch is faster (cuDNN convolutions)
- Target 60-80% of PyTorch speed (acceptable for Rust benefits)
- Optimize critical kernels (matmul, conv2d)
**Risk 2: Autograd Complexity**
- **Threat**: Reverse-mode AD too complex to implement correctly
- **Probability**: MEDIUM
- **Impact**: HIGH (blocks training entirely)
- **Mitigation**:
- Study PyTorch autograd internals
- Start with simple ops (add, mul) before complex (matmul, conv)
- Comprehensive gradient checking tests
- Hire autograd expert if needed
**Risk 3: Memory Overhead**
- **Threat**: Autograd graph consumes excessive memory
- **Probability**: MEDIUM
- **Impact**: MEDIUM (limits model size)
- **Mitigation**:
- Implement gradient checkpointing
- Release intermediates eagerly
- Provide in-place operations where safe
- Profile memory usage, optimize hotspots
### 8.2 Strategic Risks
**Risk 1: PyTorch Ecosystem Lock-in**
- **Threat**: Users prefer PyTorch due to ecosystem (libraries, pretrained models)
- **Probability**: HIGH
- **Impact**: MEDIUM (limits adoption)
- **Mitigation**:
- Provide ONNX import/export
- Load PyTorch pretrained weights
- Target Rust-native ML use cases (embedded, production)
- Emphasize benefits: type safety, performance, no Python overhead
**Risk 2: Feature Creep**
- **Threat**: Trying to match every PyTorch feature delays v1.0
- **Probability**: MEDIUM
- **Impact**: HIGH (never reach production-ready state)
- **Mitigation**:
- Focus on 80% use cases (ResNet, BERT, simple CNNs)
- Explicitly declare non-goals (symbolic computation, JIT compilation)
- Ship v1.0 with core features, add advanced features in v1.x
**Risk 3: Rust Ergonomics**
- **Threat**: Rust's ownership model makes ML code verbose/difficult
- **Probability**: MEDIUM
- **Impact**: MEDIUM (poor developer experience)
- **Mitigation**:
- Provide ergonomic macros (tensor!, sequential!)
- Smart default behaviors (auto-grad on by default)
- Comprehensive examples and tutorials
- Gather user feedback, iterate on API
---
## 9. Success Metrics & KPIs
### 9.1 Technical Metrics
**Phase 1 (v0.3.0) - 1D Operations Complete**:
- ✅ All common 1D operations GPU-accelerated
- ✅ 10-50x GPU speedup validated by benchmarks
- ✅ >90% test coverage
- ✅ Mutation testing ≥80% kill rate
**Phase 2 (v0.6.0) - NumPy Parity**:
- ✅ 80-120% of NumPy performance
- ✅ Pass 80%+ of NumPy test suite (for covered ops)
- ✅ Multi-dimensional arrays (0D-4D)
- ✅ Broadcasting support
**Phase 3 (v1.0.0) - Training Ready**:
- ✅ Train ResNet-50 on CIFAR-10 in <30 minutes (single GPU)
- ✅ 60-80% of PyTorch performance
- ✅ Autograd matches PyTorch (< 1e-5 gradient error)
### 9.2 Adoption Metrics
**Phase 1**:
- ✅ Used in production by ≥3 projects
- ✅ ≥100 GitHub stars
- ✅ ≥10 contributors
**Phase 2**:
- ✅ ≥10 production deployments
- ✅ ≥500 GitHub stars
- ✅ ≥50 contributors
**Phase 3**:
- ✅ Used in production ML training pipelines
- ✅ ≥1,000 GitHub stars
- ✅ ≥100 contributors
- ✅ Featured in Rust ML blog posts/talks
### 9.3 Quality Metrics (Continuous)
**EXTREME TDD Standards** (All phases):
- Test coverage: ≥90%
- Mutation testing: ≥80% kill rate
- Property-based tests: All core operations
- Backend equivalence: GPU vs SIMD vs Scalar (< 1e-5 error)
- **Differential testing**: All operations vs NumPy/PyTorch (Phase 2+)
- **Fuzz testing**: Robustness against malformed inputs (Phase 2+)
- Gradient checking: All autograd operations (Phase 3)
- Zero clippy warnings
- PMAT TDG: ≥B+ (85/100)
**Additional Testing Strategies** (*Jidoka* - Build Quality In):
**1. Differential Testing** (Catch Numerical Bugs):
For every operation, run it in Trueno, NumPy, and PyTorch with the same random inputs and assert outputs are numerically close [McKeeman, 1998]:
```rust
#[test]
fn test_matmul_vs_numpy() {
let a = Tensor::randn([100, 200]);
let b = Tensor::randn([200, 300]);
// Trueno result
let trueno_result = a.matmul(&b).unwrap();
// NumPy result (via PyO3)
let numpy_result = numpy_matmul(a.to_numpy(), b.to_numpy());
// Must be very close
assert_tensors_close(&trueno_result, &numpy_result, eps=1e-5);
}
```
This is incredibly powerful for finding subtle bugs in numerical implementations that unit tests miss.
**2. Fuzz Testing** (Security + Robustness):
Use `cargo-fuzz` to feed random, potentially malformed data into functions [Miller et al., 1990]:
```rust
// fuzz/fuzz_targets/model_loading.rs
#![no_main]
use libfuzzer_sys::fuzz_target;
use trueno::nn;
fuzz_target!(|data: &[u8]| {
// Try to load model from arbitrary bytes
// Should not crash, even on malformed input
let _ = nn::load_from_bytes(data);
});
```
Fuzz testing is essential for:
- Model loading (ONNX, checkpoint files)
- Complex control flow (broadcasting, indexing)
- Ensuring no panics on malformed input
- Security (prevent exploitation via malicious model files)
---
## 10. Non-Goals
### What Trueno Will NOT Be
❌ **100% PyTorch-compatible** - Inspired by, not clone of
- We will NOT replicate every PyTorch feature
- Focus on core 80% use cases (ResNet, BERT, simple CNNs)
- Explicitly omit: JIT compilation, symbolic computation, TorchScript
❌ **Research-first library** - Production performance is priority
- We will NOT prioritize cutting-edge research features
- Focus on battle-tested architectures
- Add experimental features only after proven in production
❌ **Python-first** - Rust-native, Python bindings optional
- Primary API is Rust, not Python
- PyO3 bindings for interop, but Rust is first-class
- Performance over Python compatibility
❌ **Dynamic typing** - Static typing for safety
- Compile-time shape checking where possible
- Type errors at compile time, not runtime
- Trade ergonomics for safety/performance
❌ **Symbolic computation** - Eager execution only
- No graph optimization, no lazy evaluation
- Simple mental model: operations execute immediately
- Focus on imperative PyTorch-style API
---
## 11. Appendix: Competitive Analysis
### 11.1 Trueno vs. NumPy
| **Multi-dim arrays** | ✅ 0D-32D | ❌ 1D only | ✅ 0D-4D |
| **Broadcasting** | ✅ Full | ❌ None | ✅ Full |
| **Data types** | ✅ 20+ | ❌ f32 only | ✅ 10+ (f16/f32/f64/i32/etc.) |
| **GPU acceleration** | ❌ CPU only | ✅ 14 ops | ✅ 50+ ops |
| **Performance (SIMD)** | ❌ Limited | ✅ 2-8x faster | ✅ 2-8x faster |
| **Performance (GPU)** | ❌ N/A | ✅ 10-50x faster | ✅ 10-50x faster |
| **Type safety** | ❌ Runtime | ✅ Compile-time | ✅ Compile-time |
| **Memory safety** | ❌ C backend | ✅ Rust safe | ✅ Rust safe |
**Verdict**: Trueno v0.6.0 will match NumPy functionality with superior performance and safety.
### 11.2 Trueno vs. PyTorch
| **Autograd** | ✅ Full | ❌ None | ✅ Full |
| **GPU ops** | ✅ 1000+ | ✅ 14 | ✅ 100+ |
| **Training** | ✅ Full | ❌ Inference only | ✅ Full |
| **Layers** | ✅ 100+ | ❌ None | ✅ 20+ core layers |
| **Optimizers** | ✅ 10+ | ❌ None | ✅ 5+ (SGD, Adam, AdamW, etc.) |
| **Pretrained models** | ✅ 1000+ | ❌ None | ✅ 10+ (ResNet, BERT, etc.) |
| **Performance (cuDNN)** | ✅ Baseline | ❌ N/A | ✅ 60-80% |
| **Type safety** | ❌ Runtime | ✅ Compile-time | ✅ Compile-time |
| **Memory safety** | ❌ Python/C++ | ✅ Rust safe | ✅ Rust safe |
**Verdict**: Trueno v1.0.0 will enable training with 60-80% of PyTorch performance, prioritizing safety.
### 11.3 Trueno vs. Other Rust ML Libraries
**vs. burn** (Rust deep learning):
- burn: Dynamic graphs, flexible, but less mature
- Trueno: Focus on production performance, GPU optimization, NumPy/PyTorch API
**vs. candle** (HuggingFace):
- candle: Inference-focused, minimal dependencies
- Trueno: Training + inference, SIMD + GPU, broader API surface
**vs. ndarray** (Rust NumPy):
- ndarray: CPU-only, no GPU, no autograd
- Trueno: GPU-accelerated, autograd (Phase 3), training-ready
**Trueno Differentiation**:
- ✅ GPU-first design (14 ops today, 100+ by v1.0)
- ✅ EXTREME TDD (>90% coverage, mutation testing)
- ✅ PyTorch/NumPy compatibility (familiar API)
- ✅ Production focus (safety, performance, PMAT quality gates)
---
## 12. Future Directions & Developer Experience
### 12.1 Tensor Compiler Integration (*Long-Term Vision*)
**Current State**: Trueno uses hand-written WGSL shaders for GPU operations. While effective, this approach has limitations:
- **cuDNN Gap**: PyTorch's cuDNN integration is heavily optimized by NVIDIA for specific architectures
- **Maintenance Burden**: Each operation requires custom shader implementation
- **Architecture-Specific**: Hard to optimize for different GPUs (NVIDIA vs AMD vs Apple M-series)
**Future Direction** (Post-v1.0): Integrate with a tensor compiler backend like Apache TVM or MLIR [Chen et al., 2018; Lattner et al., 2020]. These tools automatically generate highly optimized kernels for a given operation and hardware target.
**Benefits**:
- **Match/Exceed cuDNN**: Auto-optimization can match or beat vendor libraries
- **Multi-Hardware**: Single codebase targets NVIDIA, AMD, Apple, Intel GPUs
- **Reduced Maintenance**: Compiler generates kernels, we write high-level ops
- **Novel Architectures**: Easy to support new hardware (e.g., Google TPUs)
**Integration Path** (Not v1.0, but architecture should not preclude this):
```rust
// High-level operation (unchanged)
let result = a.matmul(&b);
// Under the hood (v1.0: hand-written WGSL)
gpu.execute_wgsl_shader("matmul.wgsl", ...);
// Under the hood (future: TVM/MLIR compiler)
let compiled_kernel = tvm::compile_matmul(a.shape(), b.shape(), target_gpu);
gpu.execute_compiled_kernel(compiled_kernel, ...);
```
This aligns with the direction of modern ML frameworks like JAX (uses XLA compiler) and PyTorch (integrating TorchInductor).
### 12.2 Error Message Quality (*Respect for People*)
The quality of compiler error messages will be make-or-break for usability [Ko et al., 2011]. Rust's ownership system already provides excellent error messages; Trueno must match this standard for tensor operations.
**Bad Error Message** (Current Rust libraries):
```
thread 'main' panicked at 'assertion failed: self.shape[1] == other.shape[0]'
```
**Good Error Message** (Trueno Target):
```
Error: Cannot multiply tensors due to incompatible shapes
Found:
Left tensor: shape (3, 4) [3 rows, 4 columns]
Right tensor: shape (5, 6) [5 rows, 6 columns]
Problem: Inner dimensions do not match (4 ≠ 5)
For matrix multiplication A @ B:
- A must have shape (m, k)
- B must have shape (k, n)
- Result will have shape (m, n)
Hint: Did you mean to transpose one of the tensors?
- a.transpose().matmul(&b) // Shape (4, 3) @ (5, 6)
- a.matmul(&b.transpose()) // Shape (3, 4) @ (6, 5)
```
**Implementation Strategy**:
```rust
impl Tensor {
pub fn matmul(&self, other: &Tensor) -> Result<Tensor, TruenoError> {
if self.shape[1] != other.shape[0] {
return Err(TruenoError::ShapeMismatch {
operation: "matmul",
left_shape: self.shape.clone(),
right_shape: other.shape.clone(),
issue: format!(
"Inner dimensions do not match ({} ≠ {})",
self.shape[1], other.shape[0]
),
hint: "Did you mean to transpose one of the tensors?",
});
}
// ... implementation
}
}
```
Investing in diagnostic quality pays enormous dividends in user adoption. Research on HCI shows that error message quality significantly impacts developer productivity [Meyer et al., 2017].
### 12.3 Ecosystem - Model Hub (*Reduce Activation Energy*)
**Strategic Risk**: PyTorch Ecosystem Lock-in is the biggest external threat. Users prefer PyTorch due to the vast ecosystem of pretrained models.
**Mitigation** (Beyond ONNX Support):
Create a **Trueno Model Hub** with popular architectures pre-ported and ready to use:
```rust
use trueno::hub;
// Load pretrained ResNet-50 (ImageNet weights)
let model = hub::load("resnet50", pretrained=true)?;
// Immediate transfer learning
let mut model = model;
model.fc = nn::Linear::new(2048, num_custom_classes);
let optimizer = optim::Adam::new(model.parameters(), lr=0.001);
// ... train on custom dataset
```
**Model Hub Contents** (Target for v1.0):
- **Vision**: ResNet-{18,34,50,101,152}, VGG-{11,13,16,19}, MobileNet-V2/V3, EfficientNet
- **NLP**: BERT-{base,large}, GPT-2, DistilBERT
- **Detection**: YOLO-v5, Faster R-CNN (Phase 4)
**Implementation**:
- Host pretrained weights (convert from PyTorch using ONNX)
- Provide model definitions in idiomatic Trueno code
- Benchmark inference speed vs PyTorch
- Document transfer learning workflows
This drastically lowers the activation energy for new users who want to do transfer learning—the dominant use case in industry [Amershi et al., 2019]. The ecosystem effect is a primary driver of platform adoption.
---
## 13. Tiered TDD-X Workflow & Quality Gates (Certeza Insights)
This section integrates lessons from the **certeza** project (https://github.com/paiml/certeza), a scientific experiment achieving **97.7% mutation score** through asymptotic test effectiveness. Certeza demonstrates that EXTREME TDD is achievable and sustainable when verification is properly tiered [1,2,3].
### 13.1 The Flow State Problem
**Critical Finding**: Different verification techniques operate at different time scales. Fast feedback enables flow state; slow feedback causes context switching waste [certeza spec §2.2]. Running mutation testing on every file save destroys productivity (10-100x loss) [4].
**Anti-Pattern** ❌:
```bash
# NEVER do this - destroys flow state
watch -n 1 'cargo mutants' # Running mutation tests every second
```
**Solution**: Tiered feedback loops that match verification intensity to development phase [5,6].
### 13.2 Tiered Workflow (Certeza Model)
```
┌──────────────────────────────────────────────────────────┐
│ Tiered TDD-X Workflow │
├──────────────────────────────────────────────────────────┤
│ │
│ TIER 1: ON-SAVE (Sub-second feedback) │
│ ├─ Write failing unit/property test │
│ ├─ Implement minimal code │
│ ├─ Run focused test suite │
│ ├─ Static analysis (cargo check, clippy) │
│ └─ Iterate rapidly in flow state │
│ │
│ TIER 2: ON-COMMIT (1-5 minutes) │
│ ├─ Full property-based test suite │
│ ├─ Coverage analysis (target: 90%+ line) │
│ ├─ Integration tests │
│ ├─ Differential testing (vs NumPy/PyTorch) │
│ └─ Pre-commit hook enforcement │
│ │
│ TIER 3: ON-MERGE/NIGHTLY (Hours) │
│ ├─ Comprehensive mutation testing │
│ │ └─ Target: ≥80% score, analyze survivors │
│ ├─ Formal verification (critical paths) │
│ ├─ Performance benchmarks │
│ ├─ Security audits │
│ └─ CI/CD gate for main branch │
│ │
└──────────────────────────────────────────────────────────┘
```
**Workflow Mapping** (Trueno-Specific):
| Unit tests (focused) | 1 | Every save | <1s | Rapid iteration |
| Clippy (fast mode) | 1 | Every save | <1s | Quick lint check |
| Property tests (focused) | 1 | Every save | <3s | Design validation |
| Full test suite | 2 | Every commit | <5m | Regression prevention |
| Coverage analysis | 2 | Every commit | <2m | Completeness check |
| Differential testing | 2 | Every commit | <5m | NumPy/PyTorch equivalence |
| Mutation testing | 3 | Pre-merge/nightly | <2h | Test quality assurance |
| Formal verification | 3 | Pre-merge | <4h | Proof of invariants |
| Benchmarks (full suite) | 3 | Nightly | <1h | Performance regression |
### 13.3 Risk-Based Verification Matrix
**Core Principle**: Not all code requires the same verification intensity. Apply rigorous techniques to high-risk components [certeza spec §2.3].
**Trueno Component Classification**:
```
│ Low Complexity │ Medium Complexity │ High Complexity
───────┼─────────────────┼───────────────────┼─────────────────
High │ Property + │ Property + │ Property +
Crit │ Coverage + │ Coverage + │ Coverage +
│ Mutation (85%) │ Mutation (90%) │ Mutation (90%) +
│ │ │ Formal Verify
───────┼─────────────────┼───────────────────┼─────────────────
Medium │ Property + │ Property + │ Property +
Crit │ Coverage (95%) │ Coverage + │ Coverage +
│ │ Mutation (80%) │ Mutation (85%)
───────┼─────────────────┼───────────────────┼─────────────────
Low │ Unit Tests + │ Unit Tests + │ Property +
Crit │ Coverage (90%) │ Property + │ Coverage (95%)
│ │ Coverage (90%) │
```
**Component Risk Assessment** (Trueno-Specific):
| **Very High** | SIMD intrinsics (`unsafe` blocks), GPU compute shaders (WGSL), autograd reverse-mode AD engine, memory allocators | Property + Coverage (95%) + Mutation (90%) + Formal | Memory safety, correctness-critical, high complexity |
| **High** | Tensor operations (matmul, conv2d), broadcasting logic, gradient computation, optimizer updates | Property + Coverage (95%) + Mutation (85-90%) | Core algorithms, numerical correctness critical |
| **Medium** | Activation functions (relu, sigmoid), loss functions, data loaders, layer implementations | Property + Coverage (90%) + Mutation (80%) | Important but less complex, well-understood algorithms |
| **Low** | Utility functions, simple accessors, configuration parsing, error messages | Unit tests + Coverage (90%) | Low complexity, limited failure modes |
**Resource Allocation**: Spend 40% of verification time on the 5-10% highest-risk code (SIMD intrinsics, GPU shaders, autograd engine) [7].
### 13.4 Testing Pyramid Distribution
Following certeza's empirically validated distribution [certeza README]:
```
┌─────────────────┐
│ Formal (Kani) │ <- Invariant proofs
│ ~1-5% code │ (capacity >= len, no overflows)
├─────────────────┤
│ Integration │ <- System properties
│ ~10% tests │ (end-to-end workflows)
├─────────────────┤
│ Property-Based │ <- Algorithmic correctness
│ ~30% tests │ (commutativity, associativity)
├─────────────────┤
│ Unit Tests │ <- Basic functionality
│ ~60% tests │ (edge cases, error handling)
└─────────────────┘
Coverage ────────────────────> Mutation ────────────────────> Properties
```
**Trueno Target Distribution** (v1.0.0):
- **Unit tests**: ~150-200 tests (60%) - Basic operation correctness, edge cases
- **Property-based tests**: ~75-100 properties (30%) - Equivalence with NumPy/PyTorch, mathematical properties
- **Integration tests**: ~25-30 scenarios (10%) - End-to-end training, model inference
- **Formal verification**: ~3-5 proofs (1-5% of code) - Tensor invariants, bounds checking
**Property-Based Testing Emphasis** (Inspired by certeza's 53 properties):
```rust
use proptest::prelude::*;
proptest! {
#[test]
fn test_add_commutative(
a in prop::collection::vec(-1000.0f32..1000.0, 1..10000),
b in prop::collection::vec(-1000.0f32..1000.0, 1..10000)
) {
let a_vec = Vector::from_slice(&a);
let b_vec = Vector::from_slice(&b);
let ab = a_vec.add(&b_vec).unwrap();
let ba = b_vec.add(&a_vec).unwrap();
// Commutativity: a + b == b + a
assert_tensors_close(&ab, &ba, 1e-5);
}
#[test]
fn test_numpy_equivalence(
a in prop::collection::vec(-100.0f32..100.0, 1..1000)
) {
// Differential testing: Trueno must match NumPy
let trueno_result = Vector::from_slice(&a).exp().unwrap();
let numpy_result = call_numpy_exp(&a); // Via PyO3
assert_tensors_close(&trueno_result, &numpy_result, 1e-5);
}
}
```
### 13.5 Quality Gates Per Tier
**Tier 1: ON-SAVE** (Sub-second):
```bash
make tier1 # Certeza-style Makefile target
```
- ✅ `cargo check` passes
- ✅ `cargo clippy --lib` (fast mode) - zero warnings
- ✅ Unit tests pass (focused subset)
- ✅ Property tests pass (small case count: PROPTEST_CASES=10)
**Tier 2: ON-COMMIT** (1-5 minutes):
```bash
make tier2
```
- ✅ `cargo fmt --check` - code formatted
- ✅ `cargo clippy --all-targets --all-features` - zero warnings
- ✅ Full test suite passes (`cargo test --all-features`)
- ✅ Coverage ≥90% (`cargo llvm-cov`)
- ✅ Property tests pass (full cases: PROPTEST_CASES=256-1000)
- ✅ Differential tests pass (vs NumPy/PyTorch, error < 1e-5)
- ✅ PMAT TDG ≥B+ (85/100)
- ✅ Zero SATD comments (TODO/FIXME/HACK)
**Tier 3: ON-MERGE/NIGHTLY** (Hours):
```bash
make tier3
```
- ✅ Mutation testing ≥80% kill rate (`cargo mutants`)
- ✅ Benchmarks: no regressions >5%
- ✅ Security audit (`cargo audit`, `cargo deny`)
- ✅ Formal verification passes (Kani proofs for critical invariants)
- ✅ PMAT repo score ≥90/110
### 13.6 Anti-Patterns to Avoid
Based on certeza specification §2.2 and empirical findings:
1. **❌ Running Mutation Tests on Every Save**
- **Problem**: Destroys flow state (10-100x productivity loss)
- **Solution**: Run mutation tests only in Tier 3 (pre-merge/nightly)
2. **❌ Chasing Metrics Without Understanding**
- **Problem**: Goodhart's Law - "When a measure becomes a target, it ceases to be a good measure"
- **Solution**: Mutation analysis is a learning exercise, not just a metric
3. **❌ Applying Full Verification to Low-Risk Code**
- **Problem**: Over-processing waste (Toyota Way: *Muda*)
- **Solution**: Risk-based resource allocation (40% time on 5-10% highest-risk code)
4. **❌ Ignoring Cognitive Load Limits**
- **Problem**: Analyzing 1000 surviving mutants in one session causes burnout
- **Solution**: Time-boxing (2-hour sessions), batching, pair programming for mutation analysis
5. **❌ Skipping Differential Testing**
- **Problem**: Missing NumPy/PyTorch incompatibilities until production
- **Solution**: Every operation must have differential test (vs NumPy/PyTorch)
6. **❌ No Gradient Checking for Autograd**
- **Problem**: Analytical gradients may be incorrect (chain rule bugs)
- **Solution**: Automated gradient checking (analytical vs numerical) for every autograd operation
### 13.7 Certeza as Empirical Proof
**Key Finding**: certeza achieved **97.7% mutation score** with 231 tests, proving that EXTREME TDD targets are achievable and sustainable when properly tiered [certeza README].
**Certeza Achievement Breakdown**:
- **Unit tests**: 150+ tests (60%)
- **Property-based tests**: 53 properties (30%)
- **Integration tests**: 26 scenarios (10%)
- **Formal verification**: 3 Kani proofs (capacity invariant, push/pop correctness, bounds checking)
- **Mutation score**: 97.7% (only ~10 surviving mutants out of 400+)
- **Coverage**: >95% line coverage
**Lesson for Trueno**: If a simple vector data structure (certeza::TruenoVec) can achieve 97.7% mutation score, then trueno's operations (elementwise ops, reductions, activations) can achieve ≥90% mutation score by following the same tiered workflow.
**Workflow Commands** (Trueno should adopt certeza's Makefile structure):
```bash
# Tier 1: Sub-second feedback (flow state)
make tier1
# Tier 2: Pre-commit (1-5 minutes)
make tier2
# Tier 3: Pre-merge/nightly (hours)
make tier3
# Kaizen: Continuous improvement cycle
make kaizen # Analyze coverage, complexity, TDG, binary size
# Demo mode: Interactive demonstration
make demo-mode
```
### 13.8 References (Certeza-Specific)
1. Certeza Specification: "Asymptotic Test Effectiveness: A Practical Framework for High-Assurance Rust Verification" (v1.1, ~14K words)
2. Certeza README: 97.7% mutation score achievement with 231 tests
3. Jia, Y., & Harman, M. (2011). An analysis and survey of the development of mutation testing. *IEEE TSE, 37(5)*.
4. Fowler, M. (2013). *Refactoring: Improving the Design of Existing Code* (2nd ed.). Chapter on Test-Driven Development and flow state.
5. Google Testing Blog. (2014). Test Sizes. https://testing.googleblog.com/2010/12/test-sizes.html
6. Humble, J., & Farley, D. (2010). *Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation*. Addison-Wesley.
7. McConnell, S. (2004). *Code Complete* (2nd ed.). Microsoft Press. Chapter 20: Software Quality Landscape.
---
## 14. Conclusion
Trueno is strategically positioned to become the **de facto PyTorch/NumPy replacement for Rust**. This specification defines a clear, achievable roadmap:
**v0.2.0 → v0.3.0** (2-3 months): Complete 1D operations, validate GPU performance
**v0.3.0 → v0.6.0** (6-12 months): Multi-dimensional tensors, broadcasting, NumPy parity
**v0.6.0 → v1.0.0** (12-18 months): Autograd, training, PyTorch parity
**Total timeline**: 20-33 months to production-ready deep learning library.
**Key Success Factors**:
1. **Focus**: Prioritize 80% use cases, defer advanced features
2. **Quality**: Maintain EXTREME TDD standards throughout
3. **Performance**: Validate GPU speedup claims, optimize critical paths
4. **Adoption**: Ship early, gather feedback, iterate on API
**Next Steps**:
1. ✅ Complete v0.2.1 (softmax/log_softmax GPU) - DONE
2. Benchmark all GPU ops (validate 10-50x claims)
3. Implement remaining activations (hardswish, mish, selu)
4. Begin Phase 2 design (Tensor<T, N> type)
---
## 15. Academic References
This specification is grounded in academic research and industry best practices. Key publications informing Trueno's design:
**Deep Learning Frameworks**:
1. Abadi, M., et al. (2016). TensorFlow: A System for Large-Scale Machine Learning. *OSDI 16*.
2. Paszke, A., et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. *NeurIPS 2019*.
3. Chen, T., et al. (2018). TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. *OSDI 18*.
4. Lattner, C., et al. (2020). MLIR: A Compiler Infrastructure for the End of Moore's Law. *arXiv:2002.11054*.
**Automatic Differentiation**:
5. Baydin, A. G., et al. (2018). Automatic Differentiation in Machine Learning: a Survey. *Journal of Machine Learning Research, 18*.
6. Griewank, A., & Walther, A. (2008). *Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation*. SIAM.
**Numerical Computing**:
7. Van der Walt, S., et al. (2011). The NumPy Array: A Structure for Efficient Numerical Computation. *Computing in Science & Engineering, 13(2)*.
8. Goto, K., & van de Geijn, R. (2008). Anatomy of High-Performance Matrix Multiplication. *ACM Transactions on Mathematical Software, 34(3)*.
9. Bradbury, J., et al. (2018). JAX: composable transformations of Python+NumPy programs. *Google Research*.
**Software Testing**:
10. Pei, K., et al. (2017). DeepXplore: Automated Whitebox Testing of Deep Learning Systems. *SOSP 2017*.
11. Zhang, H., et al. (2020). An Empirical Study of Common Bugs in Deep Learning Applications. *ISSRE 2020*.
12. McKeeman, W. M. (1998). Differential testing for software. *Digital Technical Journal, 10(1)*.
13. Miller, B. P., et al. (1990). An empirical study of the reliability of UNIX utilities. *CACM, 33(12)*.
**Human-Computer Interaction**:
14. Ko, A. J., et al. (2011). The state of the art in error messages: an empirical study. *ISSTA 2011*.
15. Meyer, A. N., et al. (2017). The pragmatic programmer's programmer: A study of professional developers' learning strategies. *ICER 2017*.
**Software Engineering for ML**:
16. Amershi, S., et al. (2019). Software Engineering for Machine Learning: A Case Study. *ICSE-SEIP 2019*.
17. Sculley, D., et al. (2015). Hidden Technical Debt in Machine Learning Systems. *NeurIPS 2015*.
18. Jia, X., et al. (2019). A Survey of Software Engineering for Machine Learning. *arXiv:1906.07548*.
**Foundational Architectures** (Target Use Cases):
19. Vaswani, A., et al. (2017). Attention Is All You Need. *NeurIPS 2017*. (Transformers)
20. He, K., et al. (2016). Deep Residual Learning for Image Recognition. *CVPR 2016*. (ResNets)
---
**Document Version**: 1.2
**Last Updated**: 2025-11-17
**Status**: Living Document (update as roadmap evolves)
**Owner**: Trueno Core Team
**Changelog**:
- **v1.2** (2025-11-17): Added Section 13 - Tiered TDD-X Workflow & Quality Gates based on certeza scientific experiment (97.7% mutation score). Includes: tiered workflow (tier1/tier2/tier3), risk-based verification matrix, testing pyramid distribution, quality gates per tier, anti-patterns, certeza empirical proof. ~270 lines added.
- **v1.1** (2025-11-17): Added Kaizen improvements: tensor type trade-offs, storage layout considerations, broadcasting quality gates, autograd gradient checking, differential/fuzz testing, tensor compiler vision, error message quality, model hub strategy, academic citations
- **v1.0** (2025-11-17): Initial comprehensive specification