# CUDA Library Parity Specification for the Sovereign AI Stack
**Version**: 1.0.0
**Status**: Draft
**Date**: 2026-03-04
**Authors**: Sovereign AI Stack Team
**Toyota Way Alignment**: Jidoka (built-in quality), Kaizen (continuous improvement), Genchi Genbutsu (go and see)
**Batuta Oracle Confidence**: trueno 85-95% for linear algebra; gaps identified for FFT, solvers, image processing, RNG
## Executive Summary
This specification defines the roadmap for achieving feature parity between NVIDIA's CUDA Library ecosystem (cuBLAS, cuSPARSE, cuFFT, cuSOLVER, cuTENSOR, NPP, nvCOMP, nvJPEG, cuRAND, cuPQC) and the Sovereign AI Stack's pure Rust implementation. The gap analysis reveals that **trueno** already covers dense BLAS (GEMM, GEMV, batched matmul) and data compression (LZ4), but lacks dedicated sparse linear algebra, FFT, general solvers, image processing, and GPU RNG. Each new capability is specified across four design dimensions: provable-contracts verification, Brick profiling integration, pure Rust PTX implementation, and world-class algorithmic foundations from peer-reviewed research.
### Sovereign Stack Current Coverage
| CUDA Library | Sovereign Equivalent | Coverage | Gap |
|---|---|---|---|
| cuBLAS (Level 1-3) | trueno (vector_ops, matrix_ops, GEMM/GEMV) | ~70% | Batched BLAS, mixed-precision GEMM variants |
| cuBLASLt | trueno-gpu (tiled GEMM, tensor core GEMM) | ~50% | Algorithm selection, epilogue fusion |
| cuSPARSE | **None** | 0% | Full sparse library needed |
| cuFFT | **None** (rustfft CPU only in aprender) | 0% | GPU FFT needed |
| cuSOLVER | trueno (Jacobi eigendecomp only) | ~5% | LU, QR, SVD, Cholesky |
| cuTENSOR | **None** | 0% | N-dimensional contractions |
| NPP | **None** (rmedia for video, not GPU image) | 0% | GPU image processing |
| nvCOMP | trueno (LZ4), trueno-zram | ~20% | GPU-accelerated codecs |
| nvJPEG/nvTIFF | **None** | 0% | GPU image codecs |
| cuRAND | **None** | 0% | GPU RNG |
| cuPQC | aprender (AES-256-GCM, Ed25519 — CPU) | ~10% | GPU-accelerated crypto |
---
## 1. New Crate Architecture
Each CUDA library equivalent becomes a module within the trueno workspace, following the existing pattern of `trueno-gpu`, `trueno-quant`, etc.
```
trueno/ (workspace root)
├── trueno/ (core: SIMD vector/matrix ops)
├── trueno-gpu/ (PTX generation + GPU kernels)
│ └── src/kernels/
│ ├── sparse/ NEW: SpMV, SpMM, SpGEMM kernels
│ ├── fft/ NEW: Stockham FFT kernels
│ ├── solve/ NEW: LU, QR, SVD, Cholesky kernels
│ ├── image/ NEW: Convolution, morphology, histogram
│ ├── rand/ NEW: Philox, XORWOW counter-based RNG
│ └── tensor/ NEW: N-dimensional contraction kernels
├── crates/
│ ├── trueno-sparse/ NEW: Sparse matrix formats + CPU SIMD
│ ├── trueno-fft/ NEW: FFT CPU + GPU unified API
│ ├── trueno-solve/ NEW: Dense/sparse solvers
│ ├── trueno-image/ NEW: Image processing primitives
│ └── trueno-rand/ NEW: Parallel RNG
└── contracts/ NEW: provable-contracts YAML
├── sparse-spmv-v1.yaml
├── fft-stockham-v1.yaml
├── solve-lu-v1.yaml
├── solve-qr-v1.yaml
├── solve-svd-v1.yaml
├── image-conv2d-v1.yaml
└── rand-philox-v1.yaml
```
### Dependency Graph
```
trueno-image ──► trueno-fft (convolution via FFT for large kernels)
trueno-solve ──► trueno-sparse (sparse solvers use sparse formats)
trueno-solve ──► trueno (dense BLAS for panel factorization)
trueno-sparse ──► trueno (dense sub-operations within sparse)
trueno-fft ──► trueno (butterfly arithmetic)
trueno-rand ──► trueno-gpu (PTX kernel generation)
All ──► trueno-gpu (GPU kernel infrastructure)
All ──► provable-contracts (YAML verification contracts)
```
### PAIML MCP Agent Toolkit Alignment
Per pmat conventions, each new crate follows:
- **Uniform Contracts**: `BaseAnalysisContract` pattern with `#[serde(flatten)]` for shared parameters
- **Service Layer**: `impl Service for SparseService { type Input; type Output; type Error; }`
- **MCP Tool Registration**: Each crate exposes pmat-compatible analysis tools
- **TDG Grade**: A- minimum (score >= 88/100)
- **Coverage**: >= 95% line coverage, >= 80% mutation score
- **EXTREME TDD**: 8+ tests per feature, property-based testing for all numerical code
- **Zero SATD**: No TODO/FIXME/HACK in shipped code
- **Feature Gating**: Optional heavy features behind Cargo feature flags
---
## 2. trueno-sparse: Sparse Linear Algebra
### 2A. Provable-Contracts Design
**Contract**: `sparse-spmv-v1.yaml`
```yaml
metadata:
version: "1.0.0"
author: "Sovereign AI Stack"
description: "Sparse matrix-vector multiply (CSR format)"
references:
- "Merrill & Garland, 'Merge-Based Parallel Sparse Matrix-Vector Multiplication', PPoPP 2016"
- "LAProof: Princeton formal verification of SpMV in C (Appel et al.)"
equations:
spmv:
formula: "y_i = α · Σ_j A_{ij} · x_j + β · y_i"
domain: "A ∈ ℝ^{m×n} (CSR), x ∈ ℝ^n, y ∈ ℝ^m, α,β ∈ ℝ"
codomain: "y ∈ ℝ^m"
invariants:
- "Output length equals number of rows: len(y) == m"
- "Backward error bounded: |Ay - y_exact| ≤ nnz_per_row · u · |A| · |x|"
proof_obligations:
- type: invariant
property: "Output dimension correctness"
formal: "len(result) == A.rows()"
applies_to: all
- type: equivalence
property: "Sparse result matches dense reference"
formal: "|sparse_spmv(A,x) - dense_matvec(A.to_dense(), x)| < ε"
tolerance: 1.0e-5
applies_to: all
- type: bound
property: "Backward error bound (LAProof)"
formal: "||Ax_computed - Ax_exact||_inf ≤ n · u · ||A||_inf · ||x||_inf"
tolerance: "n * f32::EPSILON"
applies_to: all
- type: linearity
property: "SpMV is linear in x"
formal: "|spmv(A, αx + βy) - α·spmv(A,x) - β·spmv(A,y)| < ε"
tolerance: 1.0e-4
applies_to: all
- type: equivalence
property: "SIMD matches scalar within ULP"
tolerance: 8.0
applies_to: simd
- type: equivalence
property: "GPU matches CPU within tolerance"
tolerance: 16.0
applies_to: simd
kernel_structure:
phases:
- name: format_validation
description: "Validate CSR invariants (sorted columns, valid offsets)"
invariant: "offsets[0] == 0 && offsets[m] == nnz && offsets monotonically increasing"
- name: merge_path_partition
description: "Binary search to partition work equally among thread blocks"
invariant: "Each partition processes approximately nnz/num_blocks elements"
- name: merge_spmv
description: "Parallel merge of row-offsets and NZ-data"
invariant: "Each element accumulated into correct row"
- name: reduction
description: "Reduce partial row sums across thread blocks"
invariant: "y[i] = sum of all partial contributions to row i"
simd_dispatch:
spmv_csr:
scalar: "spmv_csr_scalar"
avx2: "spmv_csr_avx2"
avx512: "spmv_csr_avx512"
neon: "spmv_csr_neon"
ptx: "spmv_csr_merge_gpu"
enforcement:
format_invariants:
description: "CSR offsets must be monotonically increasing with offsets[0]=0"
check: "contract_tests::FALSIFY-SPARSE-001"
severity: "ERROR"
no_oob_access:
description: "Column indices must be in [0, n)"
check: "contract_tests::FALSIFY-SPARSE-002"
severity: "ERROR"
backward_error:
description: "Numerical error bounded by LAProof formula"
check: "contract_tests::FALSIFY-SPARSE-003"
severity: "ERROR"
falsification_tests:
- id: FALSIFY-SPARSE-001
rule: "CSR format validity"
prediction: "Reject CSR with non-monotonic offsets"
test: "Construct invalid CSR, assert error returned"
if_fails: "Format validation missing or incomplete"
- id: FALSIFY-SPARSE-002
rule: "Dimension correctness"
prediction: "spmv(A[m,n], x[n]) produces y[m]"
test: "proptest with random dimensions"
if_fails: "Index arithmetic error in merge-path"
- id: FALSIFY-SPARSE-003
rule: "Backward error bound"
prediction: "|y_sparse - y_dense| ≤ nnz_per_row * eps * |A|_inf * |x|_inf"
test: "proptest with random CSR + dense reference"
if_fails: "Accumulation order error or missing Kahan summation"
- id: FALSIFY-SPARSE-004
rule: "SIMD-scalar equivalence"
prediction: "|y_simd - y_scalar| ≤ 8 ULP per element"
test: "proptest with AVX2 vs scalar on same input"
if_fails: "SIMD reassociation exceeds tolerance"
- id: FALSIFY-SPARSE-005
rule: "GPU-CPU equivalence"
prediction: "|y_gpu - y_cpu| ≤ 16 ULP per element"
test: "Run merge-based GPU kernel vs CPU scalar reference"
if_fails: "GPU kernel accumulation or memory access error"
kani_harnesses:
- id: KANI-SPARSE-001
obligation: "Format invariants"
property: "CSR offsets form valid partition of [0, nnz)"
bound: 16
strategy: exhaustive
harness: verify_csr_offsets_valid
- id: KANI-SPARSE-002
obligation: "Merge-path partition correctness"
property: "Binary search finds correct merge-path diagonal"
bound: 32
strategy: bounded_int
harness: verify_merge_path_partition
```
**Contract**: `sparse-spmm-v1.yaml` (SpMM) and `sparse-spgemm-v1.yaml` (SpGEMM) follow the same pattern with matrix-output equations.
### 2B. Brick Profiling Design
```rust
// New BrickId variants for sparse operations
pub enum BrickId {
// ... existing variants ...
// Sparse operations (new)
SpMV, // Sparse matrix-vector
SpMM, // Sparse matrix-dense matrix
SpGEMM, // Sparse matrix-sparse matrix
FormatConvert, // CSR↔COO↔BSR conversion
}
// New BrickCategory
pub enum BrickCategory {
// ... existing ...
Sparse, // All sparse operations
}
```
**Roofline Classification**:
- SpMV: **memory-bound** (arithmetic intensity ~0.25 FLOP/byte for f32 CSR)
- SpMM: **mixed** (depends on dense matrix width; memory-bound for narrow, compute-bound for wide)
- SpGEMM: **memory-bound** (irregular access, hash-table accumulation)
**Tuner Features** (extend `TunerFeatures` v1.2.0):
```rust
// Sparse-specific features (6 new dimensions)
pub struct SparseTunerFeatures {
pub nnz: u64, // Total nonzeros
pub rows: u64, // Matrix rows
pub cols: u64, // Matrix columns
pub avg_nnz_per_row: f32, // Average nonzeros per row
pub row_length_variance: f32, // Variance of row lengths (key for algorithm selection)
pub format: SparseFormat, // CSR, COO, BSR, ELL
}
```
**Algorithm Selection via BrickTuner**:
- `row_length_variance < 0.1`: Row-split SpMV (regular matrices)
- `row_length_variance >= 0.1`: Merge-based SpMV (irregular matrices)
- `avg_nnz_per_row > 32 && GPU available`: GPU merge-based
- `avg_nnz_per_row <= 32`: CPU SIMD (memory-bound, GPU launch overhead dominates)
**ModelTracer Extensions** (MLT-06):
```rust
pub struct SparseOperationTrace {
pub format: SparseFormat,
pub nnz: u64,
pub effective_bandwidth_gbps: f64,
pub achieved_vs_roofline_pct: f64,
pub load_balance_efficiency: f64, // 1.0 = perfect balance
}
```
### 2C. Pure Rust Implementation
#### Sparse Formats (CPU + SIMD)
```rust
/// Compressed Sparse Row format
pub struct CsrMatrix<T> {
pub rows: usize,
pub cols: usize,
pub offsets: Vec<u32>, // len = rows + 1
pub col_indices: Vec<u32>, // len = nnz
pub values: Vec<T>, // len = nnz
}
/// Coordinate format (for construction)
pub struct CooMatrix<T> {
pub rows: usize,
pub cols: usize,
pub row_indices: Vec<u32>,
pub col_indices: Vec<u32>,
pub values: Vec<T>,
}
/// Block Sparse Row (for structured sparsity)
pub struct BsrMatrix<T> {
pub rows: usize,
pub cols: usize,
pub block_rows: usize,
pub block_cols: usize,
pub offsets: Vec<u32>,
pub col_indices: Vec<u32>,
pub values: Vec<T>, // stored as dense blocks
}
/// SELL-C-sigma (Sliced ELLPACK with sorting)
pub struct SellMatrix<T> {
pub rows: usize,
pub cols: usize,
pub slice_size: usize, // typically 32 or 256
pub slice_starts: Vec<u32>,
pub col_indices: Vec<u32>,
pub values: Vec<T>,
}
```
#### SIMD SpMV (AVX2)
Algorithm: Row-split with 8-wide f32 accumulation.
```rust
// Backend trait extension
pub trait SparseBackend {
fn spmv_csr(
alpha: f32, matrix: &CsrMatrix<f32>, x: &[f32],
beta: f32, y: &mut [f32],
);
fn spmm_csr(
alpha: f32, matrix: &CsrMatrix<f32>, b: &[f32], b_cols: usize,
beta: f32, c: &mut [f32],
);
}
```
For AVX2 SpMV: gather from `x` using `_mm256_i32gather_ps` with column indices, FMA accumulate, horizontal sum per row. For rows shorter than 8 elements, use masked operations.
#### GPU Kernel: Merge-Based SpMV (PTX)
Per Merrill & Garland (PPoPP 2016):
```rust
pub struct MergeSpMVKernel {
pub rows: u32,
pub nnz: u32,
pub items_per_thread: u32,
}
impl Kernel for MergeSpMVKernel {
fn name(&self) -> &str { "spmv_merge" }
fn build_ptx(&self) -> PtxKernel {
// 1. Each thread block computes its merge-path diagonal
// via binary search on (row_offsets, nz_indices)
// 2. Each thread walks its portion of the merge path:
// - When advancing in row_offsets: emit partial sum for previous row
// - When advancing in nz_indices: accumulate val[j] * x[col[j]]
// 3. Block-level reduction of partial row sums via shared memory
// 4. Inter-block reduction for rows that span multiple blocks
PtxKernel::new("spmv_merge")
.param(PtxType::U64, "offsets_ptr")
.param(PtxType::U64, "col_indices_ptr")
.param(PtxType::U64, "values_ptr")
.param(PtxType::U64, "x_ptr")
.param(PtxType::U64, "y_ptr")
.param(PtxType::U32, "rows")
.param(PtxType::U32, "nnz")
.param(PtxType::F32, "alpha")
.param(PtxType::F32, "beta")
.shared_memory(4096) // Partial row sums
.build(|ctx| {
// Merge-path binary search + accumulation
// (full PTX generation via builder extension traits)
})
}
}
```
### 2D. World-Class Algorithmic Foundation
**Primary Algorithm**: Merge-Based SpMV (Merrill & Garland, PPoPP 2016)
- Frames SpMV as a merge of row-offsets list and NZ-data list
- Work-balanced: every thread processes the same number of merge-path steps
- No format conversion required (works directly on CSR)
- Achieved 1.84x over cuSPARSE on irregular matrices (Cluster Computing 2024)
**Formal Verification**: LAProof (Princeton, Appel et al.)
- Machine-checked backward error bounds for CSR SpMV
- Error bound: `||r||_inf ≤ n · u · ||A||_inf · ||x||_inf` where `u = f32::EPSILON/2`
- Verified Jacobi iterative solver convergence
- Translates directly to Kani harnesses for bounded verification
**Adaptive Format Selection**: MANet (arXiv 2018)
- ML-based sparse format selection (CSR vs ELL vs COO vs BSR)
- Integrates with BrickTuner via `row_length_variance` feature
**Reference Papers**:
1. Merrill & Garland, "Merge-Based Parallel Sparse Matrix-Vector Multiplication", PPoPP 2016
2. Acc-SpMM, arXiv:2501.09251 (Tensor Core SpMM, January 2025)
3. HC-SpMM, arXiv:2412.08902 (Hybrid Core SpMM, December 2024)
4. LAProof, Princeton CS (formal SpMV verification in Coq/VST)
5. Block Strategy with Adaptive Storage, Cluster Computing 2024 (1.84x over cuSPARSE)
---
## 3. trueno-fft: Fast Fourier Transforms
### 3A. Provable-Contracts Design
**Contract**: `fft-stockham-v1.yaml`
```yaml
metadata:
version: "1.0.0"
author: "Sovereign AI Stack"
description: "GPU-accelerated FFT via Stockham auto-sort algorithm"
references:
- "Stockham, 'High-speed convolution and correlation', AFIPS 1966"
- "VkFFT: Efficient Vulkan/CUDA/HIP/OpenCL/Metal FFT, IEEE 2023"
- "Cooley & Tukey, 'An algorithm for the machine calculation of complex Fourier series', 1965"
equations:
dft:
formula: "X_k = Σ_{n=0}^{N-1} x_n · exp(-2πi·k·n/N)"
domain: "x ∈ ℂ^N, N ≥ 1"
codomain: "X ∈ ℂ^N"
invariants:
- "Parseval's theorem: Σ|x_n|² = (1/N)·Σ|X_k|²"
- "Linearity: DFT(αx + βy) = α·DFT(x) + β·DFT(y)"
- "Inverse: IDFT(DFT(x)) = x"
r2c:
formula: "X_k = Σ_{n=0}^{N-1} x_n · exp(-2πi·k·n/N), x ∈ ℝ^N"
domain: "x ∈ ℝ^N"
codomain: "X ∈ ℂ^{N/2+1} (Hermitian symmetry)"
invariants:
- "X_{N-k} = conj(X_k) (conjugate symmetry for real input)"
- "Output length = N/2 + 1"
proof_obligations:
- type: invariant
property: "Parseval energy conservation"
formal: "|Σ|x_n|² - (1/N)·Σ|X_k|²| < ε"
tolerance: 1.0e-4
applies_to: all
- type: invariant
property: "Inverse roundtrip"
formal: "|IFFT(FFT(x)) - x| < ε element-wise"
tolerance: 1.0e-5
applies_to: all
- type: linearity
property: "FFT linearity"
formal: "|FFT(αx + βy) - α·FFT(x) - β·FFT(y)| < ε"
tolerance: 1.0e-4
applies_to: all
- type: equivalence
property: "GPU matches CPU reference"
tolerance: 32.0
applies_to: simd
- type: conservation
property: "Parseval energy conservation"
formal: "energy_time_domain ≈ energy_freq_domain"
tolerance: 1.0e-4
applies_to: all
kernel_structure:
phases:
- name: twiddle_precompute
description: "Precompute twiddle factors in FP64, store as FP32"
invariant: "|W_k - exp(-2πik/N)| < f64::EPSILON"
- name: butterfly_stages
description: "log2(N) stages of butterfly operations (Stockham)"
invariant: "Each stage halves the DFT problem size"
- name: transpose
description: "Shared memory transpose between stages (bank-conflict-free)"
invariant: "Data permutation is correct"
- name: output_scale
description: "Optional 1/N scaling for inverse"
invariant: "Scale applied uniformly"
simd_dispatch:
fft_1d:
scalar: "fft_1d_scalar"
avx2: "fft_1d_avx2"
avx512: "fft_1d_avx512"
neon: "fft_1d_neon"
ptx: "fft_stockham_gpu"
falsification_tests:
- id: FALSIFY-FFT-001
rule: "Parseval energy conservation"
prediction: "Time-domain energy equals frequency-domain energy within tolerance"
test: "proptest with random complex vectors, N = 2^k for k in 4..16"
if_fails: "Butterfly coefficient or twiddle factor error"
- id: FALSIFY-FFT-002
rule: "Inverse roundtrip"
prediction: "IFFT(FFT(x)) reproduces x within tolerance"
test: "proptest with random input, all supported sizes"
if_fails: "Twiddle sign error or missing conjugation in inverse"
- id: FALSIFY-FFT-003
rule: "Known-value test (DFT of impulse)"
prediction: "FFT([1, 0, 0, ..., 0]) = [1, 1, 1, ..., 1]"
test: "Exact comparison for impulse input"
if_fails: "Fundamental algorithm error"
- id: FALSIFY-FFT-004
rule: "GPU-CPU equivalence"
prediction: "|FFT_gpu(x) - FFT_cpu(x)| ≤ 32 ULP per element"
test: "Run Stockham GPU vs scalar CPU on same input"
if_fails: "Shared memory transpose or twiddle error in PTX"
kani_harnesses:
- id: KANI-FFT-001
obligation: "Butterfly index correctness"
property: "Butterfly pairs are non-overlapping and cover all elements"
bound: 16
strategy: bounded_int
harness: verify_butterfly_indices
- id: KANI-FFT-002
obligation: "Twiddle factor indexing"
property: "Twiddle index k is in [0, N/2) at each stage"
bound: 16
strategy: bounded_int
harness: verify_twiddle_indices
```
### 3B. Brick Profiling Design
```rust
pub enum BrickId {
// ... existing ...
// FFT operations (new)
FFT1D,
FFT2D,
FFT3D,
IFFT1D,
TwiddlePrecompute,
ButterflyStage,
}
pub enum BrickCategory {
// ... existing ...
FFT,
}
```
**Roofline Classification**:
- Small FFT (N < 1024): **compute-bound** (butterfly arithmetic dominates)
- Large FFT (N > 65536): **memory-bound** (data movement between stages dominates)
- 2D/3D FFT: **memory-bound** (dominated by transpose between dimensions)
**Tuner Features**:
```rust
pub struct FFTTunerFeatures {
pub n: u64, // Transform size
pub batch_size: u64, // Number of transforms
pub dimensions: u8, // 1D, 2D, 3D
pub is_real: bool, // R2C vs C2C
pub is_power_of_two: bool, // Fast path available
pub largest_prime_factor: u32, // For mixed-radix planning
}
```
### 3C. Pure Rust Implementation
#### CPU FFT (Stockham Auto-Sort)
```rust
pub struct FftPlan {
pub n: usize,
pub direction: FftDirection,
pub twiddles: Vec<Complex<f32>>, // Precomputed in f64, stored as f32
pub radix_sequence: Vec<usize>, // Mixed-radix decomposition
}
pub enum FftDirection { Forward, Inverse }
pub trait Fft {
fn fft_1d(&self, input: &[Complex<f32>], output: &mut [Complex<f32>]);
fn ifft_1d(&self, input: &[Complex<f32>], output: &mut [Complex<f32>]);
fn fft_r2c(&self, input: &[f32], output: &mut [Complex<f32>]);
fn fft_c2r(&self, input: &[Complex<f32>], output: &mut [f32]);
fn fft_2d(&self, input: &[Complex<f32>], output: &mut [Complex<f32>],
nx: usize, ny: usize);
}
```
**SIMD Butterfly**: AVX2 processes 4 complex f32 (8 floats) per instruction:
- Load pairs, multiply by twiddle (FMA), add/subtract for butterfly
- Interleaved real/imaginary format for coalesced access
#### GPU FFT (Stockham with Shared Memory)
Per VkFFT (IEEE 2023) design:
```rust
pub struct StockhamFFTKernel {
pub n: u32,
pub radix: u32, // 2, 4, 8 (register-level)
pub threads_per_block: u32,
pub ffts_per_block: u32, // Multiple small FFTs per block (bank conflict avoidance)
}
impl Kernel for StockhamFFTKernel {
fn name(&self) -> &str { "fft_stockham" }
fn build_ptx(&self) -> PtxKernel {
// Each stage:
// 1. Load from global/shared into registers (coalesced)
// 2. Compute radix-N butterfly in registers
// 3. Multiply by twiddle factors
// 4. Store to shared memory (with padding for bank conflicts)
// 5. Barrier sync
// 6. Next stage reads from shared
//
// Final stage stores to global memory
//
// Key optimizations:
// - Radix-4/8 reduces shared memory round-trips
// - Multiple FFTs per block amortizes sync overhead
// - FP64 twiddle precomputation for accuracy (VkFFT approach)
todo!()
}
}
```
**Non-Power-of-Two Support**: Bluestein's chirp-z transform:
1. Zero-pad to next power-of-two
2. Convolve with chirp sequence (3 power-of-two FFTs)
3. Extract N-point result
### 3D. World-Class Algorithmic Foundation
**Primary Algorithm**: Stockham Auto-Sort FFT
- No bit-reversal permutation (unlike Cooley-Tukey)
- Ping-pong buffers give natural coalesced access
- Mixed-radix decomposition: N = r1 × r2 × ... × rk for small primes
**Twiddle Factor Precision**: VkFFT approach
- Precompute in FP128 (or FP64), store as FP32
- More precise than cuFFT (IEEE 2023 paper demonstrates this)
**Non-Power-of-Two**: Rader's algorithm (prime N) + Bluestein's (arbitrary N)
**Reference Papers**:
1. Stockham, "High-speed convolution and correlation", AFIPS 1966
2. VkFFT, "Efficient Vulkan/CUDA/HIP/OpenCL/Metal FFT library", IEEE 2023
3. TurboFNO, arXiv:2504.11681 (fused FFT-GEMM-iFFT, April 2025)
4. Mixed-Precision FFT, arXiv:2508.10202 (Pareto-optimal precision, 2025)
---
## 4. trueno-solve: Dense & Sparse Solvers
### 4A. Provable-Contracts Design
**Contract**: `solve-lu-v1.yaml`
```yaml
metadata:
version: "1.0.0"
author: "Sovereign AI Stack"
description: "LU factorization with partial pivoting"
references:
- "Golub & Van Loan, 'Matrix Computations', 4th ed., 2013"
- "High Accuracy Low Precision QR, arXiv:1912.05508"
equations:
lu_factorization:
formula: "PA = LU where L lower triangular (unit diagonal), U upper triangular, P permutation"
domain: "A ∈ ℝ^{n×n}"
codomain: "L ∈ ℝ^{n×n}, U ∈ ℝ^{n×n}, P ∈ {0,1}^{n×n}"
invariants:
- "PA = LU within backward error"
- "L is unit lower triangular: L_{ii} = 1, L_{ij} = 0 for j > i"
- "U is upper triangular: U_{ij} = 0 for j < i"
- "P is a permutation matrix"
proof_obligations:
- type: invariant
property: "Factorization backward error"
formal: "||PA - LU||_F / ||A||_F < n · u"
tolerance: "n * f32::EPSILON"
applies_to: all
- type: invariant
property: "Triangular structure"
formal: "L[i][j] == 0 for j > i && U[i][j] == 0 for j < i"
applies_to: all
- type: invariant
property: "Solution accuracy for Ax=b"
formal: "||Ax_computed - b||_inf / (||A||_inf · ||x||_inf) < κ(A) · n · u"
applies_to: all
- type: equivalence
property: "GPU matches CPU reference"
tolerance: 32.0
applies_to: simd
```
**Contract**: `solve-qr-v1.yaml`
```yaml
equations:
qr_factorization:
formula: "A = QR where Q orthogonal (Q^T Q = I), R upper triangular"
domain: "A ∈ ℝ^{m×n}, m ≥ n"
codomain: "Q ∈ ℝ^{m×m}, R ∈ ℝ^{m×n}"
invariants:
- "Q^T Q = I (orthogonality)"
- "||A - QR||_F / ||A||_F < √n · u"
- "R is upper triangular"
proof_obligations:
- type: invariant
property: "Orthogonality"
formal: "||Q^T Q - I||_F < √n · u"
tolerance: "sqrt(n) * f32::EPSILON"
applies_to: all
- type: invariant
property: "Factorization accuracy"
formal: "||A - QR||_F / ||A||_F < √n · u"
tolerance: "sqrt(n) * f32::EPSILON"
applies_to: all
```
**Contract**: `solve-svd-v1.yaml`
```yaml
equations:
svd:
formula: "A = UΣV^T where U,V orthogonal, Σ diagonal with σ_1 ≥ σ_2 ≥ ... ≥ σ_min(m,n) ≥ 0"
domain: "A ∈ ℝ^{m×n}"
codomain: "U ∈ ℝ^{m×m}, Σ ∈ ℝ^{m×n}, V ∈ ℝ^{n×n}"
invariants:
- "U^T U = I, V^T V = I"
- "Σ diagonal with non-negative decreasing entries"
- "||A - UΣV^T||_F / ||A||_F < √n · u"
proof_obligations:
- type: invariant
property: "Orthogonality of U"
formal: "||U^T U - I||_F < √m · u"
tolerance: "sqrt(m) * f32::EPSILON"
applies_to: all
- type: invariant
property: "Orthogonality of V"
formal: "||V^T V - I||_F < √n · u"
tolerance: "sqrt(n) * f32::EPSILON"
applies_to: all
- type: ordering
property: "Singular values non-increasing"
formal: "σ_i ≥ σ_{i+1} for all i"
applies_to: all
- type: bound
property: "Singular values non-negative"
formal: "σ_i ≥ 0 for all i"
applies_to: all
- type: invariant
property: "Reconstruction accuracy"
formal: "||A - UΣV^T||_F / ||A||_F < √min(m,n) · u"
tolerance: "sqrt(min(m,n)) * f32::EPSILON"
applies_to: all
```
### 4B. Brick Profiling Design
```rust
pub enum BrickId {
// ... existing ...
LUFactorize,
LUSolve,
QRFactorize,
QRSolve,
SVDCompute,
CholeskyFactorize,
CholeskySolve,
EigenSymmetric,
TriangularSolve,
}
pub enum BrickCategory {
// ... existing ...
Solver,
}
```
**Roofline Classification**:
- Panel factorization (LU/QR column operations): **memory-bound**
- Trailing matrix update (GEMM): **compute-bound**
- SVD (Jacobi rotations): **compute-bound** for small matrices
- Triangular solve: **memory-bound**
**Tuner Features**:
```rust
pub struct SolverTunerFeatures {
pub n: u64, // Matrix dimension
pub m: u64, // Rows (for rectangular)
pub algorithm: SolverAlgorithm, // LU, QR, SVD, Cholesky
pub is_positive_definite: bool, // Can use Cholesky
pub is_symmetric: bool, // Can use symmetric eigensolvers
pub condition_number_estimate: f64, // For algorithm selection
pub batch_size: u64, // For batched solvers
}
```
### 4C. Pure Rust Implementation
#### Block LU Factorization
```rust
pub trait Solver {
/// LU factorization with partial pivoting: PA = LU
fn lu_factorize(&self, a: &mut [f32], n: usize, pivot: &mut [i32]) -> SolverResult<()>;
/// Solve Ax = b using LU factors
fn lu_solve(&self, lu: &[f32], n: usize, pivot: &[i32], b: &mut [f32]) -> SolverResult<()>;
/// QR factorization via Householder reflections: A = QR
fn qr_factorize(&self, a: &mut [f32], m: usize, n: usize,
tau: &mut [f32]) -> SolverResult<()>;
/// SVD: A = U Σ V^T
fn svd(&self, a: &mut [f32], m: usize, n: usize,
u: &mut [f32], s: &mut [f32], vt: &mut [f32]) -> SolverResult<()>;
/// Cholesky: A = LL^T for positive definite A
fn cholesky(&self, a: &mut [f32], n: usize) -> SolverResult<()>;
}
```
**GPU Strategy**: Block hybrid approach
1. **Panel factorization**: Small matrix operations on CPU (latency-sensitive)
2. **Trailing matrix update**: GEMM on GPU (compute-intensive)
3. **Look-ahead**: Overlap next panel with current update
For SVD, use one-sided Jacobi (arXiv:1707.05141) which is inherently parallel -- each rotation is independent. For large matrices, the divide-and-conquer approach (arXiv:2508.11467) achieves 1293x over rocSOLVER.
### 4D. World-Class Algorithmic Foundation
**LU**: Block LU with partial pivoting (Golub & Van Loan Ch. 3)
- Panel factorization + Schur complement update
- GPU: trailing update is GEMM (90%+ of compute)
**QR**: Householder with block reflectors (arXiv:1912.05508)
- TensorCore-accelerated: 2.9-14.7x speedup for tall-skinny matrices
- Block size tuned to shared memory capacity
**SVD**: Hierarchically blocked one-sided Jacobi (arXiv:1707.05141)
- Inherently parallel: each Jacobi rotation independent
- Maps to GPU memory hierarchy: warp-level → block-level → grid-level
- For large matrices: bidiagonal divide-and-conquer (arXiv:2508.11467)
- Randomized SVD for approximate low-rank (arXiv:2110.03423)
**Cholesky**: Block Cholesky with GEMM-heavy updates
- Positive definite only; ~2x faster than LU
**Reference Papers**:
1. Golub & Van Loan, "Matrix Computations", 4th ed., 2013
2. Efficient Batch SVD on GPUs, arXiv:2601.17979 (January 2026)
3. GPU Divide-and-Conquer SVD, arXiv:2508.11467 (1293x over rocSOLVER)
4. High Accuracy Low Precision QR, arXiv:1912.05508 (TensorCore QR)
5. Batched QR and SVD on GPUs, arXiv:1707.05141 (one-sided Jacobi)
6. Performant Unified GPU SVD Kernels, arXiv:2508.06339 (Julia-based)
---
## 5. trueno-image: GPU Image Processing
### 5A. Provable-Contracts Design
**Contract**: `image-conv2d-v1.yaml`
```yaml
metadata:
version: "1.0.0"
author: "Sovereign AI Stack"
description: "2D image convolution (separable and non-separable)"
references:
- "Gonzalez & Woods, 'Digital Image Processing', 4th ed."
- "Parallel Canny Edge Detection, PLOS One 2024"
equations:
conv2d:
formula: "O(i,j) = Σ_p Σ_q I(i+p, j+q) · K(p, q)"
domain: "I ∈ ℝ^{H×W}, K ∈ ℝ^{kh×kw}"
codomain: "O ∈ ℝ^{H×W} (same-padding)"
invariants:
- "Output dimensions match input (with same-padding)"
- "Separable kernel: K = v · h^T gives same result as 2D convolution"
- "Commutativity: I * K = K * I"
separable_conv2d:
formula: "O = (I * v) * h^T where K = v · h^T"
domain: "I ∈ ℝ^{H×W}, v ∈ ℝ^{kh}, h ∈ ℝ^{kw}"
codomain: "O ∈ ℝ^{H×W}"
invariants:
- "Result matches non-separable convolution within tolerance"
- "Complexity: O(H·W·(kh+kw)) vs O(H·W·kh·kw)"
proof_obligations:
- type: equivalence
property: "Separable matches non-separable"
formal: "|conv2d(I, v*h^T) - separable_conv2d(I, v, h)| < ε"
tolerance: 1.0e-5
applies_to: all
- type: invariant
property: "Gaussian blur is symmetric"
formal: "blur(I, σ) has same result for any axis ordering"
applies_to: all
- type: equivalence
property: "GPU matches CPU"
tolerance: 4.0
applies_to: simd
falsification_tests:
- id: FALSIFY-IMG-001
rule: "Separable equivalence"
prediction: "Separable Gaussian matches 2D Gaussian within tolerance"
test: "proptest with random images and Gaussian kernels"
if_fails: "Separable decomposition or border handling error"
- id: FALSIFY-IMG-002
rule: "Identity kernel"
prediction: "Convolution with delta kernel reproduces input exactly"
test: "conv2d(I, [[0,0,0],[0,1,0],[0,0,0]]) == I"
if_fails: "Off-by-one in kernel centering"
- id: FALSIFY-IMG-003
rule: "Canny edge detection"
prediction: "Edges are 1-pixel wide (non-maximum suppression property)"
test: "Verify no two adjacent pixels are both marked as edges along gradient direction"
if_fails: "NMS implementation error"
```
### 5B. Brick Profiling Design
```rust
pub enum BrickId {
// ... existing ...
Conv2D,
GaussianBlur,
SobelFilter,
Canny,
Morphology,
Histogram,
Resize,
ColorConvert,
}
pub enum BrickCategory {
// ... existing ...
ImageProcessing,
}
```
### 5C. Pure Rust Implementation
```rust
pub trait ImageOps {
// Filtering
fn conv2d(&self, image: &ImageBuf, kernel: &[f32], kw: usize, kh: usize,
border: BorderMode) -> ImageBuf;
fn gaussian_blur(&self, image: &ImageBuf, sigma: f32) -> ImageBuf;
fn sobel(&self, image: &ImageBuf) -> (ImageBuf, ImageBuf); // (gx, gy)
fn canny(&self, image: &ImageBuf, low: f32, high: f32) -> ImageBuf;
// Morphology
fn dilate(&self, image: &ImageBuf, element: &StructElement) -> ImageBuf;
fn erode(&self, image: &ImageBuf, element: &StructElement) -> ImageBuf;
// Geometry
fn resize(&self, image: &ImageBuf, new_w: usize, new_h: usize,
interp: Interpolation) -> ImageBuf;
// Color
fn rgb_to_gray(&self, image: &ImageBuf) -> ImageBuf;
fn rgb_to_hsv(&self, image: &ImageBuf) -> ImageBuf;
// Analysis
fn histogram(&self, image: &ImageBuf, bins: usize) -> Vec<u32>;
fn connected_components(&self, binary: &ImageBuf) -> (ImageBuf, u32);
}
pub struct ImageBuf {
pub data: Vec<u8>, // or Vec<f32>
pub width: usize,
pub height: usize,
pub channels: usize,
pub dtype: ImageDType,
}
pub enum BorderMode { Replicate, Reflect, Zero, Wrap }
pub enum Interpolation { Nearest, Bilinear, Bicubic, Lanczos }
```
**GPU Strategy**: Each operation is a separate PTX kernel:
- **Conv2D**: Shared memory tiling with halo (border elements loaded from global)
- **Gaussian blur**: Two 1D passes (separable) for O(kh+kw) instead of O(kh*kw)
- **Sobel**: 3x3 shared memory tile, 8 neighbor reads per pixel
- **Histogram**: Per-block shared memory histogram + global atomicAdd reduction
- **Resize**: Output-centric parallelism (each thread computes one output pixel)
### 5D. World-Class Algorithmic Foundation
**Canny Pipeline** (GPU, per PLOS One 2024):
1. Gaussian blur (separable, shared memory)
2. Sobel gradient magnitude + direction
3. Non-maximum suppression (per-pixel, gradient direction lookup)
4. Hysteresis thresholding (parallel connected component propagation)
**Separable Convolution**: Two 1D passes — first horizontal (row-parallel), then vertical (column-parallel). Each pass uses shared memory tiling with halo regions. This is the standard approach used by NPP internally.
**Reference Papers**:
1. Parallel Canny Edge Detection, PLOS One 2024 (1.21x over CUDA Canny)
2. cubic: CUDA-accelerated Bioimage Computing, arXiv:2510.14143
---
## 6. trueno-rand: Parallel Random Number Generation
### 6A. Provable-Contracts Design
**Contract**: `rand-philox-v1.yaml`
```yaml
metadata:
version: "1.0.0"
author: "Sovereign AI Stack"
description: "Philox 4x32-10 counter-based PRNG for GPU"
references:
- "Salmon et al., 'Parallel Random Numbers: As Easy as 1, 2, 3', SC 2011"
equations:
philox4x32:
formula: "output = philox_round^10(counter, key) where each round applies Feistel network"
domain: "counter ∈ ℤ^4 (128-bit), key ∈ ℤ^2 (64-bit)"
codomain: "output ∈ {0,1}^128 (4 × 32-bit random values)"
invariants:
- "Deterministic: same (counter, key) always produces same output"
- "Period: 2^128 (counter space)"
- "Statistical quality: passes BigCrush (TestU01)"
proof_obligations:
- type: invariant
property: "Determinism"
formal: "philox(c, k) == philox(c, k) always"
applies_to: all
- type: invariant
property: "Counter independence"
formal: "philox(c1, k) and philox(c2, k) are statistically independent for c1 ≠ c2"
applies_to: all
- type: equivalence
property: "GPU matches CPU bit-exact"
tolerance: 0.0
applies_to: simd
falsification_tests:
- id: FALSIFY-RNG-001
rule: "Determinism"
prediction: "Same counter+key always produces same output"
test: "Generate twice with same state, compare bit-exact"
if_fails: "Uninitialized state or race condition"
- id: FALSIFY-RNG-002
rule: "Uniformity"
prediction: "Chi-squared test passes for bucket distribution"
test: "Generate 1M samples, chi-squared goodness-of-fit"
if_fails: "Round function is biased"
- id: FALSIFY-RNG-003
rule: "GPU-CPU bit-exactness"
prediction: "GPU and CPU produce identical bits"
test: "Generate 10K values on GPU and CPU, compare bit-exact"
if_fails: "Integer arithmetic divergence (should be impossible for Philox)"
kani_harnesses:
- id: KANI-RNG-001
obligation: "Feistel round invertibility"
property: "Each round is bijective (no information loss)"
bound: 4
strategy: exhaustive
harness: verify_feistel_bijective
```
### 6B-C. Implementation
**Philox 4x32-10** (Salmon et al., SC 2011):
- Counter-based: embarrassingly parallel (no state sharing between threads)
- Each GPU thread computes `philox(global_thread_id + offset, key)` for 4 random u32s
- 10 rounds of Feistel network using widening multiply
- Passes BigCrush statistical test suite
```rust
pub trait GpuRng {
fn fill_uniform_f32(&self, output: &mut GpuBuffer<f32>, seed: u64, offset: u64);
fn fill_normal_f32(&self, output: &mut GpuBuffer<f32>, seed: u64, mean: f32, std: f32);
fn fill_uniform_u32(&self, output: &mut GpuBuffer<u32>, seed: u64, offset: u64);
}
```
**Normal distribution**: Box-Muller transform applied to pairs of uniform values.
### 6D. World-Class Algorithmic Foundation
**Reference**: Salmon et al., "Parallel Random Numbers: As Easy as 1, 2, 3", SC 2011
- Counter-based design: no per-thread state, perfect for GPU
- Philox: 10 rounds of Feistel using 32×32→64 widening multiply
- Threefry: 20 rounds using rotations and XOR (backup for non-multiply hardware)
- Both pass BigCrush and Crush from TestU01
---
## 7. trueno-tensor: N-Dimensional Tensor Contractions
### 7A. Provable-Contracts Design
**Contract**: `tensor-contraction-v1.yaml`
```yaml
metadata:
version: "1.0.0"
description: "General tensor contraction (Einstein summation)"
references:
- "Springer et al., 'Design of a High-Performance GEMM-like Tensor-Tensor Multiplication', ACM TOMS 2018"
equations:
contraction:
formula: "C_{free_C} = α · Σ_{contracted} A_{free_A, contracted} · B_{free_B, contracted} + β · C_{free_C}"
domain: "A ∈ ℝ^{d1×...×dk}, B ∈ ℝ^{dk+1×...×dn}"
codomain: "C with dimensions = free indices of A ∪ free indices of B"
invariants:
- "Contracted dimensions must match between A and B"
- "Output dimensions = free_A dimensions × free_B dimensions"
- "For rank-2 tensors, reduces to GEMM"
proof_obligations:
- type: invariant
property: "Reduces to GEMM for matrices"
formal: "contraction(A[m,k], B[k,n], 'ik,kj->ij') == gemm(A, B)"
tolerance: 1.0e-5
applies_to: all
- type: invariant
property: "Output shape correctness"
formal: "output_shape == infer_shape(A_shape, B_shape, contraction_indices)"
applies_to: all
- type: associativity
property: "Contraction order independence"
formal: "|contract(contract(A,B), C) - contract(A, contract(B,C))| < ε"
tolerance: 1.0e-3
applies_to: all
```
### 7C. Pure Rust Implementation
**Strategy**: Reduce all tensor contractions to GEMM via TTGT (Transpose-Transpose-GEMM-Transpose):
1. Reshape/transpose A to 2D: (free_A, contracted)
2. Reshape/transpose B to 2D: (contracted, free_B)
3. Call existing trueno GEMM
4. Reshape/transpose result to output dimensions
This leverages the heavily optimized GEMM already in trueno-gpu (tiled, tensor core variants).
For specialized contractions (e.g., batched outer products, trace), emit custom PTX kernels that avoid the transpose overhead.
```rust
pub fn einsum(
subscripts: &str, // e.g., "ijk,jkl->il"
inputs: &[&Tensor],
output: &mut Tensor,
) -> Result<(), TensorError>;
```
---
## 8. Enhanced Dense BLAS (cuBLAS Parity)
### Missing Operations (Priority Order)
| Operation | cuBLAS Name | Status | Priority |
|---|---|---|---|
| General GEMM | gemm | Exists (tiled, tensor core) | Done |
| Batched GEMM | gemmBatched | Exists (batched_4d) | Done |
| Strided Batched | gemmStridedBatched | Partial | P1 |
| Mixed Precision | gemmEx | Missing | P1 |
| Symmetric Rank-k | syrk/syr2k | Missing | P2 |
| Triangular Solve | trsm | Missing (needed for solvers) | P1 |
| Triangular Multiply | trmm | Missing | P2 |
| Symmetric Multiply | symm/hemm | Missing | P2 |
| Level-2 GEMV | gemv | Exists (CoalescedGemvKernel) | Done |
| Level-1 (dot, nrm2, etc.) | Various | Exists in trueno core | Done |
| Epilogue Fusion | gemmEx with epilogue | Missing | P2 |
**Contract**: `blas-trsm-v1.yaml` — triangular solve is critical for LU/QR/Cholesky solvers.
```yaml
equations:
trsm:
formula: "Solve AX = B for X, where A is triangular"
domain: "A ∈ ℝ^{n×n} (triangular), B ∈ ℝ^{n×nrhs}"
codomain: "X ∈ ℝ^{n×nrhs}"
invariants:
- "AX = B within backward error"
- "||AX - B||_F / (||A||_F · ||X||_F) < n · u"
```
---
## 9. Enhanced Compression (nvCOMP Parity)
### Current State
trueno has LZ4 CPU + GPU kernels. trueno-zram provides ZRAM integration. Missing:
| Algorithm | nvCOMP | Sovereign Stack | Priority |
|---|---|---|---|
| LZ4 | Yes | Yes (trueno LZ4 kernel) | Done |
| Zstd | Yes | trueno-zram (CPU) | P1: GPU kernel |
| Snappy | Yes | Missing | P3 |
| GDeflate | Yes | Missing | P3 |
| Cascaded | Yes | Missing | P3 |
| ANS | Yes | Missing | P2 |
**Strategy**: Focus on Zstd GPU kernel (most impactful for KV cache compression in realizar) and ANS (used for neural codec compression).
---
## 10. Implementation Roadmap
### Phase 1: Foundation (Months 1-3)
| Priority | Crate | Deliverable | Contract |
|---|---|---|---|
| P0 | trueno-sparse | CSR/COO formats + CPU SIMD SpMV | sparse-spmv-v1.yaml |
| P0 | trueno-sparse | Merge-based GPU SpMV kernel | sparse-spmv-v1.yaml |
| P0 | trueno-gpu | Triangular solve (TRSM) kernel | blas-trsm-v1.yaml |
| P0 | trueno-gpu | Mixed-precision GEMM (FP16 in, FP32 acc) | blas-gemmex-v1.yaml |
### Phase 2: Solvers + FFT (Months 3-6)
| Priority | Crate | Deliverable | Contract |
|---|---|---|---|
| P1 | trueno-solve | LU factorization (block, GPU) | solve-lu-v1.yaml |
| P1 | trueno-solve | QR factorization (Householder, GPU) | solve-qr-v1.yaml |
| P1 | trueno-solve | SVD (one-sided Jacobi, batched) | solve-svd-v1.yaml |
| P1 | trueno-fft | 1D C2C + R2C Stockham FFT (GPU) | fft-stockham-v1.yaml |
| P1 | trueno-fft | 2D FFT via 1D + transpose | fft-2d-v1.yaml |
| P1 | trueno-sparse | SpMM kernel (row-split + merge) | sparse-spmm-v1.yaml |
### Phase 3: Image + Ecosystem (Months 6-9)
| Priority | Crate | Deliverable | Contract |
|---|---|---|---|
| P2 | trueno-image | Separable conv2d + Gaussian blur | image-conv2d-v1.yaml |
| P2 | trueno-image | Sobel + Canny edge detection | image-canny-v1.yaml |
| P2 | trueno-image | Histogram + morphology | image-histogram-v1.yaml |
| P2 | trueno-rand | Philox 4x32-10 GPU RNG | rand-philox-v1.yaml |
| P2 | trueno-tensor | Einstein summation via TTGT | tensor-contraction-v1.yaml |
| P2 | trueno-solve | Cholesky factorization | solve-cholesky-v1.yaml |
### Phase 4: Polish + Parity (Months 9-12)
| Priority | Crate | Deliverable | Contract |
|---|---|---|---|
| P3 | trueno-sparse | SpGEMM (sparse × sparse) | sparse-spgemm-v1.yaml |
| P3 | trueno-sparse | BSR + SELL formats | sparse-formats-v1.yaml |
| P3 | trueno-fft | 3D FFT + batched | fft-3d-v1.yaml |
| P3 | trueno-fft | Bluestein non-power-of-two | fft-bluestein-v1.yaml |
| P3 | trueno-image | Resize (bilinear/bicubic/Lanczos) | image-resize-v1.yaml |
| P3 | trueno-image | Color conversion + connected components | image-color-v1.yaml |
| P3 | trueno-gpu | Zstd GPU compression kernel | comp-zstd-v1.yaml |
---
## 11. Quality Gates (Per PAIML MCP Agent Toolkit Standards)
Every new crate and kernel must pass before merge to master:
### Code Quality
| Gate | Threshold | Tool |
|---|---|---|
| Test Coverage | >= 95% line | `cargo llvm-cov` |
| Mutation Score | >= 80% kill | `cargo mutants` |
| TDG Grade | >= A- (88+) | `pmat tdg` |
| Cyclomatic Complexity | <= 15 per fn | `pmat complexity` |
| SATD Count | 0 | `pmat satd` |
| Clippy | 0 warnings | `cargo clippy -- -D warnings` |
| Rustfmt | Pass | `cargo fmt --check` |
### Numerical Quality
| Gate | Verification | Tool |
|---|---|---|
| Backward Error | Per-contract tolerance | proptest + contract tests |
| SIMD-Scalar Parity | <= 8 ULP | proptest with ULP comparison |
| GPU-CPU Parity | <= 32 ULP | contract tests with GPU runner |
| Known-Value Tests | Exact match | Hardcoded reference values |
| Property Tests | 10K+ iterations | proptest |
| Kani Verification | All harnesses pass | `cargo kani` |
### Performance Quality
| Gate | Threshold | Tool |
|---|---|---|
| Roofline Efficiency | >= 50% of theoretical | BrickProfiler + roofline model |
| Budget Compliance | Meet contract budget | ComputeBrick.verify() |
| No Performance Regression | < 5% degradation | Criterion.rs + CI baseline |
### Provable-Contracts Pipeline
For each new operation:
1. **Phase 1 (Extract)**: Identify paper, extract equations
2. **Phase 2 (Specify)**: Write YAML contract with obligations + falsification tests
3. **Phase 3 (Scaffold)**: Generate trait + test stubs via `pv scaffold`
4. **Phase 4 (Implement)**: Scalar reference first, then SIMD, then GPU
5. **Phase 5 (Falsify)**: Property-based tests via `pv probar`
6. **Phase 6 (Verify)**: Kani bounded model checking via `pv verify`
7. **Phase 7 (Prove)**: Lean 4 proofs for critical properties (optional)
---
## 12. Integration with Sovereign Stack Components
### Batuta Oracle Integration Points
Per oracle recommendations:
- **trueno** is the compute foundation (85-95% confidence for linear algebra)
- **provable-contracts** validates correctness (95% confidence for verification)
- **pepita** provides kernel-level I/O primitives (io_uring, ublk)
- **repartir** distributes multi-GPU work (CPU/GPU executors, checkpointing)
### Downstream Consumer Benefits
| Consumer | New Capability Unlocked |
|---|---|
| **aprender** | PCA via SVD, spectral clustering via eigendecomp, graph Laplacian via sparse |
| **entrenar** | FFT-based convolution layers, sparse attention, QR-based optimizer |
| **realizar** | Sparse attention patterns, FFT for audio models, image preprocessing |
| **trueno-rag** | Sparse retrieval indices, FFT for audio chunking |
| **rmedia** | GPU image processing pipeline (resize, filter, color convert) |
| **simular** | FFT for physics simulation, sparse systems for FEM |
### Cross-Crate Contract Binding
Each new trueno capability gets a binding entry in `provable-contracts/contracts/trueno/binding.yaml`:
```yaml
version: "1.0.0"
target_crate: trueno
bindings:
- contract: sparse-spmv-v1.yaml
equation: spmv
module_path: "trueno_sparse::ops::spmv_csr"
function: spmv_csr
status: not_implemented
- contract: fft-stockham-v1.yaml
equation: dft
module_path: "trueno_fft::fft_1d"
function: fft_1d
status: not_implemented
```
---
## 13. Competitive Positioning
### Sovereign Stack Advantages Over CUDA Libraries
| Advantage | Details |
|---|---|
| **No vendor lock-in** | wgpu backend supports Vulkan, Metal, DX12, WebGPU |
| **Pure Rust safety** | Zero unsafe in public API, ownership-based memory safety |
| **Zero external toolchains** | No nvcc, no LLVM, no C/C++ compiler needed |
| **Cross-platform** | Same code runs on CPU SIMD, GPU, and WASM |
| **Formal verification** | provable-contracts with Kani + Lean 4 (CUDA has no equivalent) |
| **ML-based auto-tuning** | BrickTuner with learned kernel selection (CUDA relies on manual selection) |
| **Built-in profiling** | Every operation profiled via Brick with roofline analysis |
| **Cargo ecosystem** | `cargo add trueno-sparse` vs manual CMake + library linking |
### Where CUDA Libraries Still Lead
| Gap | CUDA Advantage | Timeline to Close |
|---|---|---|
| **Maturity** | 15+ years of optimization | 2-3 years for parity |
| **Sparse algorithm breadth** | 20+ SpMV algorithms, preconditioners | 6-12 months for core |
| **FFT performance** | Heavily tuned for each GPU architecture | 6-12 months for competitive |
| **Solver completeness** | Full LAPACK equivalent | 12-18 months |
| **Image processing breadth** | 200+ NPP functions | 12-24 months for core 50 |
| **Multi-GPU** | NCCL, cuBLASMp | Partially covered by repartir |
---
## Appendix A: Batuta Oracle Capability Map
```
┌──────────────────────────────────────────────────────────────┐
│ Sovereign AI Stack │
│ │
│ ┌─────────────────── Orchestration ──────────────────────┐ │
│ │ batuta (workflow) repartir (distributed) pforge (MCP) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────── Training & Inference ───────────────────┐ │
│ │ entrenar (train) realizar (infer) whisper-apr (ASR) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────── ML Algorithms ─────────────────────┐ │
│ │ aprender (classical ML + neural + text + audio) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────── Compute Primitives ─────────────────────┐ │
│ │ trueno (SIMD/GPU) trueno-gpu (PTX) pepita (kernel) │ │
│ │ trueno-db (analytics) trueno-graph trueno-rag │ │
│ │ │ │
│ │ ┌── NEW (This Spec) ─────────────────────────────────┐ │ │
│ │ │ trueno-sparse trueno-fft trueno-solve │ │ │
│ │ │ trueno-image trueno-rand trueno-tensor │ │ │
│ │ └────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────── Quality & Profiling ───────────────────┐ │
│ │ pmat (TDG) certeza (coverage) provable-contracts │ │
│ │ probar (property) renacer (strace) batuta (oracle) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────── Transpilers ────────────────────────┐ │
│ │ depyler (Py→Rust) decy (C→Rust) ruchy (script→Rust) │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
```
## Appendix B: CUDA Library → Sovereign Stack Mapping
| CUDA Library | Sovereign Crate | Status |
|---|---|---|
| cuBLAS Level 1 | trueno (vector_ops) | Complete |
| cuBLAS Level 2 | trueno (GEMV) + trueno-gpu (CoalescedGemvKernel) | ~80% |
| cuBLAS Level 3 | trueno-gpu (GEMM variants, batched) | ~70% |
| cuBLASLt | trueno-gpu (algorithm selection via BrickTuner) | ~50% |
| cuBLASMp | repartir (distributed executor) | ~30% |
| cuSPARSE | **trueno-sparse** (NEW) | 0% → Phase 1 |
| cuSPARSELt | **trueno-sparse** (structured sparsity) | 0% → Phase 3 |
| cuFFT | **trueno-fft** (NEW) | 0% → Phase 2 |
| cuFFTMp | repartir + trueno-fft (distributed) | 0% → Phase 4 |
| cuSOLVER Dense | **trueno-solve** (NEW) | 5% → Phase 2 |
| cuSOLVER Sparse | trueno-solve + trueno-sparse | 0% → Phase 3 |
| cuTENSOR | **trueno-tensor** (NEW) | 0% → Phase 3 |
| NPP | **trueno-image** (NEW) | 0% → Phase 3 |
| nvCOMP | trueno (LZ4) + trueno-zram | ~20% → Phase 4 |
| nvJPEG | **trueno-image** (codec module) | 0% → Phase 4 |
| cuRAND | **trueno-rand** (NEW) | 0% → Phase 3 |
| cuPQC | aprender (CPU crypto) | ~10% → Future |
| MathDx | trueno-gpu (device-side math) | ~40% |
---
*This specification is a living document. Each phase completion triggers a review and update cycle.*
*Generated with guidance from batuta oracle, provable-contracts framework, and arXiv research survey.*