Multi-Target High-Performance Compute Library
trueno (Spanish: "thunder") provides unified compute primitives across CPU SIMD, GPU, and WebAssembly.
Table of Contents
- Features
- Installation
- Quick Start
- Performance
- trueno-gpu: Pure Rust CUDA
- Training (WGPU)
- Operations
- Development
- Contributing
- License
Features
- CPU SIMD: x86 (SSE2/AVX/AVX2/AVX-512), ARM (NEON), WASM (SIMD128)
- GPU: Pure Rust PTX generation via
trueno-gpu(no nvcc required) - Native Blackwell: sm_121 support via PTX 8.8 ISA
- JIT Disk Cache: Compiled PTX cached at
~/.cache/trueno/ptx/for instant reload - Cross-platform GPU: Vulkan/Metal/DX12/WebGPU via
wgpu - Auto-dispatch: Runtime selection of optimal backend
- Zero unsafe in public API: Safety via type system
Installation
[]
= "0.16"
# Optional: GPU support for large matrices
= { = "0.16", = ["gpu"] }
# Optional: Pure Rust CUDA PTX generation
= "0.4"
Quick Start
use ;
// Vector operations - auto-selects best SIMD backend
let a = from_slice;
let b = from_slice;
let sum = a.add.unwrap; // [6.0, 8.0, 10.0, 12.0]
let dot = a.dot.unwrap; // 70.0
let activated = a.relu.unwrap; // ReLU activation
// Matrix operations
let m = from_vec.unwrap;
let product = m.matmul.unwrap; // Matrix multiplication
let transposed = m.transpose; // Transpose
// Batched matmul for transformers (Q @ K^T pattern)
let batch = 2; let heads = 4; let seq = 8; let dim = 64;
let q: = vec!;
let kt: = vec!;
let attn = batched_matmul_4d.unwrap;
// Eigendecomposition (PCA, spectral analysis)
let cov = from_vec.unwrap;
let eigen = new.unwrap;
let eigenvalues = eigen.eigenvalues; // [4.0, 2.0]
Performance
| Operation | SIMD Speedup | Notes |
|---|---|---|
| Dot product | 6-17x | AVX-512 for compute-bound |
| Matrix multiply | 2-10x | GPU for 500x500+ |
| Reductions (sum, max, min) | 3-12x | AVX-512 optimal |
| Element-wise (add, mul) | 1-2x | Memory-bound |
| Convolution 2D | 5-8x | AVX2/AVX-512 optimized |
Benchmark Results (AMD Ryzen 9 7950X)
| Benchmark | Throughput |
|---|---|
| Vector recip (AVX-512, 10K) | 10.0 Gelem/s |
| Vector recip (AVX2, 10K) | 9.7 Gelem/s |
| PTX module emit | 3.1 µs |
| PTX kernel build | 81 ns |
| Launch config | 1.7 ns |
GPU Note: GPU acceleration benefits matrix multiply only. Element-wise operations use CPU SIMD (GPU transfer overhead exceeds compute time).
trueno-gpu: Pure Rust CUDA
Generate CUDA PTX kernels without nvcc, LLVM, or external toolchains:
use ;
// Generate optimized GEMM kernel (supports sm_121 Blackwell via PTX 8.8)
let gemm = tensor_core;
let ptx = gemm.emit_ptx; // Pure Rust PTX generation
// Generate softmax with warp shuffle reduction
let softmax = new;
let ptx = softmax.emit_ptx;
// Available kernels: GEMM, Softmax, LayerNorm, Attention, Quantize (Q4K/Q5K/Q6K)
// Reliable PTX->cubin compilation via fixed cuLinkCreate path
Training (WGPU)
trueno now supports backward pass computation via WGSL compute shaders, enabling neural network training on AMD, Intel Arc, and Apple Silicon GPUs through Vulkan, Metal, DX12, and WebGPU -- no CUDA required.
7 backward ops implemented:
silu_backward-- SiLU activation gradientgemm_backward_a-- weight gradient (dL/dA)gemm_backward_b-- input gradient (dL/dB)rmsnorm_backward-- RMSNorm gradientrope_backward-- rotary position embedding gradientadamw_step-- AdamW optimizer parameter updatenf4_dequant-- NF4 4-bit dequantization for QLoRA
All 7 shaders verified on AMD Radeon Pro W5700X via Vulkan with 8 FALSIFY contract tests passing.
use GpuDevice;
let dev = new?;
// Backward pass: compute SiLU gradient
dev.silu_backward?;
// Optimizer step: AdamW update
dev.adamw_step?;
Operations
Vector: add, sub, mul, div, dot, sum, min, max, argmin, argmax, norm_l1, norm_l2, normalize, recip, sqrt, abs, clamp
Activations: relu, leaky_relu, elu, sigmoid, tanh, gelu, swish, softmax, log_softmax, silu
Matrix: matmul, batched_matmul, batched_matmul_4d, transpose, matvec, convolve2d, pooling (max/avg), topk, gather, pad
Statistics: mean, variance, stddev, covariance, correlation, zscore
Eigen: symmetric eigendecomposition (Jacobi algorithm)
GPU Kernels: GEMM (naive/tiled/tensor core), Softmax, LayerNorm, RMSNorm, Attention, GEMV, Quantization
Development
Ecosystem
Part of the Pragmatic AI Labs stack:
- trueno-gpu - Pure Rust PTX generation (no nvcc)
- trueno-db - GPU-first analytics database
- trueno-graph - Graph algorithms
- trueno-rag - RAG pipeline
- 🤖 Coursera Hugging Face AI Development Specialization - Build Production AI systems with Hugging Face in Pure Rust
Usage
Add trueno to your Cargo.toml:
[]
= "0.16"
Then use it in your code:
use Vector;
let a = from_slice;
let b = from_slice;
let result = a.add.unwrap;
The library auto-selects the best SIMD backend at runtime. No configuration needed.
Contributing
Contributions are welcome. Please ensure:
- All tests pass:
cargo test --all-features - Coverage stays above 90%:
make coverage - No clippy warnings:
cargo clippy --all-features -- -D warnings - Code is formatted:
cargo fmt
MSRV
Minimum Supported Rust Version: 1.89
See Also
- Cookbook — 34 runnable examples
License
MIT - see LICENSE