Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Table of Contents
- What is trueno?
- Installation
- Usage
- Features
- Architecture
- Performance
- Quality
- Sovereign AI Stack
- Documentation
- Contributing
- License
What is trueno?
trueno (Spanish: "thunder") is a pure Rust high-performance compute library providing unified primitives across CPU SIMD, GPU, and WebAssembly. It is the foundation layer of the PAIML Sovereign AI Stack, powering tensor operations, quantized inference kernels, and training backward passes for the entire ecosystem.
trueno auto-detects the best available hardware at runtime -- AVX-512 on server CPUs, AVX2 on desktops, NEON on ARM, SIMD128 on WASM -- and dispatches to hand-tuned kernels without configuration. For large matrices, wgpu GPU compute (Vulkan/Metal/DX12/WebGPU) is available as an optional backend. For NVIDIA hardware, trueno-gpu generates CUDA PTX kernels in pure Rust with no external toolchain.
Installation
[]
= "0.17"
# Optional: wgpu GPU support for large matrices
= { = "0.17", = ["gpu"] }
# Optional: pure Rust CUDA PTX generation
= "0.4"
Usage
Vector and Matrix Operations
use ;
// Vector operations -- auto-selects best SIMD backend
let a = from_slice;
let b = from_slice;
let sum = a.add.unwrap; // [6.0, 8.0, 10.0, 12.0]
let dot = a.dot.unwrap; // 70.0
let activated = a.relu.unwrap; // ReLU activation
// Matrix operations
let m = from_vec.unwrap;
let product = m.matmul.unwrap; // Matrix multiplication
let transposed = m.transpose; // Transpose
// Eigendecomposition (PCA, spectral analysis)
let cov = from_vec.unwrap;
let eigen = new.unwrap;
let eigenvalues = eigen.eigenvalues; // [4.0, 2.0]
Batched Matmul for Transformers
use Matrix;
// Q @ K^T pattern: batched 4D matmul
let batch = 2; let heads = 4; let seq = 8; let dim = 64;
let q: = vec!;
let kt: = vec!;
let attn = batched_matmul_4d.unwrap;
GPU Training (wgpu Backward Pass)
use GpuDevice;
let dev = new?;
// Backward pass: compute SiLU gradient
dev.silu_backward?;
// Optimizer step: AdamW update
dev.adamw_step?;
Pure Rust CUDA PTX (trueno-gpu)
use ;
// Generate optimized GEMM kernel (supports sm_121 Blackwell via PTX 8.8)
let gemm = tensor_core;
let ptx = gemm.emit_ptx; // Pure Rust PTX generation, no nvcc
// Generate softmax with warp shuffle reduction
let softmax = new;
let ptx = softmax.emit_ptx;
Features
BLIS 5-Loop GEMM
Production-grade matrix multiplication implementing the BLIS
(BLAS-like Library Instantiation Software) 5-loop algorithm with
compile-time microkernel specialization via trueno-gemm-codegen.
Cache-oblivious tiling for L1/L2/L3 with architecture-specific
register blocking.
SIMD Auto-Detection
Runtime detection and dispatch across four ISA families:
| ISA | Instructions | Use Case |
|---|---|---|
| AVX-512 | 512-bit vectors, VNNI | Server CPUs (Zen 4, Sapphire Rapids) |
| AVX2 | 256-bit vectors, FMA | Desktop CPUs (Haswell+) |
| NEON | 128-bit vectors | ARM (Apple Silicon, Graviton, Jetson) |
| SIMD128 | 128-bit vectors | WebAssembly |
No configuration required. trueno probes cpuid at startup and
selects the widest available path.
Quantization Kernels
Fused dequantization + matmul kernels for quantized LLM inference:
| Kernel | Block Size | Bytes/Block | Description |
|---|---|---|---|
| Q4K | 256 | 144 | 4-bit quantization with K-means centroids |
| Q5K | 256 | 176 | 5-bit quantization |
| Q6K | 256 | 210 | 6-bit quantization |
| Q8K | 256 | 292 | 8-bit quantization |
Each kernel has scalar, AVX2, and AVX-512 implementations with parity tests ensuring identical output across all paths.
wgpu GPU Compute
Cross-platform GPU via WGSL compute shaders (Vulkan, Metal, DX12, WebGPU). Seven backward pass operations for neural network training without CUDA:
silu_backward-- SiLU activation gradientgemm_backward_a/gemm_backward_b-- weight and input gradientsrmsnorm_backward-- RMSNorm gradientrope_backward-- rotary position embedding gradientadamw_step-- AdamW optimizer parameter updatenf4_dequant-- NF4 4-bit dequantization for QLoRA
LZ4/ZSTD Compression
SIMD-accelerated compression integrated into the tensor pipeline, used by the APR v2 model format for compressed tensor storage.
Pure Rust CUDA (trueno-gpu)
Generate CUDA PTX kernels without nvcc, LLVM, or external toolchains.
Supports sm_121 Blackwell via PTX 8.8 ISA. Available kernels: GEMM
(naive/tiled/tensor core), Softmax, LayerNorm, RMSNorm, Attention,
GEMV, Quantization (Q4K/Q5K/Q6K). JIT disk cache at
~/.cache/trueno/ptx/ for instant kernel reload.
Feature Flags
| Feature | Description |
|---|---|
gpu |
wgpu GPU compute (Vulkan/Metal/DX12/WebGPU) |
gpu-wasm |
WebGPU for WASM targets |
parallel |
Rayon parallel iterators |
cuda-monitor |
NVIDIA device info via trueno-gpu |
cuda |
Full CUDA support via trueno-gpu |
ml-tuner |
ML-based kernel selection via aprender RandomForest |
execution-graph |
PTX-to-kernel call graph profiling |
tui-monitor |
Terminal monitoring dashboard |
hardware-detect |
Hardware capability detection and caching |
dhat-heap |
Heap profiling via dhat-rs |
Architecture
trueno (core library)
|
|-- src/backends/
| |-- scalar.rs Portable reference implementations
| |-- avx2.rs AVX2 SIMD kernels (256-bit)
| |-- avx512.rs AVX-512 SIMD kernels (512-bit)
| |-- neon.rs ARM NEON kernels (128-bit)
| |-- wasm_simd.rs WASM SIMD128 kernels
| |-- gpu/ wgpu WGSL compute shaders
| +-- dispatch.rs Runtime backend selection
|
|-- src/matrix.rs Matrix ops, BLIS GEMM, batched matmul
|-- src/vector.rs Vector ops, activations, reductions
|-- src/quant/ Q4K/Q5K/Q6K/Q8K fused kernels
|-- src/eigen.rs Symmetric eigendecomposition (Jacobi)
|-- src/conv.rs 2D convolution, pooling
+-- src/tuner.rs ML-based kernel selection
trueno-gpu (companion crate)
|
+-- Pure Rust PTX generation for NVIDIA GPUs (sm_50 -- sm_121)
Performance
| Operation | SIMD Speedup | Notes |
|---|---|---|
| Dot product | 6-17x | AVX-512 for compute-bound |
| Matrix multiply | 2-10x | GPU for 500x500+ |
| Reductions (sum, max, min) | 3-12x | AVX-512 optimal |
| Element-wise (add, mul) | 1-2x | Memory-bound |
| Convolution 2D | 5-8x | AVX2/AVX-512 optimized |
GPU acceleration benefits matrix multiplication. Element-wise operations use CPU SIMD because GPU transfer overhead exceeds compute time.
Operations
Vector: add, sub, mul, div, dot, sum, min, max, argmin, argmax, norm_l1, norm_l2, normalize, recip, sqrt, abs, clamp
Activations: relu, leaky_relu, elu, sigmoid, tanh, gelu, swish, softmax, log_softmax, silu
Matrix: matmul, batched_matmul, batched_matmul_4d, transpose, matvec, convolve2d, pooling (max/avg), topk, gather, pad
Statistics: mean, variance, stddev, covariance, correlation, zscore
Eigen: symmetric eigendecomposition (Jacobi algorithm)
GPU Kernels: GEMM (naive/tiled/tensor core), Softmax, LayerNorm, RMSNorm, Attention, GEMV, Quantization (Q4K/Q5K/Q6K)
Quality
Falsifiable Commitments
| Claim | Falsification Test | Status |
|---|---|---|
| AVX-512 and AVX2 produce identical results to scalar | Parity tests across all vector/matrix ops | Passing |
| Q4K/Q5K/Q6K fused kernels match scalar reference | Per-block output comparison with epsilon tolerance | Passing |
| BLIS GEMM matches naive matmul | Property-based tests (proptest) against reference | Passing |
| wgpu backward ops produce correct gradients | 8 FALSIFY contract tests on AMD Radeon Pro W5700X | Passing |
| PTX generation produces valid NVIDIA ISA | Regex validation of emitted PTX instructions | Passing |
| Runtime SIMD detection never selects unsupported ISA | Feature probe matches cpuid at startup | Passing |
Metrics
- 97% test coverage
- Zero clippy warnings (
-D warnings) - Mutation testing via cargo-mutants
- provable-contracts enforcement: 100 bindings, AllImplemented
Sovereign AI Stack
trueno is the compute foundation for the PAIML Sovereign AI Stack.
| Layer | Crate | Relationship |
|---|---|---|
| Orchestration | batuta | Stack coordinator, uses trueno for analysis |
| ML | aprender | ML algorithms built on trueno SIMD |
| Training | entrenar | Autograd and LoRA training on trueno tensors |
| Inference | realizar | LLM inference using trueno quantization kernels |
| Distribution | repartir | Distributed compute with trueno tensor integration |
| Database | trueno-db | GPU-first analytics database |
| Graph | trueno-graph | Graph algorithms for code analysis |
| RAG | trueno-rag | RAG pipeline (chunking, BM25+vector, RRF) |
| Visualization | trueno-viz | Terminal and PNG visualization |
| Compression | trueno-zram-core | SIMD/GPU memory compression |
| Block Device | trueno-ublk | GPU-accelerated ZRAM replacement |
| PTX Codegen | trueno-gpu | Pure Rust CUDA PTX generation |
| Contracts | provable-contracts | YAML contract verification |
Documentation
- API Reference: docs.rs/trueno
- Examples:
examples/-- 34 runnable examples - Benchmarks:
benches/-- criterion benchmarks - trueno-gpu docs: docs.rs/trueno-gpu
Contributing
- All tests pass:
cargo test --all-features - Coverage stays above 90%:
make coverage - No clippy warnings:
cargo clippy --all-features -- -D warnings - Code is formatted:
cargo fmt
License
MIT -- see LICENSE.
Part of the Aprender monorepo — 70 workspace crates.