Trueno โก
Multi-Target High-Performance Compute Library
Trueno (Spanish: "thunder") provides unified, high-performance compute primitives across three execution targets:
- CPU SIMD - x86 (SSE2/AVX/AVX2/AVX-512), ARM (NEON), WASM (SIMD128)
- GPU - Vulkan/Metal/DX12/WebGPU via
wgpu - WebAssembly - Portable SIMD128 for browser/edge deployment
Quick Start
use ;
// Vector operations
let a = from_slice;
let b = from_slice;
// Auto-selects best backend (AVX2/GPU/WASM)
let result = a.add.unwrap;
assert_eq!;
let dot_product = a.dot.unwrap; // 70.0
let sum = a.sum.unwrap; // 10.0
let max = a.max.unwrap; // 4.0
// Matrix operations (NEW in v0.1)
let m1 = from_vec.unwrap;
let m2 = identity;
let product = m1.matmul.unwrap; // Matrix multiplication
let transposed = m1.transpose; // Matrix transpose
Performance
Trueno delivers exceptional performance through multi-level SIMD optimization:
SSE2 (128-bit SIMD) vs Scalar
| Operation | Speedup | Use Case |
|---|---|---|
| Dot Product | 340% faster | Machine learning, signal processing |
| Sum Reduction | 315% faster | Statistics, aggregations |
| Max Finding | 348% faster | Data analysis, optimization |
| Element-wise Add | 3-10% faster | Memory-bound (limited SIMD benefit) |
| Element-wise Mul | 5-6% faster | Memory-bound (limited SIMD benefit) |
AVX2 (256-bit SIMD) vs SSE2
| Operation | Speedup | Notes |
|---|---|---|
| Dot Product | 182% faster | FMA (fused multiply-add) acceleration |
| Element-wise Add | 15% faster | Memory bandwidth limited |
| Element-wise Mul | 12% faster | Memory bandwidth limited |
Key Insights:
- SIMD excels at compute-intensive operations (dot product, reductions)
- Element-wise operations are memory-bound, limiting SIMD gains
- AVX2's FMA provides significant acceleration for dot products
๐ See Performance Guide and AVX2 Benchmarks for detailed analysis.
Features
- ๐ Write Once, Optimize Everywhere: Single algorithm, multiple backends
- โก Runtime Dispatch: Auto-select best implementation based on CPU features
- ๐ก๏ธ Zero Unsafe in Public API: Safety via type system,
unsafeisolated in backends - ๐ Benchmarked Performance: Every optimization proves โฅ10% speedup
- ๐งช Extreme TDD: >90% test coverage, mutation testing, property-based tests
- ๐ฏ Production Ready: PMAT quality gates, Toyota Way principles
Design Principles
Write Once, Optimize Everywhere
// Same code runs optimally on x86, ARM, WASM, GPU
let result = a.add.unwrap;
Trueno automatically selects the best backend:
- x86_64: AVX-512 โ AVX2 โ AVX โ SSE2 โ Scalar
- ARM: NEON โ Scalar
- WASM: SIMD128 โ Scalar
- GPU: Vulkan/Metal/DX12/WebGPU (large datasets)
Safety First
// Public API is 100% safe Rust
let result = vector.add?; // Returns Result<Vector, TruenoError>
// Size mismatches caught at runtime
let a = from_slice;
let b = from_slice;
assert!; // SizeMismatch error
Performance Targets
| Operation | Size | Target Speedup vs Scalar | Backend |
|---|---|---|---|
add() |
1K | 8x | AVX2 |
add() |
100K | 16x | GPU |
dot() |
10K | 12x | AVX2 + FMA |
sum() |
1M | 20x | GPU |
All optimizations benchmarked with Criterion.rs, minimum 10% improvement required
Installation
Add to your Cargo.toml:
[]
= "0.1"
For bleeding-edge features:
[]
= { = "https://github.com/paiml/trueno" }
Usage Examples
Basic Vector Operations
use Vector;
// Element-wise addition
let a = from_slice;
let b = from_slice;
let sum = a.add.unwrap;
assert_eq!;
// Element-wise multiplication
let product = a.mul.unwrap;
assert_eq!;
// Dot product
let dot = a.dot.unwrap;
assert_eq!; // 1*4 + 2*5 + 3*6
// Reductions
let total = a.sum.unwrap; // 6.0
let maximum = a.max.unwrap; // 3.0
Backend Selection
use ;
// Auto-select best backend (recommended)
let v = from_slice; // Uses Backend::Auto
// Explicit backend (for testing/benchmarking)
let v = from_slice_with_backend;
let v = from_slice_with_backend;
Error Handling
use ;
let a = from_slice;
let b = from_slice;
match a.add
Ecosystem Integration
Trueno integrates with the Pragmatic AI Labs transpiler ecosystem:
Ruchy
# Ruchy syntax
let v = Vector([1.0, 2.0]) + Vector([3.0, 4.0])
# Transpiles to: trueno::Vector::add()
Depyler (Python โ Rust)
# Python/NumPy code
=
# Transpiles to: trueno::Vector::dot(&a, &b)
Decy (C โ Rust)
// C SIMD intrinsics
__m256 result = ;
// Transpiles to: trueno::Vector::add() (safe!)
Development
Prerequisites
# Install Rust (if not already installed)
|
# Install development tools
Building
# Development build
# Release build (optimized)
# Run tests
# Fast test run (<5 min target)
Quality Gates
Trueno enforces EXTREME TDD quality standards:
# Run all quality gates (pre-commit)
# Individual gates
Quality Metrics:
- โ Test Coverage: 100% (target >85%)
- โ PMAT TDG Score: 96.1/100 (A+)
- โ Clippy Warnings: 0
- โ Property Tests: 10 tests ร 100 cases each
- โ Cyclomatic Complexity: Median 1.0 (very low)
PMAT Integration
# Technical Debt Grading
# Complexity analysis
# Repository health score
Testing Philosophy
Trueno uses multi-layered testing:
- Unit Tests (30 tests): Basic functionality, edge cases, error paths
- Property Tests (10 tests ร 100 cases): Mathematical properties verification
- Commutativity:
a + b == b + a - Associativity:
(a + b) + c == a + (b + c) - Identity elements:
a + 0 == a,a * 1 == a - Distributive:
a * (b + c) == a*b + a*c
- Commutativity:
- Integration Tests: Backend selection, large datasets
- Benchmarks: Performance regression prevention (Criterion.rs)
- Mutation Tests: Test suite effectiveness (>80% kill rate)
Run property tests with verbose output:
Benchmarking
# Run all benchmarks
# Benchmark specific operation
Benchmark results are stored in target/criterion/ and include:
- Throughput (elements/second)
- Latency (mean, median, p95, p99)
- Backend comparison (Scalar vs SIMD vs GPU)
- Regression detection
Examples
Trueno includes several runnable examples demonstrating real-world use cases:
# Machine Learning: Cosine similarity, L2 normalization, k-NN
# Performance: Compare Scalar vs SSE2 backends
# Backend Detection: Runtime CPU feature detection
ML Example Features:
- Document similarity for recommendation systems
- Feature normalization for neural networks
- k-Nearest Neighbors classification
- Demonstrates 340% speedup for dot products
See examples/ directory for complete code.
Project Structure
trueno/
โโโ src/
โ โโโ lib.rs # Public API, backend enum, auto-selection
โ โโโ error.rs # Error types (TruenoError)
โ โโโ vector.rs # Vector<T> implementation
โ โโโ backends/ # Backend implementations (future)
โ โโโ scalar.rs
โ โโโ simd/
โ โ โโโ avx2.rs
โ โ โโโ avx512.rs
โ โ โโโ neon.rs
โ โโโ gpu.rs
โ โโโ wasm.rs
โโโ benches/ # Criterion benchmarks (future)
โโโ docs/
โ โโโ specifications/ # Design specifications
โโโ Cargo.toml # Dependencies, optimization flags
โโโ Makefile # Quality gates, development commands
โโโ README.md # This file
Roadmap
Phase 1: Scalar Baseline โ COMPLETE
- Core
Vector<f32>API (add, mul, dot, sum, max) - Error handling with
TruenoError - 100% test coverage (40 tests)
- Property-based tests (PROPTEST_CASES=100)
- PMAT quality gates integration
- Documentation and README
Phase 2: x86 SIMD โ COMPLETE
- Runtime CPU feature detection (
is_x86_feature_detected!) - SSE2 implementation (baseline x86_64)
- Benchmarks proving โฅ10% speedup (66.7% of tests, avg 178.5%)
- Auto-dispatch based on CPU features
- Backend trait architecture
- Comprehensive performance analysis
Phase 3: AVX2 SIMD โ COMPLETE
- AVX2 implementation with FMA support (256-bit SIMD)
- Benchmarks proving exceptional speedups (1.82x for dot product)
- Performance analysis and documentation
- All quality gates passing (0 warnings, 78 tests)
Phase 4: ARM SIMD โ COMPLETE
- ARM NEON implementation (128-bit SIMD)
- Runtime feature detection (ARMv7/ARMv8/AArch64)
- Cross-platform compilation support
- Comprehensive tests with cross-validation
- Benchmarks on ARM hardware (pending ARM access)
Phase 5: WebAssembly โ COMPLETE
- WASM SIMD128 implementation (128-bit SIMD)
- All 5 operations with f32x4 intrinsics
- Comprehensive tests with cross-validation
- Browser deployment example (future)
- Edge computing use case (future)
Phase 6: GPU Compute
-
wgpuintegration - Compute shader kernels (WGSL)
- Host-device memory transfer optimization
- GPU dispatch heuristics (OpComplexity)
- Multi-GPU support
Phase 7: Advanced Operations โ COMPLETE
- Element-wise subtraction (sub) and division (div)
- Reductions: min, max, sum, sum_kahan (Kahan summation)
- Index finding: argmax, argmin
- Vector norms: norm_l2 (Euclidean norm), normalize (unit vector)
- Activation functions: ReLU, Leaky ReLU, ELU, Sigmoid, Softmax/Log-Softmax, GELU, Swish/SiLU
- Preprocessing: zscore, minmax_normalize, clip
- Statistical operations: mean, variance, stddev, covariance, correlation
Phase 8: Matrix Operations ๐ง IN PROGRESS
- Matrix type with row-major storage (NumPy-compatible)
- Matrix multiplication (matmul) - naive O(nยณ)
- Matrix transpose
- Matrix-vector operations
- SIMD-optimized matmul
- GPU dispatch for large matrices
Phase 8 Progress: Core matrix operations complete with 24 tests passing (611 total).
Phase 7 Status: โ COMPLETE - Core vector operations with 587 tests passing. The library now supports:
- Element-wise operations: add, sub, mul, div, abs (absolute value), neg (negation/unary minus), clamp (range constraint), lerp (linear interpolation), fma (fused multiply-add), sqrt (square root), recip (reciprocal), pow (power), exp (exponential), ln (natural logarithm), sin (sine), cos (cosine), tan (tangent), asin (arcsine), acos (arccosine), atan (arctangent), sinh (hyperbolic sine), cosh (hyperbolic cosine), tanh (hyperbolic tangent), asinh (inverse hyperbolic sine), acosh (inverse hyperbolic cosine), atanh (inverse hyperbolic tangent), floor (round down), ceil (round up), round (round to nearest), trunc (truncate toward zero), fract (fractional part), signum (sign function), copysign (copy sign from one vector to another), minimum (element-wise minimum of two vectors), maximum (element-wise maximum of two vectors)
- Scalar operations: scale (scalar multiplication with full SIMD support)
- Dot product: Optimized for ML/scientific computing
- Reductions: sum (naive + Kahan), min, max, sum_of_squares, mean (arithmetic average), variance (population variance), stddev (standard deviation), covariance (population covariance between two vectors), correlation (Pearson correlation coefficient)
- Activation functions: relu (rectified linear unit - max(0, x)), leaky_relu (leaky ReLU with configurable negative slope), elu (exponential linear unit with smooth gradients), sigmoid (logistic function - 1/(1+e^-x)), softmax (convert logits to probability distribution), log_softmax (numerically stable log of softmax for cross-entropy loss), gelu (Gaussian Error Linear Unit - smooth activation used in transformers like BERT/GPT), swish/silu (Swish/Sigmoid Linear Unit - self-gated activation used in EfficientNet/MobileNet v3)
- Preprocessing: zscore (z-score normalization/standardization), minmax_normalize (min-max scaling to [0,1] range), clip (constrain values to [min,max] range)
- Index operations: argmin, argmax
- Vector norms: L1 (Manhattan), L2 (Euclidean), Lโ (max norm), normalization to unit vectors
- Numerical stability: Kahan summation for accurate floating-point accumulation
- FMA optimization: Hardware-accelerated fused multiply-add on AVX2 and NEON platforms
- Mathematical functions: Element-wise square root, reciprocal, power, exponential, logarithm, trigonometric (sine, cosine, tangent), inverse trigonometric (arcsine, arccosine, arctangent), hyperbolic functions (sinh, cosh, tanh), and inverse hyperbolic functions (asinh, acosh, atanh) for ML (neural network activations), signal processing (waveforms, oscillators, phase recovery, FM demodulation), physics simulations, graphics (perspective projection, inverse transformations, lighting models, camera orientation), navigation (GPS, spherical trigonometry, bearing calculations, heading calculations), robotics (orientation calculations, inverse kinematics, steering angles), and Fourier analysis
Contributing
We welcome contributions! Please follow these guidelines:
-
Quality Gates: All PRs must pass
make quality-gates- Zero clippy warnings
- 100% formatted code
- All tests passing
- Coverage >85%
-
Testing: Include tests for new features
- Unit tests for basic functionality
- Property tests for mathematical operations
- Benchmarks for performance claims
-
Documentation: Update README and docs for new features
-
Toyota Way Principles:
- Jidoka (built-in quality): Tests catch issues immediately
- Kaizen (continuous improvement): Every PR makes the codebase better
- Genchi Genbutsu (go and see): Benchmark claims, measure reality
License
This project is licensed under the MIT License - see the LICENSE file for details.
Authors
- Pragmatic AI Labs - https://github.com/paiml
Acknowledgments
- Inspired by NumPy, Eigen, and ndarray
- SIMD guidance from
std::archdocumentation - GPU compute via
wgpuproject - Quality standards from Toyota Production System
- PMAT quality gates by Pragmatic AI Labs
Citation
If you use Trueno in academic work, please cite:
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: contact@paiml.com
Built with EXTREME TDD and Toyota Way principles ๐โก