Trueno ⚡

Multi-Target High-Performance Compute Library

Trueno (Spanish: "thunder") provides unified, high-performance compute primitives across three execution targets:

CPU SIMD - x86 (SSE2/AVX/AVX2/AVX-512), ARM (NEON), WASM (SIMD128)
GPU - Vulkan/Metal/DX12/WebGPU via wgpu
WebAssembly - Portable SIMD128 for browser/edge deployment

Quick Start

use trueno::{Vector, Matrix};

// Vector operations
let a = Vector::from_slice(&[1.0, 2.0, 3.0, 4.0]);
let b = Vector::from_slice(&[5.0, 6.0, 7.0, 8.0]);

// Auto-selects best backend (AVX2/GPU/WASM)
let result = a.add(&b).unwrap();
assert_eq!(result.as_slice(), &[6.0, 8.0, 10.0, 12.0]);

let dot_product = a.dot(&b).unwrap();  // 70.0
let sum = a.sum().unwrap();            // 10.0
let max = a.max().unwrap();            // 4.0

// Matrix operations (NEW in v0.1)
let m1 = Matrix::from_vec(2, 2, vec![1.0, 2.0, 3.0, 4.0]).unwrap();
let m2 = Matrix::identity(2);
let product = m1.matmul(&m2).unwrap();  // Matrix multiplication
let transposed = m1.transpose();        // Matrix transpose

Performance

Trueno delivers exceptional performance through multi-level SIMD optimization:

SSE2 (128-bit SIMD) vs Scalar

Operation	Speedup	Use Case
Dot Product	340% faster	Machine learning, signal processing
Sum Reduction	315% faster	Statistics, aggregations
Max Finding	348% faster	Data analysis, optimization
Element-wise Add	3-10% faster	Memory-bound (limited SIMD benefit)
Element-wise Mul	5-6% faster	Memory-bound (limited SIMD benefit)

AVX2 (256-bit SIMD) vs SSE2

Operation	Speedup	Notes
Dot Product	182% faster	FMA (fused multiply-add) acceleration
Element-wise Add	15% faster	Memory bandwidth limited
Element-wise Mul	12% faster	Memory bandwidth limited

Key Insights:

SIMD excels at compute-intensive operations (dot product, reductions)
Element-wise operations are memory-bound, limiting SIMD gains
AVX2's FMA provides significant acceleration for dot products

📖 See Performance Guide and AVX2 Benchmarks for detailed analysis.

Features

🚀 Write Once, Optimize Everywhere: Single algorithm, multiple backends
⚡ Runtime Dispatch: Auto-select best implementation based on CPU features
🛡️ Zero Unsafe in Public API: Safety via type system, unsafe isolated in backends
📊 Benchmarked Performance: Every optimization proves ≥10% speedup
🧪 Extreme TDD: >90% test coverage, mutation testing, property-based tests
🎯 Production Ready: PMAT quality gates, Toyota Way principles

Design Principles

Write Once, Optimize Everywhere

// Same code runs optimally on x86, ARM, WASM, GPU
let result = a.add(&b).unwrap();

Trueno automatically selects the best backend:

x86_64: AVX-512 → AVX2 → AVX → SSE2 → Scalar
ARM: NEON → Scalar
WASM: SIMD128 → Scalar
GPU: Vulkan/Metal/DX12/WebGPU (large datasets)

Safety First

// Public API is 100% safe Rust
let result = vector.add(&other)?;  // Returns Result<Vector, TruenoError>

// Size mismatches caught at runtime
let a = Vector::from_slice(&[1.0, 2.0]);
let b = Vector::from_slice(&[1.0, 2.0, 3.0]);
assert!(a.add(&b).is_err());  // SizeMismatch error

Performance Targets

Operation	Size	Target Speedup vs Scalar	Backend
`add()`	1K	8x	AVX2
`add()`	100K	16x	GPU
`dot()`	10K	12x	AVX2 + FMA
`sum()`	1M	20x	GPU

All optimizations benchmarked with Criterion.rs, minimum 10% improvement required

Installation

Add to your Cargo.toml:

[dependencies]
trueno = "0.1"

For bleeding-edge features:

[dependencies]
trueno = { git = "https://github.com/paiml/trueno" }

Usage Examples

Basic Vector Operations

use trueno::Vector;

// Element-wise addition
let a = Vector::from_slice(&[1.0, 2.0, 3.0]);
let b = Vector::from_slice(&[4.0, 5.0, 6.0]);
let sum = a.add(&b).unwrap();
assert_eq!(sum.as_slice(), &[5.0, 7.0, 9.0]);

// Element-wise multiplication
let product = a.mul(&b).unwrap();
assert_eq!(product.as_slice(), &[4.0, 10.0, 18.0]);

// Dot product
let dot = a.dot(&b).unwrap();
assert_eq!(dot, 32.0);  // 1*4 + 2*5 + 3*6

// Reductions
let total = a.sum().unwrap();  // 6.0
let maximum = a.max().unwrap();  // 3.0

Backend Selection

use trueno::{Vector, Backend};

// Auto-select best backend (recommended)
let v = Vector::from_slice(&data);  // Uses Backend::Auto

// Explicit backend (for testing/benchmarking)
let v = Vector::from_slice_with_backend(&data, Backend::AVX2);
let v = Vector::from_slice_with_backend(&data, Backend::GPU);

Error Handling

use trueno::{Vector, TruenoError};

let a = Vector::from_slice(&[1.0, 2.0]);
let b = Vector::from_slice(&[1.0, 2.0, 3.0]);

match a.add(&b) {
    Ok(result) => println!("Sum: {:?}", result.as_slice()),
    Err(TruenoError::SizeMismatch { expected, actual }) => {
        eprintln!("Size mismatch: expected {}, got {}", expected, actual);
    }
    Err(e) => eprintln!("Error: {}", e),
}

Ecosystem Integration

Trueno integrates with the Pragmatic AI Labs transpiler ecosystem:

Ruchy

# Ruchy syntax
let v = Vector([1.0, 2.0]) + Vector([3.0, 4.0])
# Transpiles to: trueno::Vector::add()

Depyler (Python → Rust)

# Python/NumPy code
import numpy as np
result = np.dot(a, b)
# Transpiles to: trueno::Vector::dot(&a, &b)

Decy (C → Rust)

// C SIMD intrinsics
__m256 result = _mm256_add_ps(a, b);
// Transpiles to: trueno::Vector::add() (safe!)

Development

Prerequisites

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install development tools
make install-tools

Building

# Development build
make build

# Release build (optimized)
make build-release

# Run tests
make test

# Fast test run (<5 min target)
make test-fast

Quality Gates

Trueno enforces EXTREME TDD quality standards:

# Run all quality gates (pre-commit)
make quality-gates

# Individual gates
make lint       # Zero warnings policy
make fmt-check  # Format verification
make test-fast  # All tests (<5 min)
make coverage   # >85% required (<10 min)
make mutate     # Mutation testing (>80% kill rate)

Quality Metrics:

✅ Test Coverage: 100% (target >85%)
✅ PMAT TDG Score: 96.1/100 (A+)
✅ Clippy Warnings: 0
✅ Property Tests: 10 tests × 100 cases each
✅ Cyclomatic Complexity: Median 1.0 (very low)

PMAT Integration

# Technical Debt Grading
make pmat-tdg

# Complexity analysis
make pmat-analyze

# Repository health score
make pmat-score

Testing Philosophy

Trueno uses multi-layered testing:

Unit Tests (30 tests): Basic functionality, edge cases, error paths
Property Tests (10 tests × 100 cases): Mathematical properties verification
- Commutativity: a + b == b + a
- Associativity: (a + b) + c == a + (b + c)
- Identity elements: a + 0 == a, a * 1 == a
- Distributive: a * (b + c) == a*b + a*c
Integration Tests: Backend selection, large datasets
Benchmarks: Performance regression prevention (Criterion.rs)
Mutation Tests: Test suite effectiveness (>80% kill rate)

Run property tests with verbose output:

cargo test property_tests -- --nocapture

Benchmarking

# Run all benchmarks
make bench

# Benchmark specific operation
cargo bench -- add
cargo bench -- dot

Benchmark results are stored in target/criterion/ and include:

Throughput (elements/second)
Latency (mean, median, p95, p99)
Backend comparison (Scalar vs SIMD vs GPU)
Regression detection

Examples

Trueno includes several runnable examples demonstrating real-world use cases:

# Machine Learning: Cosine similarity, L2 normalization, k-NN
cargo run --release --example ml_similarity

# Performance: Compare Scalar vs SSE2 backends
cargo run --release --example performance_demo

# Backend Detection: Runtime CPU feature detection
cargo run --release --example backend_detection

ML Example Features:

Document similarity for recommendation systems
Feature normalization for neural networks
k-Nearest Neighbors classification
Demonstrates 340% speedup for dot products

See examples/ directory for complete code.

Project Structure

trueno/
├── src/
│   ├── lib.rs          # Public API, backend enum, auto-selection
│   ├── error.rs        # Error types (TruenoError)
│   ├── vector.rs       # Vector<T> implementation
│   └── backends/       # Backend implementations (future)
│       ├── scalar.rs
│       ├── simd/
│       │   ├── avx2.rs
│       │   ├── avx512.rs
│       │   └── neon.rs
│       ├── gpu.rs
│       └── wasm.rs
├── benches/            # Criterion benchmarks (future)
├── docs/
│   └── specifications/ # Design specifications
├── Cargo.toml          # Dependencies, optimization flags
├── Makefile            # Quality gates, development commands
└── README.md           # This file

Roadmap

Phase 1: Scalar Baseline ✅ COMPLETE

Core Vector<f32> API (add, mul, dot, sum, max)
Error handling with TruenoError
100% test coverage (40 tests)
Property-based tests (PROPTEST_CASES=100)
PMAT quality gates integration
Documentation and README

Phase 2: x86 SIMD ✅ COMPLETE

Runtime CPU feature detection (is_x86_feature_detected!)
SSE2 implementation (baseline x86_64)
Benchmarks proving ≥10% speedup (66.7% of tests, avg 178.5%)
Auto-dispatch based on CPU features
Backend trait architecture
Comprehensive performance analysis

Phase 3: AVX2 SIMD ✅ COMPLETE

AVX2 implementation with FMA support (256-bit SIMD)
Benchmarks proving exceptional speedups (1.82x for dot product)
Performance analysis and documentation
All quality gates passing (0 warnings, 78 tests)

Phase 4: ARM SIMD ✅ COMPLETE

ARM NEON implementation (128-bit SIMD)
Runtime feature detection (ARMv7/ARMv8/AArch64)
Cross-platform compilation support
Comprehensive tests with cross-validation
Benchmarks on ARM hardware (pending ARM access)

Phase 5: WebAssembly ✅ COMPLETE

WASM SIMD128 implementation (128-bit SIMD)
All 5 operations with f32x4 intrinsics
Comprehensive tests with cross-validation
Browser deployment example (future)
Edge computing use case (future)

Phase 6: GPU Compute

wgpu integration
Compute shader kernels (WGSL)
Host-device memory transfer optimization
GPU dispatch heuristics (OpComplexity)
Multi-GPU support

Phase 7: Advanced Operations ✅ COMPLETE

Element-wise subtraction (sub) and division (div)
Reductions: min, max, sum, sum_kahan (Kahan summation)
Index finding: argmax, argmin
Vector norms: norm_l2 (Euclidean norm), normalize (unit vector)
Activation functions: ReLU, Leaky ReLU, ELU, Sigmoid, Softmax/Log-Softmax, GELU, Swish/SiLU
Preprocessing: zscore, minmax_normalize, clip
Statistical operations: mean, variance, stddev, covariance, correlation

Phase 8: Matrix Operations 🚧 IN PROGRESS

Matrix type with row-major storage (NumPy-compatible)
Matrix multiplication (matmul) - naive O(n³)
Matrix transpose
Matrix-vector operations
SIMD-optimized matmul
GPU dispatch for large matrices

Phase 8 Progress: Core matrix operations complete with 24 tests passing (611 total).

Phase 7 Status: ✅ COMPLETE - Core vector operations with 587 tests passing. The library now supports:

Element-wise operations: add, sub, mul, div, abs (absolute value), neg (negation/unary minus), clamp (range constraint), lerp (linear interpolation), fma (fused multiply-add), sqrt (square root), recip (reciprocal), pow (power), exp (exponential), ln (natural logarithm), sin (sine), cos (cosine), tan (tangent), asin (arcsine), acos (arccosine), atan (arctangent), sinh (hyperbolic sine), cosh (hyperbolic cosine), tanh (hyperbolic tangent), asinh (inverse hyperbolic sine), acosh (inverse hyperbolic cosine), atanh (inverse hyperbolic tangent), floor (round down), ceil (round up), round (round to nearest), trunc (truncate toward zero), fract (fractional part), signum (sign function), copysign (copy sign from one vector to another), minimum (element-wise minimum of two vectors), maximum (element-wise maximum of two vectors)
Scalar operations: scale (scalar multiplication with full SIMD support)
Dot product: Optimized for ML/scientific computing
Reductions: sum (naive + Kahan), min, max, sum_of_squares, mean (arithmetic average), variance (population variance), stddev (standard deviation), covariance (population covariance between two vectors), correlation (Pearson correlation coefficient)
Activation functions: relu (rectified linear unit - max(0, x)), leaky_relu (leaky ReLU with configurable negative slope), elu (exponential linear unit with smooth gradients), sigmoid (logistic function - 1/(1+e^-x)), softmax (convert logits to probability distribution), log_softmax (numerically stable log of softmax for cross-entropy loss), gelu (Gaussian Error Linear Unit - smooth activation used in transformers like BERT/GPT), swish/silu (Swish/Sigmoid Linear Unit - self-gated activation used in EfficientNet/MobileNet v3)
Preprocessing: zscore (z-score normalization/standardization), minmax_normalize (min-max scaling to [0,1] range), clip (constrain values to [min,max] range)
Index operations: argmin, argmax
Vector norms: L1 (Manhattan), L2 (Euclidean), L∞ (max norm), normalization to unit vectors
Numerical stability: Kahan summation for accurate floating-point accumulation
FMA optimization: Hardware-accelerated fused multiply-add on AVX2 and NEON platforms
Mathematical functions: Element-wise square root, reciprocal, power, exponential, logarithm, trigonometric (sine, cosine, tangent), inverse trigonometric (arcsine, arccosine, arctangent), hyperbolic functions (sinh, cosh, tanh), and inverse hyperbolic functions (asinh, acosh, atanh) for ML (neural network activations), signal processing (waveforms, oscillators, phase recovery, FM demodulation), physics simulations, graphics (perspective projection, inverse transformations, lighting models, camera orientation), navigation (GPS, spherical trigonometry, bearing calculations, heading calculations), robotics (orientation calculations, inverse kinematics, steering angles), and Fourier analysis

Contributing

We welcome contributions! Please follow these guidelines:

Quality Gates: All PRs must pass make quality-gates
- Zero clippy warnings
- 100% formatted code
- All tests passing
- Coverage >85%
Testing: Include tests for new features
- Unit tests for basic functionality
- Property tests for mathematical operations
- Benchmarks for performance claims
Documentation: Update README and docs for new features
Toyota Way Principles:
- Jidoka (built-in quality): Tests catch issues immediately
- Kaizen (continuous improvement): Every PR makes the codebase better
- Genchi Genbutsu (go and see): Benchmark claims, measure reality

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

Pragmatic AI Labs - https://github.com/paiml

Acknowledgments

Inspired by NumPy, Eigen, and ndarray
SIMD guidance from std::arch documentation
GPU compute via wgpu project
Quality standards from Toyota Production System
PMAT quality gates by Pragmatic AI Labs

Citation

If you use Trueno in academic work, please cite:

@software{trueno2025,
  title = {Trueno: Multi-Target High-Performance Compute Library},
  author = {Pragmatic AI Labs},
  year = {2025},
  url = {https://github.com/paiml/trueno}
}

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: contact@paiml.com

Built with EXTREME TDD and Toyota Way principles 🚗⚡

trueno 0.1.0