# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.18.0] - 2026-04-06
### Added
- **End-to-end LLM inference engine** (`src/inference/`)
- `GgufFile`: GGUF v2/v3 reader — metadata KV, tensor info, alignment-padded data
- `LlamaModel`: Full transformer — RMSNorm, Q4K fused matmul, RoPE, GQA, SwiGLU FFN, KV cache
- `WeightMatrix` enum: Q4K fused path for hot weights, F32 dequant path for mixed quantization
- Dequantization: Q4_0, Q4_1, Q4K, Q5K, Q6K, Q8_0, F16, BF16
- `generate()`: Autoregressive decode with temperature, top-k, top-p nucleus sampling
- QKV bias support for Qwen2/Qwen3 architectures
- `examples/inference_demo.rs`: CLI — load GGUF, tokenize, generate, print tok/s stats
- **Software-pipelined GPU GEMM kernel** (64×128 CTA, 3-stage cp.async)
- 60.9 TF/s peak at 2048 — 0.52× cuBLAS, TARGET MET
- 18KB shared memory (3×6KB pipeline stages)
- 5 FALSIFY tests, 19/19 contracts pass
### Performance
- **P5c industry baseline**: trueno 807 tok/s vs llama.cpp 2481 tok/s (0.33×) on TinyLlama 5M F16
- GPU GEMM pipeline: +39% over non-pipelined (60.9 vs 43.8 TF/s at 2048)
- All 3630 tests pass
## [0.16.0] - 2026-02-26
### Changed
- Minor version bump for PAIML Sovereign AI Stack coordinated release
- Updated workspace lints and CI configurations
## [0.9.0] - 2025-12-30
### Added
- **CUDA-Tile Behavior GPU Optimizations** (cuda-tile-behavior.md spec)
- `TensorView<T>`: Structured memory view with shape/stride metadata for GPU buffers
- `PartitionView<T>`: Tiling strategy for 16x16 GPU workgroup distribution
- Tiled reduction algorithms: `tiled_sum_2d`, `tiled_max_2d`, `tiled_min_2d`
- `ReduceOp` trait for custom reduction operations (SumOp, MaxOp, MinOp)
- WGSL tiled reduction shaders for GPU compute (pending integration)
- **Intel SDE Support for AVX-512 Testing** (Makefile targets)
- `make install-sde`: Download and install Intel Software Development Emulator
- `make test-avx512-sde`: Run AVX-512 tests under Skylake-X emulation
- `make bench-avx512-sde`: Run AVX-512 benchmarks under emulation
- `make coverage-avx512-sde`: Run AVX-512 coverage under emulation
- Enables AVX-512 testing on CPUs without native support (e.g., Intel Meteor Lake)
- **PTX Optimization Passes** (trueno-gpu)
- FMA fusion pass: Automatically fuse mul+add into fma instructions
- Tile validation: Compile-time validation of tile constraints
- ~33% instruction reduction for FMA-eligible code
### Documentation
- New example: `tiled_reduction_demo` demonstrating GPU memory abstractions
- Updated book chapter: GPU Compute Shaders with tiled reduction algorithms
- GitHub issues #72-#76 filed for CUDA-specific integration work
### Fixed
- AVX-512 dot product test tolerance using relative error for large results
- Clippy warnings for match arms, must_use, and div_ceil
## [0.8.9] - 2025-12-23
### Added
- **Batched Matrix Multiplication** for 3D and 4D tensors (Refs #71)
- `Matrix::batched_matmul`: Shape `[batch, m, k] @ [batch, k, n] -> [batch, m, n]`
- `Matrix::batched_matmul_4d`: Attention pattern `[batch, heads, m, k] @ [batch, heads, k, n]`
- SIMD-accelerated using trueno's matmul backend
- Critical for transformer multi-head attention (Q @ K^T, attn @ V)
- 8 unit tests for correctness and error handling
### Documentation
- Updated `examples/matrix_operations.rs` with batched matmul demos
- Book updates for batched matmul API reference and examples
- Created GitHub issue #71 for BatchedGemmKernel GPU support
## [0.8.8] - 2025-12-17
### Changed
- Updated `trueno-gpu` dependency to v0.3.0
- BiasActivationKernel: Fused bias + activation epilogue (None/ReLU/GELU)
- GemvKernel: Matrix-vector multiply for M=1 matmuls in LLM inference
### Documentation
- Book updates for BiasActivationKernel examples and PTX generation
## [trueno-gpu 0.3.0] - 2025-12-17
### Added
- **BiasActivationKernel**: Fused bias + activation epilogue kernel for GEMM operations
- Three activation variants: None (bias only), ReLU, GELU
- Builder pattern API: `BiasActivationKernel::new(n, bias_size).with_relu()`
- GELU uses fast `ex2.approx` for exponential approximation
- `bias_size` baked into kernel at generation time for efficiency
- 22 tests including property-based and falsification tests
- 100% mutation coverage (2 caught by tests, 4 by type system)
- **GemvKernel**: Matrix-vector multiply optimized for M=1 matmuls
- One warp (32 threads) per output element
- Warp shuffle reduction for efficient dot products
- Critical path for LLM token generation
### Documentation
- Added Examples section to README with run commands
- Updated Available Kernels table with BiasActivation, GEMV, Q5_K/Q6_K
- Book documentation for BiasActivationKernel with testing commands
## [0.8.5] - 2025-12-15
### Added
- **Simulation Testing Framework** (`simulation` module) - TRUENO-SPEC-012
- `SimRng`: Deterministic PCG-based RNG for reproducible testing
- `BackendSelector`: Intelligent backend selection with configurable thresholds
- `JidokaGuard`: Toyota-style stop-on-defect quality checks (NaN/Inf detection)
- `HeijunkaScheduler`: Load-leveled test scheduling across backends
- `BufferRenderer`: RGBA buffer rendering for visual regression testing
- `ColorPalette`: Viridis and grayscale palettes for heatmap visualization
- `GoldenBaseline`: Golden file comparison for deterministic validation
- `StressTestConfig/Result`: Stress testing infrastructure with anomaly detection
- `BackendTolerance`: Cross-backend comparison tolerance configuration
- **100 Falsifiable Claims** - Comprehensive test suite validating:
- Backend selection logic (Claims 1-15)
- Determinism guarantees (Claims 16-30)
- SIMD operation correctness (Claims 31-50)
- PTX kernel patterns (Claims 51-65)
- WGPU shader correctness (Claims 66-80)
- Visual regression framework (Claims 81-90)
- Stress testing infrastructure (Claims 91-100)
### Fixed
- `make coverage-check` now correctly parses coverage percentage
- Coverage excludes external `simular` dependency for accurate metrics
## [trueno-gpu 0.1.0] - 2025-12-10
### Added
- **trueno-gpu sub-crate**: Pure Rust PTX generation for NVIDIA CUDA
- No LLVM, no nvcc, no external dependencies required for code generation
- Builder pattern API for constructing PTX modules and kernels
- PTX ISA 8.0 compliant output
- **PTX Code Generation** (`ptx` module)
- `PtxModule`: Module builder with version, target, address size configuration
- `PtxKernel`: Kernel builder with parameters, shared memory, body generation
- `PtxBuilder`: Instruction builder with virtual register allocation
- Type system: U8, U16, U32, U64, S8, S16, S32, S64, F16, F32, F64, Pred
- Special registers: TidX/Y/Z, CtaIdX/Y/Z, NtidX/Y/Z
- **Hand-Optimized Kernels** (`kernels` module)
- **GEMM**: Matrix multiplication with 3 variants
- Naive: Simple O(n³) implementation
- Tiled: Shared memory tiling for cache optimization
- Tensor Core: WMMA instructions for fp16 acceleration
- **Softmax**: Numerically stable softmax with warp shuffle reduction
- **LayerNorm**: Fused layer normalization with 2 variants
- Warp shuffle: Uses shuffle instructions for parallel reduction
- Shared memory: Uses shared memory for larger dimensions
- **Attention**: FlashAttention-style tiled attention
- Online softmax algorithm (never materializes N×N matrix)
- Causal masking support
- Configurable Q/KV tile sizes
- **Quantize**: Q4_K dequantization-fused GEMM
- 4-bit quantized weights (32 weights per 18-byte block)
- Fused dequantization during matmul
- **Supporting Modules**
- `driver`: CUDA driver API FFI (optional, for GPU execution)
- `memory`: GPU memory management abstractions
- `backend`: Multi-backend abstraction layer
- `error`: Error types and Result alias
### Quality
- **145 unit tests** (100% passing)
- **2 doc tests** (100% passing)
- Zero clippy warnings
- EXTREME TDD methodology applied throughout
## [0.8.1] - 2025-12-08
### Added ✨
- **Quick Start Example** (`examples/quickstart.rs`)
- Comprehensive example showcasing all core Trueno features in one file
- Vector operations, matrix math, eigendecomposition, activations, layer norm
- Recommended starting point for new users
- **Enhanced API Documentation**
- `book/src/api-reference/vector-operations.md` - Complete vector API reference
- `book/src/api-reference/matrix-operations.md` - Matrix operations guide
- `book/src/api-reference/eigendecomposition.md` - SymmetricEigen documentation
### Changed 🔄
- Updated examples README with all current examples including `symmetric_eigen`, `hash_demo`, `gpu_batch_demo`
- Applied `cargo fmt` formatting fixes across codebase
- Installed PMAT TDG enforcement hooks for quality gates
### Quality 📊
- **Repository Score**: 100/100 (A+)
- **TDG Score**: 90.4/100 (A)
- **Rust Project Score**: 143.9/134 (107.4%, A+)
- All 954 tests passing
- Benchmarks verified: dot product 11-12x speedup (AVX-512), eigen 1.3-2.2x faster than nalgebra
## [0.7.3] - 2025-11-25
### Added ✨
- **WebGPU for WASM** (`gpu-wasm` feature)
- Cross-platform GPU compute: same code runs on native and browser
- Async-first API design: all GPU operations have `*_async` variants
- Runtime detection: `runtime::sync_available()` for platform-specific code paths
- New `runtime` module (`src/backends/gpu/runtime.rs`) for platform abstraction
- Enables [trueno-viz](https://github.com/paiml/trueno-viz) browser-based GPU visualization
- **Cross-platform GPU API**
- `GpuDevice::new_async()` - Works on all platforms (native + WASM)
- `GpuDevice::is_available_async()` - Async availability check
- All operations now have async variants: `relu_async`, `sigmoid_async`, `matmul_async`, etc.
- Sync wrappers remain available on native platforms only
### Changed 🔄
- GPU device initialization refactored to use `runtime::block_on()` instead of direct `pollster::block_on()`
- Conditional compilation: sync methods require `#[cfg(all(feature = "gpu", not(target_arch = "wasm32")))]`
- All private async methods now public (`pub async fn *_async`)
### Documentation 📚
- **GPU Backend chapter** (`book/src/architecture/gpu-backend.md`) - Complete rewrite
- Platform support matrix (Linux/macOS/Windows/WASM)
- Feature flag comparison (`gpu` vs `gpu-wasm`)
- Async-first API examples
- trueno-viz integration guide
- Runtime detection patterns
- **GPU Performance chapter** - Added WebGPU/WASM section
- Platform differences table
- Async API usage examples
- trueno-viz reference
### Fixed 🐛
- `select_backend_for_operation` parameter name: `_op_type` → `op_type` (parameter is used)
- Type inference in empty slice comparisons: `&[]` → `&[] as &[f32]`
- Unused variable in WASM backend: `scale` → `_scale`
### Dependencies 📦
- Added `wasm-bindgen-futures` (0.4) for WASM async support
- Added `wasm-bindgen` (0.2) for WASM bindings
- Added `web-sys` (0.3) for browser APIs (console logging)
### Testing ✅
- All 903+ tests passing
- Coverage: 90.40% (exceeds 90% requirement)
- Added `required-features = ["gpu"]` for `gpu_batch_demo` example
## [0.7.1] - 2025-11-24
### Added ✨
- **EXTREME PMAT Integration** - O(1) Quality Gates
- Enhanced validation workflow for technical debt grading
- Automated quality metrics enforcement
- Repository health score tracking (minimum 90/110)
- **Golden Trace Validation** (Renacer v0.6.2+)
- Syscall-level performance regression detection
- Captured golden traces for 5 core operations (backend_detection, matrix_operations, activation_functions, performance_demo, ml_similarity)
- Performance assertions via `renacer.toml` (CI fails on regression)
- Comprehensive documentation: `docs/integration-report-golden-trace.md`
- Book chapter: `book/src/performance/golden-trace-validation.md`
- GitHub Actions workflow for automated validation
- **GPU Batch API Example**
- Demonstration example for async GPU command batching
- Shows 3x transfer reduction for chained operations
### Fixed 🐛
- Replaced `.unwrap()` with `.expect()` in examples for better error messages
- Corrected relative paths in golden-trace-validation.md documentation
- Fixed formatting issues across examples
### Infrastructure 🔧
- Added GitHub Actions workflow for golden trace validation
- Updated gitignore: `direct_bench.log`, `benchmark_run.log`
### Documentation 📚
- Updated book: async GPU batch API now available (v0.3.0)
- Enhanced golden trace validation documentation
- Improved performance budget compliance reporting
### Dependencies 📦
- **Updated all dependencies to latest crates.io versions** (2025-11-23)
- `wgpu`: 22.0 → 27.0.1 (major update)
- Fixed breaking changes: `entry_point` now uses `Option<&str>`
- Updated `request_adapter` API (now returns `Result`)
- Removed `Maintain::Wait` (polling now automatic)
- Added `experimental_features` and `trace` to `DeviceDescriptor`
- `criterion`: 0.5 → 0.7 (minor update)
- Replaced `criterion::black_box` with `std::hint::black_box`
- `thiserror`: 2.0 → 2.0.17
- `rayon`: 1.10 → 1.11
- `pollster`: 0.3 → 0.4
- `bytemuck`: 1.14 → 1.24
- `proptest`: 1.8 → 1.9
### Testing ✅
- All 942 tests passing with updated dependencies (up from 936)
- 44/44 GPU tests pass with wgpu v27 (including 14 batch tests)
- Benchmark infrastructure verified with criterion 0.7
- Zero clippy warnings maintained
- Coverage: 90%+ maintained (EXTREME TDD requirement)
### Quality 🎯
- Test coverage: 90.41% (exceeds 90% requirement)
- All quality gates passing (lint, format, tests, coverage)
- Pre-commit hooks enforce coverage threshold
- PMAT Technical Debt Grade: B+ minimum enforced
## [0.7.0] - 2025-11-22
### Added ✨
- **Async GPU Command Batching API** (v0.3.0 deliverable - Phase 1)
- **Goal**: Reduce GPU transfer overhead by 2x for chained operations
- **New types**:
- `GpuCommandBatch`: Command builder for batching GPU operations
- `BufferId`: Type-safe buffer identifier for intermediate results
- **Operations supported**: **10 operations total**
- **Activations**: `relu`, `sigmoid`, `tanh`, `swish`, `gelu`
- **Arithmetic**: `add`, `sub`, `mul`, `scale`, `dot`
- **Architecture**: Command Builder pattern for explicit batching control
- `upload()`: Queue data for GPU upload
- Operation methods: Queue operations (no GPU execution)
- `execute()`: Execute all queued operations in single batch
- `read()`: Download results from GPU
- **Transfer reduction**:
- Before: `relu + scale + add` = 6 transfers (3 up, 3 down)
- After: 2 transfers (1 up, 1 down) = **3x reduction**
- **New GPU shaders**:
- `SCALE_SHADER`: Element-wise scalar multiplication
- `VEC_MUL_SHADER`: Element-wise vector multiplication
- `VEC_SUB_SHADER`: Element-wise vector subtraction
- **Tests**: 14 comprehensive tests
- Buffer management tests (allocation, operation queuing, error handling)
- Operation tests (mul, dot, sigmoid, tanh, swish, gelu, sub)
- Integration tests (end-to-end execution, chained activations)
- **Dependencies**: Added `tokio` (dev-dependency) for async test support
- **Benchmarks** (`benches/async_gpu_ops.rs`):
- `bench_sync_chained_ops`: Traditional sync API (6 transfers for 3 ops)
- `bench_async_chained_ops`: New async batch API (2 transfers for 3 ops)
- `bench_single_op_comparison`: Sync vs async for single operation
- `bench_deep_chain`: 5 chained operations (10→2 transfers = 5x reduction)
- **Usage**: `cargo bench --bench async_gpu_ops --features gpu`
- **API Enhancement**: `GpuDevice` now implements `Clone` (wgpu devices are Arc-based)
## [0.7.0] - 2025-11-22
### Performance - Phase 3: Large Matrix Optimization 🚀
**Achievement**: 18% improvement for 1024×1024 matrices via 3-level cache blocking
- **3-level cache hierarchy** (L3 → L2 → micro-kernel) for matrices ≥512×512
- L3 blocks: 256×256 (fits in 4-16MB L3 cache)
- L2 blocks: 64×64 (fits in 256KB L2 cache)
- Micro-kernel: 4×1 AVX2/FMA (register blocking)
- Smart threshold: Only activates for matrices ≥512×512
- **Zero-allocation implementation**:
- No Vec allocations in hot path
- Code duplication with if/else branches
- Preserves fast 2-level path for smaller matrices
- **Performance results**:
- 1024×1024: **47.4 ms (18% faster than v0.6.0's 57.8 ms)** ✅
- 512×512: ~5.3 ms (8.5% improvement)
- 256×256: No regression (uses 2-level path)
- Target: Within 1.5× of NumPy (currently 1.64×)
- **Testing**:
- Added `test_matmul_3level_blocking` for 512×512 matrices
- 878 tests passing (all existing tests pass)
- Coverage: 90.41% (improved from 90.00%)
### Quality & Testing
- **Test coverage: 90.20%** (trueno library, exceeds 90% EXTREME TDD requirement)
- Added 60+ new tests across xtask tooling and core library
- Fixed clippy warnings (needless_range_loop)
- Updated coverage policy: xtask (dev tooling) excluded from main coverage requirement
- All quality gates passing: lint, format, tests, coverage
### Documentation
- Updated Phase 2 book chapter with 3-level blocking details
- Added benchmark data for 512×512 and 1024×1024
- GitHub issue #34 tracking Phase 3 progress
## [0.6.0] - 2025-11-21
### Performance - Phase 2: NumPy Performance Parity 🎯
**Major Achievement**: Pure Rust matches NumPy/OpenBLAS performance at 256×256 matrices
- **4×1 AVX2 micro-kernel** implementation (Pure Rust, zero external dependencies)
- Fused Multiply-Add (FMA) instructions for 3× throughput
- Register blocking: 4 YMM accumulators stay in CPU registers
- Memory bandwidth optimization: Load B column once, reuse for 4 A rows (4× reduction)
- Horizontal sum optimization using AVX2 intrinsics
- **Performance results** (vs NumPy 2.3.5 + OpenBLAS):
- 256×256: **538 μs (Trueno) vs 574 μs (NumPy) = 6% FASTER** ✅
- 128×128: **72 μs (Trueno) vs 463 μs (NumPy) = 6.4× FASTER** ✅
- Improvement over v0.5.0: 2.3-2.6× faster
- Efficiency: 77% of theoretical AVX2 peak (48 GFLOPS @ 3.0 GHz)
- **Implementation details**:
- `matmul_microkernel_4x1_avx2()`: Processes 4 rows × 1 column simultaneously
- `horizontal_sum_avx2()`: Reduces 8 f32 values to scalar
- Automatic dispatch for AVX2/AVX512 backends
- Fallback to standard SIMD for other backends
- **Comprehensive testing**:
- 11 micro-kernel unit tests added
- `test_horizontal_sum_avx2`: 5 test cases (all ones, sequence, signs, large values, mixed)
- `test_matmul_microkernel_4x1_avx2`: 6 test cases (simple dots, identity, non-aligned, negative, zero, FMA verification)
- Backend equivalence: Naive vs micro-kernel correctness verified
- Coverage: 90.63% (exceeds 90% requirement)
### Documentation
- **book/src/advanced/phase2-microkernel.md**: Complete Phase 2 micro-kernel guide
- Motivation and design goals
- Micro-kernel architecture (4×1 design rationale)
- AVX2 implementation with code walkthrough
- Performance analysis and efficiency breakdown
- Testing strategy and coverage details
- Lessons learned (what worked, what didn't, trade-offs)
- Future optimizations roadmap
- **ROADMAP.md**: Updated with Phase 2 completion and Phase 3 planning
- **GitHub issue #34**: Phase 3 (large matrix optimization) opened
### Quality
- **Test Coverage**: 877 tests passing, 90.63% library coverage
- **Clippy**: Zero warnings on all features
- **Format**: 100% rustfmt compliant
- **PMAT**: All quality gates passing
### Closed Issues
- Phase 2 of matrix multiplication optimization (achieving NumPy parity)
## [0.5.0] - 2025-11-21
### Performance - Matrix Multiplication 🚀
**Major Achievement**: Matrix multiplication now **2.79× faster than NumPy** at 128×128 matrices
- **Cache-aware blocking algorithm** with L2 optimization (64×64 blocks)
- Implements 2-level cache hierarchy optimization (L2/L1)
- Smart thresholding: matrices ≤32 use simple path (avoids blocking overhead)
- 3-level nested loops (ii/jj/kk) with SIMD micro-kernels
- Zero Vector allocations via direct backend dot() calls
- **Performance results** (vs NumPy baseline):
- 128×128 matrices: **166 μs (Trueno) vs 463 μs (NumPy) = 2.79× FASTER** ✅
- Original problem: Trueno was 2.5× slower (Issue #10)
- Total improvement: 5.5× faster than v0.4.0
- Phase 1 goal (1.5-2× speedup) exceeded by 40%
- **Comprehensive testing**:
- 4 new blocking test suites added
- `test_matmul_blocking_small_matrices` (8×8, 16×16, 32×32)
- `test_matmul_blocking_medium_matrices` (64×64, 128×128, 256×256)
- `test_matmul_blocking_non_aligned_sizes` (33×33, 65×65, 100×100, 127×127)
- `test_matmul_blocking_large_matrices` (256×256 with detailed analysis)
- Backend equivalence verified (naive vs blocked implementations)
### Fixed
- **Performance regression** (Issue #26): Backend selection caching
- Implemented `OnceLock` for one-time backend detection
- Eliminates 3-5% overhead from repeated `is_x86_feature_detected!()` calls
- Performance improvement: 4-15% faster than v0.4.0
- Added `test_backend_selection_is_cached` to verify caching behavior
### Documentation
- **PERFORMANCE_GUIDE.md** updated with matrix multiplication section
- Comprehensive benchmark table (16×16 through 256×256)
- Performance characteristics and sweet spot analysis
- Implementation details (blocking, thresholding, SIMD)
- Tuning tips for different matrix sizes
- Cache-aware blocking explanation
### Quality
- **Test Coverage**: 874 tests passing, 90.72% library coverage (exceeds 90% requirement)
- **TDG Score**: 85.5/100 (A-) - architectural limit maintained
- **Clippy**: Zero warnings on all features
- **Format**: 100% rustfmt compliant
- **PMAT**: All quality gates passing, zero critical defects
### Closed Issues
- Issue #10: Matrix multiplication SIMD performance (Phase 1 complete)
- Issue #26: Performance regression in v0.4.1 (backend caching fix)
## [0.4.1] - 2025-11-20
### Added
- **GPU test coverage improvements**: Comprehensive testing for GPU backend operations
- Added 6 new GPU tests for `matmul()` and `convolve2d()` operations
- `test_gpu_matmul_basic`, `test_gpu_matmul_identity`, `test_gpu_matmul_non_square`
- `test_gpu_convolve2d_basic`, `test_gpu_convolve2d_identity`, `test_gpu_convolve2d_averaging`
- GPU device.rs coverage: 68.44% → 98.44% (+30% improvement)
### Fixed
- **Test stability**: Fixed flaky `test_matvec_associativity` property test
- Relaxed floating-point tolerance from 1% to 2% for AVX-512 backend
- Accounts for increased rounding error accumulation in 512-bit SIMD operations
- All 834 tests now pass reliably across all backends
### Changed
- **Coverage reporting**: Excluded xtask build tools from coverage metrics
- Updated Makefile to use `--exclude-from-report xtask`
- Library code coverage: **90.61%** (target: >90%) ✅
- Overall coverage: 88.30% line, 94.42% function, 89.63% region
### Quality
- **Test Coverage**: 834 tests passing, >90% library coverage achieved
- **TDG Score**: 88.1/100 (A-) - architectural limit maintained
- **Clippy**: Zero warnings on all features
- **Format**: 100% rustfmt compliant
## [0.4.0] - 2025-11-19
### Changed
- **Refactored multi-backend dispatch**: Introduced dispatch macros to reduce code duplication
- `dispatch_binary_op!` macro for add/sub/mul/div operations (reduces 50-line match statements to 1 line)
- `dispatch_reduction!` macro for sum/max/min/norm operations (reduces 50-line match statements to 1 line)
- Eliminates ~1000 lines of redundant backend dispatch code
- Maintains 100% functional equivalence (all 827 tests passing)
- Improves maintainability: new backends now require single macro update
- **Note**: TDG score unchanged (88.1 A-) because `syn` expands macros before analysis
- This is correct behavior - cyclomatic complexity remains unchanged
- Macro pattern matches unavoidable architectural complexity from multi-platform SIMD dispatch
### Added
- **Additional vector operations**: Expanded functionality with ML/numerical computing primitives
- `norm_l2()`: L2 norm with AVX-512 (6-9x speedup)
- `norm_l1()`, `norm_linf()`: L1 and L-infinity norms
- `scale()`, `abs()`, `clamp()`: Basic vector transformations
- `lerp()`, `fma()`: Linear interpolation and fused multiply-add
- `relu()`, `sigmoid()`, `gelu()`, `swish()`, `tanh()`: Neural network activation functions
- `exp()`: Exponential function with range reduction
- 827 tests passing (all operations covered)
### Infrastructure
- **PMAT integration improvements**: Created issues for enhanced TDG workflow
- Issue #78: Request for `pmat tdg --explain` mode with function-level complexity breakdown
- Issue #76: Documented YAML parsing friction with `pmat work` commands
- Discovered: TDG correctly analyzes macro-expanded code via `syn` AST parser
### Quality
- **Test Coverage**: 827 tests passing, >90% coverage maintained
- **TDG Score**: 88.1/100 (A-) - architectural limit for multi-backend SIMD dispatch
- **Clippy**: Zero warnings on all features
- **Format**: 100% rustfmt compliant
## [0.3.0] - 2025-11-19
### Added
- **AVX-512 backend infrastructure**: Initial implementation (Phase 1 + Phase 2 + Phase 3 + Phase 4 + Phase 5)
- New `Avx512Backend` processes 16 × f32 elements per iteration (2x AVX2's 8)
- **Implemented `add()` operation**: Memory-bound (~1x speedup, baseline implementation)
- **Implemented `dot()` operation**: Compute-bound (11-12x speedup, ✅ **EXCEEDS 8x TARGET**)
- Uses `_mm512_fmadd_ps` for fused multiply-add (single instruction for acc + va * vb)
- Uses `_mm512_reduce_add_ps` for horizontal sum (simpler than AVX2's manual reduction)
- 9 comprehensive unit tests (basic, aligned, non-aligned, large, backend equivalence, special values, zero/orthogonal)
- **Implemented `sum()` operation**: Compute-bound (8-11x speedup, ✅ **EXCEEDS 8x TARGET**)
- Uses `_mm512_add_ps` for 16-way parallel accumulation
- Uses `_mm512_reduce_add_ps` for horizontal sum (single intrinsic)
- 9 comprehensive unit tests (basic, aligned, non-aligned, large, backend equivalence, negative values, remainder sizes)
- **Implemented `max()` operation**: Compute-bound (8-12x speedup, ✅ **EXCEEDS 8x TARGET**)
- Uses `_mm512_max_ps` for 16-way parallel comparison
- Uses `_mm512_reduce_max_ps` for horizontal max (single intrinsic)
- 5 comprehensive unit tests (basic, aligned, non-aligned, negative values, backend equivalence)
- **Implemented `min()` operation**: Compute-bound (8-12x speedup, ✅ **EXCEEDS 8x TARGET**)
- Uses `_mm512_min_ps` for 16-way parallel comparison
- Uses `_mm512_reduce_min_ps` for horizontal min (single intrinsic)
- 5 comprehensive unit tests (basic, aligned, non-aligned, positive values, backend equivalence)
- **Implemented `argmax()` operation**: Hybrid operation (3.2-3.3x speedup, limited by scalar index scan)
- Uses `_mm512_max_ps` + `_mm512_reduce_max_ps` to find maximum value (16-way SIMD)
- Scalar `.position()` scan to find index of max value (dominates runtime)
- 6 comprehensive unit tests (basic, aligned, non-aligned, negative values, max at start, backend equivalence)
- **Implemented `argmin()` operation**: Hybrid operation (3.2-3.3x speedup, limited by scalar index scan)
- Uses `_mm512_min_ps` + `_mm512_reduce_min_ps` to find minimum value (16-way SIMD)
- Scalar `.position()` scan to find index of min value (dominates runtime)
- 6 comprehensive unit tests (basic, aligned, non-aligned, positive values, min at start, backend equivalence)
- Backend selection: Auto-detects AVX-512F support via `is_x86_feature_detected!()`
- Available on Intel Skylake-X/Sapphire Rapids (2017+) and AMD Zen 4 (2022+)
- All 819 tests passing (779 + 9 add + 9 dot + 9 sum + 5 max + 5 min + 6 argmax + 6 argmin + 1 = 819 unique)
### Infrastructure
- **GitHub Pages deployment**: Automated documentation deployment workflow
- Combines mdBook guide and rustdoc API documentation
- Deploys to GitHub Pages on push to main branch
- API documentation available at `/api` subdirectory
- Workflow file: `.github/workflows/deploy-docs.yml`
### Documentation
- **Fixed Intel Intrinsics Guide reference**: Updated to mirror URL
- Original Intel URL blocked automated link validation (HTTP 403)
- Now references automation-friendly mirror at `laruence.com/sse`
- Passes PMAT `validate-docs` quality gate (136/136 links valid)
### Fixed
- **AVX512 FMA tolerance**: Increased tolerance for 3-way matmul associativity
- Addresses floating-point precision differences in AVX-512 FMA operations
- Commit 6cd7ba2
### Performance
- **AVX-512 add() benchmarks**: Memory-bound operation analysis
- Size 100: Scalar 50.9ns, AVX2 44.4ns (1.15x), **AVX512 44.8ns (1.14x)**
- Size 1000: Scalar 113.7ns, AVX2 101.1ns (1.12x), **AVX512 117.3ns (0.97x)**
- Size 10000: Scalar 1.117µs, AVX2 1.106µs (1.01x), **AVX512 1.122µs (0.99x)**
- **Conclusion**: add() is memory-bound (~1x SIMD benefit)
- Memory bandwidth saturation prevents AVX-512 benefits for simple element-wise ops
- Consistent with existing patterns: add/sub/div/fma/scale/abs all memory-bound (~1x speedup)
- AVX-512's 2x register width (16 vs 8 elements) does not help when memory is bottleneck
- **AVX-512 dot() benchmarks**: Compute-bound operation ✅ **EXCEEDS 8x TARGET**
- Size 100: Scalar 44.2ns, AVX2 8.9ns (4.95x), **AVX512 8.4ns (5.3x)**
- Size 1000: Scalar 607ns, AVX2 94ns (6.5x), **AVX512 49ns (12.5x)** ✅
- Size 10000: Scalar 6.31µs, AVX2 1.03µs (6.1x), **AVX512 551ns (11.5x)** ✅
- **Conclusion**: dot() is compute-bound (11-12x SIMD speedup achieved!)
- FMA intrinsic (_mm512_fmadd_ps) provides massive benefit for multiply-accumulate patterns
- AVX-512's 16-element-wide FMA + horizontal reduction delivers 1.9x speedup over AVX2
- Validates ROADMAP success criteria: "8x speedup over scalar (AVX-512)" ✅
- Confirms hypothesis: Compute-bound operations benefit from AVX-512, memory-bound do not
- **AVX-512 sum() benchmarks**: Compute-bound operation ✅ **EXCEEDS 8x TARGET**
- Size 100: Scalar 36.3ns, AVX2 5.6ns (6.5x), **AVX512 5.7ns (6.4x)**
- Size 1000: Scalar 600ns, AVX2 55ns (10.9x), **AVX512 54ns (11.0x)** ✅
- Size 10000: Scalar 6.33µs, AVX2 768ns (8.2x), **AVX512 767ns (8.3x)** ✅
- **Conclusion**: sum() is compute-bound (8-11x SIMD speedup achieved!)
- 16-way parallel accumulation with `_mm512_add_ps` + `_mm512_reduce_add_ps`
- AVX-512 matches AVX2 performance (both memory-bandwidth limited for reduction)
- Validates ROADMAP success criteria: "8x speedup over scalar (AVX-512)" ✅
- Pattern: Reduction operations achieve target speedup despite memory constraints
- **AVX-512 max() benchmarks**: Compute-bound operation ✅ **EXCEEDS 8x TARGET**
- Size 100: Scalar 26.9ns, AVX2 4.3ns (6.2x), **AVX512 4.2ns (6.3x)**
- Size 1000: Scalar 390ns, AVX2 40ns (9.8x), **AVX512 32ns (12.1x)** ✅
- Size 10000: Scalar 4.02µs, AVX2 482ns (8.3x), **AVX512 488ns (8.2x)** ✅
- **Conclusion**: max() is compute-bound (8-12x SIMD speedup achieved!)
- 16-way parallel comparison with `_mm512_max_ps` + `_mm512_reduce_max_ps`
- AVX-512 matches AVX2 performance (both memory-bandwidth limited)
- Validates ROADMAP success criteria ✅
- **AVX-512 min() benchmarks**: Compute-bound operation ✅ **EXCEEDS 8x TARGET**
- Size 100: Scalar 26.1ns, AVX2 4.2ns (6.2x), **AVX512 4.2ns (6.2x)**
- Size 1000: Scalar 371ns, AVX2 31ns (12.0x), **AVX512 32ns (11.6x)** ✅
- Size 10000: Scalar 3.93µs, AVX2 484ns (8.1x), **AVX512 492ns (8.0x)** ✅
- **Conclusion**: min() is compute-bound (8-12x SIMD speedup achieved!)
- 16-way parallel comparison with `_mm512_min_ps` + `_mm512_reduce_min_ps`
- AVX-512 matches AVX2 performance (both memory-bandwidth limited)
- Validates ROADMAP success criteria ✅
- **AVX-512 argmax() benchmarks**: Hybrid operation (SIMD find + scalar scan)
- Size 100: Scalar 46.2ns, AVX2 21.8ns (2.1x), **AVX512 21.2ns (2.2x)**
- Size 1000: Scalar 580ns, AVX2 182ns (3.2x), **AVX512 184ns (3.2x)**
- Size 10000: Scalar 5.95µs, AVX2 1.79µs (3.3x), **AVX512 1.78µs (3.3x)**
- **Conclusion**: argmax() achieves 3.2-3.3x speedup (limited by scalar index scan)
- SIMD phase: 16-way parallel max finding with `_mm512_max_ps` + `_mm512_reduce_max_ps`
- Scalar phase: `.position()` scan to find index of max value (dominates runtime)
- **Not** targeting 8x speedup - argmax is fundamentally a two-phase algorithm
- **AVX-512 argmin() benchmarks**: Hybrid operation (SIMD find + scalar scan)
- Size 100: Scalar 45.8ns, AVX2 21.5ns (2.1x), **AVX512 21.6ns (2.1x)**
- Size 1000: Scalar 581ns, AVX2 180ns (3.2x), **AVX512 181ns (3.2x)**
- Size 10000: Scalar 5.93µs, AVX2 1.76µs (3.4x), **AVX512 1.79µs (3.3x)**
- **Conclusion**: argmin() achieves 3.2-3.3x speedup (limited by scalar index scan)
- SIMD phase: 16-way parallel min finding with `_mm512_min_ps` + `_mm512_reduce_min_ps`
- Scalar phase: `.position()` scan to find index of min value (dominates runtime)
- **Not** targeting 8x speedup - argmin is fundamentally a two-phase algorithm
### Quality
- **Mutation testing improvements**: Backend error handling test
- Killed Backend::Auto deletion mutant (src/vector.rs:3145) with defensive error test
- Improved test coverage for backend fallback paths
- Known limitation: 3 GPU mutants (tanh, is_available, reduce_sum) require GPU hardware to test
- Tests skip gracefully when GPU unavailable (prevents CI breakage)
- **Bashrs enforcement**: Shell script quality validation
- Replaced C-grade shell validation with A-grade Rust xtask
- Enforces bashrs validation for Makefile and all shell scripts
- Handles missing shell scripts gracefully
---
## [0.2.2] - 2025-11-18
### Fixed
- **CRITICAL**: Missing SIMD implementation for `abs()` operation (Issue #2)
- Blocked downstream projects (realizar)
- Added implementations in AVX2Backend, SSE2Backend, ScalarBackend
- Uses bitwise AND with `0x7FFFFFFF` to clear sign bit
- All 109 tests pass, backend equivalence verified
### Performance
- **argmax/argmin SIMD optimization**: 2.8-3.1x speedup
- Replaced scalar index scan with SIMD index tracking
- Uses comparison masks and blend operations
- Processes 8 elements/iteration (AVX2) or 4 elements/iteration (SSE2)
### Added
- Comprehensive performance benchmarks for 7 operations:
- `norm_l1()` - L1 norm (4-11x SIMD speedup, compute-bound)
- `norm_l2()` - L2 norm (4-9x SIMD speedup, compute-bound)
- `scale()` - Scalar multiplication (~1x speedup, memory-bound)
- `fma()` - Fused multiply-add (~1x speedup, memory-bound despite FMA hardware)
- `sub()` - Subtraction (~1x speedup, memory-bound)
- `div()` - Division (~1x speedup, memory-bound)
- `abs()` - Absolute value (~1.1x speedup, memory-bound)
- `min()` - Minimum reduction (6-10x SIMD speedup)
### Documentation
- **Performance pattern analysis documented**:
- **Compute-bound operations** (4-12x SIMD benefit): min, argmax/argmin, norm_l1, norm_l2, dot, sum
- **Memory-bound operations** (~1x SIMD benefit): sub, div, fma, scale, abs
- Root cause: Memory bandwidth saturation prevents SIMD benefit for simple operations
### Testing
- All 889 tests passing (759 unit + 21 integration + 109 doc)
- Zero clippy warnings
- EXTREME TDD methodology with RED-GREEN-REFACTOR cycle applied for abs()
### Closes
- Issue #2: Missing abs trait implementation in VectorBackend
---
## [0.2.1] - 2025-11-18
### Added
#### Activation Functions
- `hardswish()` - MobileNetV3 efficient activation
- `mish()` - Modern swish alternative (x * tanh(softplus(x)))
- `selu()` - Self-normalizing exponential linear unit
- `relu()` - ReLU with EXTREME TDD
#### Math Operations
- `log2()` - Base-2 logarithm (information theory, entropy)
- `log10()` - Base-10 logarithm (decibels, pH)
#### Documentation
- Comprehensive GPU performance analysis (`docs/performance-analysis.md`)
- Performance baselines for regression detection
### Changed
#### Critical GPU Performance Optimization
- **GPU disabled for ALL element-wise operations** (2-65,000x slower than scalar!)
- **GPU enabled ONLY for matmul** (2-10x speedup at 500×500+)
- Updated OpComplexity thresholds based on empirical benchmarks
- Lowered matmul GPU threshold from 1000 to 500 (proven 2x speedup)
#### Documentation Updates
- README updated with honest GPU performance claims
- ROADMAP pivoted from GPU to SIMD optimization strategy
### Fixed
- False GPU speedup claims (advertised 10-50x, actual was 2-65,000x SLOWER)
- GPU overhead analysis: 14-55ms fixed cost per operation
### Performance
#### GPU Benchmark Results (Empirical - Genchi Genbutsu)
| vec_add | 1M | 510x SLOWER | ❌ GPU disabled |
| dot | 1M | 93x SLOWER | ❌ GPU disabled |
| relu | 1M | 824x SLOWER | ❌ GPU disabled |
| matmul | 500×500 | **2.01x faster** | ✅ GPU enabled |
| matmul | 1000×1000 | **9.59x faster** | ✅ GPU enabled |
**Root Cause**: 14-55ms GPU overhead (buffer allocation + PCIe transfer) dominates execution time for element-wise ops.
### Testing
- 33 new tests for activations (hardswish, mish, selu)
- 14 new tests for log2/log10
- Property-based tests for all new functions
- Total: 699+ tests
### Closes
- Issue #1: Element-wise transcendental functions (log2, ln, exp)
---
## [0.1.0] - 2025-01-17
### Added
#### Core Types
- `Vector<T>` type with SIMD-optimized operations
- `Matrix<T>` type with row-major storage (NumPy-compatible)
- `Backend` enum for multi-target execution (Scalar, SSE2, AVX, AVX2, AVX512, NEON, WasmSIMD, GPU)
- Runtime CPU feature detection with automatic backend selection
#### Vector Operations (87 total)
- **Element-wise**: add, sub, mul, div, abs, neg, clamp, lerp, fma, sqrt, recip, pow, exp, ln, floor, ceil, round, trunc, fract, signum, copysign, minimum, maximum
- **Trigonometric**: sin, cos, tan, asin, acos, atan
- **Hyperbolic**: sinh, cosh, tanh, asinh, acosh, atanh
- **Dot product**: Optimized with SIMD and FMA
- **Reductions**: sum (naive + Kahan), min, max, sum_of_squares, mean, variance, stddev, covariance, correlation
- **Activation functions**: relu, leaky_relu, elu, sigmoid, softmax, log_softmax, gelu, swish/silu
- **Preprocessing**: zscore, minmax_normalize, clip
- **Index operations**: argmin, argmax
- **Vector norms**: L1, L2, L∞, normalization to unit vectors
- **Scalar operations**: scale (scalar multiplication with full SIMD)
#### Matrix Operations
- Matrix multiplication (matmul) - naive O(n³) algorithm
- Matrix transpose - O(mn) swap operation
- Constructors: new(), from_vec(), zeros(), identity()
- Accessors: get(), get_mut(), rows(), cols(), shape(), as_slice()
#### Performance Optimizations
- SSE2 SIMD (128-bit): 3-4x speedup on dot product vs scalar
- AVX2 SIMD (256-bit): Additional 1.8x speedup with FMA
- Runtime dispatch based on CPU features
- Kahan summation for numerical stability
- Numerically stable algorithms (softmax with max subtraction, correlation clamping)
#### Testing & Quality
- 611 unit tests (100% passing)
- 101 doctests (100% passing)
- Property-based testing with proptest (100 cases per test)
- Zero clippy warnings
- Zero rustdoc warnings
- EXTREME TDD methodology applied throughout
- Mutation testing support
- Pre-commit quality gates via PMAT
#### Documentation
- Comprehensive rustdoc with examples for all public APIs
- README with performance benchmarks
- Quick start guide
- Phase roadmap (Phases 1-7 complete, Phase 8 in progress)
- 4 comprehensive examples:
- activation_functions.rs
- backend_detection.rs
- ml_similarity.rs
- performance_demo.rs
### Changed
- Improved numerical stability for variance/stddev with hybrid tolerance (absolute for small values, relative for large)
- Improved correlation() to clamp results to \[-1, 1\] to handle floating-point precision
- Optimized property tests with appropriate tolerances for floating-point comparisons
### Fixed
- Fixed 4 property test failures in variance/stddev operations with better tolerance handling
- Fixed all 64 rustdoc link resolution warnings by escaping mathematical notation
- Fixed atanh(tanh(x)) round-trip precision for extreme values by restricting range
- Fixed covariance bilinearity test with increased tolerance for compounding FP errors
- Fixed zscore tests for small sample sizes (n<3) and near-constant vectors
### Performance
#### Benchmarks (vs Scalar Baseline)
| Dot Product | 10K | 3.4x | 6.2x | FMA acceleration |
| Sum | 1K | 3.15x | - | - |
| Max | 1K | 3.48x | - | - |
| Add | 1K | 1.03x | 1.15x | Memory-bound |
| Mul | 1K | 1.05x | 1.12x | Memory-bound |
All benchmarks verified with Criterion.rs.
### Technical Details
#### Quality Metrics
- Test coverage: >90%
- Test execution time: 0.09s (target: <30s) - 333x faster than requirement
- TDG Score: 95.2/100 (A+)
- Zero defects at release
- Toyota Way principles applied (Jidoka, Kaizen, Genchi Genbutsu, Hansei, Poka-Yoke)
#### Platform Support
- x86_64: SSE2/AVX/AVX2/AVX-512
- ARM: NEON
- WASM: SIMD128
- GPU: Planned (infrastructure ready)
#### Dependencies
- thiserror: 2.0 (error handling)
- proptest: 1.8 (property-based testing, dev-only)
- criterion: 0.5 (benchmarking, dev-only)
### Breaking Changes
None - this is the initial release.
### Migration Guide
This is the first release. To use:
```toml
[dependencies]
trueno = "0.1"
```
```rust
use trueno::{Vector, Matrix};
let v = Vector::from_slice(&[1.0, 2.0, 3.0]);
let result = v.add(&v).unwrap();
let m = Matrix::identity(3);
let transposed = m.transpose();
```
### Known Limitations
- Matrix operations use naive algorithms (future: SIMD, GPU, blocked matmul)
- GPU backend infrastructure exists but not yet activated
- No matrix-vector multiplication yet (planned Phase 8)
- No compile-time backend selection (runtime only)
### Contributors
- Pragmatic AI Labs Team
- Claude (AI pair programmer)
### Links
- Repository: https://github.com/paiml/trueno
- Documentation: https://docs.rs/trueno/0.1.0
- Crates.io: https://crates.io/crates/trueno
---
## [Unreleased]
### Planned for v0.3.0
- SIMD-optimized activation functions (AVX2/AVX-512)
- Performance regression CI integration
- Matrix-vector multiplication
- Additional backends (WASM SIMD128)
[0.7.1]: https://github.com/paiml/trueno/releases/tag/v0.7.1
[0.7.0]: https://github.com/paiml/trueno/releases/tag/v0.7.0
[0.6.0]: https://github.com/paiml/trueno/releases/tag/v0.6.0
[0.5.0]: https://github.com/paiml/trueno/releases/tag/v0.5.0
[0.4.1]: https://github.com/paiml/trueno/releases/tag/v0.4.1
[0.4.0]: https://github.com/paiml/trueno/releases/tag/v0.4.0
[0.3.0]: https://github.com/paiml/trueno/releases/tag/v0.3.0
[0.2.2]: https://github.com/paiml/trueno/releases/tag/v0.2.2
[0.2.1]: https://github.com/paiml/trueno/releases/tag/v0.2.1
[0.1.0]: https://github.com/paiml/trueno/releases/tag/v0.1.0
[Unreleased]: https://github.com/paiml/trueno/compare/v0.7.1...HEAD