aprender-compute 0.29.0

High-performance SIMD compute library with GPU support, LLM inference engine, and GGUF model loading (was: trueno)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.18.0] - 2026-04-06

### Added

- **End-to-end LLM inference engine** (`src/inference/`)
  - `GgufFile`: GGUF v2/v3 reader — metadata KV, tensor info, alignment-padded data
  - `LlamaModel`: Full transformer — RMSNorm, Q4K fused matmul, RoPE, GQA, SwiGLU FFN, KV cache
  - `WeightMatrix` enum: Q4K fused path for hot weights, F32 dequant path for mixed quantization
  - Dequantization: Q4_0, Q4_1, Q4K, Q5K, Q6K, Q8_0, F16, BF16
  - `generate()`: Autoregressive decode with temperature, top-k, top-p nucleus sampling
  - QKV bias support for Qwen2/Qwen3 architectures
  - `examples/inference_demo.rs`: CLI — load GGUF, tokenize, generate, print tok/s stats

- **Software-pipelined GPU GEMM kernel** (64×128 CTA, 3-stage cp.async)
  - 60.9 TF/s peak at 2048 — 0.52× cuBLAS, TARGET MET
  - 18KB shared memory (3×6KB pipeline stages)
  - 5 FALSIFY tests, 19/19 contracts pass

### Performance

- **P5c industry baseline**: trueno 807 tok/s vs llama.cpp 2481 tok/s (0.33×) on TinyLlama 5M F16
- GPU GEMM pipeline: +39% over non-pipelined (60.9 vs 43.8 TF/s at 2048)
- All 3630 tests pass

## [0.16.0] - 2026-02-26

### Changed
- Minor version bump for PAIML Sovereign AI Stack coordinated release
- Updated workspace lints and CI configurations

## [0.9.0] - 2025-12-30

### Added

- **CUDA-Tile Behavior GPU Optimizations** (cuda-tile-behavior.md spec)
  - `TensorView<T>`: Structured memory view with shape/stride metadata for GPU buffers
  - `PartitionView<T>`: Tiling strategy for 16x16 GPU workgroup distribution
  - Tiled reduction algorithms: `tiled_sum_2d`, `tiled_max_2d`, `tiled_min_2d`
  - `ReduceOp` trait for custom reduction operations (SumOp, MaxOp, MinOp)
  - WGSL tiled reduction shaders for GPU compute (pending integration)

- **Intel SDE Support for AVX-512 Testing** (Makefile targets)
  - `make install-sde`: Download and install Intel Software Development Emulator
  - `make test-avx512-sde`: Run AVX-512 tests under Skylake-X emulation
  - `make bench-avx512-sde`: Run AVX-512 benchmarks under emulation
  - `make coverage-avx512-sde`: Run AVX-512 coverage under emulation
  - Enables AVX-512 testing on CPUs without native support (e.g., Intel Meteor Lake)

- **PTX Optimization Passes** (trueno-gpu)
  - FMA fusion pass: Automatically fuse mul+add into fma instructions
  - Tile validation: Compile-time validation of tile constraints
  - ~33% instruction reduction for FMA-eligible code

### Documentation

- New example: `tiled_reduction_demo` demonstrating GPU memory abstractions
- Updated book chapter: GPU Compute Shaders with tiled reduction algorithms
- GitHub issues #72-#76 filed for CUDA-specific integration work

### Fixed

- AVX-512 dot product test tolerance using relative error for large results
- Clippy warnings for match arms, must_use, and div_ceil

## [0.8.9] - 2025-12-23

### Added

- **Batched Matrix Multiplication** for 3D and 4D tensors (Refs #71)
  - `Matrix::batched_matmul`: Shape `[batch, m, k] @ [batch, k, n] -> [batch, m, n]`
  - `Matrix::batched_matmul_4d`: Attention pattern `[batch, heads, m, k] @ [batch, heads, k, n]`
  - SIMD-accelerated using trueno's matmul backend
  - Critical for transformer multi-head attention (Q @ K^T, attn @ V)
  - 8 unit tests for correctness and error handling

### Documentation

- Updated `examples/matrix_operations.rs` with batched matmul demos
- Book updates for batched matmul API reference and examples
- Created GitHub issue #71 for BatchedGemmKernel GPU support

## [0.8.8] - 2025-12-17

### Changed

- Updated `trueno-gpu` dependency to v0.3.0
  - BiasActivationKernel: Fused bias + activation epilogue (None/ReLU/GELU)
  - GemvKernel: Matrix-vector multiply for M=1 matmuls in LLM inference

### Documentation

- Book updates for BiasActivationKernel examples and PTX generation

## [trueno-gpu 0.3.0] - 2025-12-17

### Added

- **BiasActivationKernel**: Fused bias + activation epilogue kernel for GEMM operations
  - Three activation variants: None (bias only), ReLU, GELU
  - Builder pattern API: `BiasActivationKernel::new(n, bias_size).with_relu()`
  - GELU uses fast `ex2.approx` for exponential approximation
  - `bias_size` baked into kernel at generation time for efficiency
  - 22 tests including property-based and falsification tests
  - 100% mutation coverage (2 caught by tests, 4 by type system)

- **GemvKernel**: Matrix-vector multiply optimized for M=1 matmuls
  - One warp (32 threads) per output element
  - Warp shuffle reduction for efficient dot products
  - Critical path for LLM token generation

### Documentation

- Added Examples section to README with run commands
- Updated Available Kernels table with BiasActivation, GEMV, Q5_K/Q6_K
- Book documentation for BiasActivationKernel with testing commands

## [0.8.5] - 2025-12-15

### Added

- **Simulation Testing Framework** (`simulation` module) - TRUENO-SPEC-012
  - `SimRng`: Deterministic PCG-based RNG for reproducible testing
  - `BackendSelector`: Intelligent backend selection with configurable thresholds
  - `JidokaGuard`: Toyota-style stop-on-defect quality checks (NaN/Inf detection)
  - `HeijunkaScheduler`: Load-leveled test scheduling across backends
  - `BufferRenderer`: RGBA buffer rendering for visual regression testing
  - `ColorPalette`: Viridis and grayscale palettes for heatmap visualization
  - `GoldenBaseline`: Golden file comparison for deterministic validation
  - `StressTestConfig/Result`: Stress testing infrastructure with anomaly detection
  - `BackendTolerance`: Cross-backend comparison tolerance configuration

- **100 Falsifiable Claims** - Comprehensive test suite validating:
  - Backend selection logic (Claims 1-15)
  - Determinism guarantees (Claims 16-30)
  - SIMD operation correctness (Claims 31-50)
  - PTX kernel patterns (Claims 51-65)
  - WGPU shader correctness (Claims 66-80)
  - Visual regression framework (Claims 81-90)
  - Stress testing infrastructure (Claims 91-100)

### Fixed

- `make coverage-check` now correctly parses coverage percentage
- Coverage excludes external `simular` dependency for accurate metrics

## [trueno-gpu 0.1.0] - 2025-12-10

### Added

- **trueno-gpu sub-crate**: Pure Rust PTX generation for NVIDIA CUDA
  - No LLVM, no nvcc, no external dependencies required for code generation
  - Builder pattern API for constructing PTX modules and kernels
  - PTX ISA 8.0 compliant output

- **PTX Code Generation** (`ptx` module)
  - `PtxModule`: Module builder with version, target, address size configuration
  - `PtxKernel`: Kernel builder with parameters, shared memory, body generation
  - `PtxBuilder`: Instruction builder with virtual register allocation
  - Type system: U8, U16, U32, U64, S8, S16, S32, S64, F16, F32, F64, Pred
  - Special registers: TidX/Y/Z, CtaIdX/Y/Z, NtidX/Y/Z

- **Hand-Optimized Kernels** (`kernels` module)
  - **GEMM**: Matrix multiplication with 3 variants
    - Naive: Simple O(n³) implementation
    - Tiled: Shared memory tiling for cache optimization
    - Tensor Core: WMMA instructions for fp16 acceleration
  - **Softmax**: Numerically stable softmax with warp shuffle reduction
  - **LayerNorm**: Fused layer normalization with 2 variants
    - Warp shuffle: Uses shuffle instructions for parallel reduction
    - Shared memory: Uses shared memory for larger dimensions
  - **Attention**: FlashAttention-style tiled attention
    - Online softmax algorithm (never materializes N×N matrix)
    - Causal masking support
    - Configurable Q/KV tile sizes
  - **Quantize**: Q4_K dequantization-fused GEMM
    - 4-bit quantized weights (32 weights per 18-byte block)
    - Fused dequantization during matmul

- **Supporting Modules**
  - `driver`: CUDA driver API FFI (optional, for GPU execution)
  - `memory`: GPU memory management abstractions
  - `backend`: Multi-backend abstraction layer
  - `error`: Error types and Result alias

### Quality

- **145 unit tests** (100% passing)
- **2 doc tests** (100% passing)
- Zero clippy warnings
- EXTREME TDD methodology applied throughout

## [0.8.1] - 2025-12-08

### Added ✨

- **Quick Start Example** (`examples/quickstart.rs`)
  - Comprehensive example showcasing all core Trueno features in one file
  - Vector operations, matrix math, eigendecomposition, activations, layer norm
  - Recommended starting point for new users

- **Enhanced API Documentation**
  - `book/src/api-reference/vector-operations.md` - Complete vector API reference
  - `book/src/api-reference/matrix-operations.md` - Matrix operations guide
  - `book/src/api-reference/eigendecomposition.md` - SymmetricEigen documentation

### Changed 🔄

- Updated examples README with all current examples including `symmetric_eigen`, `hash_demo`, `gpu_batch_demo`
- Applied `cargo fmt` formatting fixes across codebase
- Installed PMAT TDG enforcement hooks for quality gates

### Quality 📊

- **Repository Score**: 100/100 (A+)
- **TDG Score**: 90.4/100 (A)
- **Rust Project Score**: 143.9/134 (107.4%, A+)
- All 954 tests passing
- Benchmarks verified: dot product 11-12x speedup (AVX-512), eigen 1.3-2.2x faster than nalgebra

## [0.7.3] - 2025-11-25

### Added ✨

- **WebGPU for WASM** (`gpu-wasm` feature)
  - Cross-platform GPU compute: same code runs on native and browser
  - Async-first API design: all GPU operations have `*_async` variants
  - Runtime detection: `runtime::sync_available()` for platform-specific code paths
  - New `runtime` module (`src/backends/gpu/runtime.rs`) for platform abstraction
  - Enables [trueno-viz]https://github.com/paiml/trueno-viz browser-based GPU visualization

- **Cross-platform GPU API**
  - `GpuDevice::new_async()` - Works on all platforms (native + WASM)
  - `GpuDevice::is_available_async()` - Async availability check
  - All operations now have async variants: `relu_async`, `sigmoid_async`, `matmul_async`, etc.
  - Sync wrappers remain available on native platforms only

### Changed 🔄

- GPU device initialization refactored to use `runtime::block_on()` instead of direct `pollster::block_on()`
- Conditional compilation: sync methods require `#[cfg(all(feature = "gpu", not(target_arch = "wasm32")))]`
- All private async methods now public (`pub async fn *_async`)

### Documentation 📚

- **GPU Backend chapter** (`book/src/architecture/gpu-backend.md`) - Complete rewrite
  - Platform support matrix (Linux/macOS/Windows/WASM)
  - Feature flag comparison (`gpu` vs `gpu-wasm`)
  - Async-first API examples
  - trueno-viz integration guide
  - Runtime detection patterns

- **GPU Performance chapter** - Added WebGPU/WASM section
  - Platform differences table
  - Async API usage examples
  - trueno-viz reference

### Fixed 🐛

- `select_backend_for_operation` parameter name: `_op_type``op_type` (parameter is used)
- Type inference in empty slice comparisons: `&[]``&[] as &[f32]`
- Unused variable in WASM backend: `scale``_scale`

### Dependencies 📦

- Added `wasm-bindgen-futures` (0.4) for WASM async support
- Added `wasm-bindgen` (0.2) for WASM bindings
- Added `web-sys` (0.3) for browser APIs (console logging)

### Testing ✅

- All 903+ tests passing
- Coverage: 90.40% (exceeds 90% requirement)
- Added `required-features = ["gpu"]` for `gpu_batch_demo` example

## [0.7.1] - 2025-11-24

### Added ✨

- **EXTREME PMAT Integration** - O(1) Quality Gates
  - Enhanced validation workflow for technical debt grading
  - Automated quality metrics enforcement
  - Repository health score tracking (minimum 90/110)

- **Golden Trace Validation** (Renacer v0.6.2+)
  - Syscall-level performance regression detection
  - Captured golden traces for 5 core operations (backend_detection, matrix_operations, activation_functions, performance_demo, ml_similarity)
  - Performance assertions via `renacer.toml` (CI fails on regression)
  - Comprehensive documentation: `docs/integration-report-golden-trace.md`
  - Book chapter: `book/src/performance/golden-trace-validation.md`
  - GitHub Actions workflow for automated validation

- **GPU Batch API Example**
  - Demonstration example for async GPU command batching
  - Shows 3x transfer reduction for chained operations

### Fixed 🐛

- Replaced `.unwrap()` with `.expect()` in examples for better error messages
- Corrected relative paths in golden-trace-validation.md documentation
- Fixed formatting issues across examples

### Infrastructure 🔧

- Added GitHub Actions workflow for golden trace validation
- Updated gitignore: `direct_bench.log`, `benchmark_run.log`

### Documentation 📚

- Updated book: async GPU batch API now available (v0.3.0)
- Enhanced golden trace validation documentation
- Improved performance budget compliance reporting

### Dependencies 📦

- **Updated all dependencies to latest crates.io versions** (2025-11-23)
  - `wgpu`: 22.0 → 27.0.1 (major update)
    - Fixed breaking changes: `entry_point` now uses `Option<&str>`
    - Updated `request_adapter` API (now returns `Result`)
    - Removed `Maintain::Wait` (polling now automatic)
    - Added `experimental_features` and `trace` to `DeviceDescriptor`
  - `criterion`: 0.5 → 0.7 (minor update)
    - Replaced `criterion::black_box` with `std::hint::black_box`
  - `thiserror`: 2.0 → 2.0.17
  - `rayon`: 1.10 → 1.11
  - `pollster`: 0.3 → 0.4
  - `bytemuck`: 1.14 → 1.24
  - `proptest`: 1.8 → 1.9

### Testing ✅

- All 942 tests passing with updated dependencies (up from 936)
- 44/44 GPU tests pass with wgpu v27 (including 14 batch tests)
- Benchmark infrastructure verified with criterion 0.7
- Zero clippy warnings maintained
- Coverage: 90%+ maintained (EXTREME TDD requirement)

### Quality 🎯

- Test coverage: 90.41% (exceeds 90% requirement)
- All quality gates passing (lint, format, tests, coverage)
- Pre-commit hooks enforce coverage threshold
- PMAT Technical Debt Grade: B+ minimum enforced

## [0.7.0] - 2025-11-22

### Added ✨

- **Async GPU Command Batching API** (v0.3.0 deliverable - Phase 1)
  - **Goal**: Reduce GPU transfer overhead by 2x for chained operations
  - **New types**:
    - `GpuCommandBatch`: Command builder for batching GPU operations
    - `BufferId`: Type-safe buffer identifier for intermediate results
  - **Operations supported**: **10 operations total**
    - **Activations**: `relu`, `sigmoid`, `tanh`, `swish`, `gelu`
    - **Arithmetic**: `add`, `sub`, `mul`, `scale`, `dot`
  - **Architecture**: Command Builder pattern for explicit batching control
    - `upload()`: Queue data for GPU upload
    - Operation methods: Queue operations (no GPU execution)
    - `execute()`: Execute all queued operations in single batch
    - `read()`: Download results from GPU
  - **Transfer reduction**:
    - Before: `relu + scale + add` = 6 transfers (3 up, 3 down)
    - After: 2 transfers (1 up, 1 down) = **3x reduction**
  - **New GPU shaders**:
    - `SCALE_SHADER`: Element-wise scalar multiplication
    - `VEC_MUL_SHADER`: Element-wise vector multiplication
    - `VEC_SUB_SHADER`: Element-wise vector subtraction
  - **Tests**: 14 comprehensive tests
    - Buffer management tests (allocation, operation queuing, error handling)
    - Operation tests (mul, dot, sigmoid, tanh, swish, gelu, sub)
    - Integration tests (end-to-end execution, chained activations)
  - **Dependencies**: Added `tokio` (dev-dependency) for async test support
  - **Benchmarks** (`benches/async_gpu_ops.rs`):
    - `bench_sync_chained_ops`: Traditional sync API (6 transfers for 3 ops)
    - `bench_async_chained_ops`: New async batch API (2 transfers for 3 ops)
    - `bench_single_op_comparison`: Sync vs async for single operation
    - `bench_deep_chain`: 5 chained operations (10→2 transfers = 5x reduction)
    - **Usage**: `cargo bench --bench async_gpu_ops --features gpu`
  - **API Enhancement**: `GpuDevice` now implements `Clone` (wgpu devices are Arc-based)

## [0.7.0] - 2025-11-22

### Performance - Phase 3: Large Matrix Optimization 🚀

**Achievement**: 18% improvement for 1024×1024 matrices via 3-level cache blocking

- **3-level cache hierarchy** (L3 → L2 → micro-kernel) for matrices ≥512×512
  - L3 blocks: 256×256 (fits in 4-16MB L3 cache)
  - L2 blocks: 64×64 (fits in 256KB L2 cache)
  - Micro-kernel: 4×1 AVX2/FMA (register blocking)
  - Smart threshold: Only activates for matrices ≥512×512

- **Zero-allocation implementation**:
  - No Vec allocations in hot path
  - Code duplication with if/else branches
  - Preserves fast 2-level path for smaller matrices

- **Performance results**:
  - 1024×1024: **47.4 ms (18% faster than v0.6.0's 57.8 ms)**  - 512×512: ~5.3 ms (8.5% improvement)
  - 256×256: No regression (uses 2-level path)
  - Target: Within 1.5× of NumPy (currently 1.64×)

- **Testing**:
  - Added `test_matmul_3level_blocking` for 512×512 matrices
  - 878 tests passing (all existing tests pass)
  - Coverage: 90.41% (improved from 90.00%)

### Quality & Testing

- **Test coverage: 90.20%** (trueno library, exceeds 90% EXTREME TDD requirement)
- Added 60+ new tests across xtask tooling and core library
- Fixed clippy warnings (needless_range_loop)
- Updated coverage policy: xtask (dev tooling) excluded from main coverage requirement
- All quality gates passing: lint, format, tests, coverage

### Documentation

- Updated Phase 2 book chapter with 3-level blocking details
- Added benchmark data for 512×512 and 1024×1024
- GitHub issue #34 tracking Phase 3 progress

## [0.6.0] - 2025-11-21

### Performance - Phase 2: NumPy Performance Parity 🎯

**Major Achievement**: Pure Rust matches NumPy/OpenBLAS performance at 256×256 matrices

- **4×1 AVX2 micro-kernel** implementation (Pure Rust, zero external dependencies)
  - Fused Multiply-Add (FMA) instructions for 3× throughput
  - Register blocking: 4 YMM accumulators stay in CPU registers
  - Memory bandwidth optimization: Load B column once, reuse for 4 A rows (4× reduction)
  - Horizontal sum optimization using AVX2 intrinsics

- **Performance results** (vs NumPy 2.3.5 + OpenBLAS):
  - 256×256: **538 μs (Trueno) vs 574 μs (NumPy) = 6% FASTER**  - 128×128: **72 μs (Trueno) vs 463 μs (NumPy) = 6.4× FASTER**  - Improvement over v0.5.0: 2.3-2.6× faster
  - Efficiency: 77% of theoretical AVX2 peak (48 GFLOPS @ 3.0 GHz)

- **Implementation details**:
  - `matmul_microkernel_4x1_avx2()`: Processes 4 rows × 1 column simultaneously
  - `horizontal_sum_avx2()`: Reduces 8 f32 values to scalar
  - Automatic dispatch for AVX2/AVX512 backends
  - Fallback to standard SIMD for other backends

- **Comprehensive testing**:
  - 11 micro-kernel unit tests added
  - `test_horizontal_sum_avx2`: 5 test cases (all ones, sequence, signs, large values, mixed)
  - `test_matmul_microkernel_4x1_avx2`: 6 test cases (simple dots, identity, non-aligned, negative, zero, FMA verification)
  - Backend equivalence: Naive vs micro-kernel correctness verified
  - Coverage: 90.63% (exceeds 90% requirement)

### Documentation

- **book/src/advanced/phase2-microkernel.md**: Complete Phase 2 micro-kernel guide
  - Motivation and design goals
  - Micro-kernel architecture (4×1 design rationale)
  - AVX2 implementation with code walkthrough
  - Performance analysis and efficiency breakdown
  - Testing strategy and coverage details
  - Lessons learned (what worked, what didn't, trade-offs)
  - Future optimizations roadmap

- **ROADMAP.md**: Updated with Phase 2 completion and Phase 3 planning
- **GitHub issue #34**: Phase 3 (large matrix optimization) opened

### Quality

- **Test Coverage**: 877 tests passing, 90.63% library coverage
- **Clippy**: Zero warnings on all features
- **Format**: 100% rustfmt compliant
- **PMAT**: All quality gates passing

### Closed Issues

- Phase 2 of matrix multiplication optimization (achieving NumPy parity)

## [0.5.0] - 2025-11-21

### Performance - Matrix Multiplication 🚀

**Major Achievement**: Matrix multiplication now **2.79× faster than NumPy** at 128×128 matrices

- **Cache-aware blocking algorithm** with L2 optimization (64×64 blocks)
  - Implements 2-level cache hierarchy optimization (L2/L1)
  - Smart thresholding: matrices ≤32 use simple path (avoids blocking overhead)
  - 3-level nested loops (ii/jj/kk) with SIMD micro-kernels
  - Zero Vector allocations via direct backend dot() calls

- **Performance results** (vs NumPy baseline):
  - 128×128 matrices: **166 μs (Trueno) vs 463 μs (NumPy) = 2.79× FASTER**  - Original problem: Trueno was 2.5× slower (Issue #10)
  - Total improvement: 5.5× faster than v0.4.0
  - Phase 1 goal (1.5-2× speedup) exceeded by 40%

- **Comprehensive testing**:
  - 4 new blocking test suites added
  - `test_matmul_blocking_small_matrices` (8×8, 16×16, 32×32)
  - `test_matmul_blocking_medium_matrices` (64×64, 128×128, 256×256)
  - `test_matmul_blocking_non_aligned_sizes` (33×33, 65×65, 100×100, 127×127)
  - `test_matmul_blocking_large_matrices` (256×256 with detailed analysis)
  - Backend equivalence verified (naive vs blocked implementations)

### Fixed

- **Performance regression** (Issue #26): Backend selection caching
  - Implemented `OnceLock` for one-time backend detection
  - Eliminates 3-5% overhead from repeated `is_x86_feature_detected!()` calls
  - Performance improvement: 4-15% faster than v0.4.0
  - Added `test_backend_selection_is_cached` to verify caching behavior

### Documentation

- **PERFORMANCE_GUIDE.md** updated with matrix multiplication section
  - Comprehensive benchmark table (16×16 through 256×256)
  - Performance characteristics and sweet spot analysis
  - Implementation details (blocking, thresholding, SIMD)
  - Tuning tips for different matrix sizes
  - Cache-aware blocking explanation

### Quality

- **Test Coverage**: 874 tests passing, 90.72% library coverage (exceeds 90% requirement)
- **TDG Score**: 85.5/100 (A-) - architectural limit maintained
- **Clippy**: Zero warnings on all features
- **Format**: 100% rustfmt compliant
- **PMAT**: All quality gates passing, zero critical defects

### Closed Issues

- Issue #10: Matrix multiplication SIMD performance (Phase 1 complete)
- Issue #26: Performance regression in v0.4.1 (backend caching fix)

## [0.4.1] - 2025-11-20

### Added
- **GPU test coverage improvements**: Comprehensive testing for GPU backend operations
  - Added 6 new GPU tests for `matmul()` and `convolve2d()` operations
  - `test_gpu_matmul_basic`, `test_gpu_matmul_identity`, `test_gpu_matmul_non_square`
  - `test_gpu_convolve2d_basic`, `test_gpu_convolve2d_identity`, `test_gpu_convolve2d_averaging`
  - GPU device.rs coverage: 68.44% → 98.44% (+30% improvement)

### Fixed
- **Test stability**: Fixed flaky `test_matvec_associativity` property test
  - Relaxed floating-point tolerance from 1% to 2% for AVX-512 backend
  - Accounts for increased rounding error accumulation in 512-bit SIMD operations
  - All 834 tests now pass reliably across all backends

### Changed
- **Coverage reporting**: Excluded xtask build tools from coverage metrics
  - Updated Makefile to use `--exclude-from-report xtask`
  - Library code coverage: **90.61%** (target: >90%) ✅
  - Overall coverage: 88.30% line, 94.42% function, 89.63% region

### Quality
- **Test Coverage**: 834 tests passing, >90% library coverage achieved
- **TDG Score**: 88.1/100 (A-) - architectural limit maintained
- **Clippy**: Zero warnings on all features
- **Format**: 100% rustfmt compliant

## [0.4.0] - 2025-11-19

### Changed
- **Refactored multi-backend dispatch**: Introduced dispatch macros to reduce code duplication
  - `dispatch_binary_op!` macro for add/sub/mul/div operations (reduces 50-line match statements to 1 line)
  - `dispatch_reduction!` macro for sum/max/min/norm operations (reduces 50-line match statements to 1 line)
  - Eliminates ~1000 lines of redundant backend dispatch code
  - Maintains 100% functional equivalence (all 827 tests passing)
  - Improves maintainability: new backends now require single macro update
  - **Note**: TDG score unchanged (88.1 A-) because `syn` expands macros before analysis
    - This is correct behavior - cyclomatic complexity remains unchanged
    - Macro pattern matches unavoidable architectural complexity from multi-platform SIMD dispatch

### Added
- **Additional vector operations**: Expanded functionality with ML/numerical computing primitives
  - `norm_l2()`: L2 norm with AVX-512 (6-9x speedup)
  - `norm_l1()`, `norm_linf()`: L1 and L-infinity norms
  - `scale()`, `abs()`, `clamp()`: Basic vector transformations
  - `lerp()`, `fma()`: Linear interpolation and fused multiply-add
  - `relu()`, `sigmoid()`, `gelu()`, `swish()`, `tanh()`: Neural network activation functions
  - `exp()`: Exponential function with range reduction
  - 827 tests passing (all operations covered)

### Infrastructure
- **PMAT integration improvements**: Created issues for enhanced TDG workflow
  - Issue #78: Request for `pmat tdg --explain` mode with function-level complexity breakdown
  - Issue #76: Documented YAML parsing friction with `pmat work` commands
  - Discovered: TDG correctly analyzes macro-expanded code via `syn` AST parser

### Quality
- **Test Coverage**: 827 tests passing, >90% coverage maintained
- **TDG Score**: 88.1/100 (A-) - architectural limit for multi-backend SIMD dispatch
- **Clippy**: Zero warnings on all features
- **Format**: 100% rustfmt compliant

## [0.3.0] - 2025-11-19

### Added
- **AVX-512 backend infrastructure**: Initial implementation (Phase 1 + Phase 2 + Phase 3 + Phase 4 + Phase 5)
  - New `Avx512Backend` processes 16 × f32 elements per iteration (2x AVX2's 8)
  - **Implemented `add()` operation**: Memory-bound (~1x speedup, baseline implementation)
  - **Implemented `dot()` operation**: Compute-bound (11-12x speedup, ✅ **EXCEEDS 8x TARGET**)
    - Uses `_mm512_fmadd_ps` for fused multiply-add (single instruction for acc + va * vb)
    - Uses `_mm512_reduce_add_ps` for horizontal sum (simpler than AVX2's manual reduction)
    - 9 comprehensive unit tests (basic, aligned, non-aligned, large, backend equivalence, special values, zero/orthogonal)
  - **Implemented `sum()` operation**: Compute-bound (8-11x speedup, ✅ **EXCEEDS 8x TARGET**)
    - Uses `_mm512_add_ps` for 16-way parallel accumulation
    - Uses `_mm512_reduce_add_ps` for horizontal sum (single intrinsic)
    - 9 comprehensive unit tests (basic, aligned, non-aligned, large, backend equivalence, negative values, remainder sizes)
  - **Implemented `max()` operation**: Compute-bound (8-12x speedup, ✅ **EXCEEDS 8x TARGET**)
    - Uses `_mm512_max_ps` for 16-way parallel comparison
    - Uses `_mm512_reduce_max_ps` for horizontal max (single intrinsic)
    - 5 comprehensive unit tests (basic, aligned, non-aligned, negative values, backend equivalence)
  - **Implemented `min()` operation**: Compute-bound (8-12x speedup, ✅ **EXCEEDS 8x TARGET**)
    - Uses `_mm512_min_ps` for 16-way parallel comparison
    - Uses `_mm512_reduce_min_ps` for horizontal min (single intrinsic)
    - 5 comprehensive unit tests (basic, aligned, non-aligned, positive values, backend equivalence)
  - **Implemented `argmax()` operation**: Hybrid operation (3.2-3.3x speedup, limited by scalar index scan)
    - Uses `_mm512_max_ps` + `_mm512_reduce_max_ps` to find maximum value (16-way SIMD)
    - Scalar `.position()` scan to find index of max value (dominates runtime)
    - 6 comprehensive unit tests (basic, aligned, non-aligned, negative values, max at start, backend equivalence)
  - **Implemented `argmin()` operation**: Hybrid operation (3.2-3.3x speedup, limited by scalar index scan)
    - Uses `_mm512_min_ps` + `_mm512_reduce_min_ps` to find minimum value (16-way SIMD)
    - Scalar `.position()` scan to find index of min value (dominates runtime)
    - 6 comprehensive unit tests (basic, aligned, non-aligned, positive values, min at start, backend equivalence)
  - Backend selection: Auto-detects AVX-512F support via `is_x86_feature_detected!()`
  - Available on Intel Skylake-X/Sapphire Rapids (2017+) and AMD Zen 4 (2022+)
  - All 819 tests passing (779 + 9 add + 9 dot + 9 sum + 5 max + 5 min + 6 argmax + 6 argmin + 1 = 819 unique)

### Infrastructure
- **GitHub Pages deployment**: Automated documentation deployment workflow
  - Combines mdBook guide and rustdoc API documentation
  - Deploys to GitHub Pages on push to main branch
  - API documentation available at `/api` subdirectory
  - Workflow file: `.github/workflows/deploy-docs.yml`

### Documentation
- **Fixed Intel Intrinsics Guide reference**: Updated to mirror URL
  - Original Intel URL blocked automated link validation (HTTP 403)
  - Now references automation-friendly mirror at `laruence.com/sse`
  - Passes PMAT `validate-docs` quality gate (136/136 links valid)

### Fixed
- **AVX512 FMA tolerance**: Increased tolerance for 3-way matmul associativity
  - Addresses floating-point precision differences in AVX-512 FMA operations
  - Commit 6cd7ba2

### Performance
- **AVX-512 add() benchmarks**: Memory-bound operation analysis
  - Size 100:   Scalar 50.9ns, AVX2 44.4ns (1.15x), **AVX512 44.8ns (1.14x)**
  - Size 1000:  Scalar 113.7ns, AVX2 101.1ns (1.12x), **AVX512 117.3ns (0.97x)**
  - Size 10000: Scalar 1.117µs, AVX2 1.106µs (1.01x), **AVX512 1.122µs (0.99x)**
  - **Conclusion**: add() is memory-bound (~1x SIMD benefit)
  - Memory bandwidth saturation prevents AVX-512 benefits for simple element-wise ops
  - Consistent with existing patterns: add/sub/div/fma/scale/abs all memory-bound (~1x speedup)
  - AVX-512's 2x register width (16 vs 8 elements) does not help when memory is bottleneck

- **AVX-512 dot() benchmarks**: Compute-bound operation ✅ **EXCEEDS 8x TARGET**
  - Size 100:   Scalar 44.2ns, AVX2 8.9ns (4.95x), **AVX512 8.4ns (5.3x)**
  - Size 1000:  Scalar 607ns, AVX2 94ns (6.5x), **AVX512 49ns (12.5x)**  - Size 10000: Scalar 6.31µs, AVX2 1.03µs (6.1x), **AVX512 551ns (11.5x)**  - **Conclusion**: dot() is compute-bound (11-12x SIMD speedup achieved!)
  - FMA intrinsic (_mm512_fmadd_ps) provides massive benefit for multiply-accumulate patterns
  - AVX-512's 16-element-wide FMA + horizontal reduction delivers 1.9x speedup over AVX2
  - Validates ROADMAP success criteria: "8x speedup over scalar (AVX-512)" ✅
  - Confirms hypothesis: Compute-bound operations benefit from AVX-512, memory-bound do not

- **AVX-512 sum() benchmarks**: Compute-bound operation ✅ **EXCEEDS 8x TARGET**
  - Size 100:   Scalar 36.3ns, AVX2 5.6ns (6.5x), **AVX512 5.7ns (6.4x)**
  - Size 1000:  Scalar 600ns, AVX2 55ns (10.9x), **AVX512 54ns (11.0x)**  - Size 10000: Scalar 6.33µs, AVX2 768ns (8.2x), **AVX512 767ns (8.3x)**  - **Conclusion**: sum() is compute-bound (8-11x SIMD speedup achieved!)
  - 16-way parallel accumulation with `_mm512_add_ps` + `_mm512_reduce_add_ps`
  - AVX-512 matches AVX2 performance (both memory-bandwidth limited for reduction)
  - Validates ROADMAP success criteria: "8x speedup over scalar (AVX-512)" ✅
  - Pattern: Reduction operations achieve target speedup despite memory constraints

- **AVX-512 max() benchmarks**: Compute-bound operation ✅ **EXCEEDS 8x TARGET**
  - Size 100:   Scalar 26.9ns, AVX2 4.3ns (6.2x), **AVX512 4.2ns (6.3x)**
  - Size 1000:  Scalar 390ns, AVX2 40ns (9.8x), **AVX512 32ns (12.1x)**  - Size 10000: Scalar 4.02µs, AVX2 482ns (8.3x), **AVX512 488ns (8.2x)**  - **Conclusion**: max() is compute-bound (8-12x SIMD speedup achieved!)
  - 16-way parallel comparison with `_mm512_max_ps` + `_mm512_reduce_max_ps`
  - AVX-512 matches AVX2 performance (both memory-bandwidth limited)
  - Validates ROADMAP success criteria ✅

- **AVX-512 min() benchmarks**: Compute-bound operation ✅ **EXCEEDS 8x TARGET**
  - Size 100:   Scalar 26.1ns, AVX2 4.2ns (6.2x), **AVX512 4.2ns (6.2x)**
  - Size 1000:  Scalar 371ns, AVX2 31ns (12.0x), **AVX512 32ns (11.6x)**  - Size 10000: Scalar 3.93µs, AVX2 484ns (8.1x), **AVX512 492ns (8.0x)**  - **Conclusion**: min() is compute-bound (8-12x SIMD speedup achieved!)
  - 16-way parallel comparison with `_mm512_min_ps` + `_mm512_reduce_min_ps`
  - AVX-512 matches AVX2 performance (both memory-bandwidth limited)
  - Validates ROADMAP success criteria ✅

- **AVX-512 argmax() benchmarks**: Hybrid operation (SIMD find + scalar scan)
  - Size 100:   Scalar 46.2ns, AVX2 21.8ns (2.1x), **AVX512 21.2ns (2.2x)**
  - Size 1000:  Scalar 580ns, AVX2 182ns (3.2x), **AVX512 184ns (3.2x)**
  - Size 10000: Scalar 5.95µs, AVX2 1.79µs (3.3x), **AVX512 1.78µs (3.3x)**
  - **Conclusion**: argmax() achieves 3.2-3.3x speedup (limited by scalar index scan)
  - SIMD phase: 16-way parallel max finding with `_mm512_max_ps` + `_mm512_reduce_max_ps`
  - Scalar phase: `.position()` scan to find index of max value (dominates runtime)
  - **Not** targeting 8x speedup - argmax is fundamentally a two-phase algorithm

- **AVX-512 argmin() benchmarks**: Hybrid operation (SIMD find + scalar scan)
  - Size 100:   Scalar 45.8ns, AVX2 21.5ns (2.1x), **AVX512 21.6ns (2.1x)**
  - Size 1000:  Scalar 581ns, AVX2 180ns (3.2x), **AVX512 181ns (3.2x)**
  - Size 10000: Scalar 5.93µs, AVX2 1.76µs (3.4x), **AVX512 1.79µs (3.3x)**
  - **Conclusion**: argmin() achieves 3.2-3.3x speedup (limited by scalar index scan)
  - SIMD phase: 16-way parallel min finding with `_mm512_min_ps` + `_mm512_reduce_min_ps`
  - Scalar phase: `.position()` scan to find index of min value (dominates runtime)
  - **Not** targeting 8x speedup - argmin is fundamentally a two-phase algorithm

### Quality
- **Mutation testing improvements**: Backend error handling test
  - Killed Backend::Auto deletion mutant (src/vector.rs:3145) with defensive error test
  - Improved test coverage for backend fallback paths
  - Known limitation: 3 GPU mutants (tanh, is_available, reduce_sum) require GPU hardware to test
  - Tests skip gracefully when GPU unavailable (prevents CI breakage)
- **Bashrs enforcement**: Shell script quality validation
  - Replaced C-grade shell validation with A-grade Rust xtask
  - Enforces bashrs validation for Makefile and all shell scripts
  - Handles missing shell scripts gracefully

---

## [0.2.2] - 2025-11-18

### Fixed
- **CRITICAL**: Missing SIMD implementation for `abs()` operation (Issue #2)
  - Blocked downstream projects (realizar)
  - Added implementations in AVX2Backend, SSE2Backend, ScalarBackend
  - Uses bitwise AND with `0x7FFFFFFF` to clear sign bit
  - All 109 tests pass, backend equivalence verified

### Performance
- **argmax/argmin SIMD optimization**: 2.8-3.1x speedup
  - Replaced scalar index scan with SIMD index tracking
  - Uses comparison masks and blend operations
  - Processes 8 elements/iteration (AVX2) or 4 elements/iteration (SSE2)

### Added
- Comprehensive performance benchmarks for 7 operations:
  - `norm_l1()` - L1 norm (4-11x SIMD speedup, compute-bound)
  - `norm_l2()` - L2 norm (4-9x SIMD speedup, compute-bound)
  - `scale()` - Scalar multiplication (~1x speedup, memory-bound)
  - `fma()` - Fused multiply-add (~1x speedup, memory-bound despite FMA hardware)
  - `sub()` - Subtraction (~1x speedup, memory-bound)
  - `div()` - Division (~1x speedup, memory-bound)
  - `abs()` - Absolute value (~1.1x speedup, memory-bound)
  - `min()` - Minimum reduction (6-10x SIMD speedup)

### Documentation
- **Performance pattern analysis documented**:
  - **Compute-bound operations** (4-12x SIMD benefit): min, argmax/argmin, norm_l1, norm_l2, dot, sum
  - **Memory-bound operations** (~1x SIMD benefit): sub, div, fma, scale, abs
  - Root cause: Memory bandwidth saturation prevents SIMD benefit for simple operations

### Testing
- All 889 tests passing (759 unit + 21 integration + 109 doc)
- Zero clippy warnings
- EXTREME TDD methodology with RED-GREEN-REFACTOR cycle applied for abs()

### Closes
- Issue #2: Missing abs trait implementation in VectorBackend

---

## [0.2.1] - 2025-11-18

### Added

#### Activation Functions
- `hardswish()` - MobileNetV3 efficient activation
- `mish()` - Modern swish alternative (x * tanh(softplus(x)))
- `selu()` - Self-normalizing exponential linear unit
- `relu()` - ReLU with EXTREME TDD

#### Math Operations
- `log2()` - Base-2 logarithm (information theory, entropy)
- `log10()` - Base-10 logarithm (decibels, pH)

#### Documentation
- Comprehensive GPU performance analysis (`docs/performance-analysis.md`)
- Performance baselines for regression detection

### Changed

#### Critical GPU Performance Optimization
- **GPU disabled for ALL element-wise operations** (2-65,000x slower than scalar!)
- **GPU enabled ONLY for matmul** (2-10x speedup at 500×500+)
- Updated OpComplexity thresholds based on empirical benchmarks
- Lowered matmul GPU threshold from 1000 to 500 (proven 2x speedup)

#### Documentation Updates
- README updated with honest GPU performance claims
- ROADMAP pivoted from GPU to SIMD optimization strategy

### Fixed
- False GPU speedup claims (advertised 10-50x, actual was 2-65,000x SLOWER)
- GPU overhead analysis: 14-55ms fixed cost per operation

### Performance

#### GPU Benchmark Results (Empirical - Genchi Genbutsu)
| Operation | Size | GPU vs Scalar | Result |
|-----------|------|---------------|--------|
| vec_add | 1M | 510x SLOWER | ❌ GPU disabled |
| dot | 1M | 93x SLOWER | ❌ GPU disabled |
| relu | 1M | 824x SLOWER | ❌ GPU disabled |
| matmul | 500×500 | **2.01x faster** | ✅ GPU enabled |
| matmul | 1000×1000 | **9.59x faster** | ✅ GPU enabled |

**Root Cause**: 14-55ms GPU overhead (buffer allocation + PCIe transfer) dominates execution time for element-wise ops.

### Testing
- 33 new tests for activations (hardswish, mish, selu)
- 14 new tests for log2/log10
- Property-based tests for all new functions
- Total: 699+ tests

### Closes
- Issue #1: Element-wise transcendental functions (log2, ln, exp)

---

## [0.1.0] - 2025-01-17

### Added

#### Core Types
- `Vector<T>` type with SIMD-optimized operations
- `Matrix<T>` type with row-major storage (NumPy-compatible)
- `Backend` enum for multi-target execution (Scalar, SSE2, AVX, AVX2, AVX512, NEON, WasmSIMD, GPU)
- Runtime CPU feature detection with automatic backend selection

#### Vector Operations (87 total)
- **Element-wise**: add, sub, mul, div, abs, neg, clamp, lerp, fma, sqrt, recip, pow, exp, ln, floor, ceil, round, trunc, fract, signum, copysign, minimum, maximum
- **Trigonometric**: sin, cos, tan, asin, acos, atan
- **Hyperbolic**: sinh, cosh, tanh, asinh, acosh, atanh
- **Dot product**: Optimized with SIMD and FMA
- **Reductions**: sum (naive + Kahan), min, max, sum_of_squares, mean, variance, stddev, covariance, correlation
- **Activation functions**: relu, leaky_relu, elu, sigmoid, softmax, log_softmax, gelu, swish/silu
- **Preprocessing**: zscore, minmax_normalize, clip
- **Index operations**: argmin, argmax
- **Vector norms**: L1, L2, L∞, normalization to unit vectors
- **Scalar operations**: scale (scalar multiplication with full SIMD)

#### Matrix Operations
- Matrix multiplication (matmul) - naive O(n³) algorithm
- Matrix transpose - O(mn) swap operation
- Constructors: new(), from_vec(), zeros(), identity()
- Accessors: get(), get_mut(), rows(), cols(), shape(), as_slice()

#### Performance Optimizations
- SSE2 SIMD (128-bit): 3-4x speedup on dot product vs scalar
- AVX2 SIMD (256-bit): Additional 1.8x speedup with FMA
- Runtime dispatch based on CPU features
- Kahan summation for numerical stability
- Numerically stable algorithms (softmax with max subtraction, correlation clamping)

#### Testing & Quality
- 611 unit tests (100% passing)
- 101 doctests (100% passing)
- Property-based testing with proptest (100 cases per test)
- Zero clippy warnings
- Zero rustdoc warnings
- EXTREME TDD methodology applied throughout
- Mutation testing support
- Pre-commit quality gates via PMAT

#### Documentation
- Comprehensive rustdoc with examples for all public APIs
- README with performance benchmarks
- Quick start guide
- Phase roadmap (Phases 1-7 complete, Phase 8 in progress)
- 4 comprehensive examples:
  - activation_functions.rs
  - backend_detection.rs
  - ml_similarity.rs
  - performance_demo.rs

### Changed
- Improved numerical stability for variance/stddev with hybrid tolerance (absolute for small values, relative for large)
- Improved correlation() to clamp results to \[-1, 1\] to handle floating-point precision
- Optimized property tests with appropriate tolerances for floating-point comparisons

### Fixed
- Fixed 4 property test failures in variance/stddev operations with better tolerance handling
- Fixed all 64 rustdoc link resolution warnings by escaping mathematical notation
- Fixed atanh(tanh(x)) round-trip precision for extreme values by restricting range
- Fixed covariance bilinearity test with increased tolerance for compounding FP errors
- Fixed zscore tests for small sample sizes (n<3) and near-constant vectors

### Performance

#### Benchmarks (vs Scalar Baseline)
| Operation | Size | SSE2 | AVX2 | Notes |
|-----------|------|------|------|-------|
| Dot Product | 10K | 3.4x | 6.2x | FMA acceleration |
| Sum | 1K | 3.15x | - | - |
| Max | 1K | 3.48x | - | - |
| Add | 1K | 1.03x | 1.15x | Memory-bound |
| Mul | 1K | 1.05x | 1.12x | Memory-bound |

All benchmarks verified with Criterion.rs.

### Technical Details

#### Quality Metrics
- Test coverage: >90%
- Test execution time: 0.09s (target: <30s) - 333x faster than requirement
- TDG Score: 95.2/100 (A+)
- Zero defects at release
- Toyota Way principles applied (Jidoka, Kaizen, Genchi Genbutsu, Hansei, Poka-Yoke)

#### Platform Support
- x86_64: SSE2/AVX/AVX2/AVX-512
- ARM: NEON
- WASM: SIMD128
- GPU: Planned (infrastructure ready)

#### Dependencies
- thiserror: 2.0 (error handling)
- proptest: 1.8 (property-based testing, dev-only)
- criterion: 0.5 (benchmarking, dev-only)

### Breaking Changes
None - this is the initial release.

### Migration Guide
This is the first release. To use:

```toml
[dependencies]
trueno = "0.1"
```

```rust
use trueno::{Vector, Matrix};

let v = Vector::from_slice(&[1.0, 2.0, 3.0]);
let result = v.add(&v).unwrap();

let m = Matrix::identity(3);
let transposed = m.transpose();
```

### Known Limitations
- Matrix operations use naive algorithms (future: SIMD, GPU, blocked matmul)
- GPU backend infrastructure exists but not yet activated
- No matrix-vector multiplication yet (planned Phase 8)
- No compile-time backend selection (runtime only)

### Contributors
- Pragmatic AI Labs Team
- Claude (AI pair programmer)

### Links
- Repository: https://github.com/paiml/trueno
- Documentation: https://docs.rs/trueno/0.1.0
- Crates.io: https://crates.io/crates/trueno

---

## [Unreleased]

### Planned for v0.3.0
- SIMD-optimized activation functions (AVX2/AVX-512)
- Performance regression CI integration
- Matrix-vector multiplication
- Additional backends (WASM SIMD128)

[0.7.1]: https://github.com/paiml/trueno/releases/tag/v0.7.1
[0.7.0]: https://github.com/paiml/trueno/releases/tag/v0.7.0
[0.6.0]: https://github.com/paiml/trueno/releases/tag/v0.6.0
[0.5.0]: https://github.com/paiml/trueno/releases/tag/v0.5.0
[0.4.1]: https://github.com/paiml/trueno/releases/tag/v0.4.1
[0.4.0]: https://github.com/paiml/trueno/releases/tag/v0.4.0
[0.3.0]: https://github.com/paiml/trueno/releases/tag/v0.3.0
[0.2.2]: https://github.com/paiml/trueno/releases/tag/v0.2.2
[0.2.1]: https://github.com/paiml/trueno/releases/tag/v0.2.1
[0.1.0]: https://github.com/paiml/trueno/releases/tag/v0.1.0
[Unreleased]: https://github.com/paiml/trueno/compare/v0.7.1...HEAD