aprender-compute 0.29.0

High-performance SIMD compute library with GPU support, LLM inference engine, and GGUF model loading (was: trueno)
# CLAUDE.md

## Project Overview

**Trueno** — Rust library for unified high-performance compute:
CPU SIMD (SSE2/AVX/AVX2/AVX-512/NEON/WASM SIMD128),
NVIDIA CUDA (pure Rust PTX via `trueno-gpu`, no nvcc),
cross-platform GPU (wgpu: Vulkan/Metal/DX12/WebGPU).

**Canonical spec:** `docs/specifications/trueno-spec.md`
(ONE spec, sub-specs in `docs/specifications/sub/`).

**Core Principles**: Write once optimize everywhere, runtime dispatch,
zero unsafe in public API, every optimization must prove ≥10% speedup,
>90% test coverage, **contract-first** (no kernel ships without a YAML
contract in `contracts/`).

## Development Commands

```bash
# Build
cargo build                                    # standard
cargo build --all-features                     # all features
cargo build --target wasm32-unknown-unknown     # WASM

# Test
cargo test --all-features                      # all tests
cargo test property_tests                      # proptest
cargo test backend_equivalence                 # cross-backend
cargo test --test integration_tests            # integration

# Coverage — ONLY use make coverage (never cargo llvm-cov or tarpaulin)
make coverage                                  # MUST be ≥90%
make coverage-check                            # exits with error if <90%

# Lint
cargo clippy --all-features -- -D warnings
cargo fmt -- --check

# Bench
cargo bench --no-fail-fast

# Profile (Renacer v0.5.0+)
make profile                                   # benchmark profiling
make profile-flamegraph                        # flamegraph
make profile-otlp-jaeger                       # OTLP → Jaeger

# Golden trace validation
renacer --assert renacer.toml -- \
  ./target/release/examples/backend_detection

# Quality gates
pmat analyze tdg --min-grade B+
pmat repo-score . --min-score 90
cargo mutants --timeout 120 --minimum-pass-rate 80
```

## Code Search

**Use `pmat query` for code discovery, not grep/rg.**
See global CLAUDE.md for full flag reference.

```bash
pmat query "simd kernel" --limit 10
pmat query "matrix multiply" --min-grade A --exclude-tests
pmat query "gpu compute" --faults --churn -G   # full audit
```

## Backend Story Policy (NEVER VIOLATE)

Every operation MUST work on ALL backends: Scalar, SSE2,
AVX/AVX2, AVX-512, NEON, WASM SIMD128, CUDA, wgpu.
**NO EXCEPTIONS.** If GPU acceleration isn't beneficial,
the GPU method must fall back to CPU.

When adding a new operation:
1. **Write contract FIRST** (`contracts/my-op-v1.yaml`)
2. Add to `VectorBackend` trait in `src/backends/mod.rs`
3. Implement in ALL backend modules
4. Add WGSL shader if GPU-accelerable
5. Add sync + async device methods
6. Add integration test in `tests/backend_story.rs`

Enforcement: `tests/backend_story.rs` + CI.

## Provable-Contracts Integration

**Contract-first design:** `build.rs` reads
`../provable-contracts/contracts/trueno/binding.yaml`
(38/38 bindings implemented), sets `CONTRACT_*` env vars,
enforces **AllImplemented** policy — any `not_implemented`
binding fails the build.

**Escape-proof pipeline:** Equation → Lean proof →
YAML validation → build.rs codegen → `#[contract]` proc
macro → FALSIFY tests.
See `docs/specifications/sub/contracts.md` for details.

```bash
pv lint contracts/                  # 7-gate quality check
pv verify-bindings                  # check binding registry
pv score                            # PVScore metric
```

## Quality Standards

**Every commit**: clippy clean, all tests pass,
≥90% coverage (`make coverage`), rustfmt, PMAT TDG ≥B+.

**Every PR**: Tests for new code (unit, property, backend
equivalence, mutation, benchmark), rustdoc updated,
benchmarks prove ≥10% improvement.

**Test categories**: Unit (empty, NaN, subnormal),
property-based (proptest), backend equivalence
(f32 tolerance < 1e-5), mutation (≥80% kill rate),
benchmarks.

## LAYOUT-002: Row-Major Mandate (Q4K/Q6K)

The Sovereign AI Stack uses **ROW-MAJOR exclusively** for
APR/GGUF data. Column-major kernels exist for internal
BLAS-style ops only. Garbage inference output = wrong kernel
layout. Aprender handles GGUF→APR transpose during import.
See `book/src/advanced/phase15-fused-q4k.md` for fused
dequant+dot spec.

## Safety Rules

- `unsafe` ONLY in backend implementations, never in public API
- Every `unsafe` block needs a safety comment
- SIMD intrinsics wrapped in `#[target_feature]` functions
- Always handle SIMD remainder with scalar fallback
- GPU only for >100K elements (PCIe transfer ~0.5ms)
- f32 comparisons: tolerance < 1e-5

## trueno-gpu (Pure Rust CUDA)

Pure Rust PTX generation — no nvcc, no LLVM. Kernels: GEMM
(naive/tiled/tensor core), Softmax, LayerNorm, Attention
(FlashAttention-style), Q4_K dequantization.

```bash
cargo test -p trueno-gpu --features cuda   # requires GPU
cargo test -p trueno-gpu property_tests    # no hardware
```

## WGPU Inference & Training

**Inference**: `WgslForwardPass` — RMSNorm, GEMV, SiLU, RoPE.
27.6 tok/s on Radeon Pro W5700X.

**Training**: 9 shaders in `src/backends/gpu/shaders/backward.rs`
(6 backward + adamw + nf4_dequant + cross_entropy_forward).
All FALSIFY tests pass. Full training loop on AMD/Intel/Apple.

## Blackwell (trueno#200, trueno#203)

**JIT Bug (#200)**: `cuModuleLoadDataEx` fails on sm_121 during
active GPU work. Forward kernels work after pre-warming;
backward kernels crash. Inference unaffected (cuBLAS/SIMD).

**Fix (#203)**: Dimension-independent kernels (M,K,N as runtime
params, ~15 types vs 50+ variants) + pre-compiled cubin pipeline
(`build.rs → nvcc → include_bytes!()` → zero JIT at runtime).

## Stack Search

```bash
batuta oracle --rag "your question here"
batuta oracle --rag-index               # reindex
```