# baracuda

> **About the name.** Yes, we know — it's spelled **barracuda** (two Rs). That
> name was taken on crates.io, so we dropped one R and kept swimming.
A unified Rust ML-op facade over the NVIDIA CUDA ecosystem.




## What baracuda is
baracuda is a Rust workspace that exposes every primitive an ML framework
expects — the union of PyTorch (`torch.*` + `nn.functional`) and JAX
(`jax.lax.*` + `jax.numpy.*`) — through a single `Plan`-based crate surface
called [`baracuda-kernels`]. Internally each plan dispatches to:
1. The appropriate NVIDIA-library wrapper crate (cuBLAS, cuDNN, cuFFT,
cuSOLVER, cuRAND, cuSPARSE, cuTENSOR, NPP, CV-CUDA, CUTLASS) when one
already covers the op well, or
2. A bespoke hand-rolled `.cu` kernel shipped in [`baracuda-kernels-sys`]
when no NVIDIA library covers the op (or covers it poorly at the shapes
that matter for modern transformer / vision / GNN workloads).
Callers import **one** crate (`baracuda-kernels`) and reach for **one** API
style. The dispatch decision — which is observable through
`Plan::sku()` for telemetry — is otherwise invisible. Switching from a
CUTLASS-backed SKU to a bespoke-backed SKU is a layout flag, not an import
change.
baracuda is for downstream Rust ML / inference / training frameworks that
need access to the full CUDA stack without re-vendoring it themselves. The
workspace also ships idiomatic stand-alone wrappers for every CUDA library
under `crates/baracuda-<lib>` if you want to skip the kernel facade and
talk to one library directly.
## Status
**In active development — alpha.67.** **2280+ GPU tests passing,
zero failures** across the 6 critical test crates on an RTX 4070
(sm_89; the `baracuda-kernels` suite alone is 2180/0 across 513 test
binaries). alpha.67 (Phase 74, Fuel-ask) ships the **plain dense FP
GEMM family**: 12 cuBLAS-backed flat C symbols
(`baracuda_kernels_gemm_dense_{f32, f64, f16, bf16}_*`) with
RRR / RCR / **CRR** layouts, flexible leading dims, and
strided-batch folded into the base signature — plus the
`DenseGemmPlan<T>` typed plan, the `ReduceToPlan<T, N>`
broadcast-reverse reduction facade, `UnaryKind::Step`, and gelu
flavor-disambiguation docs. This closes the last non-baracuda CUDA
surface in Fuel (its own cuBLAS MatMul wrapper). alpha.66 (Fuel-ask)
exposed **per-device VRAM queries** on the Driver-API `Device` —
`vram_free()` / `vram_total()` / `vram_info()` wrap
`cuMemGetInfo_v2` so downstream optimizers can read runtime memory
pressure without round-tripping through a failed allocation. Phase 73 follow-up (alpha.65) lands a **17-33× decode
speedup** via a focused FlashDecodingPlan (split-K, seq_q=1),
**4× win at GQA shapes** via the new `num_kv_heads` descriptor
field, the long-awaited `FlashSdpaPlan` GQA-broadcast routing fix,
and a SDPA gap closure that makes baracuda's standard MHA shape
**50% faster than PyTorch** by making `fa2` a default feature. The
ConcatPlan (KV-cache decode) and reduce_axis (small-shape rows)
kernels also got significant rewrites — 13× and 2.6-15.4×
respectively. Phase 63 (alpha.63, Fuel-ask) closes the FlashAttention
saved-tensor wiring gap: **NEW
`baracuda_kernels_fa2_sdpa_lse_size(batch, num_heads, seq_q) -> usize`**
dense LSE pre-allocation helper (sibling of the existing varlen
`_varlen_lse_size`); **load-bearing "LSE saved-tensor contract"
docs** on the FW + BW trailblazers naming the exact FW→saved-LSE→BW
handoff; **new docs guide** at `docs/guides/fa2-saved-tensor-contract.md`
showing the wiring pattern downstream autograd frameworks should
use. **Test investment**: 8 new tests — 3 host-only `lse_size`
helper sanity tests + 5 GPU FW→BW roundtrip tests (f16/bf16 ×
d128 causal/noncausal + d64 causal) + 4 BW feature-surface tests
backfilling the gaps left by `fa2_backward_smoke.rs` (BW with
sliding window, softcap, ALiBi, and all-features composed). The
existing `softmax_lse` output on `baracuda_kernels_fa2_sdpa_<dt>_run`
(v1) and `..._run_v2` (Phase 59a) — turns out Fuel didn't need a
new "v3 with lse" variant; the FW has been writing LSE since
alpha.56, just lacked the size helper + saved-tensor wiring
documentation. Phase 62 (alpha.62, Fuel-ask) lifts the in-place
op contract from contig-only (Phase 61) to **strided** by shipping
11 new affine in-place FFI symbols (4 contig int dtype backfill —
`i32`/`i64`/`u8`/`i8` matching forward affine; 7 strided variants
across the full forward-strided dtype matrix — `f32`/`f64`/`i32`/`i64`/`u8`/`bf16`/`f16`)
and documenting the **stride-equality precondition** for
same-pointer aliasing on the unary / binary / ternary strided
trailblazers as a stable public contract. NEW
`baracuda_kernels_types::strides_equal` host helper for callers
to validate the precondition before dispatch. Zero new bespoke
kernels for the elementwise unary/binary/ternary families —
their existing strided launchers are aliasing-safe under the
contract. Test investment: 14 host-only unit tests +
17 GPU smoke tests for the new FFI surface + 7 aliasing-contract
proof tests across contig + strided trailblazers (also backfills
the Phase 61 contig contract that shipped without test coverage).
Phase 61 (alpha.61) completed the alpha.55 baseline with bf16/f16
in-place affine + the original contig same-pointer contract doc.
Phase 60 lifted FA2 FW to the full Candle-fork-extended
9-head_dim set ({32, 64, 96, 128, 160, 192, 224, 256, 512}) via
12 new vendored `.cu` files + 32 FW smoke tests. Phase 59a + 59b
added the full FA2 v2.8.3 surface (FW + BW + varlen across
head_dims 32-256, GQA, ALiBi, sliding window, softcap) plus 48
new smoke tests, closing Fuel's FA2-retirement requirements. Phase 59c (consolidation pass, alpha.59) fixed a
pre-existing parallel-test race in the bespoke flash kernel's
SMEM-carveout call surfaced by Phase 59a's 5-head_dim fanout, plus
updated `flash_sdpa_backward_smoke` to force the bespoke backend on
f16/bf16 (Phase 59b made FA2 the new default BW backend, breaking
source-compat for the existing bespoke BW smoke tests).
Phase 42-44
add three opt-in backends (FA2, mHC.cu, ozIMMU); none are on the
default build path. Phase 44b internalized the ozIMMU sources
(clean-fork; cutf submodule retired; Linux + Windows both build).
Phase 49 adds the `baracuda-optim` sibling crate (Adam / LAMB / SGD
via vendored Apex `multi_tensor_apply`) — gated behind the `optim`
feature so inference-only consumers don't pay the FFI surface cost.
Phase 55 adds the `baracuda-transformer-engine` sibling crate
(NVIDIA TransformerEngine FP8 cast + delayed-scaling recipe,
Apache-2.0) — gated behind the `tensor_engine` feature. On Ada
(sm_89) the FP8 wins are bandwidth-saving only (KV cache, weights);
the recipe machinery is forward-compatible with Hopper / Blackwell
where the MMA throughput win also materializes.
Phase coverage (see [`ARCHITECTURE.md`](ARCHITECTURE.md) for the phase
matrix):
| Phase | Scope | Status |
| --- | --- | --- |
| 59a | FA2 FW expansion (alpha.59) — full upstream feature parity (head_dim fanout {32,64,96,192,256}; GQA; ALiBi; sliding window; softcap): vendored 20 new `.cu` files from Dao-AILab FA2 v2.8.3 (head_dims 32/64/96/192/256 × {fp16, bf16} × {causal, non-causal}) bringing the FW vendor coverage to the full upstream set of {32, 64, 96, 128, 192, 256}. Upstream FA2 v2.8.3 does NOT ship head_dims 160/224/512 — those are permanently out-of-scope (no source). Launcher (`kernels/attention/fa2_launcher.cu`) rewritten to dispatch all 6 supported head_dims via runtime switch. NEW `..._run_v2` + `..._can_implement_v2` FFI entry points (+4 symbols) carrying ALiBi slopes + per-head-or-per-batch layout selector + sliding window left/right bounds + Gemma-2-style softcap. v1 entry points preserved for backwards-compat. GQA-divisible head counts (`num_heads % num_heads_k == 0`) now accepted on the FA2 path. `FlashSdpaDescriptor` is now `#[non_exhaustive]` with `::new(...)` + chainable `with_window_size_left`/`with_window_size_right`/`with_softcap` builders (Phase 32 convention). `FlashSdpaArgs` gained `alibi_slopes: Option<TensorRef<f32, 2>>`. Bespoke backend rejects sliding-window/softcap/ALiBi at select-time with clear errors. ~33 descriptor + ~30 args callsites migrated to the builder pattern. 4 new smoke test files (26 new test functions): `fa2_hdim_fanout_smoke` (20), `fa2_gqa_smoke` (1), `fa2_alibi_smoke` (3), `fa2_sliding_window_smoke` (3), `fa2_softcap_smoke` (4). Out of scope (Phase 59b territory): BW path, varlen, split-KV. | done |
| 59c | Bespoke flash SMEM-carveout race fix + flash_sdpa_backward smoke test routing fix (alpha.59 consolidation pass): added `std::mutex`-serialized helper `set_dynamic_smem_serialized` around all `cudaFuncSetAttribute(MaxDynamicSharedMemorySize)` calls in `baracuda_flash_sdpa.cuh` + `baracuda_flash_sdpa_sm89.cuh` (5 call sites total: FW + BW dQ + BW dKdV + sm_89 FW + sm_89 strided FW). Pre-existing flake (root cause: Phase 6 / Milestone 6.6 host wrapper) that surfaced as `CutlassInternal(1001)` (= `cudaErrorMissingConfiguration`) at ~33% rate on Phase 59a's 20-test hdim fanout, specifically for d_k=96 + fp16 (smem ~50 KiB, just past the 48 KiB cudaFuncSetAttribute trigger). Confirmed fix via 3 stress runs after fix: 60/60 tests pass. Also fixed `flash_sdpa_backward_smoke`'s f16/bf16 paths to explicitly request `BackendKind::Bespoke` — Phase 59b made FA2 the default BW backend for f16/bf16 (more permissive heuristic), which broke source-compat for the existing bespoke BW smoke tests (they fed `lse: f16` not `lse_f32`). | done |
| 59b | FA2 BW + varlen (alpha.59; closes Fuel's FA2-retirement requirements): vendored 24 new BW `.cu` files (`flash_bwd_hdim{32,64,96,128,192,256}_{fp16,bf16}_{,causal}_sm80.cu` — full FA2 v2.8.3 BW set, mirrors 59a FW vendor 1:1) plus 3 new BW headers (`flash_bwd_kernel.h`, `flash_bwd_launch_template.h`, `flash_bwd_preprocess_kernel.h`). **Key finding**: varlen does NOT have a separate .cu file family upstream — FA2 v2.8.3 dispatches varlen via a runtime `cu_seqlens_q != nullptr` check inside the existing FW/BW launch templates, so the same per-(headdim, dtype, causal) instantiations serve dense and varlen callers. NEW `kernels/attention/fa2_backward_launcher.cu` (BW dispatch, supports dense + varlen via two `fill_*_params` helpers) + `fa2_varlen_launcher.cu` (varlen FW). +12 new FFI symbols (BW dense ×2 + can_implement ×2 + workspace_size; varlen FW ×2 + can_implement ×2 + lse_size; varlen BW ×2 + can_implement ×2 + workspace_size). API: `FlashSdpaBackwardDescriptor` is now `#[non_exhaustive]` with `::new(...)` + sliding-window/softcap builders. `FlashSdpaBackwardArgs` gained `lse_f32: Option<TensorRef<f32, 3>>` (FA2 stores LSE in f32 regardless of T) + `alibi_slopes`. `FlashSdpaBackwardPlan` extended with `BackendChoice::FlashAttentionV2` arm (additive — bespoke path source-compat preserved). NEW `FlashSdpaVarlenPlan` / `FlashSdpaVarlenBackwardPlan` plan families with packed-batch `[total_q, H, D]` layout + `cu_seqlens_q`/`cu_seqlens_k` index tensors + f32 LSE `[H, total_q + 128*B]`. BW workspace = `dq_accum + dsoftmax_d` (sizes via `..._backward_workspace_size`); launcher zero-fills via `cudaMemsetAsync`. Determinism: FA2 BW uses atomicAdd into dq_accum, so NOT bit-stable run-to-run (precision SKU tags this honestly). 2 new smoke test files: `fa2_backward_smoke.rs` (12 tests: workspace sizing + eligibility + e2e BW for d ∈ {64,128,192,256} × {f16,bf16} × {causal,non-causal}), `fa2_varlen_smoke.rs` (5 tests: plan selection, lse_size formula, varlen FW with 3 packed sequences, varlen BW with 2 sequences, varlen × GQA). | done |
| 60 | FA2 head_dim {160, 224, 512} FW expansion (alpha.60) — **corrects Phase 59a's incorrect "permanently out-of-scope" claim**. The Candle fork (`EricLBuehler/candle`) has carried hd160/192/224/256 since 2023-07 (PR #245 by Laurent Mazare); hd224 was restored by PR #2688 (Michael Feil, 2024-12-31); hd512 was added by PR #3417 (Eric Buehler, merged 2026-03-28 — adds the `cudaDeviceGetAttribute(cudaDevAttrMaxSharedMemoryPerBlockOptin)` SMEM opt-in path and updates the splitkv block-size formula). Phase 60 vendors the 12 missing FW `.cu` files from those PRs into baracuda's FA2 tree (8 hd160/224 from `EricLBuehler/candle@main`; 4 hd512 from `huggingface/candle@5430d32c`) plus the corresponding `flash_fwd_launch_template.h` + `static_switch.h` patches. **BW path NOT extended** — hd160/224 fall on FA2 BW kernel's `kBlockKSmem = (kHeadDim % 64 == 0) ? 64 : 32` constraint (BW atom_layout assumes 64); hd512 needs `kBlockM = 32` to fit any SMEM budget but BW kernel_traits static-asserts `kBlockM >= 64`. Upstream FA2 and the Candle fork ship no BW for these three either — limitation is fundamental to FA2's BW algorithm, not an oversight. Phase 60 attempted both paths; the experiment + reasoning is documented in `VENDOR.md`, in code comments at the dropped registration sites, and in `FA2_BW_SUPPORTED_HEAD_DIMS` (kept at `{32, 64, 96, 128, 192, 256}`). Callers needing BW at hd160/224/512 transparently fall back to the bespoke 3-kernel SDPA BW pipeline (the only path that was supporting them previously, anyway). 12 new FW smoke test functions in `fa2_hdim_fanout_smoke`. `FA2_SUPPORTED_HEAD_DIMS` (FW) lifted to `{32, 64, 96, 128, 160, 192, 224, 256, 512}` — full Candle-fork-extended set. | done |
| 63 | FlashAttention saved-tensor contract — dense LSE size helper + FW/BW wiring docs (alpha.63, Fuel-ask). Closes the wiring gap for downstream autograd integration of baracuda's FA2 backward. **1 new FFI symbol**: `baracuda_kernels_fa2_sdpa_lse_size(batch, num_heads, seq_q) -> usize` returning f32 element count — sibling of the existing `_varlen_lse_size` (Phase 59b). FW v1 + v2 have written `softmax_lse` since alpha.56 — turns out Fuel didn't need a new "v3 with lse" forward variant as their ask suggested; just the size helper for pre-allocation + clarity on the saved-tensor pattern. **Documented "LSE saved-tensor contract"** as a stable public contract on the FW + BW trailblazers (`baracuda_kernels_fa2_sdpa_f16_run` / `..._backward_f16_run`): pre-allocate via `_lse_size`, pass same f32 buffer to FW as output and BW as input, ALiBi/sliding-window/softcap params must match between FW and BW. **NEW docs guide** at `docs/guides/fa2-saved-tensor-contract.md` (~180 lines, ASCII handoff diagram + side-by-side FW/BW code samples). BW head_dim cap confirmed at 256 (matches Fuel's Vulkan limit); hd160/224/512 BW remains structurally not supported by FA2 (per Phase 60), callers fall back to bespoke `SdpaBackwardPlan`. **Test investment**: 3 host-only `lse_size` helper tests + 5 GPU FW→BW roundtrip tests (the load-bearing wiring proof — pre-allocate LSE, FW writes it, BW reads it, verify dQ/dK/dV finite + non-zero) + 4 BW feature-surface tests (sliding window, softcap, ALiBi, all-features composed — backfills the gaps in the existing `fa2_backward_smoke.rs` which only tested the base path with `alibi_slopes: None`). All 12 new tests pass on RTX 4070. Option B (recompute-LSE backward variant) explicitly rejected: 2× backward compute for zero functional benefit when the saved-tensor pattern already works. PagedAttention backward filed as "ask if needed" per Fuel — not preemptively built. | done |
| 62 | Strided in-place op support + comprehensive test investment (alpha.62, Fuel-ask). Lifts the in-place contract from contig-only (Phase 61) to strided. **11 new FFI symbols** on the affine in-place family: 4 contig int dtype backfill (`i32`/`i64`/`u8`/`i8` matching forward affine matrix) + 7 strided variants across the full forward-strided dtype set (`f32`/`f64`/`i32`/`i64`/`u8`/`bf16`/`f16`). Half-precision strided uses the same f32-scalar / upcast-to-f32 / downcast pattern as the forward strided f16/bf16 kernels. **Same-pointer aliasing contract documented for the strided trailblazers** (`unary_neg_f32_strided_run`, `binary_add_f32_strided_run`, `ternary_clamp_f32_strided_run`) as a stable public contract: aliasing is safe IFF the aliased input's stride array equals `stride_y` element-for-element. NEW `baracuda_kernels_types::strides_equal(a, b)` host helper for callers to validate the precondition before dispatching. **Zero new bespoke CUDA kernels** for the elementwise unary/binary/ternary families — their existing strided launchers are structurally aliasing-safe under the contract (each thread reads its own stride-offset cell before writing, same per-thread pattern as the contig case). Unblocks Fuel's strided in-place op fanout (every existing strided unary/binary/ternary forward kernel becomes an in-place candidate with `x_ptr == y_ptr` + equal strides). **Test investment**: 14 new host-only unit tests for `strides_equal` + `contiguous_stride` in `baracuda-kernels-types`; 17 new GPU direct-FFI smoke tests covering all 11 new affine in-place symbols + backfill tests for the alpha.55 baseline (f32/f64) and alpha.61 half-precision (bf16/f16) contig in-place; 7 aliasing-contract proof tests across contig + strided trailblazers (also backfills the Phase 61 contig contract that shipped without test coverage). Multi-pass families (Softmax / LayerNorm / RMSNorm / etc.) explicitly out of scope — same as Phase 61. | done |
| 61 | In-place op infrastructure completion + same-pointer aliasing contract (alpha.61, Fuel-ask) — 2 new bf16/f16 FFI symbols + docstring tightening. (1) `baracuda_kernels_affine_inplace_{bf16,f16}_run` complete the 4-dtype matrix on top of the alpha.60 f32/f64 in-place affine helper, with f32-scalar ABI matching the forward `affine_{bf16,f16}_run` convention (avoids passing `__nv_bfloat16`/`__half` by value through the C ABI). Kernels reuse the forward upcast-to-f32 / downcast-to-storage pattern from `affine_contig_kernel_{f16,bf16}`. Unblocks Fuel's `Op::AddScalar`/`Op::MulScalar` in-place rewrites + weight-decay scaling on bf16/f16 model weights without the previous Cast → Affine → Cast scratch-buffer round-trip. (2) Documented same-pointer aliasing safety as a stable public contract on the three contig elementwise trailblazers — `unary_neg_f32_run` (covers ~30 plain unary launchers + `unary_param_*` family across all dtypes via the existing "Same device-pointer contract..." inheritance line), `binary_add_f32_run` (covers ~20 binary launchers), `ternary_clamp_f32_run` (already documented since alpha.36). Unblocks Fuel's planned in-place expansion (16+ unary in-place op families + 4 binary in-place op families + ClampInplace + PowIInplace) with zero new baracuda symbols for the elementwise case — Fuel dispatches the forward symbol with `x_ptr == y_ptr` (or `a_ptr == y_ptr` for binary). Strided in-place variants (Phase 62 candidate) deferred — v1 contract from Fuel's executor is contiguous + zero-offset. | done |
| 0 | Crate scaffolding, shared type vocabulary | done |
| 1 | int8 GEMM RRR (Fuel-blocking, 18 SKUs) | done |
| 2 | FP8 / int4 / bin GEMM completion | done |
| 3 | Elementwise + shape / layout (Categories B, B', C, C', D, N) | done |
| 4 | Reductions + scans + random (Categories E, F, Q) | done |
| 5 | Normalization + softmax + loss (Categories G, H, R) | done |
| 6 | Attention + linalg + FFT (Categories K, Linalg, U) | done |
| 7 | Convolution + pooling + indexing + embedding + segment (Categories I, J, L, M, S) | done |
| 8 | Quantization helpers + GGUF + MoE (Category P, V) | done |
| 9 | Sort / topk / image / NMS (Categories O, T) | done |
| 10 | sm_89 (Ada Lovelace) tuning sweep | done |
| 11 | Fuel feedback integration (alpha.27) — ScalarType ergonomics, Conv/Pool fanout, GGUF Q8_K MMVQ, i64 indices, Sparsemax cap lift, atomicAdd-via-CAS, build-env probe | done |
| 12 | PowI + ArgMax/Min u32/i32 outputs (alpha.28) — `IndexOutputElement` sealed trait | done |
| 13 | WriteSlice + Contiguize + sub-byte casts + Triu/Tril (alpha.29) — KV-cache fast path, retires Fuel's D2H/CPU/H2D fallback, plus `DeviceBuffer::zero()` (alpha.30) | done |
| 14 | Strided FFI siblings (alpha.31) — Affine, PowI, Triu/Tril, RoPE+SDPA, GGUF MMVQ activation-strided + W byte offset; 56 new FFI symbols | done |
| 15 | Quick wins + correctness cleanup (alpha.32) — MMVQ alignment guard, OneHot/Nonzero i64 wrappers, MoE fixture race fix | done |
| 16 | Pool completion (alpha.33) — bit-exact adaptive pool {1,2,3}d, bespoke LpPool {1,2}d, bespoke FractionalMaxPool {2,3}d; 48 new FFI symbols | done |
| 17 | SDPA / attention completion (alpha.34) — Flash SDPA sm_89 strided FW + SDPA BW GQA-broadcast atomicAdd | done |
| 18 | Sub-byte / quantized completeness (alpha.35) — f16/bf16 activations for `GgufMmvqPlan` across all 11 block formats × contig + strided; 44 new FFI symbols | done |
| 19 | Fuel retirement asks (alpha.36) — pool/conv FFI facade for cuDNN-backed plans + Upsample Nearest2d + NEW im2col/im2col1d/col2im1d bespoke; vendored Fuel Q8_1 for inspection; 140 new FFI symbols. Surfaced 1.0-freeze prereq for broader library-backed FFI facade audit | done |
| 20 | MoE — Item 4 from Fuel retirement (alpha.37): batched MMVQ × N-experts (36 new FFI symbols across 11 GGUF block formats × 3 activation dtypes + 3 pure-FP); MoE absorb-and-expose proved to be a no-op (Fuel hadn't evolved their kernels since Phase 8.5 vendor; 5 baracuda-side symbols already match) + 2 direct-FFI smoke tests | done |
| 21 | Bilinear interpolate expansion (alpha.38): `align_corners` + scale-factor overrides + f16/bf16 fanout (FW+BW). Breaking change to existing f32/f64 signatures. | done |
| 22 | MMVQ ncols≥64 debug assertion + cuSOLVER FFI facade (alpha.39): 10 cuSOLVER-backed plan families (Cholesky, LU, QR+ormqr, SVD/SvdBatched/SvdaBatched, eigh real+complex, eig, lstsq, solve, inverse) wrapped behind ~50 flat C symbols in `baracuda-kernels-sys/src/cusolver_facade.rs`; closes the Phase 19 library-backed FFI facade gap for cuSOLVER. No feature gate (cuSOLVER ships with the CUDA toolkit). | done |
| 23 | cuFFT + cuRAND FFI facade (alpha.40): 6 cuFFT plan families (FFT 1d/Nd C2C, R2C, C2R) × c32/c64 + f32/f64 + 2 cuRAND families (Uniform, Normal) × f32/f64 = 32 flat C symbols in `baracuda-kernels-sys/src/{cufft,curand}_facade.rs`. cuSPARSE skipped — no baracuda-kernels plans wrap it today. | done |
| 24 | Cutlass GEMM re-export FFI facade (alpha.41): 210 trampolines (70 SKU families × {run, workspace_size, can_implement}) in `baracuda-kernels-sys/src/cutlass_reexport.rs` exposing the full Cutlass GEMM surface (fp16/bf16/tf32/f32_simt/f64/s8/u8 × {rcr, rrr} × {plain, bias, bias+relu/gelu/silu} + strided-batched fp16/bf16). cuTENSOR / NPP / CV-CUDA skipped — no baracuda-kernels plans wrap them. Completes the Phase 19 library-backed FFI facade 1.0-freeze prereq. | done |
| 25-26 | Segment/EmbeddingBag BW completion + BatchedOrmqrWy complex (alpha.42): 9 new Rust plans + 24 new FFI symbols for Segment Max/Min/Prod BW (sorted + unsorted, f32/f64), Unsorted Segment Prod FW (`atomicCAS`-retry mul), EmbeddingBag Max FW+BW (f32/f64/f16/bf16 × i32/i64). Plus BatchedOrmqrWy complex (Complex32, Complex64) via the bespoke WY-block kernels + cuBLAS C/Z gemmStridedBatched (4 new bespoke FFI + 2 cuBLAS symbols). | done |
| 27 | Q8_1 perf inspection (alpha.42 doc-only): Multi-M MMVQ opportunity identified, kept doc-only — bigger ROI than reformatting Q8_1. | done |
| 28 | API hygiene for 1.0 prep (alpha.43): new `KernelDtype` umbrella marker trait extending `Element`/`IntElement`/`FpElement`/`BinElement`; `#[non_exhaustive]` audit across the op-family `*Kind` enums + auxiliary tag enums + `Error` types. `ElementKind` / `LayoutSku` / `ArchSku` / `EpilogueKind` / `ActivationKind` / `Workspace` intentionally left exhaustive (hot-path-dispatched). | done |
| 29 | Cross-implementation benchmark suite (alpha.44): 10 new criterion+CUDA-event benches comparing baracuda against cuBLAS / cuDNN at LLM-typical shapes (GEMM f32/f16/bf16, MMVQ all qtypes, Softmax, LayerNorm, RMSNorm, Conv2d, MaxPool2d, Reductions, Elementwise, Flash SDPA+GQA). ~2,750 LOC of bench code + 13 bench binaries total. Critical finding: baracuda f16/bf16 GEMM is **2-4× slower than cuBLAS at M=1/M=32** (decode regime); validates the deferred Phase 27 multi-M MMVQ port. See [`BENCHMARKS.md`](crates/baracuda-kernels-bench/BENCHMARKS.md) for the methodology + sample run. | done |
| 30 | f16/bf16 GEMM cuBLAS fast-path (alpha.45): adds `PlanPreference::prefer_backend: Option<BackendKind>` + thread-local cuBLAS-handle cache to `GemmPlan`. Heuristic: cuBLAS for f16/bf16 at `2 ≤ M < 128` (decode batch); CUTLASS otherwise. **3× speedup at M=32 f16** (55.6µs → 19.0µs, parity with cuBLAS direct). M=1 stays on CUTLASS (cuBLAS RCR→col-major transa=T mapping slower than CUTLASS sm_80 GEMV-tile at K=N≥2048). Capture-mode auto-fallback to CUTLASS (cuBLAS-classic not capture-safe). 9 new smoke tests. | done |
| 31 | Fuel Phase 6c.2 storage.rs unblock (alpha.46): 5 gaps closed — ELU α parameter (breaking; 8 sigs modified), `powf` (8 new), `step` + `gelu_erf` (16 new), cast `u32`/`i16` (36 new × 2 directions), `reduce_sum_to`/`reduce_max_to` broadcast-reverse reductions (8 new). **~76 new/modified FFI symbols + 17 new smoke tests.** Unblocks Fuel's full PTX retirement (AFFINE/UNARY/BINARY/CAST/REDUCE/INDEXING/TERNARY/FILL/SORT modules). | done |
| 32 | Descriptor `#[non_exhaustive]` + builder pattern (alpha.47): 18 descriptors retrofitted with `::new()` builders + chainable setters (`with_stride`/`with_padding`/`with_dilation`/etc.). Conv {1,2,3}d + ConvTranspose {1,2,3}d + Pool {1,2,3}d + AdaptivePool {1,2,3}d + LpPool {1,2}d + FractionalMaxPool {2,3}d + Interpolate + InterpolateBackward. **Breaking change for downstream struct-literal callers** — pre-1.0 hardening. Migration: `Conv2dDescriptor { ... }` → `Conv2dDescriptor::new(input_shape, filter_shape, element).with_stride(...)`. | done |
| 33 | Multi-M MMVQ via Q8_1 staging (alpha.48): closes Phase 27's deferred opportunity. NEW `GgufMmvqMultiMPlan` + `quantize_q8_1` staging kernel + 4 Q8_0 multi-M launchers (M ∈ {1, 2, 4, 8}). **Bench: 7.29-7.96× speedup at M=8** on Llama-2 7B layer shapes (4096²; 11008×4096; 32000×4096). Q8_0 only this phase (clean partial); 9 remaining block formats (Q4_0/Q4_1/Q5_0/Q5_1/Q2_K..Q6_K) are mechanical fanout for a follow-up. 8 new FFI symbols (3 staging + 4 multi-M + 1 workspace). | done |
| 34 | Multi-M MMVQ block format fanout (alpha.49): 9 remaining GGUF formats shipped — Q4_0, Q4_1, Q5_0, Q5_1, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K. 36 new FFI symbols (9 fmts × 4 M-sizes). **Bench at N=K=4096 M=8**: Q5_0 **17.32×**, Q5_1 15.05×, Q4_0 12.78×, Q4_1 12.15×, Q8_0 8.79× — type-0/1 formats massively exceeded Phase 27's 3-7× target. K-quants (Q2_K..Q6_K) hit 3-7× at M=8 (larger 256-elem super-blocks dilute weight-reuse savings). Q8_K MMVQ correctly rejected at select() — bespoke per Phase 11.4. | done |
| 35 | Test-infra hardening (alpha.50): **first zero-failure regression** in the entire Phase 22-35 sweep (2229/0 across 638 binaries). Five fixes: (a) `mmvq_w_offset_alignment_misaligned_rejected_debug` `#[cfg(debug_assertions)]` gate; (b) cuBLAS handle retry with 5× linear backoff (Phase 30 parallel-init race); (c) cuDNN handle retry on CTC path (1001 NOT_INITIALIZED race); (d) `Stream::capture` panic-safe Drop guard (ThreadLocal capture state leak under cargo's thread reuse → cudaErrorStreamCaptureImplicit on subsequent tests); (e) **`cudaResourceDesc` 48→128 byte expansion + `repr(align(8))`** (Rust struct under-allocated by 16+ bytes AND missing 8-byte alignment that the union's `void*`/`size_t` arms require — caused release-only STATUS_ACCESS_VIOLATION in wave5_smoke). | done |
| 36 | Fuel 6c.4 unblock — Phase 1/3 (alpha.51): RoPE apply with precomputed cos/sin tables (FW+BW × 4 fp dtypes; 16 symbols) + Fill missing dtypes & strided variant (3 new contig + 11 strided; 28 symbols) + Argsort dtype fanout (u8/i8/u32/i16/bf16/f16/fp8e4m3; 14 symbols). 58 new FFI declarations total. | done |
| 37 | Fuel 6c.4 part 2/4 (alpha.52): Reduce family Gap 1 — `reduce_min_to`/`prod_to` broadcast-reverse for 4 fp dtypes (16 symbols) + integer-dtype single-axis sum/min/max/prod + argmin/argmax for U8/I8/U32/I16/I32/I64 (48 symbols, with U64/I64 widened accumulator + store-time narrow on Sum/Prod). 64 new FFI declarations total. Documented bit-exact wrap-on-overflow contract for u8/u32 sum/prod. | done |
| 38 | Fuel 6c.4 part 3/4 (alpha.53): Ternary `where_cond` dtype-matrix fanout — Cond lifted to template parameter; U8 (existing, untouched) + U32 + I64 cond × {f32/f64/f16/bf16, u8/i8/u32/i16/i32/i64, fp8e4m3} value × {contig, strided}. 87 new FFI declarations (58 `_run` + 29 `_can_implement`). Existing `where_<value>_run` family preserved bit-identically (default Cond=uint8_t). | done |
| 39 | Fuel 6c.4 part 4/4 (alpha.53, bundled with Phase 38): Indexing Tier 1 — NEW scatter (pure assign) + index_add for {f32/f64/f16/bf16} × {i32, i64idx} (16 syms) + gather u8idx extras for {f32, f64} (2 syms). 18 new FFI symbols total. Existing per-axis stride arrays meant no separate contig/strided split needed. f16/bf16 index_add uses the Phase 11.3 `atomic::add<T>` atomicCAS helper. Scatter documented + tested with disjoint-target indices (last-writer-wins on collisions, caller-aware non-determinism). | done |
| 40 | Fuel 6c.4 final cleanup (alpha.54): multi-block radix argsort via CUB `DeviceSegmentedRadixSort` for `row_len > 1024` (4 dtypes × 3 entries = 12 syms; bitonic stays for ≤1024) + Indexing Tier 2 integer value-dtype matrix (gather/index_select/scatter for u8/i8/u16/i16/u32/i32/i64 × i32/i64idx = 38 syms; index_add for i32/u32/i64 only = 6 syms). 56 new C symbols total. New `atomic::add<int64_t>` specialization via `unsigned long long*` reinterpret. Tier 3 (fp8e4m3 + sub-32-bit ints for index_add) deferred — no concrete caller. | done |
| 41 | Fuel 6c.5 final unblock (alpha.55): RoPE interleaved-pair (Gap 7) + RoPE THD-layout (Gap 8) variants. 28 new FFI symbols (FW+BW × 4 fp dtypes × 2 variants + `_can_implement` companions). **Closes the entire Fuel 6c.4/6c.5 batch ask** — Fuel can now drop the last `Id::Reduce` PTX module + retire `fuel-cuda-kernels` workspace member + drop the `cudaforge` build dep. Discovery: existing `rope_apply_*` was already using `(2k, 2k+1)` pairing (not `(i, i+d/2)` as the brief stated) → interleaved symbols are name-aliases on the same kernel; THD is genuinely new. | done |
| 42 | Flash Attention v2 vendor + `FlashSdpaPlan` backend (alpha.56): Tri Dao's FA2 v2.8.3 (BSD-3) vendored under `crates/baracuda-kernels-sys/vendor/flash-attention/` — Tier 1 (head_dim=128, fp16+bf16, sm_80, FW only) — wired as `BackendKind::FlashAttentionV2` on `FlashSdpaPlan` behind the `fa2` cargo feature. Heuristic routes long-context (seq_q×seq_k ≥ 1024×1024) shapes to FA2, bespoke otherwise; `PlanPreference::prefer_backend` overrides. PyTorch shim headers (at::PhiloxCudaState + C10_CUDA_CHECK) decouple the vendor from torch deps. Tier 2 (BW, varlen, paged, other head_dims) deferred. | done |
| 43 | mHC.cu vendor + `HyperConnectionPlan` family (alpha.56): DeepSeek-AI's Manifold-Constrained Hyper-Connections residual-mixing op (arXiv:2512.24880) from AndreSlavescu/mHC.cu (MIT) vendored under `crates/baracuda-kernels-sys/vendor/mhc/` — Tier 1 (static-H, bf16 only) — exposed as `HyperConnectionPlan` behind the `mhc` cargo feature. Replaces bare `y = x + sublayer(x)` residual with a learned `n×n` Sinkhorn-Knopp doubly-stochastic mixing matrix. Tier 2 (BW, dynamic-H, fp16/f32) deferred. Requires cuBLAS-Lt (already linked). | done |
| 44 | ozIMMU FP64-via-Int8-TC backend (alpha.56): enp1s0/ozIMMU (MIT) — Ootomo/Ozaki/Yokota's Ozaki-scheme DGEMM that synthesizes FP64 from S² int8 tensor-core matmuls — vendored under `crates/baracuda-ozimmu-sys/vendor/ozimmu/` with `cutf` submodule pinned alongside. NEW `baracuda-ozimmu-sys` + `baracuda-ozimmu` sibling crates. Wired into `GemmPlan` f64 path as opt-in `BackendKind::Ozaki { slices }` (default stays on CUTLASS/cuBLAS DGEMM — Ozaki is NOT bit-equivalent). Two patches: direct-link mode (no LD_PRELOAD), exclude `cublas.cu`/`culip.cu`. | done |
| 46 | FlashInfer cherry-pick — paged-KV decode + sort-free sampling + cascade attention (alpha.57 Checkpoint A, **closed in the alpha.58 consolidation pass**): surgical extraction of three FlashInfer kernel families (Apache-2.0, v0.6.12, commit `eee0d75f`) vendored under `crates/baracuda-kernels-sys/vendor/flashinfer/` (~12 kLOC across 25 headers, no wholesale wrap). NEW plan families: `BatchPagedDecodePlan` + `PagedKvAppendPlan` (vLLM-style paged KV cache decode), `TopKTopPSamplingPlan` (sort-free TopK/TopP/MinP/combined samplers), `CascadeAttentionPlan` (LSE-merge for prefix-cache sharing). NEW `BackendKind::FlashInfer` + `RandomKind::Multinomial` discriminants. NEW `flashinfer` cargo feature on both `baracuda-kernels-sys` and `baracuda-kernels` (default OFF). 7 MSVC-portability patches to vendored headers (see `vendor/flashinfer/VENDOR.md`). **Checkpoint B (alpha.58 consolidation)**: `flashinfer_paged_decode_launcher.cu` now compiles cleanly under MSVC nvcc — root cause was `std::max(unsigned long, size_t)` type mismatch inside `decode.cuh` (the earlier hypothesis about `cudaLaunchKernel_ptsz` was incorrect). Patched via `static_cast<size_t>(...)` on both arguments; launcher TU also carries a defensive `cudaLaunchKernel` shim macro. All 4 launchers now build under the `flashinfer` feature. | done |
| 44b | ozIMMU clean-fork + cutf elimination + Windows port (alpha.57): full internalize of ozIMMU sources (no longer vendored — we own them at `crates/baracuda-ozimmu-sys/cuda/`). `cutf` submodule eliminated entirely (upstream went offline); ~360 LOC of useful FP / cp_async utilities preserved as native `baracuda_fp_bits.cuh` + `baracuda_cp_async.cuh`; ~2,200 LOC of cutf duplicates deleted. Portable `baracuda::Uint128` replaces `__uint128_t` for Windows compile (typedef alias on Linux — bit-for-bit preservation). LD_PRELOAD path removed entirely. Linux + Windows both build clean. | done |
| 44c | ozIMMU RIKEN-RCCS perf-enhancement variants (alpha.57, no version bump): folds in `accelerator_for_ozIMMU` (Uchino/Ozaki/Imamura 2024, arXiv:2409.13313) — three new variants `EF` (group-wise error-free summation; chains int8 cublasGemmEx with `beta_i=1` to delay int32→f64 materialization), `RN` (nearest-rounding `(a+t)-t` split; ~2 extra effective bits per slice), `H` (= EF + RN), plus n-blocking (chunk `n > 12288` into 8192-wide pieces on the int8 GEMM call). Variant selected via `BackendKind::Ozaki { slices }` discriminant's high-3-bits field; `ozaki_slices::{base,ef,rn,h}(s)` helper constructors in `baracuda-kernels-types::sku`. NEW `OzakiVariant` enum + `Handle::dgemm_with_variant` on `baracuda-ozimmu`. Source-compatible with Phase 44b callers (`slices: 8` decodes as Base/S=8). **Discovered + fixed a pre-existing Phase 44b MSVC bug** in `axby` / `axy_complex`: upstream's `(1l << 44)` overflows on Windows (where `long` is 32-bit, LLP64) → silent `inf` output. Fixed by switching to `static_cast<double>(1ull << 44)`. 9 new accuracy/variant/n-blocking smoke tests, all green on RTX 4070; the pre-existing Phase 44b accuracy_smoke tests (4 cases) also unbreak. | done |
| 47 | Fused Linear Cross-Entropy (alpha.56, single-kernel port from LinkedIn's Liger-Kernel BSD-2): NEW `FusedLinearCrossEntropyPlan` family that fuses lm_head GEMM + CE loss in a chunked outer loop, never materializing the `[BT, V]` logits tensor. At BT=16K, V=128K, bf16 (Llama-3-class) saves **5-10 GiB of activation memory**. Bespoke per-chunk fused softmax+CE+gradient kernel (FP32 accumulator across 4 fp dtypes — f16/bf16/f32/f64); GEMMs dispatched via `cublasGemmEx`. Backward produces `grad_input`+`grad_weight` during the FW pass (chunked loop); BW call just scales by `dy_scalar` (no-op when `dy=1.0`, the typical "CE is the last layer" case). 16 new bespoke FFI symbols (per_row + per_row_cast + scalar_finalize + inplace_scale, each × 4 dtypes) + 1 count-non-ignore helper + `cublasGemmEx` binding. NEW `LossKind::FusedLinearCrossEntropy` variant. **Algorithm credit**: LinkedIn Liger-Kernel (BSD-2-Clause, clean-room CUDA reimplementation — no source vendored). | done |
| 45 | SmoothQuant compose + YaRN/LongRoPE Rust helper (alpha.56, no version bump — consolidation phase will bump): **two zero-new-CUDA pure-Rust additions**. (a) `SmoothQuantLinearPlan<TIn, TWQ>` (in `crates/baracuda-kernels/src/quantize/smoothquant.rs`) composes the existing Phase 8.3 `quantized_linear_w8a8` kernel + `fill_<dt>` broadcast for the per-tensor activation scale. Caller supplies pre-smoothed-and-quantized int8 activations + int8 weights (smoothing itself is offline Python per the SmoothQuant paper — mit-han-lab/smoothquant MIT, Xiao et al. ICML 2023; not in scope). (b) `RopeScaledTableBuilder` + `RopeScaling` enum (Linear / YaRN / LongRoPE, in `crates/baracuda-kernels/src/attention/rope_scaling.rs`) — host-side cos/sin table builder feeding the Phase 36 `rope_apply_<dt>_run` kernel. YaRN (jquesnelle/yarn MIT, Peng et al. arXiv:2309.00071) implements §3.2 NTK-by-parts frequency interpolation + §3.3 attention-temperature absorption into cos/sin. LongRoPE (microsoft/LongRoPE MIT, Ding et al. arXiv:2402.13753) multiplies inv-freq by caller-supplied per-dim factors (evolutionary search itself is offline + out of scope). Existing Phase 36 `RopeApply*` types source-compat preserved. | done |
| 51 | Arbitrary-mask `FlashSdpaPlan` + spec-decode composition doc (alpha.57, no version bump — consolidation phase will bump): NEW optional `mask: TensorRef<f32, 4>` field on `FlashSdpaArgs` routing to a bespoke arbmask SDPA kernel that adds an f32 `[B, H, Q, K]` additive bias to `S = Q·K^T·scale` before softmax. Unlocks spec-decode tree masks (EAGLE / Medusa / lookahead), MoE expert masking, prefix-LM, sliding-window with attention sinks — all entirely from caller-side composition. 4 dtypes (f32/f16/bf16/f64) × `_run` + `_can_implement` = 8 new FFI symbols. `is_causal` composes with the mask correctly (`-INF + finite == -INF`). New header `baracuda_attn_arbmask.cuh` reuses Phase 6.6's online-softmax tile pipeline; 1 new .cu instantiation file. FA2 vendor untouched (FA2 v2.8.3's `Mask` template has no arbitrary-mask hook). Runnable example at `crates/baracuda-kernels/examples/speculative_decode_compose.rs`; design doc at [`docs/guides/spec-decode.md`](docs/guides/spec-decode.md). FW only; BW deferred. | done |
| 50 | Mamba-2 SSD chunk-scan + Dao-AILab causal-conv1d (alpha.57, gated behind `mamba` cargo feature): **opens the state-space LLM class (Mamba-2 8B, Codestral-Mamba, Falcon-Mamba, Zamba2 — Mamba-1 selective_scan deferred to Phase 50b).** NEW `SsdChunkScanPlan` + `SsdChunkScanBackwardPlan` (lives under `attention` because of the SSD-as-attention duality) and `CausalConv1dPlan` + `CausalConv1dBackwardPlan` (top-level module — bespoke kernels, no cuDNN dep). Vendor attribution + LICENSE at `crates/baracuda-kernels-sys/vendor/causal-conv1d/` (Tri Dao, BSD-3) and `crates/baracuda-kernels-sys/vendor/mamba/` (state-spaces/mamba, Apache-2.0). Hand-port of the upstream Triton SSD reference + causal-conv1d primitive. **Dtypes**: causal-conv1d f32/f16/bf16/f64 × widths 2/3/4 × {SiLU, identity}; SSD f32/f16/bf16 (no f64 upstream). FW caps state at D,N ≤ 256; BW tighter at 64 (SMEM budget). 30 new FFI symbols (8 causal-conv1d FW + 8 BW + 6 SSD FW + 6 SSD BW + 2 can_implement extras). 5 new smoke tests (causal_conv1d_smoke/bw + ssd_chunk_scan_smoke/bw + mamba2_block_smoke). | done |
| 50b | Mamba-1 `selective_scan` (alpha.57, gated behind the same `mamba` cargo feature as Phase 50): **completes the state-space LLM coverage by adding the original Mamba-1 op family that powers Mamba-7B, Falcon-Mamba, and Codestral-Mamba** — Phase 50's SSD covers Mamba-2 / Codestral-Mamba / Falcon-Mamba / Zamba2, but every Mamba-1-shipping model still uses v1's `selective_scan`, not v2's SSD reformulation. NEW `SelectiveScanPlan` + `SelectiveScanBackwardPlan` (sibling to `SsdChunkScanPlan` under `attention/`). Shape `(B, L, D, N)` with the full Mamba-1 surface: optional `D[d]` skip, optional SiLU-gated `z[t, d]` tail, optional `delta_bias[d]` + optional `softplus(delta)` mapping (all 9 args of upstream `selective_scan_fn` wired). Dtypes f32/f16/bf16 (complex deferred — no shipping LLM uses it). Hand-port of `state-spaces/mamba`'s `csrc/selective_scan/` under Apache-2.0; same `vendor/mamba/` directory as Phase 50 (VENDOR.md updated, no new LICENSE file). FW caps state at `N ≤ 256`; BW uses two-pass record-then-reverse with `B*D*L*N*sizeof(T)` workspace. NEW `AttentionKind::SelectiveScan = 8` variant (`#[non_exhaustive]` so source-compat). 17 new FFI symbols (3 FW + 3 FW-can-impl + 3 BW + 1 workspace-bytes + module-internal launchers). 3 new smoke tests (selective_scan_smoke covering 4 option-combinations + f16/bf16 loose-tol, selective_scan_bw_smoke with FD checks on du/ddelta/dA + topology rejection, mamba1_block_smoke end-to-end). | done |
| 52 | NCCL foundation crate pair (alpha.57, no version bump — consolidation phase will bump): `baracuda-nccl-sys` (raw FFI types + libloading lazy-resolve, NO bindgen / NO link-time dep) + `baracuda-nccl` (safe `Communicator` with full collective surface — `all_reduce` / `reduce` / `reduce_scatter` / `all_gather` / `broadcast` / `send` / `recv` + group API + `NcclMem` + custom `pre_mul_sum` reduction op + `register` / `deregister` for zero-copy). **The distributed-roadmap prerequisite** for Ring Attention, distributed MoE, Megatron-LM tensor parallelism, FSDP-style shard collectives — Phase 52 only ships the substrate; consumer plans land in Phase 53+. Spec-named API: `Communicator::new_single_gpu` / `new_with_id`, cached infallible `rank()` / `world_size()`, `NcclReduceOp` / `NcclUniqueId` / `NcclDataType` aliases, `NcclUniqueId::generate()`. Linux-primary (NCCL ships with the CUDA toolkit there); Windows builds clean and defers the "is NCCL installed?" question to first `nccl()` call (loader fails with `LoaderError::LibraryNotFound`). 20 new smoke tests (10 dtype mapping — runs on every host; 10 `#[ignore]` NCCL-required). No baracuda-kernels integration in this phase. | done |
| 49 | Apex optimizer subset (alpha.57, gated behind `optim` cargo feature): **deliberate scope expansion — training-framework-adjacent.** NEW sibling crate `baracuda-optim` (~600 LOC Rust + ~750 LOC CUDA) vendoring the NVIDIA Apex (BSD-3-Clause) `multi_tensor_apply` idiom + fused Adam / LAMB / SGD functors. Single launch over thousands of parameter tensors (Apex `MAX_TENSORS_PER_LAUNCH = 110` per batch, multi-launch transparent) — eliminates the ~10,000-launch optimizer step overhead on 32B-param models. Plans: `AdamStepPlan<T>` (f32/f16/bf16 + AdamW mode), `LambStepPlan` (f32; two-stage with atomicAdd-fused L2-norm + sqrt + trust-ratio scaling), `SgdStepPlan<T>` (f32/f16/bf16 + momentum + Nesterov + weight-decay). Inference-only consumers (e.g. Fuel) don't pay the FFI surface cost — the vendored sources only build / link when the feature is enabled. Re-exported under `baracuda_kernels::optim` when enabled. **Measured 41× speedup at 1000-tensor multi-tensor Adam vs 1000 individual launches on RTX 4070** (0.173 ms vs 7.096 ms; smoke test in `crates/baracuda-optim/tests/multi_tensor_dispatch_smoke.rs`). 4 smoke tests, 6 GPU tests total, all green. | done |
| 53 | bitsandbytes NF4 dequant + GEMV vendor — QLoRA inference (alpha.57, gated behind `bnb_nf4` cargo feature): **opens the QLoRA-trained Llama / Mistral / Qwen inference class** by vendoring the bitsandbytes (Dettmers et al. arXiv:2305.14314, MIT) NF4 (NormalFloat 4-bit) dequant + GEMV kernels. NF4 is the dominant 4-bit format for QLoRA-trained prebuilts on the HuggingFace Hub — **distinct from GGUF Q4_0** (symmetric int4*scale, llama.cpp, Phase 8) and AWQ int4 (asymmetric int4 + zero-points). NF4 uses a 16-entry **non-uniform quantile codebook** derived from the inverse CDF of `N(0, 1)` — dequant is a 16-entry lookup, not arithmetic; better accuracy than symmetric int4 for normally-distributed weights. NEW plan trio: `Nf4DequantizePlan<T>` (bulk unpack `[N/2, K]` u8 → `[N, K]` T), `Nf4MmvqPlan<T>` (M=1 single-vector decode GEMV), `Nf4MmvqMultiMPlan<T>` (M ∈ {1, 2, 4, 8} batched-decode with weight gmem reuse, Phase 33 pattern applied to NF4). 11 new FFI symbols (3 dequant + 2 M=1 + 6 multi-M). Pack layout matches bitsandbytes upstream `Linear4bit`: pair-packed nibbles in `[N/2, K]` u8 (N must be even) + `[N * (K/block_size)]` f32 per-block absmax (block_size typically 64). Activation/output dtypes f16+bf16 (PyTorch convention); f32 accumulator. Codebook reproduced bit-identical to upstream as device-side switch + host-side `NF4_CODEBOOK: [f32; 16]` const + `nf4_pack_weight` host helper. Vendor metadata at `crates/baracuda-kernels-sys/vendor/bitsandbytes/{LICENSE,AUTHORS,VENDOR.md}`. 3 smoke test files (dequant roundtrip, M=1 GEMV f16+bf16, multi-M f16 vs M=1-looped). Out of scope: 8-bit optimizers (Phase 49 overlap), LLM.int8 (Phase 45 obsoletes), FP4 (different codebook — separate phase if asked), double quantization (Tier 2). | done |
| 54 | xFormers BlockSparseAttention + 2:4 sparse GEMM (alpha.57, no version bump; clean-room hand-port of facebookresearch/xformers BSD-3-Clause algorithmic reference): NEW `SdpaBlockSparsePlan` (`xformers_blocksparse` cargo feature) — block-sparse SDPA FW where the attention mask is a per-block boolean pattern `[B, H, num_blocks_q × num_blocks_k]` (uint8); only the active (q_block, k_block) pairs participate in the QK^T matmul + online-softmax accumulation. Different from Phase 51's arbitrary additive-mask path (which still computes every cell) — block-sparse actually SKIPS compute on masked blocks → real wall-clock speedup on long-context attention with known sparse patterns. NEW `GemmSparse24Plan` (`xformers_sparse24` cargo feature) — 2:4 structured sparsity GEMM consuming pre-compressed `[M, K/2]` weights + `[M, K/8]` u16 metadata. **Tier-1 implementation**: inflate-then-dense reference matmul (correctness first; sparse-tensor-core `mma.sp.sync` / cuSPARSELt backend deferred to Tier 2). 16 new FFI symbols (8 block-sparse SDPA × 4 dtypes × 2 entries; 11 sparse24 × 3 dtypes × 4 entries with workspace_bytes helper). NEW `AttentionKind::BlockSparseAttention = 9` variant (`#[non_exhaustive]` so source-compat). 3 smoke tests (block-sparse all-ones-matches-dense + diagonal-band + empty-pattern; sparse24 matches host reference + K-rejection + throughput timing). Vendor attribution at `crates/baracuda-kernels-sys/vendor/xformers/` (no upstream sources vendored verbatim — algorithmic reference only). NOT vendored from xFormers: memory-efficient attention (overlaps with FA2 vendor); fused biases / RoPE / norm (overlaps with existing baracuda phases); Triton kernel paths (no Triton toolchain). | done (Tier 1) |
| 55 | TransformerEngine FP8 cast + delayed-scaling recipe (alpha.57, gated behind `tensor_engine` cargo feature; clean-room hand-port of NVIDIA TransformerEngine Apache-2.0 algorithm — only the cast + recipe subset). NEW sibling crate pair `baracuda-transformer-engine-sys` + `baracuda-transformer-engine`. The differentiated value of TE is the **per-tensor delayed-scaling recipe with amax history** for stable FP8 training; that's the load-bearing piece this phase ships. Public API: `Fp8Recipe` (RAII handle holding amax_history ring + scale + scale_inv device scalars), `Fp8CastPlan<TIn>` (fused FP8 cast + `max(\|x\|)` amax reduction in one kernel — atomicMax into `amax_history[write_pos]`), `Fp8DequantPlan<TOut>` (symmetric dequant via `scale_inv`). Both formats: E4M3 (max=448) for fwd/weights, E5M2 (max=57344) for grads. Wide dtypes: f32/f16/bf16. 4 new C-ABI symbols (`baracuda_te_fused_cast_amax_run` / `baracuda_te_dequant_run` / `baracuda_te_recipe_update_run` / `baracuda_te_recipe_init_run`) + format/dtype id helpers. **NO cuDNN dep** (cast/recipe paths don't need it — cuDNN is only needed for `fused_attn`, which we skip). **NO pybind11** (raw C ABI, not Python). 10 GPU smoke tests, all green on RTX 4070. **Sm_89 reality check**: FP8 storage + cast intrinsics work natively on Ada, but tensor-core FP8 MMA throughput equals BF16 — so the wins here are bandwidth-saving (KV cache, weight storage, activation memory) not compute. Recipe machinery is forward-compatible with Hopper (sm_90a) / Blackwell (sm_100) where the MMA throughput win also materializes. Deliberately NOT lifted: `normalization` (Phase 5), `fused_rope` (Phase 14/36/41), `fused_attn` (Phase 17/42; cuDNN dep), `fused_softmax` (Phase 5), `activation` (Phase 3/31), `gemm` (Phase 1+24+30), `comm_gemm_overlap` (Hopper TMA), `fused_router` (Phase 8+20), `hadamard`/`newton_schulz`/`swizzle`/`permutation` (niche), `multi_tensor` (Phase 49), `dropout` (composable), Python bindings (out of scope). Algorithmic reference: `transformer_engine/common/{cast,recipe}/*.cu` upstream + FP8 spec Micikevicius et al. 2022 (arXiv:2209.05433). Vendor attribution + full Apache-2.0 text at `crates/baracuda-transformer-engine-sys/ATTRIBUTION.md`. | done |
| 57 | Megatron-LM tensor-parallel primitives (alpha.57, no version bump; gated behind `megatron_tp` cargo feature). **NEW sibling crate `baracuda-megatron`** — pure-composition over `baracuda-cublas` (local GEMM via `cublasSgemm` for f32, `cublasGemmEx` with `Compute32F` accumulator for f16/bf16) + `baracuda-nccl` (cross-rank `all_gather` / `all_reduce` collectives). **NO new CUDA kernels** — foundational TP primitives for Megatron-style models are pure orchestration; the kernel substrate already exists in baracuda. NEW plans: `ColumnParallelLinearPlan<T>` (splits W along output dim; FW `Y_local = X @ W_local^T` + `all_gather`; BW `dX_partial = dY_local @ W_local` + `all_reduce(Sum)`, `dW_local = dY_local^T @ X` local) and `RowParallelLinearPlan<T>` (splits W along input dim; FW `Y_partial = X_local @ W_local^T` + `all_reduce(Sum)`; BW `dX_local = dY @ W_local` local, `dW_local = dY^T @ X_local` local — **no BW collective**, the Megatron pairing only needs one collective per layer-pair). NEW `TensorParallelContext` borrow type holding `&Communicator` + `in_features` / `out_features` / cached `rank` / `world_size`. Dtypes: f32 always; f16 + bf16 behind the crate-level `half-crate` feature (which the kernel-facade `megatron_tp` feature pulls in). Tier 1 scope — bias rejected at call site with a Tier-2 marker error (caller can Affine-add post-FW; matters for RowParallel where the bias must be added **after** the all_reduce so it doesn't get summed N times). 5 smoke tests across 3 files: `column_parallel_smoke` (FW + BW, single-rank, matches CPU `Linear` ref), `row_parallel_smoke` (same), `multi_rank_scaffold` (`#[ignore]`-gated 2-GPU scaffold — exits cleanly on single-GPU dev boxes). Algorithmic reference: Shoeybi et al. arXiv:1909.08053 (NVIDIA Megatron-LM, Apache-2.0); no source vendored. Out of scope: async overlap (Hopper TMA); sequence parallelism (Phase 56's domain); pipeline parallelism (future phase); VocabParallelEmbedding (future polish); distributed gradient accumulation (Phase 58's domain); expert parallelism (separate phase). | done |
| 58 | DistributedAdam — ZeRO-1-style sharded optimizer state (alpha.57, no version bump; gated behind the new `distributed_optim` cargo feature on `baracuda-optim`, pulls `baracuda-nccl` as optional dep). **Pure-Rust composition** over Phase 49 [`AdamStepPlan`] + Phase 52 NCCL collectives — **NO new CUDA kernels**, **NO new `baracuda-kernels-sys` FFI**. NEW `DistributedAdamStepPlan<T>` wrapping the inner Adam plan + a borrowed `&Communicator`; orchestrates the canonical ZeRO-1 protocol: `all_reduce(grads, Sum, in-place)` → local Adam step on this rank's `1/world_size` shard → `all_gather(updated_params, in-place)`. **Single-rank degenerate case** (`world_size == 1`) elides both collectives and reduces to `AdamStepPlan::step` bit-exactly (smoke test verifies this on single-GPU dev hardware). f32 + f16 + bf16 dtypes, AdamW + classic mode, mixed-precision `step_with_f32_state` variant. NEW `shard_range(n, rank, world_size)` helper matching `torch.chunk` semantics. Phase 58 constraint: tensor element counts must be `world_size`-multiples (ring all_gather symmetry); per-tensor broadcast fallback for ragged shards is future work. Out of scope: ZeRO-2 (gradient sharding); ZeRO-3 (parameter sharding during FW/BW); DistributedLamb / DistributedSGD (same pattern, defer until concrete demand); CPU-offload optimizer state; 8-bit distributed optimizer state. 3 smoke test files (4 pure-Rust shard_range tests run unconditionally; 2 single-rank GPU smokes `#[ignore]`-gated for NCCL; 1 multi-rank scaffold `#[ignore]`-gated for 2+ GPU validation). Algorithmic reference: Rajbhandari et al. SC20 "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models", Microsoft DeepSpeed (Apache-2.0; no source vendored — pure Rust composition). | done |
| 56 | Ring Attention — sequence-parallel attention (alpha.57, no version bump — consolidation phase will bump; gated behind the new `ring_attention` cargo feature, pulls `baracuda-nccl` + `baracuda-nccl-sys` as optional deps). **First Phase 52 NCCL consumer** — proves the substrate. Hand-port of Liu/Yan/Abbeel 2023 (arXiv:2310.01889; algorithmic reference at `https://github.com/lhao499/RingAttention`, Apache-2.0 — no JAX source vendored, clean-room CUDA implementation). NEW `RingAttentionPlan<T>` + `RingAttentionDescriptor` + `RingAttentionArgs` in `crates/baracuda-kernels/src/attention/ring_attention.rs`; bespoke `kernels/attention/ring_attention_kernel.cu` (~480 LOC kernel + ~390 LOC plan). Per-rank online-softmax fold of resident K/V chunk into persistent `(o_acc, m_acc, l_acc)` f32 state; ring rotation via `comm.send`/`recv` inside `group_start`/`group_end`; finalize kernel emits `y = o_acc / l_acc` (+ optional `lse`). **Tier 1 scope**: f16/bf16 (f32/f64 deferred), `head_dim == 128`, FW only (BW Tier 2), no GQA broadcast, no arbitrary additive mask. Causal masking applied on **global** indices (each step kernel takes `q_global_base` + `k_global_base` so masking is consistent across rotation steps). 12 new FFI symbols (`workspace_bytes` + dtype-independent `init_run` + 5 per-dtype × 2 dtypes: `step_run` / `step_can_implement` / `finalize_run` / `finalize_can_implement`). Unlocks **million-token context length** across N GPUs with O(N/P) memory where N = total seq len, P = ring size. Complementary to Phase 57's tensor-parallelism (sequence-dim sharding vs head-dim sharding compose). 4 smoke tests (3 single-rank degenerate cases validating against `FlashSdpaPlan` ground truth: f16 + bf16 + f16 causal — all pass on RTX 4070; 1 multi-rank scaffold `#[ignore]`-gated for 2+ GPU validation). Single-rank `world_size == 1` reduces to standard FlashAttention math (the validation path on single-GPU hardware). | done |
| 48 | Marlin + AWQ 4-bit GEMM vendor + GPTQ→Marlin repack utility (alpha.57, no version bump — consolidation phase will bump; gated behind the new `marlin` + `awq` cargo features on `baracuda-kernels-sys` and `baracuda-kernels`). **Two complementary 4-bit GEMM vendors** completing the "4-bit hub coverage" started in Phase 53 (NF4). **Marlin** (IST-DASLab, Apache-2.0 + §3 patent grant, vendored at `crates/baracuda-kernels-sys/vendor/marlin/`) — state-of-the-art W4A16 GEMM for the decode-batch regime, ~3.87× speedup over FP16 GEMM at M ∈ [1, 32] on Ampere / Ada per the paper. **Symmetric** int4 (zero-point fused into dequant as `q - 8`); group size 128 or per-channel; sm_80/86/89 only (sm_90 needs WGMMA rewrite — Marlin v2 territory, deferred). NEW `Int4MarlinGemmPlan<f16>`. **AWQ** (mit-han-lab, MIT — no patent grant, vendored at `crates/baracuda-kernels-sys/vendor/awq/`) — natively supports the **most-deployed 4-bit format on the Hugging Face Hub** (Llama / Mistral / Qwen `*-AWQ`). **Asymmetric** int4 with explicit per-group zero-points; group size 64 or 128; loads directly from HF checkpoints without repack. NEW `Int4AwqGemmPlan<f16>`. **GPTQ→Marlin repack utility** — pure-Rust host-side `gptq_to_marlin_repack` bridging GPTQ asymmetric checkpoints into Marlin's symmetric layout via zero-point absorption (trailblazer implementation; act_order=True deferred, the upstream Marlin intra-fragment permutation table is documented but uses identity permutation in the trailblazer). 4 new FFI symbols total (2 Marlin: `_run` + `_can_implement`; 4 AWQ: `_run` + `_workspace_bytes` + `_can_implement` + dequant stub). AWQ vendor source patched to strip the upstream `<torch/extension.h>` host wrapper (`__asm__ __volatile__` → `asm volatile` for MSVC nvcc portability) and re-export only the device-side `__global__` template kernel. Marlin needs `--expt-relaxed-constexpr` (constexpr `ceildiv` called from `__global__`). Both kernels build clean on RTX 4070 with the gated features. 3 smoke test files (marlin_smoke `#[ignore]` GPU + descriptor validation; awq_smoke `#[ignore]` GPU + descriptor validation; gptq_to_marlin_smoke pure-Rust roundtrip + zp-fold verification + clamp-at-extremes). | done |
| 46+ | Phase 46-51 mainstream-techniques roadmap (FlashInfer cherry-pick, Marlin/AWQ); Hopper sm_90a / Blackwell sm_100; 1.0 freeze. | pending (see [`ROADMAP.md`](ROADMAP.md)) |
API stability is **not** promised before beta.0. Breaking changes ship in
each alpha bump and are documented in the workspace `CHANGELOG.md`.
## Quick start
Add the kernel facade and the driver crate:
```toml
[dependencies]
baracuda-kernels = { version = "0.0.1-alpha.64", features = ["sm89", "cudnn"] }
baracuda-driver = "0.0.1-alpha.64"
```
A representative example — single-axis numerically stable softmax over a
device-resident tensor:
```rust,no_run
use baracuda_driver::{Context, Device, DeviceBuffer, Stream};
use baracuda_kernels::{
PlanPreference, SoftmaxArgs, SoftmaxDescriptor, SoftmaxKind, SoftmaxPlan,
TensorMut, TensorRef, Workspace,
};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// 1. Standard CUDA bring-up via baracuda-driver.
let ctx = Context::new(&Device::get(0)?)?;
let stream = Stream::new(&ctx)?;
// 2. Allocate device input + output buffers (rank-2: rows × cols).
let rows = 32i32;
let cols = 1024i32;
let n_elems = (rows * cols) as usize;
let dev_x: DeviceBuffer<f32> = DeviceBuffer::zeros(&ctx, n_elems)?;
let mut dev_y: DeviceBuffer<f32> = DeviceBuffer::zeros(&ctx, n_elems)?;
// 3. Build the descriptor — pure shape + dtype + op-kind, no handles.
let desc = SoftmaxDescriptor::<2> {
kind: SoftmaxKind::Softmax,
input_shape: [rows, cols],
softmax_axis: 1,
element: <f32 as baracuda_kernels::KernelDtype>::KIND,
};
// 4. Plan selection — picks a kernel SKU (bespoke softmax kernel here).
let plan = SoftmaxPlan::<f32, 2>::select(&stream, &desc, PlanPreference::default())?;
// 5. Args carry the per-call tensor handles + strides.
let args = SoftmaxArgs {
x: TensorRef { data: dev_x.as_slice(), shape: [rows, cols], stride: [cols as i64, 1] },
y: TensorMut { data: dev_y.as_slice_mut(), shape: [rows, cols], stride: [cols as i64, 1] },
};
// 6. Launch. Workspace::None for plans that need no scratch.
plan.run(&stream, Workspace::None, args)?;
stream.synchronize()?;
Ok(())
}
```
The same `select` → `run` shape applies to every op. GEMM, attention,
conv2d, FFT, scatter — the descriptor / args fields differ per family but
the lifecycle is identical. See the [`crates/baracuda-kernels`
README](crates/baracuda-kernels/README.md) for the int8-GEMM variant of
the same example.
## Workspace layout
The user-facing crates a typical caller will reach for:
```text
baracuda-kernels # the unified Plan-based ML op facade
baracuda-kernels-types # shared type vocabulary (Element, TensorRef, KernelSku, ...)
baracuda-kernels-sys # raw FFI to bespoke .cu kernels
baracuda-kernels-bench # criterion harness for sm_89 perf sweeps (not published)
baracuda-cutlass # safe wrapper for CUTLASS GEMM (float, int8 RCR, batched, grouped)
baracuda-driver # safe wrapper for the CUDA Driver API
baracuda-runtime # safe wrapper for the CUDA Runtime API
```
The per-library wrappers used internally by the facade (you can also use
them stand-alone):
```text
baracuda-cublas{,-sys} # cuBLAS + cuBLASLt + cuBLASXt
baracuda-cudnn{,-sys} # cuDNN classic + Graph API
baracuda-cudf{,-sys} # cuDF (RAPIDS dataframe; Linux-only)
baracuda-cufft{,-sys} # cuFFT
baracuda-cusolver{,-sys} # cuSOLVER dense + sparse + Rf + Mg
baracuda-cusparse{,-sys} # cuSPARSE
baracuda-curand{,-sys} # cuRAND
baracuda-cutensor{,-sys} # cuTENSOR
baracuda-cutlass{,-sys} # CUTLASS GEMM kernel templates
baracuda-cutlass-kernels-sys # CUTLASS kernel-only compile target
baracuda-cuvs{,-sys} # RAPIDS cuVS GPU vector search (Phase 71)
baracuda-cvcuda{,-sys} # CV-CUDA image processing
baracuda-flashinfer{,-sys} # FlashInfer paged-KV + cascade + sampling (Phase 46/66)
baracuda-npp{,-sys} # NPP
baracuda-nccl{,-sys} # NCCL
baracuda-nvcomp{,-sys} # nvCOMP
baracuda-nvimagecodec{,-sys} # nvImageCodec (Phase 70 — supersedes nvJPEG)
baracuda-nvjpeg{,-sys} # nvJPEG (kept for back-compat)
baracuda-nvshmem{,-sys} # NVSHMEM symmetric-heap RDMA (Phase 69)
baracuda-ozimmu{,-sys} # ozIMMU Ozaki-scheme DGEMM (Phase 44)
baracuda-transformer-engine{,-sys} # TransformerEngine FP8 (Phase 55)
```
And the supporting low-level crates (FFI, build infrastructure, profiling):
```text
baracuda-cuda-sys # Driver + Runtime FFI
baracuda-nvrtc{,-sys} # runtime CUDA C++ → PTX
baracuda-nvjitlink{,-sys} # CUDA 12+ JIT linker
baracuda-cupti{,-sys} # profiling APIs
baracuda-nvml{,-sys} # device monitoring
baracuda-cufile{,-sys} # GPUDirect Storage (Linux-only)
baracuda-tensorrt{,-sys} # TensorRT inference runtime (Phase 68 — vtable-dispatch C++ shim)
baracuda-forge # build-time .cu → PTX compiler driver
baracuda-build # build.rs helpers
baracuda-core # loader + Error plumbing
baracuda-types{,-derive} # pure-data types: Half, BFloat16, Complex, DeviceRepr
```
Vendor-track training / optimization crates (opt-in via feature gates on
`baracuda-kernels`):
```text
baracuda-megatron # Megatron-LM TP primitives (Phase 57; composition, no kernels)
baracuda-optim # NVIDIA Apex multi-tensor optimizers (Phase 49)
```
The full umbrella crate (`baracuda`) re-exports everything behind cargo
features — convenient when you want everything; overkill when you don't.
## Hardware support
baracuda targets **Ampere and newer** by design. Pre-Ampere GPUs lack the
tensor-core instructions and async-copy primitives the bespoke kernels are
written against (`mma.sync.m16n8k*`, `cp.async`, `ldmatrix`), and we have
no desire to ship a slower SIMT fallback for hardware that's eight years
old.
| Compute capability | NVIDIA marketing names | baracuda support |
| --- | --- | --- |
| sm_80 | Ampere (A100, A40, A30, RTX 30xx) | **default baseline** |
| sm_89 | Ada Lovelace (RTX 40xx, L40, L4) | feature-gated specialized kernels (FP8, larger Flash Attention tiles) |
| sm_90a | Hopper async (H100, H200) | stubs in place; full specialization pending Phase 11 |
| sm_100 | Blackwell | post-Phase-11 |
| ≤ sm_75 (Turing, Volta, Pascal, …) | — | **unsupported** |
The default `sm80` build runs forward-compatibly on Ada and Hopper through
JIT-compiled PTX; turn on `sm89` to pick up the FP8 and Flash-Attention
sibling plans tuned for Ada's larger register file.
## Cargo features
The kernel facade exposes a broad opt-in feature set. Architecture and
library-integration features are off by default — pick what your
deployment needs. **Important:** unless a feature is enabled, its
plans are not built, not linked, and not present in the public API.
Always check the feature gate when scanning for an op family.
### Architecture targets
| Feature | Default | Effect |
| --- | --- | --- |
| `sm80` | yes | Ampere-baseline kernel set (RTX 30xx, A100). |
| `sm89` | no | Ada Lovelace specializations (FP8 GEMM, `FlashSdpaSm89Plan`). |
| `sm90a` | no | Hopper-specialized kernels (stubs today; tracked for Phase 7x). |
### NVIDIA library integrations
| Feature | Default | Phase | Effect |
| --- | --- | --- | --- |
| `cudnn` | no | 7 | Link cuDNN. Enables Conv2d / Pool2d / `CtcLossCudnnPlan`. Separate NVIDIA download. |
### Vendored kernel families
| Feature | Default | Phase | Effect |
| --- | --- | --- | --- |
| `fa2` | no | 42 | Vendored Tri Dao FlashAttention v2 (BSD-3). Long-context routing on `FlashSdpaPlan`. |
| `mhc` | no | 43 | Vendored DeepSeek-AI mHC.cu (MIT). `HyperConnectionPlan` (learned residuals). |
| `ozimmu` | no | 44 | Vendored ozIMMU (MIT). Ozaki-scheme DGEMM via S² int8 tensor-core matmuls. |
| `flashinfer` | no | 46/66 | Vendored FlashInfer (Apache-2.0). Paged-KV decode/prefill, cascade, sampling. |
| `marlin` | no | 48 | Vendored IST-DASLab Marlin (Apache-2.0). Symmetric int4 W4A16 GEMM. |
| `awq` | no | 48 | Vendored mit-han-lab llm-awq (MIT). Asymmetric int4 W4A16 GEMM. |
| `optim` | no | 49 | Vendored NVIDIA Apex (BSD-3). Multi-tensor Adam/Lamb/SGD step kernels. |
| `mamba` | no | 50 | Vendored Mamba-2 SSD + Dao causal-conv1d. State-space LLM ops. |
| `bnb_nf4` | no | 53 | Vendored bitsandbytes NF4 (MIT). 4-bit non-uniform quantile QLoRA. |
| `xformers_blocksparse` | no | 54 | Clean-room port of xFormers block-sparse SDPA (BSD-3). |
| `xformers_sparse24` | no | 54 | Clean-room port of xFormers 2:4 structured sparsity GEMM (BSD-3). |
| `tensor_engine` | no | 55 | Vendored NVIDIA TransformerEngine (Apache-2.0). FP8 cast/dequant + delayed-scaling recipe. |
| `ring_attention` | no | 56 | Sequence-parallel ring attention (Apache-2.0 reference). Pulls in NCCL. |
| `megatron_tp` | no | 57 | Megatron-LM tensor-parallel primitives (composition, no new kernels). |
| `nvshmem` | no | 69 | NVSHMEM host-side wrapper (one-sided RDMA, sibling to NCCL). Linux-only. |
Sibling crates also have their own feature gates:
- `baracuda-cuvs` — `cuvs` feature in `baracuda-kernels-sys` ecosystem
pulls in RAPIDS cuVS vector-search (Phase 71, Linux-only). Not yet
re-exported by `baracuda-kernels` itself.
- `baracuda-tensorrt` — `shim` feature builds the vtable-dispatch C++
shim required to call TensorRT (Phase 68 — no flat C ABI upstream).
- `baracuda-flashinfer` — same `flashinfer` feature as above.
**Notes:**
- `cudnn` is off by default because cuDNN is a separate NVIDIA download
not bundled with the stock CUDA toolkit installer. Enabling it without
cuDNN installed produces a linker error on `cudnn.lib` / `libcudnn.so`
— see the [Building](#building) section for auto-discovery paths.
- Most vendored features add 30s–5 min to first build (template-heavy
CUDA). Subsequent builds incremental.
- Features that pull in optional sibling crates (`optim`, `tensor_engine`,
`megatron_tp`, `nvshmem`, `ring_attention`) only compile the sibling
when the feature is enabled — inference-only consumers don't pay the
surface cost.
## Building
Requirements:
- **CUDA Toolkit ≥ 12.0** with `nvcc` on `PATH`. baracuda is tested on
12.x and 13.x.
- **cuDNN 9.x** (only if you enable the `cudnn` feature) — separate
NVIDIA download, not bundled with the toolkit.
- **A working Rust toolchain ≥ 1.85** (workspace MSRV pinned in
`rust-toolchain.toml`).
- **Windows users**: `lld-link.exe` somewhere on `PATH`. The CUDA `nvcc`
invocation links through it; the install location is typically
`C:\Program Files\LLVM\bin`. Install the LLVM Windows package and add
that directory to `PATH` if `cargo build` complains about
`lld-link.exe` not being found.
A typical full build with all GPU-side features (CUDA toolkit + cuDNN
present):
```bash
cargo build -p baracuda-kernels --features sm89,cudnn --release
```
Or, to verify the public API surface compiles without the full kernel
build (fast — type-check only):
```bash
cargo check -p baracuda-kernels --features sm89,cudnn
```
The `baracuda-kernels-sys` build script auto-discovers cuDNN at the
following paths in order: `CUDNN_PATH` / `CUDNN_ROOT` / `CUDNN_HOME` env
vars, then `C:\Program Files\NVIDIA\CUDNN\v<X.Y>\` on Windows, then the
CUDA toolkit's own `lib/` directory (pre-cuDNN-9 layout), then the
standard Linux distro paths under `/usr/lib/`.
## Troubleshooting
### Windows: Git-for-Windows fake `link.exe` shadowing the MSVC linker
Git-for-Windows ships a GNU coreutils binary named `link.exe` at
`C:\Program Files\Git\usr\bin\link.exe` — its job is to create a hard
link, **not** to link object files. If that directory appears on `PATH`
ahead of the MSVC linker (or LLVM's `lld-link.exe`), `cargo build`
invokes the coreutils binary instead of the real linker and fails with a
cryptic error (it doesn't understand `/OUT:` and friends).
baracuda's `baracuda-kernels-sys` and `baracuda-cutlass-sys` build
scripts probe `PATH` on Windows and emit a `cargo:warning` if they
detect this shadowing. **Fix:** re-order `PATH` so the MSVC linker
(typically reached via the Visual Studio "x64 Native Tools Command
Prompt") or LLVM's `lld-link.exe` (`C:\Program Files\LLVM\bin\`) appears
before `C:\Program Files\Git\usr\bin\`. Building from the VS x64 Native
Tools prompt is the most reliable option; alternatively, install LLVM
and put its `bin` directory ahead of Git's on the user/system `PATH`.
## Testing
baracuda's GPU integration tests are gated behind `#[ignore]` so a
host-only `cargo test` doesn't try to launch a kernel on a machine
without an NVIDIA driver. To run them you need a working GPU plus the
`--ignored` flag:
```bash
# Host-only tests (compile + reference logic; no GPU access):
cargo test -p baracuda-kernels --lib
# Full GPU integration sweep — RTX 30xx / 40xx / 50xx required:
cargo test -p baracuda-kernels --release -- --ignored
# Verify the workspace-level API surface compiles (no GPU needed):
cargo check -p baracuda-kernels --features sm89,cudnn
```
The full regression on an RTX 4070 covers 324 binary targets at
~1630 tests passing. Individual op-family suites take 30–90 seconds;
the full sweep is 25–40 minutes.
## Benchmarks
The `baracuda-kernels-bench` crate is a criterion-based harness with
CUDA-event-timed throughput sweeps across GEMM, Flash Attention, and
Conv2d at LLM-typical and ResNet-typical shapes. It is **not** published
to crates.io (it depends on a working GPU).
```bash
cargo bench -p baracuda-kernels-bench --features sm89,cudnn
```
The full sweep takes ~30 minutes on an RTX 4070. Scope to a single family
with `--bench gemm` / `--bench flash_attention` / `--bench conv2d`. See
[`crates/baracuda-kernels-bench/BENCH-sm89.md`](crates/baracuda-kernels-bench/BENCH-sm89.md)
for the baseline table format and methodology.
## Project documentation
- [`ARCHITECTURE.md`](ARCHITECTURE.md) — layered design, Plan-Descriptor-Args
pattern, `KernelSku` taxonomy, dispatcher design, workspace contract,
sibling-plan pattern, vendoring convention, phase roadmap.
- `OP-MATRIX.md` — full op × dtype × backend coverage matrix (planned).
- `LESSONS.md` — postmortems, ABI footguns, performance traps (planned).
- Per-crate `README.md` files under `crates/<name>/`.
## License
Dual-licensed under [MIT](LICENSE-MIT) **or** [Apache-2.0](LICENSE-APACHE).
Pick whichever fits your project. Contributions accepted under the same
terms.
NVIDIA's CUDA libraries (`libcuda`, `libcudart`, `libcublas`, `libcudnn`,
…) are **not** redistributed by this project. You obtain them from NVIDIA
separately — either through the CUDA Toolkit installer or through each
library's dedicated download page. baracuda's loader opens whatever the
host driver / toolkit has installed.
## Vendor attribution
A small number of bespoke kernels in `baracuda-kernels-sys` are vendored
from upstream open-source projects (huggingface/candle's CUDA kernel set
via `fuel-cuda-kernels`; llama.cpp's `ggml-cuda` GGUF block-format
quantization + MMVQ; `guoqingbao/attention.rs`'s fused MoE expert
kernels). Each adapted source carries an `SPDX-FileCopyrightText:` +
`SPDX-License-Identifier:` header; the consolidated provenance is in
[`crates/baracuda-kernels-sys/LICENSE-thirdparty.md`](crates/baracuda-kernels-sys/LICENSE-thirdparty.md).
[**FlashAttention v2**](https://github.com/Dao-AILab/flash-attention)
(Tri Dao, BSD-3-Clause, pinned at `v2.8.3` /
`060c9188beec3a8b62b33a3bfa6d5d2d44975fab`) is vendored at
[`crates/baracuda-kernels-sys/vendor/flash-attention/`](crates/baracuda-kernels-sys/vendor/flash-attention/)
with verbatim `LICENSE` + `AUTHORS` files and full vendor / scope notes
in [`VENDOR.md`](crates/baracuda-kernels-sys/vendor/flash-attention/VENDOR.md).
Gated behind the `fa2` cargo feature on `baracuda-kernels-sys` and
`baracuda-kernels`; exposed through a backend-choice path on
`FlashSdpaPlan` (Phase 42).
[**mHC.cu**](https://github.com/AndreSlavescu/mHC.cu) (Andre Slavescu,
MIT, pinned at `a426939c2dbc11c443db041bcff12b65d1b6482a`) — unofficial
CUDA implementation of DeepSeek-AI's
[*Manifold-Constrained Hyper-Connections*](https://arxiv.org/abs/2512.24880)
paper — is vendored at
[`crates/baracuda-kernels-sys/vendor/mhc/`](crates/baracuda-kernels-sys/vendor/mhc/)
with the verbatim upstream `LICENSE`, an `AUTHORS` file, and full
vendor / scope notes in
[`VENDOR.md`](crates/baracuda-kernels-sys/vendor/mhc/VENDOR.md). Gated
behind the `mhc` cargo feature on `baracuda-kernels-sys` and
`baracuda-kernels`; exposed through the new `HyperConnectionPlan`
(Phase 43, Tier 1: static-H FW, bf16 weights / f32 activations).
[**FlashInfer**](https://github.com/flashinfer-ai/flashinfer) (NVIDIA
+ FlashInfer community, Apache-2.0 with full patent grant, pinned at
`v0.6.12` / `eee0d75f91f64c520bfaed07e39a850ea4ddde23`) — a curated
~12 kLOC subset of the FlashInfer header tree is vendored at
[`crates/baracuda-kernels-sys/vendor/flashinfer/`](crates/baracuda-kernels-sys/vendor/flashinfer/)
with verbatim upstream `LICENSE` + `NOTICE` and full vendor / scope /
patch notes in
[`VENDOR.md`](crates/baracuda-kernels-sys/vendor/flashinfer/VENDOR.md).
Gated behind the `flashinfer` cargo feature on `baracuda-kernels-sys`
and `baracuda-kernels`; exposes three NEW plan families — paged-KV
decode + append (`BatchPagedDecodePlan` + `PagedKvAppendPlan` for
vLLM-style serving), sort-free sampling (`TopKTopPSamplingPlan` —
combined TopK/TopP/MinP via a single-kernel rejection sampler), and
cascade attention LSE merge (`CascadeAttentionPlan` for prefix-cache
sharing). Surgical cherry-pick (not a wholesale wrap) — Hopper /
Blackwell / NVSHMEM / Mamba / MLA / POD paths intentionally skipped
to keep the build cost contained. Phase 46.
The [`baracuda-forge`](crates/baracuda-forge) build-time kernel-compiler
crate is a vendored fork of [`cudaforge`](https://github.com/guoqingbao/cudaforge)
by **Guoqing Bao** — see [`crates/baracuda-forge/NOTICE`](crates/baracuda-forge/NOTICE)
for the upstream commit hash.
The [`baracuda-cutlass`](crates/baracuda-cutlass) safe wrapper for NVIDIA
CUTLASS — plan-based GEMM and grouped-GEMM with caller-supplied
workspace, MoE-friendly variable-M-per-group dispatch — was specified
by the **Fuel ML library team**. See
[`crates/baracuda-cutlass/NOTICE`](crates/baracuda-cutlass/NOTICE) for
the design lineage.
[`baracuda-kernels`]: crates/baracuda-kernels
[`baracuda-kernels-sys`]: crates/baracuda-kernels-sys