baracuda

A great barracuda — the project's namesake, minus one letter.

About the name. Yes, we know — it's spelled barracuda (two Rs). That name was taken on crates.io, so we dropped one R and kept swimming.

A unified Rust ML-op facade over the NVIDIA CUDA ecosystem.

What baracuda is

baracuda is a Rust workspace that exposes every primitive an ML framework expects — the union of PyTorch (torch.* + nn.functional) and JAX (jax.lax.* + jax.numpy.*) — through a single Plan-based crate surface called baracuda-kernels. Internally each plan dispatches to:

The appropriate NVIDIA-library wrapper crate (cuBLAS, cuDNN, cuFFT, cuSOLVER, cuRAND, cuSPARSE, cuTENSOR, NPP, CV-CUDA, CUTLASS) when one already covers the op well, or
A bespoke hand-rolled .cu kernel shipped in baracuda-kernels-sys when no NVIDIA library covers the op (or covers it poorly at the shapes that matter for modern transformer / vision / GNN workloads).

Callers import one crate (baracuda-kernels) and reach for one API style. The dispatch decision — which is observable through Plan::sku() for telemetry — is otherwise invisible. Switching from a CUTLASS-backed SKU to a bespoke-backed SKU is a layout flag, not an import change.

baracuda is for downstream Rust ML / inference / training frameworks that need access to the full CUDA stack without re-vendoring it themselves. The workspace also ships idiomatic stand-alone wrappers for every CUDA library under crates/baracuda-<lib> if you want to skip the kernel facade and talk to one library directly.

Status

In active development — alpha.67. 2280+ GPU tests passing, zero failures across the 6 critical test crates on an RTX 4070 (sm_89; the baracuda-kernels suite alone is 2180/0 across 513 test binaries). alpha.67 (Phase 74, Fuel-ask) ships the plain dense FP GEMM family: 12 cuBLAS-backed flat C symbols (baracuda_kernels_gemm_dense_{f32, f64, f16, bf16}_*) with RRR / RCR / CRR layouts, flexible leading dims, and strided-batch folded into the base signature — plus the DenseGemmPlan<T> typed plan, the ReduceToPlan<T, N> broadcast-reverse reduction facade, UnaryKind::Step, and gelu flavor-disambiguation docs. This closes the last non-baracuda CUDA surface in Fuel (its own cuBLAS MatMul wrapper). alpha.66 (Fuel-ask) exposed per-device VRAM queries on the Driver-API Device — vram_free() / vram_total() / vram_info() wrap cuMemGetInfo_v2 so downstream optimizers can read runtime memory pressure without round-tripping through a failed allocation. Phase 73 follow-up (alpha.65) lands a 17-33× decode speedup via a focused FlashDecodingPlan (split-K, seq_q=1), 4× win at GQA shapes via the new num_kv_heads descriptor field, the long-awaited FlashSdpaPlan GQA-broadcast routing fix, and a SDPA gap closure that makes baracuda's standard MHA shape 50% faster than PyTorch by making fa2 a default feature. The ConcatPlan (KV-cache decode) and reduce_axis (small-shape rows) kernels also got significant rewrites — 13× and 2.6-15.4× respectively. Phase 63 (alpha.63, Fuel-ask) closes the FlashAttention saved-tensor wiring gap: NEW baracuda_kernels_fa2_sdpa_lse_size(batch, num_heads, seq_q) -> usize dense LSE pre-allocation helper (sibling of the existing varlen _varlen_lse_size); load-bearing "LSE saved-tensor contract" docs on the FW + BW trailblazers naming the exact FW→saved-LSE→BW handoff; new docs guide at docs/guides/fa2-saved-tensor-contract.md showing the wiring pattern downstream autograd frameworks should use. Test investment: 8 new tests — 3 host-only lse_size helper sanity tests + 5 GPU FW→BW roundtrip tests (f16/bf16 × d128 causal/noncausal + d64 causal) + 4 BW feature-surface tests backfilling the gaps left by fa2_backward_smoke.rs (BW with sliding window, softcap, ALiBi, and all-features composed). The existing softmax_lse output on baracuda_kernels_fa2_sdpa_<dt>_run (v1) and ..._run_v2 (Phase 59a) — turns out Fuel didn't need a new "v3 with lse" variant; the FW has been writing LSE since alpha.56, just lacked the size helper + saved-tensor wiring documentation. Phase 62 (alpha.62, Fuel-ask) lifts the in-place op contract from contig-only (Phase 61) to strided by shipping 11 new affine in-place FFI symbols (4 contig int dtype backfill — i32/i64/u8/i8 matching forward affine; 7 strided variants across the full forward-strided dtype matrix — f32/f64/i32/i64/u8/bf16/f16) and documenting the stride-equality precondition for same-pointer aliasing on the unary / binary / ternary strided trailblazers as a stable public contract. NEW baracuda_kernels_types::strides_equal host helper for callers to validate the precondition before dispatch. Zero new bespoke kernels for the elementwise unary/binary/ternary families — their existing strided launchers are aliasing-safe under the contract. Test investment: 14 host-only unit tests + 17 GPU smoke tests for the new FFI surface + 7 aliasing-contract proof tests across contig + strided trailblazers (also backfills the Phase 61 contig contract that shipped without test coverage). Phase 61 (alpha.61) completed the alpha.55 baseline with bf16/f16 in-place affine + the original contig same-pointer contract doc. Phase 60 lifted FA2 FW to the full Candle-fork-extended 9-head_dim set ({32, 64, 96, 128, 160, 192, 224, 256, 512}) via 12 new vendored .cu files + 32 FW smoke tests. Phase 59a + 59b added the full FA2 v2.8.3 surface (FW + BW + varlen across head_dims 32-256, GQA, ALiBi, sliding window, softcap) plus 48 new smoke tests, closing Fuel's FA2-retirement requirements. Phase 59c (consolidation pass, alpha.59) fixed a pre-existing parallel-test race in the bespoke flash kernel's SMEM-carveout call surfaced by Phase 59a's 5-head_dim fanout, plus updated flash_sdpa_backward_smoke to force the bespoke backend on f16/bf16 (Phase 59b made FA2 the new default BW backend, breaking source-compat for the existing bespoke BW smoke tests). Phase 42-44 add three opt-in backends (FA2, mHC.cu, ozIMMU); none are on the default build path. Phase 44b internalized the ozIMMU sources (clean-fork; cutf submodule retired; Linux + Windows both build). Phase 49 adds the baracuda-optim sibling crate (Adam / LAMB / SGD via vendored Apex multi_tensor_apply) — gated behind the optim feature so inference-only consumers don't pay the FFI surface cost. Phase 55 adds the baracuda-transformer-engine sibling crate (NVIDIA TransformerEngine FP8 cast + delayed-scaling recipe, Apache-2.0) — gated behind the tensor_engine feature. On Ada (sm_89) the FP8 wins are bandwidth-saving only (KV cache, weights); the recipe machinery is forward-compatible with Hopper / Blackwell where the MMA throughput win also materializes.

Phase coverage (see ARCHITECTURE.md for the phase matrix):

Phase	Scope	Status
59a	FA2 FW expansion (alpha.59) — full upstream feature parity (head_dim fanout {32,64,96,192,256}; GQA; ALiBi; sliding window; softcap): vendored 20 new `.cu` files from Dao-AILab FA2 v2.8.3 (head_dims 32/64/96/192/256 × {fp16, bf16} × {causal, non-causal}) bringing the FW vendor coverage to the full upstream set of {32, 64, 96, 128, 192, 256}. Upstream FA2 v2.8.3 does NOT ship head_dims 160/224/512 — those are permanently out-of-scope (no source). Launcher (`kernels/attention/fa2_launcher.cu`) rewritten to dispatch all 6 supported head_dims via runtime switch. NEW `..._run_v2` + `..._can_implement_v2` FFI entry points (+4 symbols) carrying ALiBi slopes + per-head-or-per-batch layout selector + sliding window left/right bounds + Gemma-2-style softcap. v1 entry points preserved for backwards-compat. GQA-divisible head counts (`num_heads % num_heads_k == 0`) now accepted on the FA2 path. `FlashSdpaDescriptor` is now `#[non_exhaustive]` with `::new(...)` + chainable `with_window_size_left`/`with_window_size_right`/`with_softcap` builders (Phase 32 convention). `FlashSdpaArgs` gained `alibi_slopes: Option<TensorRef<f32, 2>>`. Bespoke backend rejects sliding-window/softcap/ALiBi at select-time with clear errors. ~33 descriptor + ~30 args callsites migrated to the builder pattern. 4 new smoke test files (26 new test functions): `fa2_hdim_fanout_smoke` (20), `fa2_gqa_smoke` (1), `fa2_alibi_smoke` (3), `fa2_sliding_window_smoke` (3), `fa2_softcap_smoke` (4). Out of scope (Phase 59b territory): BW path, varlen, split-KV.	done
59c	Bespoke flash SMEM-carveout race fix + flash_sdpa_backward smoke test routing fix (alpha.59 consolidation pass): added `std::mutex`-serialized helper `set_dynamic_smem_serialized` around all `cudaFuncSetAttribute(MaxDynamicSharedMemorySize)` calls in `baracuda_flash_sdpa.cuh` + `baracuda_flash_sdpa_sm89.cuh` (5 call sites total: FW + BW dQ + BW dKdV + sm_89 FW + sm_89 strided FW). Pre-existing flake (root cause: Phase 6 / Milestone 6.6 host wrapper) that surfaced as `CutlassInternal(1001)` (= `cudaErrorMissingConfiguration`) at ~33% rate on Phase 59a's 20-test hdim fanout, specifically for d_k=96 + fp16 (smem ~50 KiB, just past the 48 KiB cudaFuncSetAttribute trigger). Confirmed fix via 3 stress runs after fix: 60/60 tests pass. Also fixed `flash_sdpa_backward_smoke`'s f16/bf16 paths to explicitly request `BackendKind::Bespoke` — Phase 59b made FA2 the default BW backend for f16/bf16 (more permissive heuristic), which broke source-compat for the existing bespoke BW smoke tests (they fed `lse: f16` not `lse_f32`).	done
59b	FA2 BW + varlen (alpha.59; closes Fuel's FA2-retirement requirements): vendored 24 new BW `.cu` files (`flash_bwd_hdim{32,64,96,128,192,256}_{fp16,bf16}_{,causal}_sm80.cu` — full FA2 v2.8.3 BW set, mirrors 59a FW vendor 1:1) plus 3 new BW headers (`flash_bwd_kernel.h`, `flash_bwd_launch_template.h`, `flash_bwd_preprocess_kernel.h`). Key finding: varlen does NOT have a separate .cu file family upstream — FA2 v2.8.3 dispatches varlen via a runtime `cu_seqlens_q != nullptr` check inside the existing FW/BW launch templates, so the same per-(headdim, dtype, causal) instantiations serve dense and varlen callers. NEW `kernels/attention/fa2_backward_launcher.cu` (BW dispatch, supports dense + varlen via two `fill__params` helpers) + `fa2_varlen_launcher.cu` (varlen FW). +12 new FFI symbols (BW dense ×2 + can_implement ×2 + workspace_size; varlen FW ×2 + can_implement ×2 + lse_size; varlen BW ×2 + can_implement ×2 + workspace_size). API: `FlashSdpaBackwardDescriptor` is now `#[non_exhaustive]` with `::new(...)` + sliding-window/softcap builders. `FlashSdpaBackwardArgs` gained `lse_f32: Option<TensorRef<f32, 3>>` (FA2 stores LSE in f32 regardless of T) + `alibi_slopes`. `FlashSdpaBackwardPlan` extended with `BackendChoice::FlashAttentionV2` arm (additive — bespoke path source-compat preserved). NEW `FlashSdpaVarlenPlan` / `FlashSdpaVarlenBackwardPlan` plan families with packed-batch `[total_q, H, D]` layout + `cu_seqlens_q`/`cu_seqlens_k` index tensors + f32 LSE `[H, total_q + 128B]`. BW workspace = `dq_accum + dsoftmax_d` (sizes via `..._backward_workspace_size`); launcher zero-fills via `cudaMemsetAsync`. Determinism: FA2 BW uses atomicAdd into dq_accum, so NOT bit-stable run-to-run (precision SKU tags this honestly). 2 new smoke test files: `fa2_backward_smoke.rs` (12 tests: workspace sizing + eligibility + e2e BW for d ∈ {64,128,192,256} × {f16,bf16} × {causal,non-causal}), `fa2_varlen_smoke.rs` (5 tests: plan selection, lse_size formula, varlen FW with 3 packed sequences, varlen BW with 2 sequences, varlen × GQA).	done
60	FA2 head_dim {160, 224, 512} FW expansion (alpha.60) — corrects Phase 59a's incorrect "permanently out-of-scope" claim. The Candle fork (`EricLBuehler/candle`) has carried hd160/192/224/256 since 2023-07 (PR #245 by Laurent Mazare); hd224 was restored by PR #2688 (Michael Feil, 2024-12-31); hd512 was added by PR #3417 (Eric Buehler, merged 2026-03-28 — adds the `cudaDeviceGetAttribute(cudaDevAttrMaxSharedMemoryPerBlockOptin)` SMEM opt-in path and updates the splitkv block-size formula). Phase 60 vendors the 12 missing FW `.cu` files from those PRs into baracuda's FA2 tree (8 hd160/224 from `EricLBuehler/candle@main`; 4 hd512 from `huggingface/candle@5430d32c`) plus the corresponding `flash_fwd_launch_template.h` + `static_switch.h` patches. BW path NOT extended — hd160/224 fall on FA2 BW kernel's `kBlockKSmem = (kHeadDim % 64 == 0) ? 64 : 32` constraint (BW atom_layout assumes 64); hd512 needs `kBlockM = 32` to fit any SMEM budget but BW kernel_traits static-asserts `kBlockM >= 64`. Upstream FA2 and the Candle fork ship no BW for these three either — limitation is fundamental to FA2's BW algorithm, not an oversight. Phase 60 attempted both paths; the experiment + reasoning is documented in `VENDOR.md`, in code comments at the dropped registration sites, and in `FA2_BW_SUPPORTED_HEAD_DIMS` (kept at `{32, 64, 96, 128, 192, 256}`). Callers needing BW at hd160/224/512 transparently fall back to the bespoke 3-kernel SDPA BW pipeline (the only path that was supporting them previously, anyway). 12 new FW smoke test functions in `fa2_hdim_fanout_smoke`. `FA2_SUPPORTED_HEAD_DIMS` (FW) lifted to `{32, 64, 96, 128, 160, 192, 224, 256, 512}` — full Candle-fork-extended set.	done
63	FlashAttention saved-tensor contract — dense LSE size helper + FW/BW wiring docs (alpha.63, Fuel-ask). Closes the wiring gap for downstream autograd integration of baracuda's FA2 backward. 1 new FFI symbol: `baracuda_kernels_fa2_sdpa_lse_size(batch, num_heads, seq_q) -> usize` returning f32 element count — sibling of the existing `_varlen_lse_size` (Phase 59b). FW v1 + v2 have written `softmax_lse` since alpha.56 — turns out Fuel didn't need a new "v3 with lse" forward variant as their ask suggested; just the size helper for pre-allocation + clarity on the saved-tensor pattern. Documented "LSE saved-tensor contract" as a stable public contract on the FW + BW trailblazers (`baracuda_kernels_fa2_sdpa_f16_run` / `..._backward_f16_run`): pre-allocate via `_lse_size`, pass same f32 buffer to FW as output and BW as input, ALiBi/sliding-window/softcap params must match between FW and BW. NEW docs guide at `docs/guides/fa2-saved-tensor-contract.md` (~180 lines, ASCII handoff diagram + side-by-side FW/BW code samples). BW head_dim cap confirmed at 256 (matches Fuel's Vulkan limit); hd160/224/512 BW remains structurally not supported by FA2 (per Phase 60), callers fall back to bespoke `SdpaBackwardPlan`. Test investment: 3 host-only `lse_size` helper tests + 5 GPU FW→BW roundtrip tests (the load-bearing wiring proof — pre-allocate LSE, FW writes it, BW reads it, verify dQ/dK/dV finite + non-zero) + 4 BW feature-surface tests (sliding window, softcap, ALiBi, all-features composed — backfills the gaps in the existing `fa2_backward_smoke.rs` which only tested the base path with `alibi_slopes: None`). All 12 new tests pass on RTX 4070. Option B (recompute-LSE backward variant) explicitly rejected: 2× backward compute for zero functional benefit when the saved-tensor pattern already works. PagedAttention backward filed as "ask if needed" per Fuel — not preemptively built.	done
62	Strided in-place op support + comprehensive test investment (alpha.62, Fuel-ask). Lifts the in-place contract from contig-only (Phase 61) to strided. 11 new FFI symbols on the affine in-place family: 4 contig int dtype backfill (`i32`/`i64`/`u8`/`i8` matching forward affine matrix) + 7 strided variants across the full forward-strided dtype set (`f32`/`f64`/`i32`/`i64`/`u8`/`bf16`/`f16`). Half-precision strided uses the same f32-scalar / upcast-to-f32 / downcast pattern as the forward strided f16/bf16 kernels. Same-pointer aliasing contract documented for the strided trailblazers (`unary_neg_f32_strided_run`, `binary_add_f32_strided_run`, `ternary_clamp_f32_strided_run`) as a stable public contract: aliasing is safe IFF the aliased input's stride array equals `stride_y` element-for-element. NEW `baracuda_kernels_types::strides_equal(a, b)` host helper for callers to validate the precondition before dispatching. Zero new bespoke CUDA kernels for the elementwise unary/binary/ternary families — their existing strided launchers are structurally aliasing-safe under the contract (each thread reads its own stride-offset cell before writing, same per-thread pattern as the contig case). Unblocks Fuel's strided in-place op fanout (every existing strided unary/binary/ternary forward kernel becomes an in-place candidate with `x_ptr == y_ptr` + equal strides). Test investment: 14 new host-only unit tests for `strides_equal` + `contiguous_stride` in `baracuda-kernels-types`; 17 new GPU direct-FFI smoke tests covering all 11 new affine in-place symbols + backfill tests for the alpha.55 baseline (f32/f64) and alpha.61 half-precision (bf16/f16) contig in-place; 7 aliasing-contract proof tests across contig + strided trailblazers (also backfills the Phase 61 contig contract that shipped without test coverage). Multi-pass families (Softmax / LayerNorm / RMSNorm / etc.) explicitly out of scope — same as Phase 61.	done
61	In-place op infrastructure completion + same-pointer aliasing contract (alpha.61, Fuel-ask) — 2 new bf16/f16 FFI symbols + docstring tightening. (1) `baracuda_kernels_affine_inplace_{bf16,f16}_run` complete the 4-dtype matrix on top of the alpha.60 f32/f64 in-place affine helper, with f32-scalar ABI matching the forward `affine_{bf16,f16}_run` convention (avoids passing `__nv_bfloat16`/`__half` by value through the C ABI). Kernels reuse the forward upcast-to-f32 / downcast-to-storage pattern from `affine_contig_kernel_{f16,bf16}`. Unblocks Fuel's `Op::AddScalar`/`Op::MulScalar` in-place rewrites + weight-decay scaling on bf16/f16 model weights without the previous Cast → Affine → Cast scratch-buffer round-trip. (2) Documented same-pointer aliasing safety as a stable public contract on the three contig elementwise trailblazers — `unary_neg_f32_run` (covers ~30 plain unary launchers + `unary_param_*` family across all dtypes via the existing "Same device-pointer contract..." inheritance line), `binary_add_f32_run` (covers ~20 binary launchers), `ternary_clamp_f32_run` (already documented since alpha.36). Unblocks Fuel's planned in-place expansion (16+ unary in-place op families + 4 binary in-place op families + ClampInplace + PowIInplace) with zero new baracuda symbols for the elementwise case — Fuel dispatches the forward symbol with `x_ptr == y_ptr` (or `a_ptr == y_ptr` for binary). Strided in-place variants (Phase 62 candidate) deferred — v1 contract from Fuel's executor is contiguous + zero-offset.	done
0	Crate scaffolding, shared type vocabulary	done
1	int8 GEMM RRR (Fuel-blocking, 18 SKUs)	done
2	FP8 / int4 / bin GEMM completion	done
3	Elementwise + shape / layout (Categories B, B', C, C', D, N)	done
4	Reductions + scans + random (Categories E, F, Q)	done
5	Normalization + softmax + loss (Categories G, H, R)	done
6	Attention + linalg + FFT (Categories K, Linalg, U)	done
7	Convolution + pooling + indexing + embedding + segment (Categories I, J, L, M, S)	done
8	Quantization helpers + GGUF + MoE (Category P, V)	done
9	Sort / topk / image / NMS (Categories O, T)	done
10	sm_89 (Ada Lovelace) tuning sweep	done
11	Fuel feedback integration (alpha.27) — ScalarType ergonomics, Conv/Pool fanout, GGUF Q8_K MMVQ, i64 indices, Sparsemax cap lift, atomicAdd-via-CAS, build-env probe	done
12	PowI + ArgMax/Min u32/i32 outputs (alpha.28) — `IndexOutputElement` sealed trait	done
13	WriteSlice + Contiguize + sub-byte casts + Triu/Tril (alpha.29) — KV-cache fast path, retires Fuel's D2H/CPU/H2D fallback, plus `DeviceBuffer::zero()` (alpha.30)	done
14	Strided FFI siblings (alpha.31) — Affine, PowI, Triu/Tril, RoPE+SDPA, GGUF MMVQ activation-strided + W byte offset; 56 new FFI symbols	done
15	Quick wins + correctness cleanup (alpha.32) — MMVQ alignment guard, OneHot/Nonzero i64 wrappers, MoE fixture race fix	done
16	Pool completion (alpha.33) — bit-exact adaptive pool {1,2,3}d, bespoke LpPool {1,2}d, bespoke FractionalMaxPool {2,3}d; 48 new FFI symbols	done
17	SDPA / attention completion (alpha.34) — Flash SDPA sm_89 strided FW + SDPA BW GQA-broadcast atomicAdd	done
18	Sub-byte / quantized completeness (alpha.35) — f16/bf16 activations for `GgufMmvqPlan` across all 11 block formats × contig + strided; 44 new FFI symbols	done
19	Fuel retirement asks (alpha.36) — pool/conv FFI facade for cuDNN-backed plans + Upsample Nearest2d + NEW im2col/im2col1d/col2im1d bespoke; vendored Fuel Q8_1 for inspection; 140 new FFI symbols. Surfaced 1.0-freeze prereq for broader library-backed FFI facade audit	done
20	MoE — Item 4 from Fuel retirement (alpha.37): batched MMVQ × N-experts (36 new FFI symbols across 11 GGUF block formats × 3 activation dtypes + 3 pure-FP); MoE absorb-and-expose proved to be a no-op (Fuel hadn't evolved their kernels since Phase 8.5 vendor; 5 baracuda-side symbols already match) + 2 direct-FFI smoke tests	done
21	Bilinear interpolate expansion (alpha.38): `align_corners` + scale-factor overrides + f16/bf16 fanout (FW+BW). Breaking change to existing f32/f64 signatures.	done
22	MMVQ ncols≥64 debug assertion + cuSOLVER FFI facade (alpha.39): 10 cuSOLVER-backed plan families (Cholesky, LU, QR+ormqr, SVD/SvdBatched/SvdaBatched, eigh real+complex, eig, lstsq, solve, inverse) wrapped behind ~50 flat C symbols in `baracuda-kernels-sys/src/cusolver_facade.rs`; closes the Phase 19 library-backed FFI facade gap for cuSOLVER. No feature gate (cuSOLVER ships with the CUDA toolkit).	done
23	cuFFT + cuRAND FFI facade (alpha.40): 6 cuFFT plan families (FFT 1d/Nd C2C, R2C, C2R) × c32/c64 + f32/f64 + 2 cuRAND families (Uniform, Normal) × f32/f64 = 32 flat C symbols in `baracuda-kernels-sys/src/{cufft,curand}_facade.rs`. cuSPARSE skipped — no baracuda-kernels plans wrap it today.	done
24	Cutlass GEMM re-export FFI facade (alpha.41): 210 trampolines (70 SKU families × {run, workspace_size, can_implement}) in `baracuda-kernels-sys/src/cutlass_reexport.rs` exposing the full Cutlass GEMM surface (fp16/bf16/tf32/f32_simt/f64/s8/u8 × {rcr, rrr} × {plain, bias, bias+relu/gelu/silu} + strided-batched fp16/bf16). cuTENSOR / NPP / CV-CUDA skipped — no baracuda-kernels plans wrap them. Completes the Phase 19 library-backed FFI facade 1.0-freeze prereq.	done
25-26	Segment/EmbeddingBag BW completion + BatchedOrmqrWy complex (alpha.42): 9 new Rust plans + 24 new FFI symbols for Segment Max/Min/Prod BW (sorted + unsorted, f32/f64), Unsorted Segment Prod FW (`atomicCAS`-retry mul), EmbeddingBag Max FW+BW (f32/f64/f16/bf16 × i32/i64). Plus BatchedOrmqrWy complex (Complex32, Complex64) via the bespoke WY-block kernels + cuBLAS C/Z gemmStridedBatched (4 new bespoke FFI + 2 cuBLAS symbols).	done
27	Q8_1 perf inspection (alpha.42 doc-only): Multi-M MMVQ opportunity identified, kept doc-only — bigger ROI than reformatting Q8_1.	done
28	API hygiene for 1.0 prep (alpha.43): new `KernelDtype` umbrella marker trait extending `Element`/`IntElement`/`FpElement`/`BinElement`; `#[non_exhaustive]` audit across the op-family `*Kind` enums + auxiliary tag enums + `Error` types. `ElementKind` / `LayoutSku` / `ArchSku` / `EpilogueKind` / `ActivationKind` / `Workspace` intentionally left exhaustive (hot-path-dispatched).	done
29	Cross-implementation benchmark suite (alpha.44): 10 new criterion+CUDA-event benches comparing baracuda against cuBLAS / cuDNN at LLM-typical shapes (GEMM f32/f16/bf16, MMVQ all qtypes, Softmax, LayerNorm, RMSNorm, Conv2d, MaxPool2d, Reductions, Elementwise, Flash SDPA+GQA). ~2,750 LOC of bench code + 13 bench binaries total. Critical finding: baracuda f16/bf16 GEMM is 2-4× slower than cuBLAS at M=1/M=32 (decode regime); validates the deferred Phase 27 multi-M MMVQ port. See `BENCHMARKS.md` for the methodology + sample run.	done
30	f16/bf16 GEMM cuBLAS fast-path (alpha.45): adds `PlanPreference::prefer_backend: Option<BackendKind>` + thread-local cuBLAS-handle cache to `GemmPlan`. Heuristic: cuBLAS for f16/bf16 at `2 ≤ M < 128` (decode batch); CUTLASS otherwise. 3× speedup at M=32 f16 (55.6µs → 19.0µs, parity with cuBLAS direct). M=1 stays on CUTLASS (cuBLAS RCR→col-major transa=T mapping slower than CUTLASS sm_80 GEMV-tile at K=N≥2048). Capture-mode auto-fallback to CUTLASS (cuBLAS-classic not capture-safe). 9 new smoke tests.	done
31	Fuel Phase 6c.2 storage.rs unblock (alpha.46): 5 gaps closed — ELU α parameter (breaking; 8 sigs modified), `powf` (8 new), `step` + `gelu_erf` (16 new), cast `u32`/`i16` (36 new × 2 directions), `reduce_sum_to`/`reduce_max_to` broadcast-reverse reductions (8 new). ~76 new/modified FFI symbols + 17 new smoke tests. Unblocks Fuel's full PTX retirement (AFFINE/UNARY/BINARY/CAST/REDUCE/INDEXING/TERNARY/FILL/SORT modules).	done
32	Descriptor `#[non_exhaustive]` + builder pattern (alpha.47): 18 descriptors retrofitted with `::new()` builders + chainable setters (`with_stride`/`with_padding`/`with_dilation`/etc.). Conv {1,2,3}d + ConvTranspose {1,2,3}d + Pool {1,2,3}d + AdaptivePool {1,2,3}d + LpPool {1,2}d + FractionalMaxPool {2,3}d + Interpolate + InterpolateBackward. Breaking change for downstream struct-literal callers — pre-1.0 hardening. Migration: `Conv2dDescriptor { ... }` → `Conv2dDescriptor::new(input_shape, filter_shape, element).with_stride(...)`.	done
33	Multi-M MMVQ via Q8_1 staging (alpha.48): closes Phase 27's deferred opportunity. NEW `GgufMmvqMultiMPlan` + `quantize_q8_1` staging kernel + 4 Q8_0 multi-M launchers (M ∈ {1, 2, 4, 8}). Bench: 7.29-7.96× speedup at M=8 on Llama-2 7B layer shapes (4096²; 11008×4096; 32000×4096). Q8_0 only this phase (clean partial); 9 remaining block formats (Q4_0/Q4_1/Q5_0/Q5_1/Q2_K..Q6_K) are mechanical fanout for a follow-up. 8 new FFI symbols (3 staging + 4 multi-M + 1 workspace).	done
34	Multi-M MMVQ block format fanout (alpha.49): 9 remaining GGUF formats shipped — Q4_0, Q4_1, Q5_0, Q5_1, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K. 36 new FFI symbols (9 fmts × 4 M-sizes). Bench at N=K=4096 M=8: Q5_0 17.32×, Q5_1 15.05×, Q4_0 12.78×, Q4_1 12.15×, Q8_0 8.79× — type-0/1 formats massively exceeded Phase 27's 3-7× target. K-quants (Q2_K..Q6_K) hit 3-7× at M=8 (larger 256-elem super-blocks dilute weight-reuse savings). Q8_K MMVQ correctly rejected at select() — bespoke per Phase 11.4.	done
35	Test-infra hardening (alpha.50): first zero-failure regression in the entire Phase 22-35 sweep (2229/0 across 638 binaries). Five fixes: (a) `mmvq_w_offset_alignment_misaligned_rejected_debug` `#[cfg(debug_assertions)]` gate; (b) cuBLAS handle retry with 5× linear backoff (Phase 30 parallel-init race); (c) cuDNN handle retry on CTC path (1001 NOT_INITIALIZED race); (d) `Stream::capture` panic-safe Drop guard (ThreadLocal capture state leak under cargo's thread reuse → cudaErrorStreamCaptureImplicit on subsequent tests); (e) `cudaResourceDesc` 48→128 byte expansion + `repr(align(8))` (Rust struct under-allocated by 16+ bytes AND missing 8-byte alignment that the union's `void*`/`size_t` arms require — caused release-only STATUS_ACCESS_VIOLATION in wave5_smoke).	done
36	Fuel 6c.4 unblock — Phase 1/3 (alpha.51): RoPE apply with precomputed cos/sin tables (FW+BW × 4 fp dtypes; 16 symbols) + Fill missing dtypes & strided variant (3 new contig + 11 strided; 28 symbols) + Argsort dtype fanout (u8/i8/u32/i16/bf16/f16/fp8e4m3; 14 symbols). 58 new FFI declarations total.	done
37	Fuel 6c.4 part 2/4 (alpha.52): Reduce family Gap 1 — `reduce_min_to`/`prod_to` broadcast-reverse for 4 fp dtypes (16 symbols) + integer-dtype single-axis sum/min/max/prod + argmin/argmax for U8/I8/U32/I16/I32/I64 (48 symbols, with U64/I64 widened accumulator + store-time narrow on Sum/Prod). 64 new FFI declarations total. Documented bit-exact wrap-on-overflow contract for u8/u32 sum/prod.	done
38	Fuel 6c.4 part 3/4 (alpha.53): Ternary `where_cond` dtype-matrix fanout — Cond lifted to template parameter; U8 (existing, untouched) + U32 + I64 cond × {f32/f64/f16/bf16, u8/i8/u32/i16/i32/i64, fp8e4m3} value × {contig, strided}. 87 new FFI declarations (58 `_run` + 29 `_can_implement`). Existing `where_<value>_run` family preserved bit-identically (default Cond=uint8_t).	done
39	Fuel 6c.4 part 4/4 (alpha.53, bundled with Phase 38): Indexing Tier 1 — NEW scatter (pure assign) + index_add for {f32/f64/f16/bf16} × {i32, i64idx} (16 syms) + gather u8idx extras for {f32, f64} (2 syms). 18 new FFI symbols total. Existing per-axis stride arrays meant no separate contig/strided split needed. f16/bf16 index_add uses the Phase 11.3 `atomic::add<T>` atomicCAS helper. Scatter documented + tested with disjoint-target indices (last-writer-wins on collisions, caller-aware non-determinism).	done
40	Fuel 6c.4 final cleanup (alpha.54): multi-block radix argsort via CUB `DeviceSegmentedRadixSort` for `row_len > 1024` (4 dtypes × 3 entries = 12 syms; bitonic stays for ≤1024) + Indexing Tier 2 integer value-dtype matrix (gather/index_select/scatter for u8/i8/u16/i16/u32/i32/i64 × i32/i64idx = 38 syms; index_add for i32/u32/i64 only = 6 syms). 56 new C symbols total. New `atomic::add<int64_t>` specialization via `unsigned long long*` reinterpret. Tier 3 (fp8e4m3 + sub-32-bit ints for index_add) deferred — no concrete caller.	done
41	Fuel 6c.5 final unblock (alpha.55): RoPE interleaved-pair (Gap 7) + RoPE THD-layout (Gap 8) variants. 28 new FFI symbols (FW+BW × 4 fp dtypes × 2 variants + `_can_implement` companions). Closes the entire Fuel 6c.4/6c.5 batch ask — Fuel can now drop the last `Id::Reduce` PTX module + retire `fuel-cuda-kernels` workspace member + drop the `cudaforge` build dep. Discovery: existing `rope_apply_*` was already using `(2k, 2k+1)` pairing (not `(i, i+d/2)` as the brief stated) → interleaved symbols are name-aliases on the same kernel; THD is genuinely new.	done
42	Flash Attention v2 vendor + `FlashSdpaPlan` backend (alpha.56): Tri Dao's FA2 v2.8.3 (BSD-3) vendored under `crates/baracuda-kernels-sys/vendor/flash-attention/` — Tier 1 (head_dim=128, fp16+bf16, sm_80, FW only) — wired as `BackendKind::FlashAttentionV2` on `FlashSdpaPlan` behind the `fa2` cargo feature. Heuristic routes long-context (seq_q×seq_k ≥ 1024×1024) shapes to FA2, bespoke otherwise; `PlanPreference::prefer_backend` overrides. PyTorch shim headers (at::PhiloxCudaState + C10_CUDA_CHECK) decouple the vendor from torch deps. Tier 2 (BW, varlen, paged, other head_dims) deferred.	done
43	mHC.cu vendor + `HyperConnectionPlan` family (alpha.56): DeepSeek-AI's Manifold-Constrained Hyper-Connections residual-mixing op (arXiv:2512.24880) from AndreSlavescu/mHC.cu (MIT) vendored under `crates/baracuda-kernels-sys/vendor/mhc/` — Tier 1 (static-H, bf16 only) — exposed as `HyperConnectionPlan` behind the `mhc` cargo feature. Replaces bare `y = x + sublayer(x)` residual with a learned `n×n` Sinkhorn-Knopp doubly-stochastic mixing matrix. Tier 2 (BW, dynamic-H, fp16/f32) deferred. Requires cuBLAS-Lt (already linked).	done
44	ozIMMU FP64-via-Int8-TC backend (alpha.56): enp1s0/ozIMMU (MIT) — Ootomo/Ozaki/Yokota's Ozaki-scheme DGEMM that synthesizes FP64 from S² int8 tensor-core matmuls — vendored under `crates/baracuda-ozimmu-sys/vendor/ozimmu/` with `cutf` submodule pinned alongside. NEW `baracuda-ozimmu-sys` + `baracuda-ozimmu` sibling crates. Wired into `GemmPlan` f64 path as opt-in `BackendKind::Ozaki { slices }` (default stays on CUTLASS/cuBLAS DGEMM — Ozaki is NOT bit-equivalent). Two patches: direct-link mode (no LD_PRELOAD), exclude `cublas.cu`/`culip.cu`.	done
46	FlashInfer cherry-pick — paged-KV decode + sort-free sampling + cascade attention (alpha.57 Checkpoint A, closed in the alpha.58 consolidation pass): surgical extraction of three FlashInfer kernel families (Apache-2.0, v0.6.12, commit `eee0d75f`) vendored under `crates/baracuda-kernels-sys/vendor/flashinfer/` (~12 kLOC across 25 headers, no wholesale wrap). NEW plan families: `BatchPagedDecodePlan` + `PagedKvAppendPlan` (vLLM-style paged KV cache decode), `TopKTopPSamplingPlan` (sort-free TopK/TopP/MinP/combined samplers), `CascadeAttentionPlan` (LSE-merge for prefix-cache sharing). NEW `BackendKind::FlashInfer` + `RandomKind::Multinomial` discriminants. NEW `flashinfer` cargo feature on both `baracuda-kernels-sys` and `baracuda-kernels` (default OFF). 7 MSVC-portability patches to vendored headers (see `vendor/flashinfer/VENDOR.md`). Checkpoint B (alpha.58 consolidation): `flashinfer_paged_decode_launcher.cu` now compiles cleanly under MSVC nvcc — root cause was `std::max(unsigned long, size_t)` type mismatch inside `decode.cuh` (the earlier hypothesis about `cudaLaunchKernel_ptsz` was incorrect). Patched via `static_cast<size_t>(...)` on both arguments; launcher TU also carries a defensive `cudaLaunchKernel` shim macro. All 4 launchers now build under the `flashinfer` feature.	done
44b	ozIMMU clean-fork + cutf elimination + Windows port (alpha.57): full internalize of ozIMMU sources (no longer vendored — we own them at `crates/baracuda-ozimmu-sys/cuda/`). `cutf` submodule eliminated entirely (upstream went offline); ~360 LOC of useful FP / cp_async utilities preserved as native `baracuda_fp_bits.cuh` + `baracuda_cp_async.cuh`; ~2,200 LOC of cutf duplicates deleted. Portable `baracuda::Uint128` replaces `__uint128_t` for Windows compile (typedef alias on Linux — bit-for-bit preservation). LD_PRELOAD path removed entirely. Linux + Windows both build clean.	done
44c	ozIMMU RIKEN-RCCS perf-enhancement variants (alpha.57, no version bump): folds in `accelerator_for_ozIMMU` (Uchino/Ozaki/Imamura 2024, arXiv:2409.13313) — three new variants `EF` (group-wise error-free summation; chains int8 cublasGemmEx with `beta_i=1` to delay int32→f64 materialization), `RN` (nearest-rounding `(a+t)-t` split; ~2 extra effective bits per slice), `H` (= EF + RN), plus n-blocking (chunk `n > 12288` into 8192-wide pieces on the int8 GEMM call). Variant selected via `BackendKind::Ozaki { slices }` discriminant's high-3-bits field; `ozaki_slices::{base,ef,rn,h}(s)` helper constructors in `baracuda-kernels-types::sku`. NEW `OzakiVariant` enum + `Handle::dgemm_with_variant` on `baracuda-ozimmu`. Source-compatible with Phase 44b callers (`slices: 8` decodes as Base/S=8). Discovered + fixed a pre-existing Phase 44b MSVC bug in `axby` / `axy_complex`: upstream's `(1l << 44)` overflows on Windows (where `long` is 32-bit, LLP64) → silent `inf` output. Fixed by switching to `static_cast<double>(1ull << 44)`. 9 new accuracy/variant/n-blocking smoke tests, all green on RTX 4070; the pre-existing Phase 44b accuracy_smoke tests (4 cases) also unbreak.	done
47	Fused Linear Cross-Entropy (alpha.56, single-kernel port from LinkedIn's Liger-Kernel BSD-2): NEW `FusedLinearCrossEntropyPlan` family that fuses lm_head GEMM + CE loss in a chunked outer loop, never materializing the `[BT, V]` logits tensor. At BT=16K, V=128K, bf16 (Llama-3-class) saves 5-10 GiB of activation memory. Bespoke per-chunk fused softmax+CE+gradient kernel (FP32 accumulator across 4 fp dtypes — f16/bf16/f32/f64); GEMMs dispatched via `cublasGemmEx`. Backward produces `grad_input`+`grad_weight` during the FW pass (chunked loop); BW call just scales by `dy_scalar` (no-op when `dy=1.0`, the typical "CE is the last layer" case). 16 new bespoke FFI symbols (per_row + per_row_cast + scalar_finalize + inplace_scale, each × 4 dtypes) + 1 count-non-ignore helper + `cublasGemmEx` binding. NEW `LossKind::FusedLinearCrossEntropy` variant. Algorithm credit: LinkedIn Liger-Kernel (BSD-2-Clause, clean-room CUDA reimplementation — no source vendored).	done
45	SmoothQuant compose + YaRN/LongRoPE Rust helper (alpha.56, no version bump — consolidation phase will bump): two zero-new-CUDA pure-Rust additions. (a) `SmoothQuantLinearPlan<TIn, TWQ>` (in `crates/baracuda-kernels/src/quantize/smoothquant.rs`) composes the existing Phase 8.3 `quantized_linear_w8a8` kernel + `fill_<dt>` broadcast for the per-tensor activation scale. Caller supplies pre-smoothed-and-quantized int8 activations + int8 weights (smoothing itself is offline Python per the SmoothQuant paper — mit-han-lab/smoothquant MIT, Xiao et al. ICML 2023; not in scope). (b) `RopeScaledTableBuilder` + `RopeScaling` enum (Linear / YaRN / LongRoPE, in `crates/baracuda-kernels/src/attention/rope_scaling.rs`) — host-side cos/sin table builder feeding the Phase 36 `rope_apply_<dt>_run` kernel. YaRN (jquesnelle/yarn MIT, Peng et al. arXiv:2309.00071) implements §3.2 NTK-by-parts frequency interpolation + §3.3 attention-temperature absorption into cos/sin. LongRoPE (microsoft/LongRoPE MIT, Ding et al. arXiv:2402.13753) multiplies inv-freq by caller-supplied per-dim factors (evolutionary search itself is offline + out of scope). Existing Phase 36 `RopeApply*` types source-compat preserved.	done
51	Arbitrary-mask `FlashSdpaPlan` + spec-decode composition doc (alpha.57, no version bump — consolidation phase will bump): NEW optional `mask: TensorRef<f32, 4>` field on `FlashSdpaArgs` routing to a bespoke arbmask SDPA kernel that adds an f32 `[B, H, Q, K]` additive bias to `S = Q·K^T·scale` before softmax. Unlocks spec-decode tree masks (EAGLE / Medusa / lookahead), MoE expert masking, prefix-LM, sliding-window with attention sinks — all entirely from caller-side composition. 4 dtypes (f32/f16/bf16/f64) × `_run` + `_can_implement` = 8 new FFI symbols. `is_causal` composes with the mask correctly (`-INF + finite == -INF`). New header `baracuda_attn_arbmask.cuh` reuses Phase 6.6's online-softmax tile pipeline; 1 new .cu instantiation file. FA2 vendor untouched (FA2 v2.8.3's `Mask` template has no arbitrary-mask hook). Runnable example at `crates/baracuda-kernels/examples/speculative_decode_compose.rs`; design doc at `docs/guides/spec-decode.md`. FW only; BW deferred.	done
50	Mamba-2 SSD chunk-scan + Dao-AILab causal-conv1d (alpha.57, gated behind `mamba` cargo feature): opens the state-space LLM class (Mamba-2 8B, Codestral-Mamba, Falcon-Mamba, Zamba2 — Mamba-1 selective_scan deferred to Phase 50b). NEW `SsdChunkScanPlan` + `SsdChunkScanBackwardPlan` (lives under `attention` because of the SSD-as-attention duality) and `CausalConv1dPlan` + `CausalConv1dBackwardPlan` (top-level module — bespoke kernels, no cuDNN dep). Vendor attribution + LICENSE at `crates/baracuda-kernels-sys/vendor/causal-conv1d/` (Tri Dao, BSD-3) and `crates/baracuda-kernels-sys/vendor/mamba/` (state-spaces/mamba, Apache-2.0). Hand-port of the upstream Triton SSD reference + causal-conv1d primitive. Dtypes: causal-conv1d f32/f16/bf16/f64 × widths 2/3/4 × {SiLU, identity}; SSD f32/f16/bf16 (no f64 upstream). FW caps state at D,N ≤ 256; BW tighter at 64 (SMEM budget). 30 new FFI symbols (8 causal-conv1d FW + 8 BW + 6 SSD FW + 6 SSD BW + 2 can_implement extras). 5 new smoke tests (causal_conv1d_smoke/bw + ssd_chunk_scan_smoke/bw + mamba2_block_smoke).	done
50b	Mamba-1 `selective_scan` (alpha.57, gated behind the same `mamba` cargo feature as Phase 50): completes the state-space LLM coverage by adding the original Mamba-1 op family that powers Mamba-7B, Falcon-Mamba, and Codestral-Mamba — Phase 50's SSD covers Mamba-2 / Codestral-Mamba / Falcon-Mamba / Zamba2, but every Mamba-1-shipping model still uses v1's `selective_scan`, not v2's SSD reformulation. NEW `SelectiveScanPlan` + `SelectiveScanBackwardPlan` (sibling to `SsdChunkScanPlan` under `attention/`). Shape `(B, L, D, N)` with the full Mamba-1 surface: optional `D[d]` skip, optional SiLU-gated `z[t, d]` tail, optional `delta_bias[d]` + optional `softplus(delta)` mapping (all 9 args of upstream `selective_scan_fn` wired). Dtypes f32/f16/bf16 (complex deferred — no shipping LLM uses it). Hand-port of `state-spaces/mamba`'s `csrc/selective_scan/` under Apache-2.0; same `vendor/mamba/` directory as Phase 50 (VENDOR.md updated, no new LICENSE file). FW caps state at `N ≤ 256`; BW uses two-pass record-then-reverse with `BDLNsizeof(T)` workspace. NEW `AttentionKind::SelectiveScan = 8` variant (`#[non_exhaustive]` so source-compat). 17 new FFI symbols (3 FW + 3 FW-can-impl + 3 BW + 1 workspace-bytes + module-internal launchers). 3 new smoke tests (selective_scan_smoke covering 4 option-combinations + f16/bf16 loose-tol, selective_scan_bw_smoke with FD checks on du/ddelta/dA + topology rejection, mamba1_block_smoke end-to-end).	done
52	NCCL foundation crate pair (alpha.57, no version bump — consolidation phase will bump): `baracuda-nccl-sys` (raw FFI types + libloading lazy-resolve, NO bindgen / NO link-time dep) + `baracuda-nccl` (safe `Communicator` with full collective surface — `all_reduce` / `reduce` / `reduce_scatter` / `all_gather` / `broadcast` / `send` / `recv` + group API + `NcclMem` + custom `pre_mul_sum` reduction op + `register` / `deregister` for zero-copy). The distributed-roadmap prerequisite for Ring Attention, distributed MoE, Megatron-LM tensor parallelism, FSDP-style shard collectives — Phase 52 only ships the substrate; consumer plans land in Phase 53+. Spec-named API: `Communicator::new_single_gpu` / `new_with_id`, cached infallible `rank()` / `world_size()`, `NcclReduceOp` / `NcclUniqueId` / `NcclDataType` aliases, `NcclUniqueId::generate()`. Linux-primary (NCCL ships with the CUDA toolkit there); Windows builds clean and defers the "is NCCL installed?" question to first `nccl()` call (loader fails with `LoaderError::LibraryNotFound`). 20 new smoke tests (10 dtype mapping — runs on every host; 10 `#[ignore]` NCCL-required). No baracuda-kernels integration in this phase.	done
49	Apex optimizer subset (alpha.57, gated behind `optim` cargo feature): deliberate scope expansion — training-framework-adjacent. NEW sibling crate `baracuda-optim` (~600 LOC Rust + ~750 LOC CUDA) vendoring the NVIDIA Apex (BSD-3-Clause) `multi_tensor_apply` idiom + fused Adam / LAMB / SGD functors. Single launch over thousands of parameter tensors (Apex `MAX_TENSORS_PER_LAUNCH = 110` per batch, multi-launch transparent) — eliminates the ~10,000-launch optimizer step overhead on 32B-param models. Plans: `AdamStepPlan<T>` (f32/f16/bf16 + AdamW mode), `LambStepPlan` (f32; two-stage with atomicAdd-fused L2-norm + sqrt + trust-ratio scaling), `SgdStepPlan<T>` (f32/f16/bf16 + momentum + Nesterov + weight-decay). Inference-only consumers (e.g. Fuel) don't pay the FFI surface cost — the vendored sources only build / link when the feature is enabled. Re-exported under `baracuda_kernels::optim` when enabled. Measured 41× speedup at 1000-tensor multi-tensor Adam vs 1000 individual launches on RTX 4070 (0.173 ms vs 7.096 ms; smoke test in `crates/baracuda-optim/tests/multi_tensor_dispatch_smoke.rs`). 4 smoke tests, 6 GPU tests total, all green.	done
53	bitsandbytes NF4 dequant + GEMV vendor — QLoRA inference (alpha.57, gated behind `bnb_nf4` cargo feature): opens the QLoRA-trained Llama / Mistral / Qwen inference class by vendoring the bitsandbytes (Dettmers et al. arXiv:2305.14314, MIT) NF4 (NormalFloat 4-bit) dequant + GEMV kernels. NF4 is the dominant 4-bit format for QLoRA-trained prebuilts on the HuggingFace Hub — distinct from GGUF Q4_0 (symmetric int4scale, llama.cpp, Phase 8) and AWQ int4 (asymmetric int4 + zero-points). NF4 uses a 16-entry non-uniform quantile codebook* derived from the inverse CDF of `N(0, 1)` — dequant is a 16-entry lookup, not arithmetic; better accuracy than symmetric int4 for normally-distributed weights. NEW plan trio: `Nf4DequantizePlan<T>` (bulk unpack `[N/2, K]` u8 → `[N, K]` T), `Nf4MmvqPlan<T>` (M=1 single-vector decode GEMV), `Nf4MmvqMultiMPlan<T>` (M ∈ {1, 2, 4, 8} batched-decode with weight gmem reuse, Phase 33 pattern applied to NF4). 11 new FFI symbols (3 dequant + 2 M=1 + 6 multi-M). Pack layout matches bitsandbytes upstream `Linear4bit`: pair-packed nibbles in `[N/2, K]` u8 (N must be even) + `[N * (K/block_size)]` f32 per-block absmax (block_size typically 64). Activation/output dtypes f16+bf16 (PyTorch convention); f32 accumulator. Codebook reproduced bit-identical to upstream as device-side switch + host-side `NF4_CODEBOOK: [f32; 16]` const + `nf4_pack_weight` host helper. Vendor metadata at `crates/baracuda-kernels-sys/vendor/bitsandbytes/{LICENSE,AUTHORS,VENDOR.md}`. 3 smoke test files (dequant roundtrip, M=1 GEMV f16+bf16, multi-M f16 vs M=1-looped). Out of scope: 8-bit optimizers (Phase 49 overlap), LLM.int8 (Phase 45 obsoletes), FP4 (different codebook — separate phase if asked), double quantization (Tier 2).	done
54	xFormers BlockSparseAttention + 2:4 sparse GEMM (alpha.57, no version bump; clean-room hand-port of facebookresearch/xformers BSD-3-Clause algorithmic reference): NEW `SdpaBlockSparsePlan` (`xformers_blocksparse` cargo feature) — block-sparse SDPA FW where the attention mask is a per-block boolean pattern `[B, H, num_blocks_q × num_blocks_k]` (uint8); only the active (q_block, k_block) pairs participate in the QK^T matmul + online-softmax accumulation. Different from Phase 51's arbitrary additive-mask path (which still computes every cell) — block-sparse actually SKIPS compute on masked blocks → real wall-clock speedup on long-context attention with known sparse patterns. NEW `GemmSparse24Plan` (`xformers_sparse24` cargo feature) — 2:4 structured sparsity GEMM consuming pre-compressed `[M, K/2]` weights + `[M, K/8]` u16 metadata. Tier-1 implementation: inflate-then-dense reference matmul (correctness first; sparse-tensor-core `mma.sp.sync` / cuSPARSELt backend deferred to Tier 2). 16 new FFI symbols (8 block-sparse SDPA × 4 dtypes × 2 entries; 11 sparse24 × 3 dtypes × 4 entries with workspace_bytes helper). NEW `AttentionKind::BlockSparseAttention = 9` variant (`#[non_exhaustive]` so source-compat). 3 smoke tests (block-sparse all-ones-matches-dense + diagonal-band + empty-pattern; sparse24 matches host reference + K-rejection + throughput timing). Vendor attribution at `crates/baracuda-kernels-sys/vendor/xformers/` (no upstream sources vendored verbatim — algorithmic reference only). NOT vendored from xFormers: memory-efficient attention (overlaps with FA2 vendor); fused biases / RoPE / norm (overlaps with existing baracuda phases); Triton kernel paths (no Triton toolchain).	done (Tier 1)
55	TransformerEngine FP8 cast + delayed-scaling recipe (alpha.57, gated behind `tensor_engine` cargo feature; clean-room hand-port of NVIDIA TransformerEngine Apache-2.0 algorithm — only the cast + recipe subset). NEW sibling crate pair `baracuda-transformer-engine-sys` + `baracuda-transformer-engine`. The differentiated value of TE is the per-tensor delayed-scaling recipe with amax history for stable FP8 training; that's the load-bearing piece this phase ships. Public API: `Fp8Recipe` (RAII handle holding amax_history ring + scale + scale_inv device scalars), `Fp8CastPlan<TIn>` (fused FP8 cast + `max(\|x\|)` amax reduction in one kernel — atomicMax into `amax_history[write_pos]`), `Fp8DequantPlan<TOut>` (symmetric dequant via `scale_inv`). Both formats: E4M3 (max=448) for fwd/weights, E5M2 (max=57344) for grads. Wide dtypes: f32/f16/bf16. 4 new C-ABI symbols (`baracuda_te_fused_cast_amax_run` / `baracuda_te_dequant_run` / `baracuda_te_recipe_update_run` / `baracuda_te_recipe_init_run`) + format/dtype id helpers. NO cuDNN dep (cast/recipe paths don't need it — cuDNN is only needed for `fused_attn`, which we skip). NO pybind11 (raw C ABI, not Python). 10 GPU smoke tests, all green on RTX 4070. Sm_89 reality check: FP8 storage + cast intrinsics work natively on Ada, but tensor-core FP8 MMA throughput equals BF16 — so the wins here are bandwidth-saving (KV cache, weight storage, activation memory) not compute. Recipe machinery is forward-compatible with Hopper (sm_90a) / Blackwell (sm_100) where the MMA throughput win also materializes. Deliberately NOT lifted: `normalization` (Phase 5), `fused_rope` (Phase 14/36/41), `fused_attn` (Phase 17/42; cuDNN dep), `fused_softmax` (Phase 5), `activation` (Phase 3/31), `gemm` (Phase 1+24+30), `comm_gemm_overlap` (Hopper TMA), `fused_router` (Phase 8+20), `hadamard`/`newton_schulz`/`swizzle`/`permutation` (niche), `multi_tensor` (Phase 49), `dropout` (composable), Python bindings (out of scope). Algorithmic reference: `transformer_engine/common/{cast,recipe}/*.cu` upstream + FP8 spec Micikevicius et al. 2022 (arXiv:2209.05433). Vendor attribution + full Apache-2.0 text at `crates/baracuda-transformer-engine-sys/ATTRIBUTION.md`.	done
57	Megatron-LM tensor-parallel primitives (alpha.57, no version bump; gated behind `megatron_tp` cargo feature). NEW sibling crate `baracuda-megatron` — pure-composition over `baracuda-cublas` (local GEMM via `cublasSgemm` for f32, `cublasGemmEx` with `Compute32F` accumulator for f16/bf16) + `baracuda-nccl` (cross-rank `all_gather` / `all_reduce` collectives). NO new CUDA kernels — foundational TP primitives for Megatron-style models are pure orchestration; the kernel substrate already exists in baracuda. NEW plans: `ColumnParallelLinearPlan<T>` (splits W along output dim; FW `Y_local = X @ W_local^T` + `all_gather`; BW `dX_partial = dY_local @ W_local` + `all_reduce(Sum)`, `dW_local = dY_local^T @ X` local) and `RowParallelLinearPlan<T>` (splits W along input dim; FW `Y_partial = X_local @ W_local^T` + `all_reduce(Sum)`; BW `dX_local = dY @ W_local` local, `dW_local = dY^T @ X_local` local — no BW collective, the Megatron pairing only needs one collective per layer-pair). NEW `TensorParallelContext` borrow type holding `&Communicator` + `in_features` / `out_features` / cached `rank` / `world_size`. Dtypes: f32 always; f16 + bf16 behind the crate-level `half-crate` feature (which the kernel-facade `megatron_tp` feature pulls in). Tier 1 scope — bias rejected at call site with a Tier-2 marker error (caller can Affine-add post-FW; matters for RowParallel where the bias must be added after the all_reduce so it doesn't get summed N times). 5 smoke tests across 3 files: `column_parallel_smoke` (FW + BW, single-rank, matches CPU `Linear` ref), `row_parallel_smoke` (same), `multi_rank_scaffold` (`#[ignore]`-gated 2-GPU scaffold — exits cleanly on single-GPU dev boxes). Algorithmic reference: Shoeybi et al. arXiv:1909.08053 (NVIDIA Megatron-LM, Apache-2.0); no source vendored. Out of scope: async overlap (Hopper TMA); sequence parallelism (Phase 56's domain); pipeline parallelism (future phase); VocabParallelEmbedding (future polish); distributed gradient accumulation (Phase 58's domain); expert parallelism (separate phase).	done
58	DistributedAdam — ZeRO-1-style sharded optimizer state (alpha.57, no version bump; gated behind the new `distributed_optim` cargo feature on `baracuda-optim`, pulls `baracuda-nccl` as optional dep). Pure-Rust composition over Phase 49 [`AdamStepPlan`] + Phase 52 NCCL collectives — NO new CUDA kernels, NO new `baracuda-kernels-sys` FFI. NEW `DistributedAdamStepPlan<T>` wrapping the inner Adam plan + a borrowed `&Communicator`; orchestrates the canonical ZeRO-1 protocol: `all_reduce(grads, Sum, in-place)` → local Adam step on this rank's `1/world_size` shard → `all_gather(updated_params, in-place)`. Single-rank degenerate case (`world_size == 1`) elides both collectives and reduces to `AdamStepPlan::step` bit-exactly (smoke test verifies this on single-GPU dev hardware). f32 + f16 + bf16 dtypes, AdamW + classic mode, mixed-precision `step_with_f32_state` variant. NEW `shard_range(n, rank, world_size)` helper matching `torch.chunk` semantics. Phase 58 constraint: tensor element counts must be `world_size`-multiples (ring all_gather symmetry); per-tensor broadcast fallback for ragged shards is future work. Out of scope: ZeRO-2 (gradient sharding); ZeRO-3 (parameter sharding during FW/BW); DistributedLamb / DistributedSGD (same pattern, defer until concrete demand); CPU-offload optimizer state; 8-bit distributed optimizer state. 3 smoke test files (4 pure-Rust shard_range tests run unconditionally; 2 single-rank GPU smokes `#[ignore]`-gated for NCCL; 1 multi-rank scaffold `#[ignore]`-gated for 2+ GPU validation). Algorithmic reference: Rajbhandari et al. SC20 "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models", Microsoft DeepSpeed (Apache-2.0; no source vendored — pure Rust composition).	done
56	Ring Attention — sequence-parallel attention (alpha.57, no version bump — consolidation phase will bump; gated behind the new `ring_attention` cargo feature, pulls `baracuda-nccl` + `baracuda-nccl-sys` as optional deps). First Phase 52 NCCL consumer — proves the substrate. Hand-port of Liu/Yan/Abbeel 2023 (arXiv:2310.01889; algorithmic reference at `https://github.com/lhao499/RingAttention`, Apache-2.0 — no JAX source vendored, clean-room CUDA implementation). NEW `RingAttentionPlan<T>` + `RingAttentionDescriptor` + `RingAttentionArgs` in `crates/baracuda-kernels/src/attention/ring_attention.rs`; bespoke `kernels/attention/ring_attention_kernel.cu` (~480 LOC kernel + ~390 LOC plan). Per-rank online-softmax fold of resident K/V chunk into persistent `(o_acc, m_acc, l_acc)` f32 state; ring rotation via `comm.send`/`recv` inside `group_start`/`group_end`; finalize kernel emits `y = o_acc / l_acc` (+ optional `lse`). Tier 1 scope: f16/bf16 (f32/f64 deferred), `head_dim == 128`, FW only (BW Tier 2), no GQA broadcast, no arbitrary additive mask. Causal masking applied on global indices (each step kernel takes `q_global_base` + `k_global_base` so masking is consistent across rotation steps). 12 new FFI symbols (`workspace_bytes` + dtype-independent `init_run` + 5 per-dtype × 2 dtypes: `step_run` / `step_can_implement` / `finalize_run` / `finalize_can_implement`). Unlocks million-token context length across N GPUs with O(N/P) memory where N = total seq len, P = ring size. Complementary to Phase 57's tensor-parallelism (sequence-dim sharding vs head-dim sharding compose). 4 smoke tests (3 single-rank degenerate cases validating against `FlashSdpaPlan` ground truth: f16 + bf16 + f16 causal — all pass on RTX 4070; 1 multi-rank scaffold `#[ignore]`-gated for 2+ GPU validation). Single-rank `world_size == 1` reduces to standard FlashAttention math (the validation path on single-GPU hardware).	done
48	Marlin + AWQ 4-bit GEMM vendor + GPTQ→Marlin repack utility (alpha.57, no version bump — consolidation phase will bump; gated behind the new `marlin` + `awq` cargo features on `baracuda-kernels-sys` and `baracuda-kernels`). Two complementary 4-bit GEMM vendors completing the "4-bit hub coverage" started in Phase 53 (NF4). Marlin (IST-DASLab, Apache-2.0 + §3 patent grant, vendored at `crates/baracuda-kernels-sys/vendor/marlin/`) — state-of-the-art W4A16 GEMM for the decode-batch regime, ~3.87× speedup over FP16 GEMM at M ∈ [1, 32] on Ampere / Ada per the paper. Symmetric int4 (zero-point fused into dequant as `q - 8`); group size 128 or per-channel; sm_80/86/89 only (sm_90 needs WGMMA rewrite — Marlin v2 territory, deferred). NEW `Int4MarlinGemmPlan<f16>`. AWQ (mit-han-lab, MIT — no patent grant, vendored at `crates/baracuda-kernels-sys/vendor/awq/`) — natively supports the most-deployed 4-bit format on the Hugging Face Hub (Llama / Mistral / Qwen `-AWQ`). Asymmetric* int4 with explicit per-group zero-points; group size 64 or 128; loads directly from HF checkpoints without repack. NEW `Int4AwqGemmPlan<f16>`. GPTQ→Marlin repack utility — pure-Rust host-side `gptq_to_marlin_repack` bridging GPTQ asymmetric checkpoints into Marlin's symmetric layout via zero-point absorption (trailblazer implementation; act_order=True deferred, the upstream Marlin intra-fragment permutation table is documented but uses identity permutation in the trailblazer). 4 new FFI symbols total (2 Marlin: `_run` + `_can_implement`; 4 AWQ: `_run` + `_workspace_bytes` + `_can_implement` + dequant stub). AWQ vendor source patched to strip the upstream `<torch/extension.h>` host wrapper (`__asm__ __volatile__` → `asm volatile` for MSVC nvcc portability) and re-export only the device-side `__global__` template kernel. Marlin needs `--expt-relaxed-constexpr` (constexpr `ceildiv` called from `__global__`). Both kernels build clean on RTX 4070 with the gated features. 3 smoke test files (marlin_smoke `#[ignore]` GPU + descriptor validation; awq_smoke `#[ignore]` GPU + descriptor validation; gptq_to_marlin_smoke pure-Rust roundtrip + zp-fold verification + clamp-at-extremes).	done
46+	Phase 46-51 mainstream-techniques roadmap (FlashInfer cherry-pick, Marlin/AWQ); Hopper sm_90a / Blackwell sm_100; 1.0 freeze.	pending (see `ROADMAP.md`)

API stability is not promised before beta.0. Breaking changes ship in each alpha bump and are documented in the workspace CHANGELOG.md.

Quick start

Add the kernel facade and the driver crate:

[dependencies]

baracuda-kernels = { version = "0.0.1-alpha.64", features = ["sm89", "cudnn"] }

baracuda-driver  = "0.0.1-alpha.64"

A representative example — single-axis numerically stable softmax over a device-resident tensor:

use baracuda_driver::{Context, Device, DeviceBuffer, Stream};
use baracuda_kernels::{
    PlanPreference, SoftmaxArgs, SoftmaxDescriptor, SoftmaxKind, SoftmaxPlan,
    TensorMut, TensorRef, Workspace,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. Standard CUDA bring-up via baracuda-driver.
    let ctx = Context::new(&Device::get(0)?)?;
    let stream = Stream::new(&ctx)?;

    // 2. Allocate device input + output buffers (rank-2: rows × cols).
    let rows = 32i32;
    let cols = 1024i32;
    let n_elems = (rows * cols) as usize;
    let dev_x: DeviceBuffer<f32> = DeviceBuffer::zeros(&ctx, n_elems)?;
    let mut dev_y: DeviceBuffer<f32> = DeviceBuffer::zeros(&ctx, n_elems)?;

    // 3. Build the descriptor — pure shape + dtype + op-kind, no handles.
    let desc = SoftmaxDescriptor::<2> {
        kind: SoftmaxKind::Softmax,
        input_shape: [rows, cols],
        softmax_axis: 1,
        element: <f32 as baracuda_kernels::KernelDtype>::KIND,
    };

    // 4. Plan selection — picks a kernel SKU (bespoke softmax kernel here).
    let plan = SoftmaxPlan::<f32, 2>::select(&stream, &desc, PlanPreference::default())?;

    // 5. Args carry the per-call tensor handles + strides.
    let args = SoftmaxArgs {
        x: TensorRef { data: dev_x.as_slice(), shape: [rows, cols], stride: [cols as i64, 1] },
        y: TensorMut { data: dev_y.as_slice_mut(), shape: [rows, cols], stride: [cols as i64, 1] },
    };

    // 6. Launch. Workspace::None for plans that need no scratch.
    plan.run(&stream, Workspace::None, args)?;
    stream.synchronize()?;
    Ok(())
}

The same select → run shape applies to every op. GEMM, attention, conv2d, FFT, scatter — the descriptor / args fields differ per family but the lifecycle is identical. See the crates/baracuda-kernels README for the int8-GEMM variant of the same example.

Workspace layout

The user-facing crates a typical caller will reach for:

baracuda-kernels             # the unified Plan-based ML op facade
baracuda-kernels-types       # shared type vocabulary (Element, TensorRef, KernelSku, ...)
baracuda-kernels-sys         # raw FFI to bespoke .cu kernels
baracuda-kernels-bench       # criterion harness for sm_89 perf sweeps (not published)
baracuda-cutlass             # safe wrapper for CUTLASS GEMM (float, int8 RCR, batched, grouped)
baracuda-driver              # safe wrapper for the CUDA Driver API
baracuda-runtime             # safe wrapper for the CUDA Runtime API

The per-library wrappers used internally by the facade (you can also use them stand-alone):

baracuda-cublas{,-sys}       # cuBLAS + cuBLASLt + cuBLASXt
baracuda-cudnn{,-sys}        # cuDNN classic + Graph API
baracuda-cudf{,-sys}         # cuDF (RAPIDS dataframe; Linux-only)
baracuda-cufft{,-sys}        # cuFFT
baracuda-cusolver{,-sys}     # cuSOLVER dense + sparse + Rf + Mg
baracuda-cusparse{,-sys}     # cuSPARSE
baracuda-curand{,-sys}       # cuRAND
baracuda-cutensor{,-sys}     # cuTENSOR
baracuda-cutlass{,-sys}      # CUTLASS GEMM kernel templates
baracuda-cutlass-kernels-sys # CUTLASS kernel-only compile target
baracuda-cuvs{,-sys}         # RAPIDS cuVS GPU vector search (Phase 71)
baracuda-cvcuda{,-sys}       # CV-CUDA image processing
baracuda-flashinfer{,-sys}   # FlashInfer paged-KV + cascade + sampling (Phase 46/66)
baracuda-npp{,-sys}          # NPP
baracuda-nccl{,-sys}         # NCCL
baracuda-nvcomp{,-sys}       # nvCOMP
baracuda-nvimagecodec{,-sys} # nvImageCodec (Phase 70 — supersedes nvJPEG)
baracuda-nvjpeg{,-sys}       # nvJPEG (kept for back-compat)
baracuda-nvshmem{,-sys}      # NVSHMEM symmetric-heap RDMA (Phase 69)
baracuda-ozimmu{,-sys}       # ozIMMU Ozaki-scheme DGEMM (Phase 44)
baracuda-transformer-engine{,-sys}  # TransformerEngine FP8 (Phase 55)

And the supporting low-level crates (FFI, build infrastructure, profiling):

baracuda-cuda-sys            # Driver + Runtime FFI
baracuda-nvrtc{,-sys}        # runtime CUDA C++ → PTX
baracuda-nvjitlink{,-sys}    # CUDA 12+ JIT linker
baracuda-cupti{,-sys}        # profiling APIs
baracuda-nvml{,-sys}         # device monitoring
baracuda-cufile{,-sys}       # GPUDirect Storage (Linux-only)
baracuda-tensorrt{,-sys}     # TensorRT inference runtime (Phase 68 — vtable-dispatch C++ shim)
baracuda-forge              # build-time .cu → PTX compiler driver
baracuda-build              # build.rs helpers
baracuda-core                # loader + Error plumbing
baracuda-types{,-derive}     # pure-data types: Half, BFloat16, Complex, DeviceRepr

Vendor-track training / optimization crates (opt-in via feature gates on baracuda-kernels):

baracuda-megatron            # Megatron-LM TP primitives (Phase 57; composition, no kernels)
baracuda-optim               # NVIDIA Apex multi-tensor optimizers (Phase 49)

The full umbrella crate (baracuda) re-exports everything behind cargo features — convenient when you want everything; overkill when you don't.

Hardware support

baracuda targets Ampere and newer by design. Pre-Ampere GPUs lack the tensor-core instructions and async-copy primitives the bespoke kernels are written against (mma.sync.m16n8k*, cp.async, ldmatrix), and we have no desire to ship a slower SIMT fallback for hardware that's eight years old.

Compute capability	NVIDIA marketing names	baracuda support
sm_80	Ampere (A100, A40, A30, RTX 30xx)	default baseline
sm_89	Ada Lovelace (RTX 40xx, L40, L4)	feature-gated specialized kernels (FP8, larger Flash Attention tiles)
sm_90a	Hopper async (H100, H200)	stubs in place; full specialization pending Phase 11
sm_100	Blackwell	post-Phase-11
≤ sm_75 (Turing, Volta, Pascal, …)	—	unsupported

The default sm80 build runs forward-compatibly on Ada and Hopper through JIT-compiled PTX; turn on sm89 to pick up the FP8 and Flash-Attention sibling plans tuned for Ada's larger register file.

Cargo features

The kernel facade exposes a broad opt-in feature set. Architecture and library-integration features are off by default — pick what your deployment needs. Important: unless a feature is enabled, its plans are not built, not linked, and not present in the public API. Always check the feature gate when scanning for an op family.

Architecture targets

Feature	Default	Effect
`sm80`	yes	Ampere-baseline kernel set (RTX 30xx, A100).
`sm89`	no	Ada Lovelace specializations (FP8 GEMM, `FlashSdpaSm89Plan`).
`sm90a`	no	Hopper-specialized kernels (stubs today; tracked for Phase 7x).

NVIDIA library integrations

Feature	Default	Phase	Effect
`cudnn`	no	7	Link cuDNN. Enables Conv2d / Pool2d / `CtcLossCudnnPlan`. Separate NVIDIA download.

Vendored kernel families

Feature	Default	Phase	Effect
`fa2`	no	42	Vendored Tri Dao FlashAttention v2 (BSD-3). Long-context routing on `FlashSdpaPlan`.
`mhc`	no	43	Vendored DeepSeek-AI mHC.cu (MIT). `HyperConnectionPlan` (learned residuals).
`ozimmu`	no	44	Vendored ozIMMU (MIT). Ozaki-scheme DGEMM via S² int8 tensor-core matmuls.
`flashinfer`	no	46/66	Vendored FlashInfer (Apache-2.0). Paged-KV decode/prefill, cascade, sampling.
`marlin`	no	48	Vendored IST-DASLab Marlin (Apache-2.0). Symmetric int4 W4A16 GEMM.
`awq`	no	48	Vendored mit-han-lab llm-awq (MIT). Asymmetric int4 W4A16 GEMM.
`optim`	no	49	Vendored NVIDIA Apex (BSD-3). Multi-tensor Adam/Lamb/SGD step kernels.
`mamba`	no	50	Vendored Mamba-2 SSD + Dao causal-conv1d. State-space LLM ops.
`bnb_nf4`	no	53	Vendored bitsandbytes NF4 (MIT). 4-bit non-uniform quantile QLoRA.
`xformers_blocksparse`	no	54	Clean-room port of xFormers block-sparse SDPA (BSD-3).
`xformers_sparse24`	no	54	Clean-room port of xFormers 2:4 structured sparsity GEMM (BSD-3).
`tensor_engine`	no	55	Vendored NVIDIA TransformerEngine (Apache-2.0). FP8 cast/dequant + delayed-scaling recipe.
`ring_attention`	no	56	Sequence-parallel ring attention (Apache-2.0 reference). Pulls in NCCL.
`megatron_tp`	no	57	Megatron-LM tensor-parallel primitives (composition, no new kernels).
`nvshmem`	no	69	NVSHMEM host-side wrapper (one-sided RDMA, sibling to NCCL). Linux-only.

Sibling crates also have their own feature gates:

baracuda-cuvs — cuvs feature in baracuda-kernels-sys ecosystem pulls in RAPIDS cuVS vector-search (Phase 71, Linux-only). Not yet re-exported by baracuda-kernels itself.
baracuda-tensorrt — shim feature builds the vtable-dispatch C++ shim required to call TensorRT (Phase 68 — no flat C ABI upstream).
baracuda-flashinfer — same flashinfer feature as above.

Notes:

cudnn is off by default because cuDNN is a separate NVIDIA download not bundled with the stock CUDA toolkit installer. Enabling it without cuDNN installed produces a linker error on cudnn.lib / libcudnn.so — see the Building section for auto-discovery paths.
Most vendored features add 30s–5 min to first build (template-heavy CUDA). Subsequent builds incremental.
Features that pull in optional sibling crates (optim, tensor_engine, megatron_tp, nvshmem, ring_attention) only compile the sibling when the feature is enabled — inference-only consumers don't pay the surface cost.

Building

Requirements:

CUDA Toolkit ≥ 12.0 with nvcc on PATH. baracuda is tested on 12.x and 13.x.
cuDNN 9.x (only if you enable the cudnn feature) — separate NVIDIA download, not bundled with the toolkit.
A working Rust toolchain ≥ 1.85 (workspace MSRV pinned in rust-toolchain.toml).
Windows users: lld-link.exe somewhere on PATH. The CUDA nvcc invocation links through it; the install location is typically C:\Program Files\LLVM\bin. Install the LLVM Windows package and add that directory to PATH if cargo build complains about lld-link.exe not being found.

A typical full build with all GPU-side features (CUDA toolkit + cuDNN present):

cargo build -p baracuda-kernels --features sm89,cudnn --release

Or, to verify the public API surface compiles without the full kernel build (fast — type-check only):

cargo check -p baracuda-kernels --features sm89,cudnn

The baracuda-kernels-sys build script auto-discovers cuDNN at the following paths in order: CUDNN_PATH / CUDNN_ROOT / CUDNN_HOME env vars, then C:\Program Files\NVIDIA\CUDNN\v<X.Y>\ on Windows, then the CUDA toolkit's own lib/ directory (pre-cuDNN-9 layout), then the standard Linux distro paths under /usr/lib/.

Troubleshooting

Windows: Git-for-Windows fake `link.exe` shadowing the MSVC linker

Git-for-Windows ships a GNU coreutils binary named link.exe at C:\Program Files\Git\usr\bin\link.exe — its job is to create a hard link, not to link object files. If that directory appears on PATH ahead of the MSVC linker (or LLVM's lld-link.exe), cargo build invokes the coreutils binary instead of the real linker and fails with a cryptic error (it doesn't understand /OUT: and friends).

baracuda's baracuda-kernels-sys and baracuda-cutlass-sys build scripts probe PATH on Windows and emit a cargo:warning if they detect this shadowing. Fix: re-order PATH so the MSVC linker (typically reached via the Visual Studio "x64 Native Tools Command Prompt") or LLVM's lld-link.exe (C:\Program Files\LLVM\bin\) appears before C:\Program Files\Git\usr\bin\. Building from the VS x64 Native Tools prompt is the most reliable option; alternatively, install LLVM and put its bin directory ahead of Git's on the user/system PATH.

Testing

baracuda's GPU integration tests are gated behind #[ignore] so a host-only cargo test doesn't try to launch a kernel on a machine without an NVIDIA driver. To run them you need a working GPU plus the --ignored flag:

# Host-only tests (compile + reference logic; no GPU access):

cargo test -p baracuda-kernels --lib


# Full GPU integration sweep — RTX 30xx / 40xx / 50xx required:

cargo test -p baracuda-kernels --release -- --ignored


# Verify the workspace-level API surface compiles (no GPU needed):

cargo check -p baracuda-kernels --features sm89,cudnn

The full regression on an RTX 4070 covers 324 binary targets at ~1630 tests passing. Individual op-family suites take 30–90 seconds; the full sweep is 25–40 minutes.

Benchmarks

The baracuda-kernels-bench crate is a criterion-based harness with CUDA-event-timed throughput sweeps across GEMM, Flash Attention, and Conv2d at LLM-typical and ResNet-typical shapes. It is not published to crates.io (it depends on a working GPU).

cargo bench -p baracuda-kernels-bench --features sm89,cudnn

The full sweep takes ~30 minutes on an RTX 4070. Scope to a single family with --bench gemm / --bench flash_attention / --bench conv2d. See crates/baracuda-kernels-bench/BENCH-sm89.md for the baseline table format and methodology.

Project documentation

ARCHITECTURE.md — layered design, Plan-Descriptor-Args pattern, KernelSku taxonomy, dispatcher design, workspace contract, sibling-plan pattern, vendoring convention, phase roadmap.
OP-MATRIX.md — full op × dtype × backend coverage matrix (planned).
LESSONS.md — postmortems, ABI footguns, performance traps (planned).
Per-crate README.md files under crates/<name>/.

License

Dual-licensed under MIT or Apache-2.0. Pick whichever fits your project. Contributions accepted under the same terms.

NVIDIA's CUDA libraries (libcuda, libcudart, libcublas, libcudnn, …) are not redistributed by this project. You obtain them from NVIDIA separately — either through the CUDA Toolkit installer or through each library's dedicated download page. baracuda's loader opens whatever the host driver / toolkit has installed.

Vendor attribution

A small number of bespoke kernels in baracuda-kernels-sys are vendored from upstream open-source projects (huggingface/candle's CUDA kernel set via fuel-cuda-kernels; llama.cpp's ggml-cuda GGUF block-format quantization + MMVQ; guoqingbao/attention.rs's fused MoE expert kernels). Each adapted source carries an SPDX-FileCopyrightText: + SPDX-License-Identifier: header; the consolidated provenance is in crates/baracuda-kernels-sys/LICENSE-thirdparty.md.

FlashAttention v2 (Tri Dao, BSD-3-Clause, pinned at v2.8.3 / 060c9188beec3a8b62b33a3bfa6d5d2d44975fab) is vendored at crates/baracuda-kernels-sys/vendor/flash-attention/ with verbatim LICENSE + AUTHORS files and full vendor / scope notes in VENDOR.md. Gated behind the fa2 cargo feature on baracuda-kernels-sys and baracuda-kernels; exposed through a backend-choice path on FlashSdpaPlan (Phase 42).

mHC.cu (Andre Slavescu, MIT, pinned at a426939c2dbc11c443db041bcff12b65d1b6482a) — unofficial CUDA implementation of DeepSeek-AI's Manifold-Constrained Hyper-Connections paper — is vendored at crates/baracuda-kernels-sys/vendor/mhc/ with the verbatim upstream LICENSE, an AUTHORS file, and full vendor / scope notes in VENDOR.md. Gated behind the mhc cargo feature on baracuda-kernels-sys and baracuda-kernels; exposed through the new HyperConnectionPlan (Phase 43, Tier 1: static-H FW, bf16 weights / f32 activations).

FlashInfer (NVIDIA

FlashInfer community, Apache-2.0 with full patent grant, pinned at v0.6.12 / eee0d75f91f64c520bfaed07e39a850ea4ddde23) — a curated ~12 kLOC subset of the FlashInfer header tree is vendored at crates/baracuda-kernels-sys/vendor/flashinfer/ with verbatim upstream LICENSE + NOTICE and full vendor / scope / patch notes in VENDOR.md. Gated behind the flashinfer cargo feature on baracuda-kernels-sys and baracuda-kernels; exposes three NEW plan families — paged-KV decode + append (BatchPagedDecodePlan + PagedKvAppendPlan for vLLM-style serving), sort-free sampling (TopKTopPSamplingPlan — combined TopK/TopP/MinP via a single-kernel rejection sampler), and cascade attention LSE merge (CascadeAttentionPlan for prefix-cache sharing). Surgical cherry-pick (not a wholesale wrap) — Hopper / Blackwell / NVSHMEM / Mamba / MLA / POD paths intentionally skipped to keep the build cost contained. Phase 46.

The baracuda-forge build-time kernel-compiler crate is a vendored fork of cudaforge by Guoqing Bao — see crates/baracuda-forge/NOTICE for the upstream commit hash.

The baracuda-cutlass safe wrapper for NVIDIA CUTLASS — plan-based GEMM and grouped-GEMM with caller-supplied workspace, MoE-friendly variable-M-per-group dispatch — was specified by the Fuel ML library team. See crates/baracuda-cutlass/NOTICE for the design lineage.

baracuda 0.0.1-alpha.68