baracuda

About the name. Yes, we know — it's spelled barracuda (two Rs). That name was taken on crates.io, so we dropped one R and kept swimming.
A unified Rust ML-op facade over the NVIDIA CUDA ecosystem.
What baracuda is
baracuda is a Rust workspace that exposes every primitive an ML framework
expects — the union of PyTorch (torch.* + nn.functional) and JAX
(jax.lax.* + jax.numpy.*) — through a single Plan-based crate surface
called baracuda-kernels. Internally each plan dispatches to:
- The appropriate NVIDIA-library wrapper crate (cuBLAS, cuDNN, cuFFT, cuSOLVER, cuRAND, cuSPARSE, cuTENSOR, NPP, CV-CUDA, CUTLASS) when one already covers the op well, or
- A bespoke hand-rolled
.cukernel shipped inbaracuda-kernels-syswhen no NVIDIA library covers the op (or covers it poorly at the shapes that matter for modern transformer / vision / GNN workloads).
Callers import one crate (baracuda-kernels) and reach for one API
style. The dispatch decision — which is observable through
Plan::sku() for telemetry — is otherwise invisible. Switching from a
CUTLASS-backed SKU to a bespoke-backed SKU is a layout flag, not an import
change.
baracuda is for downstream Rust ML / inference / training frameworks that
need access to the full CUDA stack without re-vendoring it themselves. The
workspace also ships idiomatic stand-alone wrappers for every CUDA library
under crates/baracuda-<lib> if you want to skip the kernel facade and
talk to one library directly.
Status
In active development — alpha.67. 2280+ GPU tests passing,
zero failures across the 6 critical test crates on an RTX 4070
(sm_89; the baracuda-kernels suite alone is 2180/0 across 513 test
binaries). alpha.67 (Phase 74, Fuel-ask) ships the plain dense FP
GEMM family: 12 cuBLAS-backed flat C symbols
(baracuda_kernels_gemm_dense_{f32, f64, f16, bf16}_*) with
RRR / RCR / CRR layouts, flexible leading dims, and
strided-batch folded into the base signature — plus the
DenseGemmPlan<T> typed plan, the ReduceToPlan<T, N>
broadcast-reverse reduction facade, UnaryKind::Step, and gelu
flavor-disambiguation docs. This closes the last non-baracuda CUDA
surface in Fuel (its own cuBLAS MatMul wrapper). alpha.66 (Fuel-ask)
exposed per-device VRAM queries on the Driver-API Device —
vram_free() / vram_total() / vram_info() wrap
cuMemGetInfo_v2 so downstream optimizers can read runtime memory
pressure without round-tripping through a failed allocation. Phase 73 follow-up (alpha.65) lands a 17-33× decode
speedup via a focused FlashDecodingPlan (split-K, seq_q=1),
4× win at GQA shapes via the new num_kv_heads descriptor
field, the long-awaited FlashSdpaPlan GQA-broadcast routing fix,
and a SDPA gap closure that makes baracuda's standard MHA shape
50% faster than PyTorch by making fa2 a default feature. The
ConcatPlan (KV-cache decode) and reduce_axis (small-shape rows)
kernels also got significant rewrites — 13× and 2.6-15.4×
respectively. Phase 63 (alpha.63, Fuel-ask) closes the FlashAttention
saved-tensor wiring gap: NEW
baracuda_kernels_fa2_sdpa_lse_size(batch, num_heads, seq_q) -> usize
dense LSE pre-allocation helper (sibling of the existing varlen
_varlen_lse_size); load-bearing "LSE saved-tensor contract"
docs on the FW + BW trailblazers naming the exact FW→saved-LSE→BW
handoff; new docs guide at docs/guides/fa2-saved-tensor-contract.md
showing the wiring pattern downstream autograd frameworks should
use. Test investment: 8 new tests — 3 host-only lse_size
helper sanity tests + 5 GPU FW→BW roundtrip tests (f16/bf16 ×
d128 causal/noncausal + d64 causal) + 4 BW feature-surface tests
backfilling the gaps left by fa2_backward_smoke.rs (BW with
sliding window, softcap, ALiBi, and all-features composed). The
existing softmax_lse output on baracuda_kernels_fa2_sdpa_<dt>_run
(v1) and ..._run_v2 (Phase 59a) — turns out Fuel didn't need a
new "v3 with lse" variant; the FW has been writing LSE since
alpha.56, just lacked the size helper + saved-tensor wiring
documentation. Phase 62 (alpha.62, Fuel-ask) lifts the in-place
op contract from contig-only (Phase 61) to strided by shipping
11 new affine in-place FFI symbols (4 contig int dtype backfill —
i32/i64/u8/i8 matching forward affine; 7 strided variants
across the full forward-strided dtype matrix — f32/f64/i32/i64/u8/bf16/f16)
and documenting the stride-equality precondition for
same-pointer aliasing on the unary / binary / ternary strided
trailblazers as a stable public contract. NEW
baracuda_kernels_types::strides_equal host helper for callers
to validate the precondition before dispatch. Zero new bespoke
kernels for the elementwise unary/binary/ternary families —
their existing strided launchers are aliasing-safe under the
contract. Test investment: 14 host-only unit tests +
17 GPU smoke tests for the new FFI surface + 7 aliasing-contract
proof tests across contig + strided trailblazers (also backfills
the Phase 61 contig contract that shipped without test coverage).
Phase 61 (alpha.61) completed the alpha.55 baseline with bf16/f16
in-place affine + the original contig same-pointer contract doc.
Phase 60 lifted FA2 FW to the full Candle-fork-extended
9-head_dim set ({32, 64, 96, 128, 160, 192, 224, 256, 512}) via
12 new vendored .cu files + 32 FW smoke tests. Phase 59a + 59b
added the full FA2 v2.8.3 surface (FW + BW + varlen across
head_dims 32-256, GQA, ALiBi, sliding window, softcap) plus 48
new smoke tests, closing Fuel's FA2-retirement requirements. Phase 59c (consolidation pass, alpha.59) fixed a
pre-existing parallel-test race in the bespoke flash kernel's
SMEM-carveout call surfaced by Phase 59a's 5-head_dim fanout, plus
updated flash_sdpa_backward_smoke to force the bespoke backend on
f16/bf16 (Phase 59b made FA2 the new default BW backend, breaking
source-compat for the existing bespoke BW smoke tests).
Phase 42-44
add three opt-in backends (FA2, mHC.cu, ozIMMU); none are on the
default build path. Phase 44b internalized the ozIMMU sources
(clean-fork; cutf submodule retired; Linux + Windows both build).
Phase 49 adds the baracuda-optim sibling crate (Adam / LAMB / SGD
via vendored Apex multi_tensor_apply) — gated behind the optim
feature so inference-only consumers don't pay the FFI surface cost.
Phase 55 adds the baracuda-transformer-engine sibling crate
(NVIDIA TransformerEngine FP8 cast + delayed-scaling recipe,
Apache-2.0) — gated behind the tensor_engine feature. On Ada
(sm_89) the FP8 wins are bandwidth-saving only (KV cache, weights);
the recipe machinery is forward-compatible with Hopper / Blackwell
where the MMA throughput win also materializes.
Phase coverage (see ARCHITECTURE.md for the phase
matrix):
| Phase | Scope | Status |
|---|---|---|
| 59a | FA2 FW expansion (alpha.59) — full upstream feature parity (head_dim fanout {32,64,96,192,256}; GQA; ALiBi; sliding window; softcap): vendored 20 new .cu files from Dao-AILab FA2 v2.8.3 (head_dims 32/64/96/192/256 × {fp16, bf16} × {causal, non-causal}) bringing the FW vendor coverage to the full upstream set of {32, 64, 96, 128, 192, 256}. Upstream FA2 v2.8.3 does NOT ship head_dims 160/224/512 — those are permanently out-of-scope (no source). Launcher (kernels/attention/fa2_launcher.cu) rewritten to dispatch all 6 supported head_dims via runtime switch. NEW ..._run_v2 + ..._can_implement_v2 FFI entry points (+4 symbols) carrying ALiBi slopes + per-head-or-per-batch layout selector + sliding window left/right bounds + Gemma-2-style softcap. v1 entry points preserved for backwards-compat. GQA-divisible head counts (num_heads % num_heads_k == 0) now accepted on the FA2 path. FlashSdpaDescriptor is now #[non_exhaustive] with ::new(...) + chainable with_window_size_left/with_window_size_right/with_softcap builders (Phase 32 convention). FlashSdpaArgs gained alibi_slopes: Option<TensorRef<f32, 2>>. Bespoke backend rejects sliding-window/softcap/ALiBi at select-time with clear errors. ~33 descriptor + ~30 args callsites migrated to the builder pattern. 4 new smoke test files (26 new test functions): fa2_hdim_fanout_smoke (20), fa2_gqa_smoke (1), fa2_alibi_smoke (3), fa2_sliding_window_smoke (3), fa2_softcap_smoke (4). Out of scope (Phase 59b territory): BW path, varlen, split-KV. |
done |
| 59c | Bespoke flash SMEM-carveout race fix + flash_sdpa_backward smoke test routing fix (alpha.59 consolidation pass): added std::mutex-serialized helper set_dynamic_smem_serialized around all cudaFuncSetAttribute(MaxDynamicSharedMemorySize) calls in baracuda_flash_sdpa.cuh + baracuda_flash_sdpa_sm89.cuh (5 call sites total: FW + BW dQ + BW dKdV + sm_89 FW + sm_89 strided FW). Pre-existing flake (root cause: Phase 6 / Milestone 6.6 host wrapper) that surfaced as CutlassInternal(1001) (= cudaErrorMissingConfiguration) at ~33% rate on Phase 59a's 20-test hdim fanout, specifically for d_k=96 + fp16 (smem ~50 KiB, just past the 48 KiB cudaFuncSetAttribute trigger). Confirmed fix via 3 stress runs after fix: 60/60 tests pass. Also fixed flash_sdpa_backward_smoke's f16/bf16 paths to explicitly request BackendKind::Bespoke — Phase 59b made FA2 the default BW backend for f16/bf16 (more permissive heuristic), which broke source-compat for the existing bespoke BW smoke tests (they fed lse: f16 not lse_f32). |
done |
| 59b | FA2 BW + varlen (alpha.59; closes Fuel's FA2-retirement requirements): vendored 24 new BW .cu files (flash_bwd_hdim{32,64,96,128,192,256}_{fp16,bf16}_{,causal}_sm80.cu — full FA2 v2.8.3 BW set, mirrors 59a FW vendor 1:1) plus 3 new BW headers (flash_bwd_kernel.h, flash_bwd_launch_template.h, flash_bwd_preprocess_kernel.h). Key finding: varlen does NOT have a separate .cu file family upstream — FA2 v2.8.3 dispatches varlen via a runtime cu_seqlens_q != nullptr check inside the existing FW/BW launch templates, so the same per-(headdim, dtype, causal) instantiations serve dense and varlen callers. NEW kernels/attention/fa2_backward_launcher.cu (BW dispatch, supports dense + varlen via two fill_*_params helpers) + fa2_varlen_launcher.cu (varlen FW). +12 new FFI symbols (BW dense ×2 + can_implement ×2 + workspace_size; varlen FW ×2 + can_implement ×2 + lse_size; varlen BW ×2 + can_implement ×2 + workspace_size). API: FlashSdpaBackwardDescriptor is now #[non_exhaustive] with ::new(...) + sliding-window/softcap builders. FlashSdpaBackwardArgs gained lse_f32: Option<TensorRef<f32, 3>> (FA2 stores LSE in f32 regardless of T) + alibi_slopes. FlashSdpaBackwardPlan extended with BackendChoice::FlashAttentionV2 arm (additive — bespoke path source-compat preserved). NEW FlashSdpaVarlenPlan / FlashSdpaVarlenBackwardPlan plan families with packed-batch [total_q, H, D] layout + cu_seqlens_q/cu_seqlens_k index tensors + f32 LSE [H, total_q + 128*B]. BW workspace = dq_accum + dsoftmax_d (sizes via ..._backward_workspace_size); launcher zero-fills via cudaMemsetAsync. Determinism: FA2 BW uses atomicAdd into dq_accum, so NOT bit-stable run-to-run (precision SKU tags this honestly). 2 new smoke test files: fa2_backward_smoke.rs (12 tests: workspace sizing + eligibility + e2e BW for d ∈ {64,128,192,256} × {f16,bf16} × {causal,non-causal}), fa2_varlen_smoke.rs (5 tests: plan selection, lse_size formula, varlen FW with 3 packed sequences, varlen BW with 2 sequences, varlen × GQA). |
done |
| 60 | FA2 head_dim {160, 224, 512} FW expansion (alpha.60) — corrects Phase 59a's incorrect "permanently out-of-scope" claim. The Candle fork (EricLBuehler/candle) has carried hd160/192/224/256 since 2023-07 (PR #245 by Laurent Mazare); hd224 was restored by PR #2688 (Michael Feil, 2024-12-31); hd512 was added by PR #3417 (Eric Buehler, merged 2026-03-28 — adds the cudaDeviceGetAttribute(cudaDevAttrMaxSharedMemoryPerBlockOptin) SMEM opt-in path and updates the splitkv block-size formula). Phase 60 vendors the 12 missing FW .cu files from those PRs into baracuda's FA2 tree (8 hd160/224 from EricLBuehler/candle@main; 4 hd512 from huggingface/candle@5430d32c) plus the corresponding flash_fwd_launch_template.h + static_switch.h patches. BW path NOT extended — hd160/224 fall on FA2 BW kernel's kBlockKSmem = (kHeadDim % 64 == 0) ? 64 : 32 constraint (BW atom_layout assumes 64); hd512 needs kBlockM = 32 to fit any SMEM budget but BW kernel_traits static-asserts kBlockM >= 64. Upstream FA2 and the Candle fork ship no BW for these three either — limitation is fundamental to FA2's BW algorithm, not an oversight. Phase 60 attempted both paths; the experiment + reasoning is documented in VENDOR.md, in code comments at the dropped registration sites, and in FA2_BW_SUPPORTED_HEAD_DIMS (kept at {32, 64, 96, 128, 192, 256}). Callers needing BW at hd160/224/512 transparently fall back to the bespoke 3-kernel SDPA BW pipeline (the only path that was supporting them previously, anyway). 12 new FW smoke test functions in fa2_hdim_fanout_smoke. FA2_SUPPORTED_HEAD_DIMS (FW) lifted to {32, 64, 96, 128, 160, 192, 224, 256, 512} — full Candle-fork-extended set. |
done |
| 63 | FlashAttention saved-tensor contract — dense LSE size helper + FW/BW wiring docs (alpha.63, Fuel-ask). Closes the wiring gap for downstream autograd integration of baracuda's FA2 backward. 1 new FFI symbol: baracuda_kernels_fa2_sdpa_lse_size(batch, num_heads, seq_q) -> usize returning f32 element count — sibling of the existing _varlen_lse_size (Phase 59b). FW v1 + v2 have written softmax_lse since alpha.56 — turns out Fuel didn't need a new "v3 with lse" forward variant as their ask suggested; just the size helper for pre-allocation + clarity on the saved-tensor pattern. Documented "LSE saved-tensor contract" as a stable public contract on the FW + BW trailblazers (baracuda_kernels_fa2_sdpa_f16_run / ..._backward_f16_run): pre-allocate via _lse_size, pass same f32 buffer to FW as output and BW as input, ALiBi/sliding-window/softcap params must match between FW and BW. NEW docs guide at docs/guides/fa2-saved-tensor-contract.md (~180 lines, ASCII handoff diagram + side-by-side FW/BW code samples). BW head_dim cap confirmed at 256 (matches Fuel's Vulkan limit); hd160/224/512 BW remains structurally not supported by FA2 (per Phase 60), callers fall back to bespoke SdpaBackwardPlan. Test investment: 3 host-only lse_size helper tests + 5 GPU FW→BW roundtrip tests (the load-bearing wiring proof — pre-allocate LSE, FW writes it, BW reads it, verify dQ/dK/dV finite + non-zero) + 4 BW feature-surface tests (sliding window, softcap, ALiBi, all-features composed — backfills the gaps in the existing fa2_backward_smoke.rs which only tested the base path with alibi_slopes: None). All 12 new tests pass on RTX 4070. Option B (recompute-LSE backward variant) explicitly rejected: 2× backward compute for zero functional benefit when the saved-tensor pattern already works. PagedAttention backward filed as "ask if needed" per Fuel — not preemptively built. |
done |
| 62 | Strided in-place op support + comprehensive test investment (alpha.62, Fuel-ask). Lifts the in-place contract from contig-only (Phase 61) to strided. 11 new FFI symbols on the affine in-place family: 4 contig int dtype backfill (i32/i64/u8/i8 matching forward affine matrix) + 7 strided variants across the full forward-strided dtype set (f32/f64/i32/i64/u8/bf16/f16). Half-precision strided uses the same f32-scalar / upcast-to-f32 / downcast pattern as the forward strided f16/bf16 kernels. Same-pointer aliasing contract documented for the strided trailblazers (unary_neg_f32_strided_run, binary_add_f32_strided_run, ternary_clamp_f32_strided_run) as a stable public contract: aliasing is safe IFF the aliased input's stride array equals stride_y element-for-element. NEW baracuda_kernels_types::strides_equal(a, b) host helper for callers to validate the precondition before dispatching. Zero new bespoke CUDA kernels for the elementwise unary/binary/ternary families — their existing strided launchers are structurally aliasing-safe under the contract (each thread reads its own stride-offset cell before writing, same per-thread pattern as the contig case). Unblocks Fuel's strided in-place op fanout (every existing strided unary/binary/ternary forward kernel becomes an in-place candidate with x_ptr == y_ptr + equal strides). Test investment: 14 new host-only unit tests for strides_equal + contiguous_stride in baracuda-kernels-types; 17 new GPU direct-FFI smoke tests covering all 11 new affine in-place symbols + backfill tests for the alpha.55 baseline (f32/f64) and alpha.61 half-precision (bf16/f16) contig in-place; 7 aliasing-contract proof tests across contig + strided trailblazers (also backfills the Phase 61 contig contract that shipped without test coverage). Multi-pass families (Softmax / LayerNorm / RMSNorm / etc.) explicitly out of scope — same as Phase 61. |
done |
| 61 | In-place op infrastructure completion + same-pointer aliasing contract (alpha.61, Fuel-ask) — 2 new bf16/f16 FFI symbols + docstring tightening. (1) baracuda_kernels_affine_inplace_{bf16,f16}_run complete the 4-dtype matrix on top of the alpha.60 f32/f64 in-place affine helper, with f32-scalar ABI matching the forward affine_{bf16,f16}_run convention (avoids passing __nv_bfloat16/__half by value through the C ABI). Kernels reuse the forward upcast-to-f32 / downcast-to-storage pattern from affine_contig_kernel_{f16,bf16}. Unblocks Fuel's Op::AddScalar/Op::MulScalar in-place rewrites + weight-decay scaling on bf16/f16 model weights without the previous Cast → Affine → Cast scratch-buffer round-trip. (2) Documented same-pointer aliasing safety as a stable public contract on the three contig elementwise trailblazers — unary_neg_f32_run (covers ~30 plain unary launchers + unary_param_* family across all dtypes via the existing "Same device-pointer contract..." inheritance line), binary_add_f32_run (covers ~20 binary launchers), ternary_clamp_f32_run (already documented since alpha.36). Unblocks Fuel's planned in-place expansion (16+ unary in-place op families + 4 binary in-place op families + ClampInplace + PowIInplace) with zero new baracuda symbols for the elementwise case — Fuel dispatches the forward symbol with x_ptr == y_ptr (or a_ptr == y_ptr for binary). Strided in-place variants (Phase 62 candidate) deferred — v1 contract from Fuel's executor is contiguous + zero-offset. |
done |
| 0 | Crate scaffolding, shared type vocabulary | done |
| 1 | int8 GEMM RRR (Fuel-blocking, 18 SKUs) | done |
| 2 | FP8 / int4 / bin GEMM completion | done |
| 3 | Elementwise + shape / layout (Categories B, B', C, C', D, N) | done |
| 4 | Reductions + scans + random (Categories E, F, Q) | done |
| 5 | Normalization + softmax + loss (Categories G, H, R) | done |
| 6 | Attention + linalg + FFT (Categories K, Linalg, U) | done |
| 7 | Convolution + pooling + indexing + embedding + segment (Categories I, J, L, M, S) | done |
| 8 | Quantization helpers + GGUF + MoE (Category P, V) | done |
| 9 | Sort / topk / image / NMS (Categories O, T) | done |
| 10 | sm_89 (Ada Lovelace) tuning sweep | done |
| 11 | Fuel feedback integration (alpha.27) — ScalarType ergonomics, Conv/Pool fanout, GGUF Q8_K MMVQ, i64 indices, Sparsemax cap lift, atomicAdd-via-CAS, build-env probe | done |
| 12 | PowI + ArgMax/Min u32/i32 outputs (alpha.28) — IndexOutputElement sealed trait |
done |
| 13 | WriteSlice + Contiguize + sub-byte casts + Triu/Tril (alpha.29) — KV-cache fast path, retires Fuel's D2H/CPU/H2D fallback, plus DeviceBuffer::zero() (alpha.30) |
done |
| 14 | Strided FFI siblings (alpha.31) — Affine, PowI, Triu/Tril, RoPE+SDPA, GGUF MMVQ activation-strided + W byte offset; 56 new FFI symbols | done |
| 15 | Quick wins + correctness cleanup (alpha.32) — MMVQ alignment guard, OneHot/Nonzero i64 wrappers, MoE fixture race fix | done |
| 16 | Pool completion (alpha.33) — bit-exact adaptive pool {1,2,3}d, bespoke LpPool {1,2}d, bespoke FractionalMaxPool {2,3}d; 48 new FFI symbols | done |
| 17 | SDPA / attention completion (alpha.34) — Flash SDPA sm_89 strided FW + SDPA BW GQA-broadcast atomicAdd | done |
| 18 | Sub-byte / quantized completeness (alpha.35) — f16/bf16 activations for GgufMmvqPlan across all 11 block formats × contig + strided; 44 new FFI symbols |
done |
| 19 | Fuel retirement asks (alpha.36) — pool/conv FFI facade for cuDNN-backed plans + Upsample Nearest2d + NEW im2col/im2col1d/col2im1d bespoke; vendored Fuel Q8_1 for inspection; 140 new FFI symbols. Surfaced 1.0-freeze prereq for broader library-backed FFI facade audit | done |
| 20 | MoE — Item 4 from Fuel retirement (alpha.37): batched MMVQ × N-experts (36 new FFI symbols across 11 GGUF block formats × 3 activation dtypes + 3 pure-FP); MoE absorb-and-expose proved to be a no-op (Fuel hadn't evolved their kernels since Phase 8.5 vendor; 5 baracuda-side symbols already match) + 2 direct-FFI smoke tests | done |
| 21 | Bilinear interpolate expansion (alpha.38): align_corners + scale-factor overrides + f16/bf16 fanout (FW+BW). Breaking change to existing f32/f64 signatures. |
done |
| 22 | MMVQ ncols≥64 debug assertion + cuSOLVER FFI facade (alpha.39): 10 cuSOLVER-backed plan families (Cholesky, LU, QR+ormqr, SVD/SvdBatched/SvdaBatched, eigh real+complex, eig, lstsq, solve, inverse) wrapped behind ~50 flat C symbols in baracuda-kernels-sys/src/cusolver_facade.rs; closes the Phase 19 library-backed FFI facade gap for cuSOLVER. No feature gate (cuSOLVER ships with the CUDA toolkit). |
done |
| 23 | cuFFT + cuRAND FFI facade (alpha.40): 6 cuFFT plan families (FFT 1d/Nd C2C, R2C, C2R) × c32/c64 + f32/f64 + 2 cuRAND families (Uniform, Normal) × f32/f64 = 32 flat C symbols in baracuda-kernels-sys/src/{cufft,curand}_facade.rs. cuSPARSE skipped — no baracuda-kernels plans wrap it today. |
done |
| 24 | Cutlass GEMM re-export FFI facade (alpha.41): 210 trampolines (70 SKU families × {run, workspace_size, can_implement}) in baracuda-kernels-sys/src/cutlass_reexport.rs exposing the full Cutlass GEMM surface (fp16/bf16/tf32/f32_simt/f64/s8/u8 × {rcr, rrr} × {plain, bias, bias+relu/gelu/silu} + strided-batched fp16/bf16). cuTENSOR / NPP / CV-CUDA skipped — no baracuda-kernels plans wrap them. Completes the Phase 19 library-backed FFI facade 1.0-freeze prereq. |
done |
| 25-26 | Segment/EmbeddingBag BW completion + BatchedOrmqrWy complex (alpha.42): 9 new Rust plans + 24 new FFI symbols for Segment Max/Min/Prod BW (sorted + unsorted, f32/f64), Unsorted Segment Prod FW (atomicCAS-retry mul), EmbeddingBag Max FW+BW (f32/f64/f16/bf16 × i32/i64). Plus BatchedOrmqrWy complex (Complex32, Complex64) via the bespoke WY-block kernels + cuBLAS C/Z gemmStridedBatched (4 new bespoke FFI + 2 cuBLAS symbols). |
done |
| 27 | Q8_1 perf inspection (alpha.42 doc-only): Multi-M MMVQ opportunity identified, kept doc-only — bigger ROI than reformatting Q8_1. | done |
| 28 | API hygiene for 1.0 prep (alpha.43): new KernelDtype umbrella marker trait extending Element/IntElement/FpElement/BinElement; #[non_exhaustive] audit across the op-family *Kind enums + auxiliary tag enums + Error types. ElementKind / LayoutSku / ArchSku / EpilogueKind / ActivationKind / Workspace intentionally left exhaustive (hot-path-dispatched). |
done |
| 29 | Cross-implementation benchmark suite (alpha.44): 10 new criterion+CUDA-event benches comparing baracuda against cuBLAS / cuDNN at LLM-typical shapes (GEMM f32/f16/bf16, MMVQ all qtypes, Softmax, LayerNorm, RMSNorm, Conv2d, MaxPool2d, Reductions, Elementwise, Flash SDPA+GQA). ~2,750 LOC of bench code + 13 bench binaries total. Critical finding: baracuda f16/bf16 GEMM is 2-4× slower than cuBLAS at M=1/M=32 (decode regime); validates the deferred Phase 27 multi-M MMVQ port. See BENCHMARKS.md for the methodology + sample run. |
done |
| 30 | f16/bf16 GEMM cuBLAS fast-path (alpha.45): adds PlanPreference::prefer_backend: Option<BackendKind> + thread-local cuBLAS-handle cache to GemmPlan. Heuristic: cuBLAS for f16/bf16 at 2 ≤ M < 128 (decode batch); CUTLASS otherwise. 3× speedup at M=32 f16 (55.6µs → 19.0µs, parity with cuBLAS direct). M=1 stays on CUTLASS (cuBLAS RCR→col-major transa=T mapping slower than CUTLASS sm_80 GEMV-tile at K=N≥2048). Capture-mode auto-fallback to CUTLASS (cuBLAS-classic not capture-safe). 9 new smoke tests. |
done |
| 31 | Fuel Phase 6c.2 storage.rs unblock (alpha.46): 5 gaps closed — ELU α parameter (breaking; 8 sigs modified), powf (8 new), step + gelu_erf (16 new), cast u32/i16 (36 new × 2 directions), reduce_sum_to/reduce_max_to broadcast-reverse reductions (8 new). ~76 new/modified FFI symbols + 17 new smoke tests. Unblocks Fuel's full PTX retirement (AFFINE/UNARY/BINARY/CAST/REDUCE/INDEXING/TERNARY/FILL/SORT modules). |
done |
| 32 | Descriptor #[non_exhaustive] + builder pattern (alpha.47): 18 descriptors retrofitted with ::new() builders + chainable setters (with_stride/with_padding/with_dilation/etc.). Conv {1,2,3}d + ConvTranspose {1,2,3}d + Pool {1,2,3}d + AdaptivePool {1,2,3}d + LpPool {1,2}d + FractionalMaxPool {2,3}d + Interpolate + InterpolateBackward. Breaking change for downstream struct-literal callers — pre-1.0 hardening. Migration: Conv2dDescriptor { ... } → Conv2dDescriptor::new(input_shape, filter_shape, element).with_stride(...). |
done |
| 33 | Multi-M MMVQ via Q8_1 staging (alpha.48): closes Phase 27's deferred opportunity. NEW GgufMmvqMultiMPlan + quantize_q8_1 staging kernel + 4 Q8_0 multi-M launchers (M ∈ {1, 2, 4, 8}). Bench: 7.29-7.96× speedup at M=8 on Llama-2 7B layer shapes (4096²; 11008×4096; 32000×4096). Q8_0 only this phase (clean partial); 9 remaining block formats (Q4_0/Q4_1/Q5_0/Q5_1/Q2_K..Q6_K) are mechanical fanout for a follow-up. 8 new FFI symbols (3 staging + 4 multi-M + 1 workspace). |
done |
| 34 | Multi-M MMVQ block format fanout (alpha.49): 9 remaining GGUF formats shipped — Q4_0, Q4_1, Q5_0, Q5_1, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K. 36 new FFI symbols (9 fmts × 4 M-sizes). Bench at N=K=4096 M=8: Q5_0 17.32×, Q5_1 15.05×, Q4_0 12.78×, Q4_1 12.15×, Q8_0 8.79× — type-0/1 formats massively exceeded Phase 27's 3-7× target. K-quants (Q2_K..Q6_K) hit 3-7× at M=8 (larger 256-elem super-blocks dilute weight-reuse savings). Q8_K MMVQ correctly rejected at select() — bespoke per Phase 11.4. | done |
| 35 | Test-infra hardening (alpha.50): first zero-failure regression in the entire Phase 22-35 sweep (2229/0 across 638 binaries). Five fixes: (a) mmvq_w_offset_alignment_misaligned_rejected_debug #[cfg(debug_assertions)] gate; (b) cuBLAS handle retry with 5× linear backoff (Phase 30 parallel-init race); (c) cuDNN handle retry on CTC path (1001 NOT_INITIALIZED race); (d) Stream::capture panic-safe Drop guard (ThreadLocal capture state leak under cargo's thread reuse → cudaErrorStreamCaptureImplicit on subsequent tests); (e) cudaResourceDesc 48→128 byte expansion + repr(align(8)) (Rust struct under-allocated by 16+ bytes AND missing 8-byte alignment that the union's void*/size_t arms require — caused release-only STATUS_ACCESS_VIOLATION in wave5_smoke). |
done |
| 36 | Fuel 6c.4 unblock — Phase 1/3 (alpha.51): RoPE apply with precomputed cos/sin tables (FW+BW × 4 fp dtypes; 16 symbols) + Fill missing dtypes & strided variant (3 new contig + 11 strided; 28 symbols) + Argsort dtype fanout (u8/i8/u32/i16/bf16/f16/fp8e4m3; 14 symbols). 58 new FFI declarations total. | done |
| 37 | Fuel 6c.4 part 2/4 (alpha.52): Reduce family Gap 1 — reduce_min_to/prod_to broadcast-reverse for 4 fp dtypes (16 symbols) + integer-dtype single-axis sum/min/max/prod + argmin/argmax for U8/I8/U32/I16/I32/I64 (48 symbols, with U64/I64 widened accumulator + store-time narrow on Sum/Prod). 64 new FFI declarations total. Documented bit-exact wrap-on-overflow contract for u8/u32 sum/prod. |
done |
| 38 | Fuel 6c.4 part 3/4 (alpha.53): Ternary where_cond dtype-matrix fanout — Cond lifted to template parameter; U8 (existing, untouched) + U32 + I64 cond × {f32/f64/f16/bf16, u8/i8/u32/i16/i32/i64, fp8e4m3} value × {contig, strided}. 87 new FFI declarations (58 _run + 29 _can_implement). Existing where_<value>_run family preserved bit-identically (default Cond=uint8_t). |
done |
| 39 | Fuel 6c.4 part 4/4 (alpha.53, bundled with Phase 38): Indexing Tier 1 — NEW scatter (pure assign) + index_add for {f32/f64/f16/bf16} × {i32, i64idx} (16 syms) + gather u8idx extras for {f32, f64} (2 syms). 18 new FFI symbols total. Existing per-axis stride arrays meant no separate contig/strided split needed. f16/bf16 index_add uses the Phase 11.3 atomic::add<T> atomicCAS helper. Scatter documented + tested with disjoint-target indices (last-writer-wins on collisions, caller-aware non-determinism). |
done |
| 40 | Fuel 6c.4 final cleanup (alpha.54): multi-block radix argsort via CUB DeviceSegmentedRadixSort for row_len > 1024 (4 dtypes × 3 entries = 12 syms; bitonic stays for ≤1024) + Indexing Tier 2 integer value-dtype matrix (gather/index_select/scatter for u8/i8/u16/i16/u32/i32/i64 × i32/i64idx = 38 syms; index_add for i32/u32/i64 only = 6 syms). 56 new C symbols total. New atomic::add<int64_t> specialization via unsigned long long* reinterpret. Tier 3 (fp8e4m3 + sub-32-bit ints for index_add) deferred — no concrete caller. |
done |
| 41 | Fuel 6c.5 final unblock (alpha.55): RoPE interleaved-pair (Gap 7) + RoPE THD-layout (Gap 8) variants. 28 new FFI symbols (FW+BW × 4 fp dtypes × 2 variants + _can_implement companions). Closes the entire Fuel 6c.4/6c.5 batch ask — Fuel can now drop the last Id::Reduce PTX module + retire fuel-cuda-kernels workspace member + drop the cudaforge build dep. Discovery: existing rope_apply_* was already using (2k, 2k+1) pairing (not (i, i+d/2) as the brief stated) → interleaved symbols are name-aliases on the same kernel; THD is genuinely new. |
done |
| 42 | Flash Attention v2 vendor + FlashSdpaPlan backend (alpha.56): Tri Dao's FA2 v2.8.3 (BSD-3) vendored under crates/baracuda-kernels-sys/vendor/flash-attention/ — Tier 1 (head_dim=128, fp16+bf16, sm_80, FW only) — wired as BackendKind::FlashAttentionV2 on FlashSdpaPlan behind the fa2 cargo feature. Heuristic routes long-context (seq_q×seq_k ≥ 1024×1024) shapes to FA2, bespoke otherwise; PlanPreference::prefer_backend overrides. PyTorch shim headers (at::PhiloxCudaState + C10_CUDA_CHECK) decouple the vendor from torch deps. Tier 2 (BW, varlen, paged, other head_dims) deferred. |
done |
| 43 | mHC.cu vendor + HyperConnectionPlan family (alpha.56): DeepSeek-AI's Manifold-Constrained Hyper-Connections residual-mixing op (arXiv:2512.24880) from AndreSlavescu/mHC.cu (MIT) vendored under crates/baracuda-kernels-sys/vendor/mhc/ — Tier 1 (static-H, bf16 only) — exposed as HyperConnectionPlan behind the mhc cargo feature. Replaces bare y = x + sublayer(x) residual with a learned n×n Sinkhorn-Knopp doubly-stochastic mixing matrix. Tier 2 (BW, dynamic-H, fp16/f32) deferred. Requires cuBLAS-Lt (already linked). |
done |
| 44 | ozIMMU FP64-via-Int8-TC backend (alpha.56): enp1s0/ozIMMU (MIT) — Ootomo/Ozaki/Yokota's Ozaki-scheme DGEMM that synthesizes FP64 from S² int8 tensor-core matmuls — vendored under crates/baracuda-ozimmu-sys/vendor/ozimmu/ with cutf submodule pinned alongside. NEW baracuda-ozimmu-sys + baracuda-ozimmu sibling crates. Wired into GemmPlan f64 path as opt-in BackendKind::Ozaki { slices } (default stays on CUTLASS/cuBLAS DGEMM — Ozaki is NOT bit-equivalent). Two patches: direct-link mode (no LD_PRELOAD), exclude cublas.cu/culip.cu. |
done |
| 46 | FlashInfer cherry-pick — paged-KV decode + sort-free sampling + cascade attention (alpha.57 Checkpoint A, closed in the alpha.58 consolidation pass): surgical extraction of three FlashInfer kernel families (Apache-2.0, v0.6.12, commit eee0d75f) vendored under crates/baracuda-kernels-sys/vendor/flashinfer/ (~12 kLOC across 25 headers, no wholesale wrap). NEW plan families: BatchPagedDecodePlan + PagedKvAppendPlan (vLLM-style paged KV cache decode), TopKTopPSamplingPlan (sort-free TopK/TopP/MinP/combined samplers), CascadeAttentionPlan (LSE-merge for prefix-cache sharing). NEW BackendKind::FlashInfer + RandomKind::Multinomial discriminants. NEW flashinfer cargo feature on both baracuda-kernels-sys and baracuda-kernels (default OFF). 7 MSVC-portability patches to vendored headers (see vendor/flashinfer/VENDOR.md). Checkpoint B (alpha.58 consolidation): flashinfer_paged_decode_launcher.cu now compiles cleanly under MSVC nvcc — root cause was std::max(unsigned long, size_t) type mismatch inside decode.cuh (the earlier hypothesis about cudaLaunchKernel_ptsz was incorrect). Patched via static_cast<size_t>(...) on both arguments; launcher TU also carries a defensive cudaLaunchKernel shim macro. All 4 launchers now build under the flashinfer feature. |
done |
| 44b | ozIMMU clean-fork + cutf elimination + Windows port (alpha.57): full internalize of ozIMMU sources (no longer vendored — we own them at crates/baracuda-ozimmu-sys/cuda/). cutf submodule eliminated entirely (upstream went offline); ~360 LOC of useful FP / cp_async utilities preserved as native baracuda_fp_bits.cuh + baracuda_cp_async.cuh; ~2,200 LOC of cutf duplicates deleted. Portable baracuda::Uint128 replaces __uint128_t for Windows compile (typedef alias on Linux — bit-for-bit preservation). LD_PRELOAD path removed entirely. Linux + Windows both build clean. |
done |
| 44c | ozIMMU RIKEN-RCCS perf-enhancement variants (alpha.57, no version bump): folds in accelerator_for_ozIMMU (Uchino/Ozaki/Imamura 2024, arXiv:2409.13313) — three new variants EF (group-wise error-free summation; chains int8 cublasGemmEx with beta_i=1 to delay int32→f64 materialization), RN (nearest-rounding (a+t)-t split; ~2 extra effective bits per slice), H (= EF + RN), plus n-blocking (chunk n > 12288 into 8192-wide pieces on the int8 GEMM call). Variant selected via BackendKind::Ozaki { slices } discriminant's high-3-bits field; ozaki_slices::{base,ef,rn,h}(s) helper constructors in baracuda-kernels-types::sku. NEW OzakiVariant enum + Handle::dgemm_with_variant on baracuda-ozimmu. Source-compatible with Phase 44b callers (slices: 8 decodes as Base/S=8). Discovered + fixed a pre-existing Phase 44b MSVC bug in axby / axy_complex: upstream's (1l << 44) overflows on Windows (where long is 32-bit, LLP64) → silent inf output. Fixed by switching to static_cast<double>(1ull << 44). 9 new accuracy/variant/n-blocking smoke tests, all green on RTX 4070; the pre-existing Phase 44b accuracy_smoke tests (4 cases) also unbreak. |
done |
| 47 | Fused Linear Cross-Entropy (alpha.56, single-kernel port from LinkedIn's Liger-Kernel BSD-2): NEW FusedLinearCrossEntropyPlan family that fuses lm_head GEMM + CE loss in a chunked outer loop, never materializing the [BT, V] logits tensor. At BT=16K, V=128K, bf16 (Llama-3-class) saves 5-10 GiB of activation memory. Bespoke per-chunk fused softmax+CE+gradient kernel (FP32 accumulator across 4 fp dtypes — f16/bf16/f32/f64); GEMMs dispatched via cublasGemmEx. Backward produces grad_input+grad_weight during the FW pass (chunked loop); BW call just scales by dy_scalar (no-op when dy=1.0, the typical "CE is the last layer" case). 16 new bespoke FFI symbols (per_row + per_row_cast + scalar_finalize + inplace_scale, each × 4 dtypes) + 1 count-non-ignore helper + cublasGemmEx binding. NEW LossKind::FusedLinearCrossEntropy variant. Algorithm credit: LinkedIn Liger-Kernel (BSD-2-Clause, clean-room CUDA reimplementation — no source vendored). |
done |
| 45 | SmoothQuant compose + YaRN/LongRoPE Rust helper (alpha.56, no version bump — consolidation phase will bump): two zero-new-CUDA pure-Rust additions. (a) SmoothQuantLinearPlan<TIn, TWQ> (in crates/baracuda-kernels/src/quantize/smoothquant.rs) composes the existing Phase 8.3 quantized_linear_w8a8 kernel + fill_<dt> broadcast for the per-tensor activation scale. Caller supplies pre-smoothed-and-quantized int8 activations + int8 weights (smoothing itself is offline Python per the SmoothQuant paper — mit-han-lab/smoothquant MIT, Xiao et al. ICML 2023; not in scope). (b) RopeScaledTableBuilder + RopeScaling enum (Linear / YaRN / LongRoPE, in crates/baracuda-kernels/src/attention/rope_scaling.rs) — host-side cos/sin table builder feeding the Phase 36 rope_apply_<dt>_run kernel. YaRN (jquesnelle/yarn MIT, Peng et al. arXiv:2309.00071) implements §3.2 NTK-by-parts frequency interpolation + §3.3 attention-temperature absorption into cos/sin. LongRoPE (microsoft/LongRoPE MIT, Ding et al. arXiv:2402.13753) multiplies inv-freq by caller-supplied per-dim factors (evolutionary search itself is offline + out of scope). Existing Phase 36 RopeApply* types source-compat preserved. |
done |
| 51 | Arbitrary-mask FlashSdpaPlan + spec-decode composition doc (alpha.57, no version bump — consolidation phase will bump): NEW optional mask: TensorRef<f32, 4> field on FlashSdpaArgs routing to a bespoke arbmask SDPA kernel that adds an f32 [B, H, Q, K] additive bias to S = Q·K^T·scale before softmax. Unlocks spec-decode tree masks (EAGLE / Medusa / lookahead), MoE expert masking, prefix-LM, sliding-window with attention sinks — all entirely from caller-side composition. 4 dtypes (f32/f16/bf16/f64) × _run + _can_implement = 8 new FFI symbols. is_causal composes with the mask correctly (-INF + finite == -INF). New header baracuda_attn_arbmask.cuh reuses Phase 6.6's online-softmax tile pipeline; 1 new .cu instantiation file. FA2 vendor untouched (FA2 v2.8.3's Mask template has no arbitrary-mask hook). Runnable example at crates/baracuda-kernels/examples/speculative_decode_compose.rs; design doc at docs/guides/spec-decode.md. FW only; BW deferred. |
done |
| 50 | Mamba-2 SSD chunk-scan + Dao-AILab causal-conv1d (alpha.57, gated behind mamba cargo feature): opens the state-space LLM class (Mamba-2 8B, Codestral-Mamba, Falcon-Mamba, Zamba2 — Mamba-1 selective_scan deferred to Phase 50b). NEW SsdChunkScanPlan + SsdChunkScanBackwardPlan (lives under attention because of the SSD-as-attention duality) and CausalConv1dPlan + CausalConv1dBackwardPlan (top-level module — bespoke kernels, no cuDNN dep). Vendor attribution + LICENSE at crates/baracuda-kernels-sys/vendor/causal-conv1d/ (Tri Dao, BSD-3) and crates/baracuda-kernels-sys/vendor/mamba/ (state-spaces/mamba, Apache-2.0). Hand-port of the upstream Triton SSD reference + causal-conv1d primitive. Dtypes: causal-conv1d f32/f16/bf16/f64 × widths 2/3/4 × {SiLU, identity}; SSD f32/f16/bf16 (no f64 upstream). FW caps state at D,N ≤ 256; BW tighter at 64 (SMEM budget). 30 new FFI symbols (8 causal-conv1d FW + 8 BW + 6 SSD FW + 6 SSD BW + 2 can_implement extras). 5 new smoke tests (causal_conv1d_smoke/bw + ssd_chunk_scan_smoke/bw + mamba2_block_smoke). |
done |
| 50b | Mamba-1 selective_scan (alpha.57, gated behind the same mamba cargo feature as Phase 50): completes the state-space LLM coverage by adding the original Mamba-1 op family that powers Mamba-7B, Falcon-Mamba, and Codestral-Mamba — Phase 50's SSD covers Mamba-2 / Codestral-Mamba / Falcon-Mamba / Zamba2, but every Mamba-1-shipping model still uses v1's selective_scan, not v2's SSD reformulation. NEW SelectiveScanPlan + SelectiveScanBackwardPlan (sibling to SsdChunkScanPlan under attention/). Shape (B, L, D, N) with the full Mamba-1 surface: optional D[d] skip, optional SiLU-gated z[t, d] tail, optional delta_bias[d] + optional softplus(delta) mapping (all 9 args of upstream selective_scan_fn wired). Dtypes f32/f16/bf16 (complex deferred — no shipping LLM uses it). Hand-port of state-spaces/mamba's csrc/selective_scan/ under Apache-2.0; same vendor/mamba/ directory as Phase 50 (VENDOR.md updated, no new LICENSE file). FW caps state at N ≤ 256; BW uses two-pass record-then-reverse with B*D*L*N*sizeof(T) workspace. NEW AttentionKind::SelectiveScan = 8 variant (#[non_exhaustive] so source-compat). 17 new FFI symbols (3 FW + 3 FW-can-impl + 3 BW + 1 workspace-bytes + module-internal launchers). 3 new smoke tests (selective_scan_smoke covering 4 option-combinations + f16/bf16 loose-tol, selective_scan_bw_smoke with FD checks on du/ddelta/dA + topology rejection, mamba1_block_smoke end-to-end). |
done |
| 52 | NCCL foundation crate pair (alpha.57, no version bump — consolidation phase will bump): baracuda-nccl-sys (raw FFI types + libloading lazy-resolve, NO bindgen / NO link-time dep) + baracuda-nccl (safe Communicator with full collective surface — all_reduce / reduce / reduce_scatter / all_gather / broadcast / send / recv + group API + NcclMem + custom pre_mul_sum reduction op + register / deregister for zero-copy). The distributed-roadmap prerequisite for Ring Attention, distributed MoE, Megatron-LM tensor parallelism, FSDP-style shard collectives — Phase 52 only ships the substrate; consumer plans land in Phase 53+. Spec-named API: Communicator::new_single_gpu / new_with_id, cached infallible rank() / world_size(), NcclReduceOp / NcclUniqueId / NcclDataType aliases, NcclUniqueId::generate(). Linux-primary (NCCL ships with the CUDA toolkit there); Windows builds clean and defers the "is NCCL installed?" question to first nccl() call (loader fails with LoaderError::LibraryNotFound). 20 new smoke tests (10 dtype mapping — runs on every host; 10 #[ignore] NCCL-required). No baracuda-kernels integration in this phase. |
done |
| 49 | Apex optimizer subset (alpha.57, gated behind optim cargo feature): deliberate scope expansion — training-framework-adjacent. NEW sibling crate baracuda-optim (~600 LOC Rust + ~750 LOC CUDA) vendoring the NVIDIA Apex (BSD-3-Clause) multi_tensor_apply idiom + fused Adam / LAMB / SGD functors. Single launch over thousands of parameter tensors (Apex MAX_TENSORS_PER_LAUNCH = 110 per batch, multi-launch transparent) — eliminates the ~10,000-launch optimizer step overhead on 32B-param models. Plans: AdamStepPlan<T> (f32/f16/bf16 + AdamW mode), LambStepPlan (f32; two-stage with atomicAdd-fused L2-norm + sqrt + trust-ratio scaling), SgdStepPlan<T> (f32/f16/bf16 + momentum + Nesterov + weight-decay). Inference-only consumers (e.g. Fuel) don't pay the FFI surface cost — the vendored sources only build / link when the feature is enabled. Re-exported under baracuda_kernels::optim when enabled. Measured 41× speedup at 1000-tensor multi-tensor Adam vs 1000 individual launches on RTX 4070 (0.173 ms vs 7.096 ms; smoke test in crates/baracuda-optim/tests/multi_tensor_dispatch_smoke.rs). 4 smoke tests, 6 GPU tests total, all green. |
done |
| 53 | bitsandbytes NF4 dequant + GEMV vendor — QLoRA inference (alpha.57, gated behind bnb_nf4 cargo feature): opens the QLoRA-trained Llama / Mistral / Qwen inference class by vendoring the bitsandbytes (Dettmers et al. arXiv:2305.14314, MIT) NF4 (NormalFloat 4-bit) dequant + GEMV kernels. NF4 is the dominant 4-bit format for QLoRA-trained prebuilts on the HuggingFace Hub — distinct from GGUF Q4_0 (symmetric int4*scale, llama.cpp, Phase 8) and AWQ int4 (asymmetric int4 + zero-points). NF4 uses a 16-entry non-uniform quantile codebook derived from the inverse CDF of N(0, 1) — dequant is a 16-entry lookup, not arithmetic; better accuracy than symmetric int4 for normally-distributed weights. NEW plan trio: Nf4DequantizePlan<T> (bulk unpack [N/2, K] u8 → [N, K] T), Nf4MmvqPlan<T> (M=1 single-vector decode GEMV), Nf4MmvqMultiMPlan<T> (M ∈ {1, 2, 4, 8} batched-decode with weight gmem reuse, Phase 33 pattern applied to NF4). 11 new FFI symbols (3 dequant + 2 M=1 + 6 multi-M). Pack layout matches bitsandbytes upstream Linear4bit: pair-packed nibbles in [N/2, K] u8 (N must be even) + [N * (K/block_size)] f32 per-block absmax (block_size typically 64). Activation/output dtypes f16+bf16 (PyTorch convention); f32 accumulator. Codebook reproduced bit-identical to upstream as device-side switch + host-side NF4_CODEBOOK: [f32; 16] const + nf4_pack_weight host helper. Vendor metadata at crates/baracuda-kernels-sys/vendor/bitsandbytes/{LICENSE,AUTHORS,VENDOR.md}. 3 smoke test files (dequant roundtrip, M=1 GEMV f16+bf16, multi-M f16 vs M=1-looped). Out of scope: 8-bit optimizers (Phase 49 overlap), LLM.int8 (Phase 45 obsoletes), FP4 (different codebook — separate phase if asked), double quantization (Tier 2). |
done |
| 54 | xFormers BlockSparseAttention + 2:4 sparse GEMM (alpha.57, no version bump; clean-room hand-port of facebookresearch/xformers BSD-3-Clause algorithmic reference): NEW SdpaBlockSparsePlan (xformers_blocksparse cargo feature) — block-sparse SDPA FW where the attention mask is a per-block boolean pattern [B, H, num_blocks_q × num_blocks_k] (uint8); only the active (q_block, k_block) pairs participate in the QK^T matmul + online-softmax accumulation. Different from Phase 51's arbitrary additive-mask path (which still computes every cell) — block-sparse actually SKIPS compute on masked blocks → real wall-clock speedup on long-context attention with known sparse patterns. NEW GemmSparse24Plan (xformers_sparse24 cargo feature) — 2:4 structured sparsity GEMM consuming pre-compressed [M, K/2] weights + [M, K/8] u16 metadata. Tier-1 implementation: inflate-then-dense reference matmul (correctness first; sparse-tensor-core mma.sp.sync / cuSPARSELt backend deferred to Tier 2). 16 new FFI symbols (8 block-sparse SDPA × 4 dtypes × 2 entries; 11 sparse24 × 3 dtypes × 4 entries with workspace_bytes helper). NEW AttentionKind::BlockSparseAttention = 9 variant (#[non_exhaustive] so source-compat). 3 smoke tests (block-sparse all-ones-matches-dense + diagonal-band + empty-pattern; sparse24 matches host reference + K-rejection + throughput timing). Vendor attribution at crates/baracuda-kernels-sys/vendor/xformers/ (no upstream sources vendored verbatim — algorithmic reference only). NOT vendored from xFormers: memory-efficient attention (overlaps with FA2 vendor); fused biases / RoPE / norm (overlaps with existing baracuda phases); Triton kernel paths (no Triton toolchain). |
done (Tier 1) |
| 55 | TransformerEngine FP8 cast + delayed-scaling recipe (alpha.57, gated behind tensor_engine cargo feature; clean-room hand-port of NVIDIA TransformerEngine Apache-2.0 algorithm — only the cast + recipe subset). NEW sibling crate pair baracuda-transformer-engine-sys + baracuda-transformer-engine. The differentiated value of TE is the per-tensor delayed-scaling recipe with amax history for stable FP8 training; that's the load-bearing piece this phase ships. Public API: Fp8Recipe (RAII handle holding amax_history ring + scale + scale_inv device scalars), Fp8CastPlan<TIn> (fused FP8 cast + max(|x|) amax reduction in one kernel — atomicMax into amax_history[write_pos]), Fp8DequantPlan<TOut> (symmetric dequant via scale_inv). Both formats: E4M3 (max=448) for fwd/weights, E5M2 (max=57344) for grads. Wide dtypes: f32/f16/bf16. 4 new C-ABI symbols (baracuda_te_fused_cast_amax_run / baracuda_te_dequant_run / baracuda_te_recipe_update_run / baracuda_te_recipe_init_run) + format/dtype id helpers. NO cuDNN dep (cast/recipe paths don't need it — cuDNN is only needed for fused_attn, which we skip). NO pybind11 (raw C ABI, not Python). 10 GPU smoke tests, all green on RTX 4070. Sm_89 reality check: FP8 storage + cast intrinsics work natively on Ada, but tensor-core FP8 MMA throughput equals BF16 — so the wins here are bandwidth-saving (KV cache, weight storage, activation memory) not compute. Recipe machinery is forward-compatible with Hopper (sm_90a) / Blackwell (sm_100) where the MMA throughput win also materializes. Deliberately NOT lifted: normalization (Phase 5), fused_rope (Phase 14/36/41), fused_attn (Phase 17/42; cuDNN dep), fused_softmax (Phase 5), activation (Phase 3/31), gemm (Phase 1+24+30), comm_gemm_overlap (Hopper TMA), fused_router (Phase 8+20), hadamard/newton_schulz/swizzle/permutation (niche), multi_tensor (Phase 49), dropout (composable), Python bindings (out of scope). Algorithmic reference: transformer_engine/common/{cast,recipe}/*.cu upstream + FP8 spec Micikevicius et al. 2022 (arXiv:2209.05433). Vendor attribution + full Apache-2.0 text at crates/baracuda-transformer-engine-sys/ATTRIBUTION.md. |
done |
| 57 | Megatron-LM tensor-parallel primitives (alpha.57, no version bump; gated behind megatron_tp cargo feature). NEW sibling crate baracuda-megatron — pure-composition over baracuda-cublas (local GEMM via cublasSgemm for f32, cublasGemmEx with Compute32F accumulator for f16/bf16) + baracuda-nccl (cross-rank all_gather / all_reduce collectives). NO new CUDA kernels — foundational TP primitives for Megatron-style models are pure orchestration; the kernel substrate already exists in baracuda. NEW plans: ColumnParallelLinearPlan<T> (splits W along output dim; FW Y_local = X @ W_local^T + all_gather; BW dX_partial = dY_local @ W_local + all_reduce(Sum), dW_local = dY_local^T @ X local) and RowParallelLinearPlan<T> (splits W along input dim; FW Y_partial = X_local @ W_local^T + all_reduce(Sum); BW dX_local = dY @ W_local local, dW_local = dY^T @ X_local local — no BW collective, the Megatron pairing only needs one collective per layer-pair). NEW TensorParallelContext borrow type holding &Communicator + in_features / out_features / cached rank / world_size. Dtypes: f32 always; f16 + bf16 behind the crate-level half-crate feature (which the kernel-facade megatron_tp feature pulls in). Tier 1 scope — bias rejected at call site with a Tier-2 marker error (caller can Affine-add post-FW; matters for RowParallel where the bias must be added after the all_reduce so it doesn't get summed N times). 5 smoke tests across 3 files: column_parallel_smoke (FW + BW, single-rank, matches CPU Linear ref), row_parallel_smoke (same), multi_rank_scaffold (#[ignore]-gated 2-GPU scaffold — exits cleanly on single-GPU dev boxes). Algorithmic reference: Shoeybi et al. arXiv:1909.08053 (NVIDIA Megatron-LM, Apache-2.0); no source vendored. Out of scope: async overlap (Hopper TMA); sequence parallelism (Phase 56's domain); pipeline parallelism (future phase); VocabParallelEmbedding (future polish); distributed gradient accumulation (Phase 58's domain); expert parallelism (separate phase). |
done |
| 58 | DistributedAdam — ZeRO-1-style sharded optimizer state (alpha.57, no version bump; gated behind the new distributed_optim cargo feature on baracuda-optim, pulls baracuda-nccl as optional dep). Pure-Rust composition over Phase 49 [AdamStepPlan] + Phase 52 NCCL collectives — NO new CUDA kernels, NO new baracuda-kernels-sys FFI. NEW DistributedAdamStepPlan<T> wrapping the inner Adam plan + a borrowed &Communicator; orchestrates the canonical ZeRO-1 protocol: all_reduce(grads, Sum, in-place) → local Adam step on this rank's 1/world_size shard → all_gather(updated_params, in-place). Single-rank degenerate case (world_size == 1) elides both collectives and reduces to AdamStepPlan::step bit-exactly (smoke test verifies this on single-GPU dev hardware). f32 + f16 + bf16 dtypes, AdamW + classic mode, mixed-precision step_with_f32_state variant. NEW shard_range(n, rank, world_size) helper matching torch.chunk semantics. Phase 58 constraint: tensor element counts must be world_size-multiples (ring all_gather symmetry); per-tensor broadcast fallback for ragged shards is future work. Out of scope: ZeRO-2 (gradient sharding); ZeRO-3 (parameter sharding during FW/BW); DistributedLamb / DistributedSGD (same pattern, defer until concrete demand); CPU-offload optimizer state; 8-bit distributed optimizer state. 3 smoke test files (4 pure-Rust shard_range tests run unconditionally; 2 single-rank GPU smokes #[ignore]-gated for NCCL; 1 multi-rank scaffold #[ignore]-gated for 2+ GPU validation). Algorithmic reference: Rajbhandari et al. SC20 "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models", Microsoft DeepSpeed (Apache-2.0; no source vendored — pure Rust composition). |
done |
| 56 | Ring Attention — sequence-parallel attention (alpha.57, no version bump — consolidation phase will bump; gated behind the new ring_attention cargo feature, pulls baracuda-nccl + baracuda-nccl-sys as optional deps). First Phase 52 NCCL consumer — proves the substrate. Hand-port of Liu/Yan/Abbeel 2023 (arXiv:2310.01889; algorithmic reference at https://github.com/lhao499/RingAttention, Apache-2.0 — no JAX source vendored, clean-room CUDA implementation). NEW RingAttentionPlan<T> + RingAttentionDescriptor + RingAttentionArgs in crates/baracuda-kernels/src/attention/ring_attention.rs; bespoke kernels/attention/ring_attention_kernel.cu (~480 LOC kernel + ~390 LOC plan). Per-rank online-softmax fold of resident K/V chunk into persistent (o_acc, m_acc, l_acc) f32 state; ring rotation via comm.send/recv inside group_start/group_end; finalize kernel emits y = o_acc / l_acc (+ optional lse). Tier 1 scope: f16/bf16 (f32/f64 deferred), head_dim == 128, FW only (BW Tier 2), no GQA broadcast, no arbitrary additive mask. Causal masking applied on global indices (each step kernel takes q_global_base + k_global_base so masking is consistent across rotation steps). 12 new FFI symbols (workspace_bytes + dtype-independent init_run + 5 per-dtype × 2 dtypes: step_run / step_can_implement / finalize_run / finalize_can_implement). Unlocks million-token context length across N GPUs with O(N/P) memory where N = total seq len, P = ring size. Complementary to Phase 57's tensor-parallelism (sequence-dim sharding vs head-dim sharding compose). 4 smoke tests (3 single-rank degenerate cases validating against FlashSdpaPlan ground truth: f16 + bf16 + f16 causal — all pass on RTX 4070; 1 multi-rank scaffold #[ignore]-gated for 2+ GPU validation). Single-rank world_size == 1 reduces to standard FlashAttention math (the validation path on single-GPU hardware). |
done |
| 48 | Marlin + AWQ 4-bit GEMM vendor + GPTQ→Marlin repack utility (alpha.57, no version bump — consolidation phase will bump; gated behind the new marlin + awq cargo features on baracuda-kernels-sys and baracuda-kernels). Two complementary 4-bit GEMM vendors completing the "4-bit hub coverage" started in Phase 53 (NF4). Marlin (IST-DASLab, Apache-2.0 + §3 patent grant, vendored at crates/baracuda-kernels-sys/vendor/marlin/) — state-of-the-art W4A16 GEMM for the decode-batch regime, ~3.87× speedup over FP16 GEMM at M ∈ [1, 32] on Ampere / Ada per the paper. Symmetric int4 (zero-point fused into dequant as q - 8); group size 128 or per-channel; sm_80/86/89 only (sm_90 needs WGMMA rewrite — Marlin v2 territory, deferred). NEW Int4MarlinGemmPlan<f16>. AWQ (mit-han-lab, MIT — no patent grant, vendored at crates/baracuda-kernels-sys/vendor/awq/) — natively supports the most-deployed 4-bit format on the Hugging Face Hub (Llama / Mistral / Qwen *-AWQ). Asymmetric int4 with explicit per-group zero-points; group size 64 or 128; loads directly from HF checkpoints without repack. NEW Int4AwqGemmPlan<f16>. GPTQ→Marlin repack utility — pure-Rust host-side gptq_to_marlin_repack bridging GPTQ asymmetric checkpoints into Marlin's symmetric layout via zero-point absorption (trailblazer implementation; act_order=True deferred, the upstream Marlin intra-fragment permutation table is documented but uses identity permutation in the trailblazer). 4 new FFI symbols total (2 Marlin: _run + _can_implement; 4 AWQ: _run + _workspace_bytes + _can_implement + dequant stub). AWQ vendor source patched to strip the upstream <torch/extension.h> host wrapper (__asm__ __volatile__ → asm volatile for MSVC nvcc portability) and re-export only the device-side __global__ template kernel. Marlin needs --expt-relaxed-constexpr (constexpr ceildiv called from __global__). Both kernels build clean on RTX 4070 with the gated features. 3 smoke test files (marlin_smoke #[ignore] GPU + descriptor validation; awq_smoke #[ignore] GPU + descriptor validation; gptq_to_marlin_smoke pure-Rust roundtrip + zp-fold verification + clamp-at-extremes). |
done |
| 46+ | Phase 46-51 mainstream-techniques roadmap (FlashInfer cherry-pick, Marlin/AWQ); Hopper sm_90a / Blackwell sm_100; 1.0 freeze. | pending (see ROADMAP.md) |
API stability is not promised before beta.0. Breaking changes ship in
each alpha bump and are documented in the workspace CHANGELOG.md.
Quick start
Add the kernel facade and the driver crate:
[]
= { = "0.0.1-alpha.64", = ["sm89", "cudnn"] }
= "0.0.1-alpha.64"
A representative example — single-axis numerically stable softmax over a device-resident tensor:
use ;
use ;
The same select → run shape applies to every op. GEMM, attention,
conv2d, FFT, scatter — the descriptor / args fields differ per family but
the lifecycle is identical. See the crates/baracuda-kernels
README for the int8-GEMM variant of
the same example.
Workspace layout
The user-facing crates a typical caller will reach for:
baracuda-kernels # the unified Plan-based ML op facade
baracuda-kernels-types # shared type vocabulary (Element, TensorRef, KernelSku, ...)
baracuda-kernels-sys # raw FFI to bespoke .cu kernels
baracuda-kernels-bench # criterion harness for sm_89 perf sweeps (not published)
baracuda-cutlass # safe wrapper for CUTLASS GEMM (float, int8 RCR, batched, grouped)
baracuda-driver # safe wrapper for the CUDA Driver API
baracuda-runtime # safe wrapper for the CUDA Runtime API
The per-library wrappers used internally by the facade (you can also use them stand-alone):
baracuda-cublas{,-sys} # cuBLAS + cuBLASLt + cuBLASXt
baracuda-cudnn{,-sys} # cuDNN classic + Graph API
baracuda-cudf{,-sys} # cuDF (RAPIDS dataframe; Linux-only)
baracuda-cufft{,-sys} # cuFFT
baracuda-cusolver{,-sys} # cuSOLVER dense + sparse + Rf + Mg
baracuda-cusparse{,-sys} # cuSPARSE
baracuda-curand{,-sys} # cuRAND
baracuda-cutensor{,-sys} # cuTENSOR
baracuda-cutlass{,-sys} # CUTLASS GEMM kernel templates
baracuda-cutlass-kernels-sys # CUTLASS kernel-only compile target
baracuda-cuvs{,-sys} # RAPIDS cuVS GPU vector search (Phase 71)
baracuda-cvcuda{,-sys} # CV-CUDA image processing
baracuda-flashinfer{,-sys} # FlashInfer paged-KV + cascade + sampling (Phase 46/66)
baracuda-npp{,-sys} # NPP
baracuda-nccl{,-sys} # NCCL
baracuda-nvcomp{,-sys} # nvCOMP
baracuda-nvimagecodec{,-sys} # nvImageCodec (Phase 70 — supersedes nvJPEG)
baracuda-nvjpeg{,-sys} # nvJPEG (kept for back-compat)
baracuda-nvshmem{,-sys} # NVSHMEM symmetric-heap RDMA (Phase 69)
baracuda-ozimmu{,-sys} # ozIMMU Ozaki-scheme DGEMM (Phase 44)
baracuda-transformer-engine{,-sys} # TransformerEngine FP8 (Phase 55)
And the supporting low-level crates (FFI, build infrastructure, profiling):
baracuda-cuda-sys # Driver + Runtime FFI
baracuda-nvrtc{,-sys} # runtime CUDA C++ → PTX
baracuda-nvjitlink{,-sys} # CUDA 12+ JIT linker
baracuda-cupti{,-sys} # profiling APIs
baracuda-nvml{,-sys} # device monitoring
baracuda-cufile{,-sys} # GPUDirect Storage (Linux-only)
baracuda-tensorrt{,-sys} # TensorRT inference runtime (Phase 68 — vtable-dispatch C++ shim)
baracuda-forge # build-time .cu → PTX compiler driver
baracuda-build # build.rs helpers
baracuda-core # loader + Error plumbing
baracuda-types{,-derive} # pure-data types: Half, BFloat16, Complex, DeviceRepr
Vendor-track training / optimization crates (opt-in via feature gates on
baracuda-kernels):
baracuda-megatron # Megatron-LM TP primitives (Phase 57; composition, no kernels)
baracuda-optim # NVIDIA Apex multi-tensor optimizers (Phase 49)
The full umbrella crate (baracuda) re-exports everything behind cargo
features — convenient when you want everything; overkill when you don't.
Hardware support
baracuda targets Ampere and newer by design. Pre-Ampere GPUs lack the
tensor-core instructions and async-copy primitives the bespoke kernels are
written against (mma.sync.m16n8k*, cp.async, ldmatrix), and we have
no desire to ship a slower SIMT fallback for hardware that's eight years
old.
| Compute capability | NVIDIA marketing names | baracuda support |
|---|---|---|
| sm_80 | Ampere (A100, A40, A30, RTX 30xx) | default baseline |
| sm_89 | Ada Lovelace (RTX 40xx, L40, L4) | feature-gated specialized kernels (FP8, larger Flash Attention tiles) |
| sm_90a | Hopper async (H100, H200) | stubs in place; full specialization pending Phase 11 |
| sm_100 | Blackwell | post-Phase-11 |
| ≤ sm_75 (Turing, Volta, Pascal, …) | — | unsupported |
The default sm80 build runs forward-compatibly on Ada and Hopper through
JIT-compiled PTX; turn on sm89 to pick up the FP8 and Flash-Attention
sibling plans tuned for Ada's larger register file.
Cargo features
The kernel facade exposes a broad opt-in feature set. Architecture and library-integration features are off by default — pick what your deployment needs. Important: unless a feature is enabled, its plans are not built, not linked, and not present in the public API. Always check the feature gate when scanning for an op family.
Architecture targets
| Feature | Default | Effect |
|---|---|---|
sm80 |
yes | Ampere-baseline kernel set (RTX 30xx, A100). |
sm89 |
no | Ada Lovelace specializations (FP8 GEMM, FlashSdpaSm89Plan). |
sm90a |
no | Hopper-specialized kernels (stubs today; tracked for Phase 7x). |
NVIDIA library integrations
| Feature | Default | Phase | Effect |
|---|---|---|---|
cudnn |
no | 7 | Link cuDNN. Enables Conv2d / Pool2d / CtcLossCudnnPlan. Separate NVIDIA download. |
Vendored kernel families
| Feature | Default | Phase | Effect |
|---|---|---|---|
fa2 |
no | 42 | Vendored Tri Dao FlashAttention v2 (BSD-3). Long-context routing on FlashSdpaPlan. |
mhc |
no | 43 | Vendored DeepSeek-AI mHC.cu (MIT). HyperConnectionPlan (learned residuals). |
ozimmu |
no | 44 | Vendored ozIMMU (MIT). Ozaki-scheme DGEMM via S² int8 tensor-core matmuls. |
flashinfer |
no | 46/66 | Vendored FlashInfer (Apache-2.0). Paged-KV decode/prefill, cascade, sampling. |
marlin |
no | 48 | Vendored IST-DASLab Marlin (Apache-2.0). Symmetric int4 W4A16 GEMM. |
awq |
no | 48 | Vendored mit-han-lab llm-awq (MIT). Asymmetric int4 W4A16 GEMM. |
optim |
no | 49 | Vendored NVIDIA Apex (BSD-3). Multi-tensor Adam/Lamb/SGD step kernels. |
mamba |
no | 50 | Vendored Mamba-2 SSD + Dao causal-conv1d. State-space LLM ops. |
bnb_nf4 |
no | 53 | Vendored bitsandbytes NF4 (MIT). 4-bit non-uniform quantile QLoRA. |
xformers_blocksparse |
no | 54 | Clean-room port of xFormers block-sparse SDPA (BSD-3). |
xformers_sparse24 |
no | 54 | Clean-room port of xFormers 2:4 structured sparsity GEMM (BSD-3). |
tensor_engine |
no | 55 | Vendored NVIDIA TransformerEngine (Apache-2.0). FP8 cast/dequant + delayed-scaling recipe. |
ring_attention |
no | 56 | Sequence-parallel ring attention (Apache-2.0 reference). Pulls in NCCL. |
megatron_tp |
no | 57 | Megatron-LM tensor-parallel primitives (composition, no new kernels). |
nvshmem |
no | 69 | NVSHMEM host-side wrapper (one-sided RDMA, sibling to NCCL). Linux-only. |
Sibling crates also have their own feature gates:
baracuda-cuvs—cuvsfeature inbaracuda-kernels-sysecosystem pulls in RAPIDS cuVS vector-search (Phase 71, Linux-only). Not yet re-exported bybaracuda-kernelsitself.baracuda-tensorrt—shimfeature builds the vtable-dispatch C++ shim required to call TensorRT (Phase 68 — no flat C ABI upstream).baracuda-flashinfer— sameflashinferfeature as above.
Notes:
cudnnis off by default because cuDNN is a separate NVIDIA download not bundled with the stock CUDA toolkit installer. Enabling it without cuDNN installed produces a linker error oncudnn.lib/libcudnn.so— see the Building section for auto-discovery paths.- Most vendored features add 30s–5 min to first build (template-heavy CUDA). Subsequent builds incremental.
- Features that pull in optional sibling crates (
optim,tensor_engine,megatron_tp,nvshmem,ring_attention) only compile the sibling when the feature is enabled — inference-only consumers don't pay the surface cost.
Building
Requirements:
- CUDA Toolkit ≥ 12.0 with
nvcconPATH. baracuda is tested on 12.x and 13.x. - cuDNN 9.x (only if you enable the
cudnnfeature) — separate NVIDIA download, not bundled with the toolkit. - A working Rust toolchain ≥ 1.85 (workspace MSRV pinned in
rust-toolchain.toml). - Windows users:
lld-link.exesomewhere onPATH. The CUDAnvccinvocation links through it; the install location is typicallyC:\Program Files\LLVM\bin. Install the LLVM Windows package and add that directory toPATHifcargo buildcomplains aboutlld-link.exenot being found.
A typical full build with all GPU-side features (CUDA toolkit + cuDNN present):
Or, to verify the public API surface compiles without the full kernel build (fast — type-check only):
The baracuda-kernels-sys build script auto-discovers cuDNN at the
following paths in order: CUDNN_PATH / CUDNN_ROOT / CUDNN_HOME env
vars, then C:\Program Files\NVIDIA\CUDNN\v<X.Y>\ on Windows, then the
CUDA toolkit's own lib/ directory (pre-cuDNN-9 layout), then the
standard Linux distro paths under /usr/lib/.
Troubleshooting
Windows: Git-for-Windows fake link.exe shadowing the MSVC linker
Git-for-Windows ships a GNU coreutils binary named link.exe at
C:\Program Files\Git\usr\bin\link.exe — its job is to create a hard
link, not to link object files. If that directory appears on PATH
ahead of the MSVC linker (or LLVM's lld-link.exe), cargo build
invokes the coreutils binary instead of the real linker and fails with a
cryptic error (it doesn't understand /OUT: and friends).
baracuda's baracuda-kernels-sys and baracuda-cutlass-sys build
scripts probe PATH on Windows and emit a cargo:warning if they
detect this shadowing. Fix: re-order PATH so the MSVC linker
(typically reached via the Visual Studio "x64 Native Tools Command
Prompt") or LLVM's lld-link.exe (C:\Program Files\LLVM\bin\) appears
before C:\Program Files\Git\usr\bin\. Building from the VS x64 Native
Tools prompt is the most reliable option; alternatively, install LLVM
and put its bin directory ahead of Git's on the user/system PATH.
Testing
baracuda's GPU integration tests are gated behind #[ignore] so a
host-only cargo test doesn't try to launch a kernel on a machine
without an NVIDIA driver. To run them you need a working GPU plus the
--ignored flag:
# Host-only tests (compile + reference logic; no GPU access):
# Full GPU integration sweep — RTX 30xx / 40xx / 50xx required:
# Verify the workspace-level API surface compiles (no GPU needed):
The full regression on an RTX 4070 covers 324 binary targets at ~1630 tests passing. Individual op-family suites take 30–90 seconds; the full sweep is 25–40 minutes.
Benchmarks
The baracuda-kernels-bench crate is a criterion-based harness with
CUDA-event-timed throughput sweeps across GEMM, Flash Attention, and
Conv2d at LLM-typical and ResNet-typical shapes. It is not published
to crates.io (it depends on a working GPU).
The full sweep takes ~30 minutes on an RTX 4070. Scope to a single family
with --bench gemm / --bench flash_attention / --bench conv2d. See
crates/baracuda-kernels-bench/BENCH-sm89.md
for the baseline table format and methodology.
Project documentation
ARCHITECTURE.md— layered design, Plan-Descriptor-Args pattern,KernelSkutaxonomy, dispatcher design, workspace contract, sibling-plan pattern, vendoring convention, phase roadmap.OP-MATRIX.md— full op × dtype × backend coverage matrix (planned).LESSONS.md— postmortems, ABI footguns, performance traps (planned).- Per-crate
README.mdfiles undercrates/<name>/.
License
Dual-licensed under MIT or Apache-2.0. Pick whichever fits your project. Contributions accepted under the same terms.
NVIDIA's CUDA libraries (libcuda, libcudart, libcublas, libcudnn,
…) are not redistributed by this project. You obtain them from NVIDIA
separately — either through the CUDA Toolkit installer or through each
library's dedicated download page. baracuda's loader opens whatever the
host driver / toolkit has installed.
Vendor attribution
A small number of bespoke kernels in baracuda-kernels-sys are vendored
from upstream open-source projects (huggingface/candle's CUDA kernel set
via fuel-cuda-kernels; llama.cpp's ggml-cuda GGUF block-format
quantization + MMVQ; guoqingbao/attention.rs's fused MoE expert
kernels). Each adapted source carries an SPDX-FileCopyrightText: +
SPDX-License-Identifier: header; the consolidated provenance is in
crates/baracuda-kernels-sys/LICENSE-thirdparty.md.
FlashAttention v2
(Tri Dao, BSD-3-Clause, pinned at v2.8.3 /
060c9188beec3a8b62b33a3bfa6d5d2d44975fab) is vendored at
crates/baracuda-kernels-sys/vendor/flash-attention/
with verbatim LICENSE + AUTHORS files and full vendor / scope notes
in VENDOR.md.
Gated behind the fa2 cargo feature on baracuda-kernels-sys and
baracuda-kernels; exposed through a backend-choice path on
FlashSdpaPlan (Phase 42).
mHC.cu (Andre Slavescu,
MIT, pinned at a426939c2dbc11c443db041bcff12b65d1b6482a) — unofficial
CUDA implementation of DeepSeek-AI's
Manifold-Constrained Hyper-Connections
paper — is vendored at
crates/baracuda-kernels-sys/vendor/mhc/
with the verbatim upstream LICENSE, an AUTHORS file, and full
vendor / scope notes in
VENDOR.md. Gated
behind the mhc cargo feature on baracuda-kernels-sys and
baracuda-kernels; exposed through the new HyperConnectionPlan
(Phase 43, Tier 1: static-H FW, bf16 weights / f32 activations).
FlashInfer (NVIDIA
- FlashInfer community, Apache-2.0 with full patent grant, pinned at
v0.6.12/eee0d75f91f64c520bfaed07e39a850ea4ddde23) — a curated ~12 kLOC subset of the FlashInfer header tree is vendored atcrates/baracuda-kernels-sys/vendor/flashinfer/with verbatim upstreamLICENSE+NOTICEand full vendor / scope / patch notes inVENDOR.md. Gated behind theflashinfercargo feature onbaracuda-kernels-sysandbaracuda-kernels; exposes three NEW plan families — paged-KV decode + append (BatchPagedDecodePlan+PagedKvAppendPlanfor vLLM-style serving), sort-free sampling (TopKTopPSamplingPlan— combined TopK/TopP/MinP via a single-kernel rejection sampler), and cascade attention LSE merge (CascadeAttentionPlanfor prefix-cache sharing). Surgical cherry-pick (not a wholesale wrap) — Hopper / Blackwell / NVSHMEM / Mamba / MLA / POD paths intentionally skipped to keep the build cost contained. Phase 46.
The baracuda-forge build-time kernel-compiler
crate is a vendored fork of cudaforge
by Guoqing Bao — see crates/baracuda-forge/NOTICE
for the upstream commit hash.
The baracuda-cutlass safe wrapper for NVIDIA
CUTLASS — plan-based GEMM and grouped-GEMM with caller-supplied
workspace, MoE-friendly variable-M-per-group dispatch — was specified
by the Fuel ML library team. See
crates/baracuda-cutlass/NOTICE for
the design lineage.