baracuda

About the name. Yes, we know — it's spelled barracuda (two Rs). That name was taken on crates.io, so we dropped one R and kept swimming.
A unified Rust ML-op facade over the NVIDIA CUDA ecosystem.
What baracuda is
baracuda is a Rust workspace that exposes every primitive an ML framework
expects — the union of PyTorch (torch.* + nn.functional) and JAX
(jax.lax.* + jax.numpy.*) — through a single Plan-based crate surface
called baracuda-kernels. Internally each plan dispatches to:
- The appropriate NVIDIA-library wrapper crate (cuBLAS, cuDNN, cuFFT, cuSOLVER, cuRAND, cuSPARSE, cuTENSOR, NPP, CV-CUDA, CUTLASS) when one already covers the op well, or
- A bespoke hand-rolled
.cukernel shipped inbaracuda-kernels-syswhen no NVIDIA library covers the op (or covers it poorly at the shapes that matter for modern transformer / vision / GNN workloads).
Callers import one crate (baracuda-kernels) and reach for one API
style. The dispatch decision — which is observable through
Plan::sku() for telemetry — is otherwise invisible. Switching from a
CUTLASS-backed SKU to a bespoke-backed SKU is a layout flag, not an import
change.
baracuda is for downstream Rust ML / inference / training frameworks that
need access to the full CUDA stack without re-vendoring it themselves. The
workspace also ships idiomatic stand-alone wrappers for every CUDA library
under crates/baracuda-<lib> if you want to skip the kernel facade and
talk to one library directly.
Status
In active development — alpha.45. Roughly 2175 GPU tests passing on an RTX 4070 (sm_89), across 616 binary targets.
Phase coverage (see ARCHITECTURE.md for the phase
matrix):
| Phase | Scope | Status |
|---|---|---|
| 0 | Crate scaffolding, shared type vocabulary | done |
| 1 | int8 GEMM RRR (Fuel-blocking, 18 SKUs) | done |
| 2 | FP8 / int4 / bin GEMM completion | done |
| 3 | Elementwise + shape / layout (Categories B, B', C, C', D, N) | done |
| 4 | Reductions + scans + random (Categories E, F, Q) | done |
| 5 | Normalization + softmax + loss (Categories G, H, R) | done |
| 6 | Attention + linalg + FFT (Categories K, Linalg, U) | done |
| 7 | Convolution + pooling + indexing + embedding + segment (Categories I, J, L, M, S) | done |
| 8 | Quantization helpers + GGUF + MoE (Category P, V) | done |
| 9 | Sort / topk / image / NMS (Categories O, T) | done |
| 10 | sm_89 (Ada Lovelace) tuning sweep | done |
| 11 | Fuel feedback integration (alpha.27) — ScalarType ergonomics, Conv/Pool fanout, GGUF Q8_K MMVQ, i64 indices, Sparsemax cap lift, atomicAdd-via-CAS, build-env probe | done |
| 12 | PowI + ArgMax/Min u32/i32 outputs (alpha.28) — IndexOutputElement sealed trait |
done |
| 13 | WriteSlice + Contiguize + sub-byte casts + Triu/Tril (alpha.29) — KV-cache fast path, retires Fuel's D2H/CPU/H2D fallback, plus DeviceBuffer::zero() (alpha.30) |
done |
| 14 | Strided FFI siblings (alpha.31) — Affine, PowI, Triu/Tril, RoPE+SDPA, GGUF MMVQ activation-strided + W byte offset; 56 new FFI symbols | done |
| 15 | Quick wins + correctness cleanup (alpha.32) — MMVQ alignment guard, OneHot/Nonzero i64 wrappers, MoE fixture race fix | done |
| 16 | Pool completion (alpha.33) — bit-exact adaptive pool {1,2,3}d, bespoke LpPool {1,2}d, bespoke FractionalMaxPool {2,3}d; 48 new FFI symbols | done |
| 17 | SDPA / attention completion (alpha.34) — Flash SDPA sm_89 strided FW + SDPA BW GQA-broadcast atomicAdd | done |
| 18 | Sub-byte / quantized completeness (alpha.35) — f16/bf16 activations for GgufMmvqPlan across all 11 block formats × contig + strided; 44 new FFI symbols |
done |
| 19 | Fuel retirement asks (alpha.36) — pool/conv FFI facade for cuDNN-backed plans + Upsample Nearest2d + NEW im2col/im2col1d/col2im1d bespoke; vendored Fuel Q8_1 for inspection; 140 new FFI symbols. Surfaced 1.0-freeze prereq for broader library-backed FFI facade audit | done |
| 20 | MoE — Item 4 from Fuel retirement (alpha.37): batched MMVQ × N-experts (36 new FFI symbols across 11 GGUF block formats × 3 activation dtypes + 3 pure-FP); MoE absorb-and-expose proved to be a no-op (Fuel hadn't evolved their kernels since Phase 8.5 vendor; 5 baracuda-side symbols already match) + 2 direct-FFI smoke tests | done |
| 21 | Bilinear interpolate expansion (alpha.38): align_corners + scale-factor overrides + f16/bf16 fanout (FW+BW). Breaking change to existing f32/f64 signatures. |
done |
| 22 | MMVQ ncols≥64 debug assertion + cuSOLVER FFI facade (alpha.39): 10 cuSOLVER-backed plan families (Cholesky, LU, QR+ormqr, SVD/SvdBatched/SvdaBatched, eigh real+complex, eig, lstsq, solve, inverse) wrapped behind ~50 flat C symbols in baracuda-kernels-sys/src/cusolver_facade.rs; closes the Phase 19 library-backed FFI facade gap for cuSOLVER. No feature gate (cuSOLVER ships with the CUDA toolkit). |
done |
| 23 | cuFFT + cuRAND FFI facade (alpha.40): 6 cuFFT plan families (FFT 1d/Nd C2C, R2C, C2R) × c32/c64 + f32/f64 + 2 cuRAND families (Uniform, Normal) × f32/f64 = 32 flat C symbols in baracuda-kernels-sys/src/{cufft,curand}_facade.rs. cuSPARSE skipped — no baracuda-kernels plans wrap it today. |
done |
| 24 | Cutlass GEMM re-export FFI facade (alpha.41): 210 trampolines (70 SKU families × {run, workspace_size, can_implement}) in baracuda-kernels-sys/src/cutlass_reexport.rs exposing the full Cutlass GEMM surface (fp16/bf16/tf32/f32_simt/f64/s8/u8 × {rcr, rrr} × {plain, bias, bias+relu/gelu/silu} + strided-batched fp16/bf16). cuTENSOR / NPP / CV-CUDA skipped — no baracuda-kernels plans wrap them. Completes the Phase 19 library-backed FFI facade 1.0-freeze prereq. |
done |
| 25-26 | Segment/EmbeddingBag BW completion + BatchedOrmqrWy complex (alpha.42): 9 new Rust plans + 24 new FFI symbols for Segment Max/Min/Prod BW (sorted + unsorted, f32/f64), Unsorted Segment Prod FW (atomicCAS-retry mul), EmbeddingBag Max FW+BW (f32/f64/f16/bf16 × i32/i64). Plus BatchedOrmqrWy complex (Complex32, Complex64) via the bespoke WY-block kernels + cuBLAS C/Z gemmStridedBatched (4 new bespoke FFI + 2 cuBLAS symbols). |
done |
| 27 | Q8_1 perf inspection (alpha.42 doc-only): Multi-M MMVQ opportunity identified, kept doc-only — bigger ROI than reformatting Q8_1. | done |
| 28 | API hygiene for 1.0 prep (alpha.43): new KernelDtype umbrella marker trait extending Element/IntElement/FpElement/BinElement; #[non_exhaustive] audit across the op-family *Kind enums + auxiliary tag enums + Error types. ElementKind / LayoutSku / ArchSku / EpilogueKind / ActivationKind / Workspace intentionally left exhaustive (hot-path-dispatched). |
done |
| 29 | Cross-implementation benchmark suite (alpha.44): 10 new criterion+CUDA-event benches comparing baracuda against cuBLAS / cuDNN at LLM-typical shapes (GEMM f32/f16/bf16, MMVQ all qtypes, Softmax, LayerNorm, RMSNorm, Conv2d, MaxPool2d, Reductions, Elementwise, Flash SDPA+GQA). ~2,750 LOC of bench code + 13 bench binaries total. Critical finding: baracuda f16/bf16 GEMM is 2-4× slower than cuBLAS at M=1/M=32 (decode regime); validates the deferred Phase 27 multi-M MMVQ port. See BENCHMARKS.md for the methodology + sample run. |
done |
| 30 | f16/bf16 GEMM cuBLAS fast-path (alpha.45): adds PlanPreference::prefer_backend: Option<BackendKind> + thread-local cuBLAS-handle cache to GemmPlan. Heuristic: cuBLAS for f16/bf16 at 2 ≤ M < 128 (decode batch); CUTLASS otherwise. 3× speedup at M=32 f16 (55.6µs → 19.0µs, parity with cuBLAS direct). M=1 stays on CUTLASS (cuBLAS RCR→col-major transa=T mapping slower than CUTLASS sm_80 GEMV-tile at K=N≥2048). Capture-mode auto-fallback to CUTLASS (cuBLAS-classic not capture-safe). 9 new smoke tests. |
done |
| 31+ | Fuel Phase 6c.2 asks (ELU α + powf + step + gelu_erf + cast u32/i16 + reduce_sum_to/reduce_max_to ~76 FFI symbols), Phase 27 multi-M MMVQ port, descriptor #[non_exhaustive] + builder (Phase 28b), Hopper / Blackwell, 1.0 freeze |
pending (see ROADMAP.md) |
API stability is not promised before beta.0. Breaking changes ship in
each alpha bump and are documented in the workspace CHANGELOG.md.
Quick start
Add the kernel facade and the driver crate:
[]
= { = "0.0.1-alpha.45", = ["sm89", "cudnn"] }
= "0.0.1-alpha.45"
A representative example — single-axis numerically stable softmax over a device-resident tensor:
use ;
use ;
The same select → run shape applies to every op. GEMM, attention,
conv2d, FFT, scatter — the descriptor / args fields differ per family but
the lifecycle is identical. See the crates/baracuda-kernels
README for the int8-GEMM variant of
the same example.
Workspace layout
The user-facing crates a typical caller will reach for:
baracuda-kernels # the unified Plan-based ML op facade
baracuda-kernels-types # shared type vocabulary (Element, TensorRef, KernelSku, ...)
baracuda-kernels-sys # raw FFI to bespoke .cu kernels
baracuda-kernels-bench # criterion harness for sm_89 perf sweeps (not published)
baracuda-cutlass # safe wrapper for CUTLASS GEMM (float, int8 RCR, batched, grouped)
baracuda-driver # safe wrapper for the CUDA Driver API
baracuda-runtime # safe wrapper for the CUDA Runtime API
The per-library wrappers used internally by the facade (you can also use them stand-alone):
baracuda-cublas{,-sys} # cuBLAS + cuBLASLt + cuBLASXt
baracuda-cudnn{,-sys} # cuDNN classic + Graph API
baracuda-cufft{,-sys} # cuFFT
baracuda-cusolver{,-sys} # cuSOLVER dense + sparse + Rf + Mg
baracuda-cusparse{,-sys} # cuSPARSE
baracuda-curand{,-sys} # cuRAND
baracuda-cutensor{,-sys} # cuTENSOR
baracuda-npp{,-sys} # NPP
baracuda-nccl{,-sys} # NCCL
baracuda-cvcuda{,-sys} # CV-CUDA
baracuda-nvjpeg{,-sys} # nvJPEG
baracuda-nvcomp{,-sys} # nvCOMP
And the supporting low-level crates (FFI, build infrastructure, profiling):
baracuda-cuda-sys # Driver + Runtime FFI
baracuda-nvrtc{,-sys} # runtime CUDA C++ → PTX
baracuda-nvjitlink{,-sys} # CUDA 12+ JIT linker
baracuda-cupti{,-sys} # profiling APIs
baracuda-nvml{,-sys} # device monitoring
baracuda-cufile{,-sys} # GPUDirect Storage (Linux-only)
baracuda-tensorrt{,-sys} # TensorRT inference runtime
baracuda-forge # build-time .cu → PTX compiler driver
baracuda-build # build.rs helpers
baracuda-core # loader + Error plumbing
baracuda-types{,-derive} # pure-data types: Half, BFloat16, Complex, DeviceRepr
The full umbrella crate (baracuda) re-exports everything behind cargo
features — convenient when you want everything; overkill when you don't.
Hardware support
baracuda targets Ampere and newer by design. Pre-Ampere GPUs lack the
tensor-core instructions and async-copy primitives the bespoke kernels are
written against (mma.sync.m16n8k*, cp.async, ldmatrix), and we have
no desire to ship a slower SIMT fallback for hardware that's eight years
old.
| Compute capability | NVIDIA marketing names | baracuda support |
|---|---|---|
| sm_80 | Ampere (A100, A40, A30, RTX 30xx) | default baseline |
| sm_89 | Ada Lovelace (RTX 40xx, L40, L4) | feature-gated specialized kernels (FP8, larger Flash Attention tiles) |
| sm_90a | Hopper async (H100, H200) | stubs in place; full specialization pending Phase 11 |
| sm_100 | Blackwell | post-Phase-11 |
| ≤ sm_75 (Turing, Volta, Pascal, …) | — | unsupported |
The default sm80 build runs forward-compatibly on Ada and Hopper through
JIT-compiled PTX; turn on sm89 to pick up the FP8 and Flash-Attention
sibling plans tuned for Ada's larger register file.
Cargo features
The kernel facade exposes a small feature set:
| Feature | Default | Effect |
|---|---|---|
sm80 |
yes | Build the Ampere-baseline kernel set. |
sm89 |
no | Build the Ada Lovelace specializations (FP8 GEMM, FlashSdpaSm89Plan). |
sm90a |
no | Build the Hopper-specialized kernels (stubs today). |
cudnn |
no | Link cuDNN and enable conv / pool / CtcLossCudnnPlan. |
cudnn is off by default because cuDNN is a separate NVIDIA download not
bundled with the stock CUDA toolkit installer. Enabling it without cuDNN
installed produces a linker error on cudnn.lib / libcudnn.so — see
the building section for the auto-discovery paths the build script probes.
Building
Requirements:
- CUDA Toolkit ≥ 12.0 with
nvcconPATH. baracuda is tested on 12.x and 13.x. - cuDNN 9.x (only if you enable the
cudnnfeature) — separate NVIDIA download, not bundled with the toolkit. - A working Rust toolchain ≥ 1.85 (workspace MSRV pinned in
rust-toolchain.toml). - Windows users:
lld-link.exesomewhere onPATH. The CUDAnvccinvocation links through it; the install location is typicallyC:\Program Files\LLVM\bin. Install the LLVM Windows package and add that directory toPATHifcargo buildcomplains aboutlld-link.exenot being found.
A typical full build with all GPU-side features (CUDA toolkit + cuDNN present):
Or, to verify the public API surface compiles without the full kernel build (fast — type-check only):
The baracuda-kernels-sys build script auto-discovers cuDNN at the
following paths in order: CUDNN_PATH / CUDNN_ROOT / CUDNN_HOME env
vars, then C:\Program Files\NVIDIA\CUDNN\v<X.Y>\ on Windows, then the
CUDA toolkit's own lib/ directory (pre-cuDNN-9 layout), then the
standard Linux distro paths under /usr/lib/.
Troubleshooting
Windows: Git-for-Windows fake link.exe shadowing the MSVC linker
Git-for-Windows ships a GNU coreutils binary named link.exe at
C:\Program Files\Git\usr\bin\link.exe — its job is to create a hard
link, not to link object files. If that directory appears on PATH
ahead of the MSVC linker (or LLVM's lld-link.exe), cargo build
invokes the coreutils binary instead of the real linker and fails with a
cryptic error (it doesn't understand /OUT: and friends).
baracuda's baracuda-kernels-sys and baracuda-cutlass-sys build
scripts probe PATH on Windows and emit a cargo:warning if they
detect this shadowing. Fix: re-order PATH so the MSVC linker
(typically reached via the Visual Studio "x64 Native Tools Command
Prompt") or LLVM's lld-link.exe (C:\Program Files\LLVM\bin\) appears
before C:\Program Files\Git\usr\bin\. Building from the VS x64 Native
Tools prompt is the most reliable option; alternatively, install LLVM
and put its bin directory ahead of Git's on the user/system PATH.
Testing
baracuda's GPU integration tests are gated behind #[ignore] so a
host-only cargo test doesn't try to launch a kernel on a machine
without an NVIDIA driver. To run them you need a working GPU plus the
--ignored flag:
# Host-only tests (compile + reference logic; no GPU access):
# Full GPU integration sweep — RTX 30xx / 40xx / 50xx required:
# Verify the workspace-level API surface compiles (no GPU needed):
The full regression on an RTX 4070 covers 324 binary targets at ~1630 tests passing. Individual op-family suites take 30–90 seconds; the full sweep is 25–40 minutes.
Benchmarks
The baracuda-kernels-bench crate is a criterion-based harness with
CUDA-event-timed throughput sweeps across GEMM, Flash Attention, and
Conv2d at LLM-typical and ResNet-typical shapes. It is not published
to crates.io (it depends on a working GPU).
The full sweep takes ~30 minutes on an RTX 4070. Scope to a single family
with --bench gemm / --bench flash_attention / --bench conv2d. See
crates/baracuda-kernels-bench/BENCH-sm89.md
for the baseline table format and methodology.
Project documentation
ARCHITECTURE.md— layered design, Plan-Descriptor-Args pattern,KernelSkutaxonomy, dispatcher design, workspace contract, sibling-plan pattern, vendoring convention, phase roadmap.OP-MATRIX.md— full op × dtype × backend coverage matrix (planned).LESSONS.md— postmortems, ABI footguns, performance traps (planned).- Per-crate
README.mdfiles undercrates/<name>/.
License
Dual-licensed under MIT or Apache-2.0. Pick whichever fits your project. Contributions accepted under the same terms.
NVIDIA's CUDA libraries (libcuda, libcudart, libcublas, libcudnn,
…) are not redistributed by this project. You obtain them from NVIDIA
separately — either through the CUDA Toolkit installer or through each
library's dedicated download page. baracuda's loader opens whatever the
host driver / toolkit has installed.
Vendor attribution
A small number of bespoke kernels in baracuda-kernels-sys are vendored
from upstream open-source projects (huggingface/candle's CUDA kernel set
via fuel-cuda-kernels; llama.cpp's ggml-cuda GGUF block-format
quantization + MMVQ; guoqingbao/attention.rs's fused MoE expert
kernels). Each adapted source carries an SPDX-FileCopyrightText: +
SPDX-License-Identifier: header; the consolidated provenance is in
crates/baracuda-kernels-sys/LICENSE-thirdparty.md.
The baracuda-forge build-time kernel-compiler
crate is a vendored fork of cudaforge
by Guoqing Bao — see crates/baracuda-forge/NOTICE
for the upstream commit hash.
The baracuda-cutlass safe wrapper for NVIDIA
CUTLASS — plan-based GEMM and grouped-GEMM with caller-supplied
workspace, MoE-friendly variable-M-per-group dispatch — was specified
by the Fuel ML library team. See
crates/baracuda-cutlass/NOTICE for
the design lineage.