baracuda

About the name. Yes, we know — it's spelled barracuda (two Rs). That name was taken on crates.io, so we dropped one R and kept swimming.
A unified Rust ML-op facade over the NVIDIA CUDA ecosystem.
What baracuda is
baracuda is a Rust workspace that exposes every primitive an ML framework
expects — the union of PyTorch (torch.* + nn.functional) and JAX
(jax.lax.* + jax.numpy.*) — through a single Plan-based crate surface
called baracuda-kernels. Internally each plan dispatches to:
- The appropriate NVIDIA-library wrapper crate (cuBLAS, cuDNN, cuFFT, cuSOLVER, cuRAND, cuSPARSE, cuTENSOR, NPP, CV-CUDA, CUTLASS) when one already covers the op well, or
- A bespoke hand-rolled
.cukernel shipped inbaracuda-kernels-syswhen no NVIDIA library covers the op (or covers it poorly at the shapes that matter for modern transformer / vision / GNN workloads).
Callers import one crate (baracuda-kernels) and reach for one API
style. The dispatch decision — which is observable through
Plan::sku() for telemetry — is otherwise invisible. Switching from a
CUTLASS-backed SKU to a bespoke-backed SKU is a layout flag, not an import
change.
baracuda is for downstream Rust ML / inference / training frameworks that
need access to the full CUDA stack without re-vendoring it themselves. The
workspace also ships idiomatic stand-alone wrappers for every CUDA library
under crates/baracuda-<lib> if you want to skip the kernel facade and
talk to one library directly.
Status
In active development — alpha.25. Roughly 1630 GPU tests passing on an RTX 4070 (sm_89), across 324 binary targets.
Phase coverage (see ARCHITECTURE.md for the phase
matrix):
| Phase | Scope | Status |
|---|---|---|
| 0 | Crate scaffolding, shared type vocabulary | done |
| 1 | int8 GEMM RRR (Fuel-blocking, 18 SKUs) | done |
| 2 | FP8 / int4 / bin GEMM completion | done |
| 3 | Elementwise + shape / layout (Categories B, B', C, C', D, N) | done |
| 4 | Reductions + scans + random (Categories E, F, Q) | done |
| 5 | Normalization + softmax + loss (Categories G, H, R) | done |
| 6 | Attention + linalg + FFT (Categories K, Linalg, U) | done |
| 7 | Convolution + pooling + indexing + embedding + segment (Categories I, J, L, M, S) | done |
| 8 | Quantization helpers + GGUF + MoE (Category P, V) | done |
| 9 | Sort / topk / image / NMS (Categories O, T) | done |
| 10 | sm_89 (Ada Lovelace) tuning sweep | done |
| 11 | sm_90a (Hopper) / Blackwell forward-compat | pending |
| 12 | API freeze + 1.0 stability + benchmark suite | pending |
API stability is not promised before beta.0. Breaking changes ship in
each alpha bump and are documented in the workspace CHANGELOG.md.
Quick start
Add the kernel facade and the driver crate:
[]
= { = "0.0.1-alpha.25", = ["sm89", "cudnn"] }
= "0.0.1-alpha.25"
A representative example — single-axis numerically stable softmax over a device-resident tensor:
use ;
use ;
The same select → run shape applies to every op. GEMM, attention,
conv2d, FFT, scatter — the descriptor / args fields differ per family but
the lifecycle is identical. See the crates/baracuda-kernels
README for the int8-GEMM variant of
the same example.
Workspace layout
The user-facing crates a typical caller will reach for:
baracuda-kernels # the unified Plan-based ML op facade
baracuda-kernels-types # shared type vocabulary (Element, TensorRef, KernelSku, ...)
baracuda-kernels-sys # raw FFI to bespoke .cu kernels
baracuda-kernels-bench # criterion harness for sm_89 perf sweeps (not published)
baracuda-cutlass # safe wrapper for CUTLASS GEMM (float, int8 RCR, batched, grouped)
baracuda-driver # safe wrapper for the CUDA Driver API
baracuda-runtime # safe wrapper for the CUDA Runtime API
The per-library wrappers used internally by the facade (you can also use them stand-alone):
baracuda-cublas{,-sys} # cuBLAS + cuBLASLt + cuBLASXt
baracuda-cudnn{,-sys} # cuDNN classic + Graph API
baracuda-cufft{,-sys} # cuFFT
baracuda-cusolver{,-sys} # cuSOLVER dense + sparse + Rf + Mg
baracuda-cusparse{,-sys} # cuSPARSE
baracuda-curand{,-sys} # cuRAND
baracuda-cutensor{,-sys} # cuTENSOR
baracuda-npp{,-sys} # NPP
baracuda-nccl{,-sys} # NCCL
baracuda-cvcuda{,-sys} # CV-CUDA
baracuda-nvjpeg{,-sys} # nvJPEG
baracuda-nvcomp{,-sys} # nvCOMP
And the supporting low-level crates (FFI, build infrastructure, profiling):
baracuda-cuda-sys # Driver + Runtime FFI
baracuda-nvrtc{,-sys} # runtime CUDA C++ → PTX
baracuda-nvjitlink{,-sys} # CUDA 12+ JIT linker
baracuda-cupti{,-sys} # profiling APIs
baracuda-nvml{,-sys} # device monitoring
baracuda-cufile{,-sys} # GPUDirect Storage (Linux-only)
baracuda-tensorrt{,-sys} # TensorRT inference runtime
baracuda-forge # build-time .cu → PTX compiler driver
baracuda-build # build.rs helpers
baracuda-core # loader + Error plumbing
baracuda-types{,-derive} # pure-data types: Half, BFloat16, Complex, DeviceRepr
The full umbrella crate (baracuda) re-exports everything behind cargo
features — convenient when you want everything; overkill when you don't.
Hardware support
baracuda targets Ampere and newer by design. Pre-Ampere GPUs lack the
tensor-core instructions and async-copy primitives the bespoke kernels are
written against (mma.sync.m16n8k*, cp.async, ldmatrix), and we have
no desire to ship a slower SIMT fallback for hardware that's eight years
old.
| Compute capability | NVIDIA marketing names | baracuda support |
|---|---|---|
| sm_80 | Ampere (A100, A40, A30, RTX 30xx) | default baseline |
| sm_89 | Ada Lovelace (RTX 40xx, L40, L4) | feature-gated specialized kernels (FP8, larger Flash Attention tiles) |
| sm_90a | Hopper async (H100, H200) | stubs in place; full specialization pending Phase 11 |
| sm_100 | Blackwell | post-Phase-11 |
| ≤ sm_75 (Turing, Volta, Pascal, …) | — | unsupported |
The default sm80 build runs forward-compatibly on Ada and Hopper through
JIT-compiled PTX; turn on sm89 to pick up the FP8 and Flash-Attention
sibling plans tuned for Ada's larger register file.
Cargo features
The kernel facade exposes a small feature set:
| Feature | Default | Effect |
|---|---|---|
sm80 |
yes | Build the Ampere-baseline kernel set. |
sm89 |
no | Build the Ada Lovelace specializations (FP8 GEMM, FlashSdpaSm89Plan). |
sm90a |
no | Build the Hopper-specialized kernels (stubs today). |
cudnn |
no | Link cuDNN and enable conv / pool / CtcLossCudnnPlan. |
cudnn is off by default because cuDNN is a separate NVIDIA download not
bundled with the stock CUDA toolkit installer. Enabling it without cuDNN
installed produces a linker error on cudnn.lib / libcudnn.so — see
the building section for the auto-discovery paths the build script probes.
Building
Requirements:
- CUDA Toolkit ≥ 12.0 with
nvcconPATH. baracuda is tested on 12.x and 13.x. - cuDNN 9.x (only if you enable the
cudnnfeature) — separate NVIDIA download, not bundled with the toolkit. - A working Rust toolchain ≥ 1.85 (workspace MSRV pinned in
rust-toolchain.toml). - Windows users:
lld-link.exesomewhere onPATH. The CUDAnvccinvocation links through it; the install location is typicallyC:\Program Files\LLVM\bin. Install the LLVM Windows package and add that directory toPATHifcargo buildcomplains aboutlld-link.exenot being found.
A typical full build with all GPU-side features (CUDA toolkit + cuDNN present):
Or, to verify the public API surface compiles without the full kernel build (fast — type-check only):
The baracuda-kernels-sys build script auto-discovers cuDNN at the
following paths in order: CUDNN_PATH / CUDNN_ROOT / CUDNN_HOME env
vars, then C:\Program Files\NVIDIA\CUDNN\v<X.Y>\ on Windows, then the
CUDA toolkit's own lib/ directory (pre-cuDNN-9 layout), then the
standard Linux distro paths under /usr/lib/.
Testing
baracuda's GPU integration tests are gated behind #[ignore] so a
host-only cargo test doesn't try to launch a kernel on a machine
without an NVIDIA driver. To run them you need a working GPU plus the
--ignored flag:
# Host-only tests (compile + reference logic; no GPU access):
# Full GPU integration sweep — RTX 30xx / 40xx / 50xx required:
# Verify the workspace-level API surface compiles (no GPU needed):
The full regression on an RTX 4070 covers 324 binary targets at ~1630 tests passing. Individual op-family suites take 30–90 seconds; the full sweep is 25–40 minutes.
Benchmarks
The baracuda-kernels-bench crate is a criterion-based harness with
CUDA-event-timed throughput sweeps across GEMM, Flash Attention, and
Conv2d at LLM-typical and ResNet-typical shapes. It is not published
to crates.io (it depends on a working GPU).
The full sweep takes ~30 minutes on an RTX 4070. Scope to a single family
with --bench gemm / --bench flash_attention / --bench conv2d. See
crates/baracuda-kernels-bench/BENCH-sm89.md
for the baseline table format and methodology.
Project documentation
ARCHITECTURE.md— layered design, Plan-Descriptor-Args pattern,KernelSkutaxonomy, dispatcher design, workspace contract, sibling-plan pattern, vendoring convention, phase roadmap.OP-MATRIX.md— full op × dtype × backend coverage matrix (planned).LESSONS.md— postmortems, ABI footguns, performance traps (planned).- Per-crate
README.mdfiles undercrates/<name>/.
License
Dual-licensed under MIT or Apache-2.0. Pick whichever fits your project. Contributions accepted under the same terms.
NVIDIA's CUDA libraries (libcuda, libcudart, libcublas, libcudnn,
…) are not redistributed by this project. You obtain them from NVIDIA
separately — either through the CUDA Toolkit installer or through each
library's dedicated download page. baracuda's loader opens whatever the
host driver / toolkit has installed.
Vendor attribution
A small number of bespoke kernels in baracuda-kernels-sys are vendored
from upstream open-source projects (huggingface/candle's CUDA kernel set
via fuel-cuda-kernels; llama.cpp's ggml-cuda GGUF block-format
quantization + MMVQ; guoqingbao/attention.rs's fused MoE expert
kernels). Each adapted source carries an SPDX-FileCopyrightText: +
SPDX-License-Identifier: header; the consolidated provenance is in
crates/baracuda-kernels-sys/LICENSE-thirdparty.md.
The baracuda-forge build-time kernel-compiler
crate is a vendored fork of cudaforge
by Guoqing Bao — see crates/baracuda-forge/NOTICE
for the upstream commit hash.
The baracuda-cutlass safe wrapper for NVIDIA
CUTLASS — plan-based GEMM and grouped-GEMM with caller-supplied
workspace, MoE-friendly variable-M-per-group dispatch — was specified
by the Fuel ML library team. See
crates/baracuda-cutlass/NOTICE for
the design lineage.