baracuda-kernels-sys
Raw extern "C" entry points for the bespoke .cu kernels behind
baracuda-kernels. You almost certainly want baracuda-kernels
instead — it wraps these unsafe calls with typed plans, lifetime-
checked device buffers, the workspace contract, and a proper Rust API.
This crate exists at the layer where the workspace's other *-sys
crates do: just the C ABI bindings + a build.rs that compiles the
.cu sources via baracuda-forge.
What's compiled here
kernels/ ships hand-rolled CUDA C++ for every op family that doesn't
fit (or doesn't fit well) inside the NVIDIA library SKUs:
kernels/
attention/ RoPE, ALiBi, SDPA, Flash SDPA (sm_80 + sm_89), KV-cache append
elementwise/ unary / binary / ternary math + activations + casts + affine + fill
embedding/ embedding + embedding_bag (FW + BW)
fft/ fftshift / ifftshift index-permutation helpers
gemm/ int8 RRR, FP8 (sm_89), int4, bin GEMM
gguf/ llama.cpp GGUF block-format dequant + MMVQ (Q4_0..Q8_K + k-quants)
image/ interpolate, grid_sample, pixel_shuffle, ROI ops, NMS
indexing/ gather, scatter_add, index_select, masked_fill, one_hot, nonzero
linalg/ ormqr WY-blocked + reflector-by-reflector, batched QR helpers
loss/ CTCLoss DP forward + backward
moe/ scalar GGUF MoE + WMMA MoE + combined WMMA+GGUF MoE
norm/ RMSNorm, LayerNorm, BatchNorm, GroupNorm, InstanceNorm (FW + BW)
quantize/ per-tensor / per-channel / per-token / per-group quantize / dequantize
random/ dropout + bernoulli on top of cuRAND-uniform
segment/ sorted + unsorted segment sum / mean / max / min / prod
softmax/ softmax, log-softmax, Gumbel-softmax, sparsemax (FW + BW)
sort/ block-bitonic sort + topk + kthvalue + msort + searchsorted + bincount + histogram + unique
include/ shared headers — mma.sync wrappers, cp.async, smem swizzle, warp reductions, epilogue helpers
Every kernel uses the same calling convention: extern "C" int32_t
baracuda_kernels_<family>_<spec>_run(...) returning a status code (see
below), paired with _can_implement and _workspace_size helpers
where applicable. The safe layer in baracuda-kernels wraps each
launcher behind a Plan.
Build script — CUDA + cuDNN auto-discovery
The build script (build.rs) drives baracuda-forge::KernelBuilder to
compile the selected kernel set with nvcc. The CUDA toolkit is
auto-discovered through baracuda-forge's standard probe (CUDA_PATH /
CUDA_HOME / Windows installer layouts / Linux distro paths).
When the cudnn feature is enabled the build script also probes for
cuDNN. Search order, first hit wins:
CUDNN_PATH,CUDNN_ROOT,CUDNN_HOME— explicit env-var roots.C:\Program Files\NVIDIA\CUDNN\v<X.Y>\lib\<cuda_ver>\x64\— the standard Windows installer layout for cuDNN 9+ (versioned by both cuDNN release and target CUDA toolkit).- Legacy "drop into CUDA toolkit" layout —
$CUDA_PATH\lib\x64\on Windows or$CUDA_PATH/lib64/on Linux. The pre-cuDNN-9 tarball convention. - Linux distro paths —
/usr/lib/x86_64-linux-gnu/and/usr/local/cuda/lib64/.
If cuDNN can't be found and the cudnn feature is enabled, the build
fails with a typed error pointing at the search paths it tried.
Cargo features
| Feature | Default | Effect |
|---|---|---|
sm80 |
yes | Ampere baseline — runs forward-compatibly on Ada and Hopper. |
sm89 |
no | Ada Lovelace specializations (adds FP8 kernels, sm_89 Flash Attention). |
sm90a |
no | Hopper-specialized kernels (stubs today). |
cudnn |
no | Compile + link the cuDNN-backed launchers (conv / pool / CTC-cuDNN). |
Status codes
All *_run and *_can_implement entry points return an int32_t:
| Code | Meaning |
|---|---|
0 |
success |
1 |
misaligned operand |
2 |
invalid problem (M, N, or K non-positive, or shape inconsistency) |
3 |
not supported (this kernel doesn't implement the requested shape) |
4 |
workspace too small or null when required |
5 |
internal kernel error (typically a launch failure) |
The safe layer maps these to baracuda_cutlass::Error variants (the
shared error type for the kernel facade). See
ARCHITECTURE.md for the
mapping.
Safety
Functions take raw void* pointers, integer dimensions, and a
cudaStream_t cast to *mut c_void. They are unsafe because:
- Pointers are dereferenced without bounds checking.
- They're assumed to be valid device addresses.
- When a workspace pointer is non-null, it's assumed to point at the
number of writable device bytes the caller asked
*_workspace_sizefor. - The stream is assumed to be a live CUDA stream owned by the calling thread's current context.
If any of those are violated, you'll get either i32 == 5 (the launch
failed for a CUDA-reported reason) or undefined behavior at the GPU
level. The safe layer in baracuda-kernels enforces all of these.
Vendored kernels
A subset of kernels in this crate are vendored from upstream open-
source projects (HuggingFace candle via fuel-cuda-kernels for the
elementwise core; llama.cpp's ggml-cuda for GGUF dequant + MMVQ;
guoqingbao/attention.rs for fused MoE). Each adapted source carries
an SPDX-FileCopyrightText: + SPDX-License-Identifier: header; the
consolidated provenance is in
LICENSE-thirdparty.md.