baracuda-kernels-sys

Raw extern "C" entry points for the bespoke .cu kernels behind baracuda-kernels. You almost certainly want baracuda-kernels instead — it wraps these unsafe calls with typed plans, lifetime- checked device buffers, the workspace contract, and a proper Rust API.

This crate exists at the layer where the workspace's other *-sys crates do: just the C ABI bindings + a build.rs that compiles the .cu sources via baracuda-forge.

What's compiled here

kernels/ ships hand-rolled CUDA C++ for every op family that doesn't fit (or doesn't fit well) inside the NVIDIA library SKUs:

kernels/
  attention/      RoPE, ALiBi, SDPA, Flash SDPA (sm_80 + sm_89), KV-cache append
  elementwise/    unary / binary / ternary math + activations + casts + affine + fill
  embedding/      embedding + embedding_bag (FW + BW)
  fft/            fftshift / ifftshift index-permutation helpers
  gemm/           int8 RRR, FP8 (sm_89), int4, bin GEMM
  gguf/           llama.cpp GGUF block-format dequant + MMVQ (Q4_0..Q8_K + k-quants)
  image/          interpolate, grid_sample, pixel_shuffle, ROI ops, NMS
  indexing/       gather, scatter_add, index_select, masked_fill, one_hot, nonzero
  linalg/         ormqr WY-blocked + reflector-by-reflector, batched QR helpers
  loss/           CTCLoss DP forward + backward
  moe/            scalar GGUF MoE + WMMA MoE + combined WMMA+GGUF MoE
  norm/           RMSNorm, LayerNorm, BatchNorm, GroupNorm, InstanceNorm (FW + BW)
  quantize/       per-tensor / per-channel / per-token / per-group quantize / dequantize
  random/         dropout + bernoulli on top of cuRAND-uniform
  segment/        sorted + unsorted segment sum / mean / max / min / prod
  softmax/        softmax, log-softmax, Gumbel-softmax, sparsemax (FW + BW)
  sort/           block-bitonic sort + topk + kthvalue + msort + searchsorted + bincount + histogram + unique
  include/        shared headers — mma.sync wrappers, cp.async, smem swizzle, warp reductions, epilogue helpers

Every kernel uses the same calling convention: extern "C" int32_t baracuda_kernels_<family>_<spec>_run(...) returning a status code (see below), paired with _can_implement and _workspace_size helpers where applicable. The safe layer in baracuda-kernels wraps each launcher behind a Plan.

Build script — CUDA + cuDNN auto-discovery

The build script (build.rs) drives baracuda-forge::KernelBuilder to compile the selected kernel set with nvcc. The CUDA toolkit is auto-discovered through baracuda-forge's standard probe (CUDA_PATH / CUDA_HOME / Windows installer layouts / Linux distro paths).

When the cudnn feature is enabled the build script also probes for cuDNN. Search order, first hit wins:

CUDNN_PATH, CUDNN_ROOT, CUDNN_HOME — explicit env-var roots.
C:\Program Files\NVIDIA\CUDNN\v<X.Y>\lib\<cuda_ver>\x64\ — the standard Windows installer layout for cuDNN 9+ (versioned by both cuDNN release and target CUDA toolkit).
Legacy "drop into CUDA toolkit" layout — $CUDA_PATH\lib\x64\ on Windows or $CUDA_PATH/lib64/ on Linux. The pre-cuDNN-9 tarball convention.
Linux distro paths — /usr/lib/x86_64-linux-gnu/ and /usr/local/cuda/lib64/.

If cuDNN can't be found and the cudnn feature is enabled, the build fails with a typed error pointing at the search paths it tried.

Cargo features

Feature	Default	Effect
`sm80`	yes	Ampere baseline — runs forward-compatibly on Ada and Hopper.
`sm89`	no	Ada Lovelace specializations (adds FP8 kernels, sm_89 Flash Attention).
`sm90a`	no	Hopper-specialized kernels (stubs today).
`cudnn`	no	Compile + link the cuDNN-backed launchers (conv / pool / CTC-cuDNN).

Status codes

All *_run and *_can_implement entry points return an int32_t:

Code	Meaning
`0`	success
`1`	misaligned operand
`2`	invalid problem (M, N, or K non-positive, or shape inconsistency)
`3`	not supported (this kernel doesn't implement the requested shape)
`4`	workspace too small or null when required
`5`	internal kernel error (typically a launch failure)

The safe layer maps these to baracuda_cutlass::Error variants (the shared error type for the kernel facade). See ARCHITECTURE.md for the mapping.

Safety

Functions take raw void* pointers, integer dimensions, and a cudaStream_t cast to *mut c_void. They are unsafe because:

Pointers are dereferenced without bounds checking.
They're assumed to be valid device addresses.
When a workspace pointer is non-null, it's assumed to point at the number of writable device bytes the caller asked *_workspace_size for.
The stream is assumed to be a live CUDA stream owned by the calling thread's current context.

If any of those are violated, you'll get either i32 == 5 (the launch failed for a CUDA-reported reason) or undefined behavior at the GPU level. The safe layer in baracuda-kernels enforces all of these.

Vendored kernels

A subset of kernels in this crate are vendored from upstream open- source projects (HuggingFace candle via fuel-cuda-kernels for the elementwise core; llama.cpp's ggml-cuda for GGUF dequant + MMVQ; guoqingbao/attention.rs for fused MoE). Each adapted source carries an SPDX-FileCopyrightText: + SPDX-License-Identifier: header; the consolidated provenance is in LICENSE-thirdparty.md.