baracuda-forge

Build-time CUDA kernel compiler for the baracuda ecosystem. Drop it into your [build-dependencies] and compile .cu kernels to a static library or PTX with nvcc, with incremental rebuilds, parallel compilation, GPU compute-capability auto-detection, and integrated CUTLASS support.

baracuda-forge is the build-time companion to baracuda's runtime wrappers (baracuda-driver, baracuda-runtime, etc.). Use forge to turn your .cu files into a library, then use the runtime crates to launch the kernels from Rust.

If you'd rather not roll your own kernels: baracuda-kernels ships a curated, growing set of ML kernels (GEMM family across float and int dtypes, with the rest of the PyTorch + JAX op set landing across the alpha.16+ phases) behind a safe Rust facade. baracuda-kernels-sys uses forge internally to build its .cu sources; baracuda-kernels wraps them with typed plans. Reach for forge when you have a bespoke kernel the curated set doesn't cover yet.

Quick start

// build.rs
use baracuda_forge::KernelBuilder;

fn main() -> Result<(), baracuda_forge::Error> {
    let out_dir = std::env::var("OUT_DIR").unwrap();

    KernelBuilder::new()
        .source_dir("src/kernels")
        .arg("-O3")
        .build_lib(format!("{out_dir}/libkernels.a"))?;

    println!("cargo:rustc-link-search={out_dir}");
    println!("cargo:rustc-link-lib=kernels");
    Ok(())
}

# Cargo.toml

[build-dependencies]

baracuda-forge = "0.0.1-alpha.5"

Features

Incremental builds — SHA-256 content hashing skips kernels whose source, args, and watched headers haven't changed since the last build.
Auto-detection — toolkit location via $CUDA_PATH / $CUDA_HOME / $NVCC / $PATH (sharing logic with baracuda-build); GPU compute capability via $CUDA_COMPUTE_CAP or nvidia-smi.
C++ standard auto-select — defaults to c++20 on CUDA ≥ 12.0, c++17 on older toolkits. Override with .cpp_std("c++17").
Per-kernel compute caps — wildcard patterns route different .cu files to different sm_* targets in one build.
Parallel compilation — rayon thread pool, optional --threads=N for heavy template-instantiation files (CUTLASS, flash-attention).
CUTLASS — with_cutlass(commit) does sparse-checkout fetch + caches under ~/.cudaforge/git/checkouts/; pair with baracuda-cutlass-sys for the safe Rust-side header pin.
Custom git deps — with_git_dependency() for header-only dependencies beyond CUTLASS.
PTX output — build_ptx() instead of build_lib() if you want PTX blobs to ship and JIT at runtime via baracuda-driver.

Per-kernel compute capability

use baracuda_forge::KernelBuilder;
# fn main() -> Result<(), baracuda_forge::Error> {
KernelBuilder::new()
    .source_glob("src/**/*.cu")
    .with_compute_override("sm90_*.cu", 90)  // Hopper kernels
    .with_compute_override("sm80_*.cu", 80)  // Ampere kernels
    .build_lib("libkernels.a")?;
# Ok(())
# }

With CUTLASS

use baracuda_forge::KernelBuilder;
# fn main() -> Result<(), baracuda_forge::Error> {
KernelBuilder::new()
    .source_dir("src/kernels")
    .with_cutlass(None)  // default pinned commit
    .arg("-DUSE_CUTLASS")
    .build_lib("libkernels.a")?;
# Ok(())
# }

Docker / CI

nvidia-smi doesn't work during docker build (only docker run --gpus all). For Docker builds, set CUDA_COMPUTE_CAP and use require_explicit_compute_cap() to fail fast if it's missing:

use baracuda_forge::KernelBuilder;
# fn main() -> Result<(), baracuda_forge::Error> {
KernelBuilder::new()
    .require_explicit_compute_cap()?
    .source_dir("src/kernels")
    .build_lib("libkernels.a")?;
# Ok(())
# }

Compatibility

Aspect	Value	Note
MSRV	1.75	Inherited from the workspace floor.
CUDA toolkit	11.4 – 13.x	C++ standard auto-selects: `c++17` on 11.x, `c++20` on 12.0+.
Host compiler	MSVC, GCC, Clang	Whatever your nvcc accepts. Override via `NVCC_CCBIN`.
Target OS	Linux, Windows	Same as nvcc. macOS unsupported by NVIDIA since 2019.
GPU at build time	Not required	Compute cap can be set via `CUDA_COMPUTE_CAP` (use `require_explicit_compute_cap()` in Docker).

Feature flags

baracuda-forge ships no Cargo features today — it depends on baracuda-build unconditionally and uses serde only at build time (not in the consumer's runtime crate).

For CUTLASS version pinning, see baracuda-cutlass-sys, which has a cutlass-2-11 feature flag that downgrades the default CUTLASS 4.x to the CUDA-11.4-compatible 2.11.x line.

End-to-end example

examples/forge_hello.rs wires this crate up to baracuda-driver for the canonical "compile → load → launch → verify" loop:

cargo run -p baracuda-examples --bin forge_hello --features forge-hello --release

That builds examples/kernels/forge_hello.cu to PTX via KernelBuilder, loads the PTX through Module::load_ptx, runs 1,048,576 element vector-add on the GPU, and asserts an exact match against the CPU reference.

Acknowledgments

baracuda-forge is a vendored fork of cudaforge by Guoqing Bao, with modifications to fit baracuda's workspace conventions (error type alignment, shared toolkit detection via baracuda-build, C++ standard auto-selection from detected CUDA version). The two projects are heading in different directions — cudaforge stays a single focused crate, and baracuda integrates kernel building into a broader CUDA-stack workspace — so we vendored rather than depended-on, but the bulk of the algorithm and API is Guoqing's work.

Big thank you to Guoqing for releasing cudaforge under permissive terms. See NOTICE for the full provenance and the upstream commit hash.

License

Dual-licensed under MIT or Apache-2.0, matching the workspace and the upstream cudaforge license. See LICENSE-MIT and LICENSE-APACHE at the repository root.

baracuda-forge 0.0.1-alpha.68