cudaforge 0.1.3

Advanced CUDA kernel builder for Rust with incremental builds, auto-detection, and external dependency support.
Documentation

CudaForge

Version License CUDA

Advanced CUDA kernel builder for Rust with incremental builds, auto-detection, and external dependency support.

Features

  • 🚀 Incremental Builds - Only recompile modified kernels using content hashing
  • 🔍 Auto-Detection - Automatically find CUDA toolkit, nvcc, and compute capability
  • 🎯 Per-Kernel Compute Cap - Override compute capability for specific kernels by filename
  • 📦 External Dependencies - Built-in CUTLASS support, or fetch any git repo
  • Parallel Compilation - Configurable thread percentage for parallel builds
  • 📁 Flexible Sources - Directory, glob, files, or exclude patterns

Installation

Add to your Cargo.toml:

[build-dependencies]
cudaforge = "0.1"

Quick Start

Building a Static Library

// build.rs
use cudaforge::{KernelBuilder, Result};

fn main() -> Result<()> {
    let out_dir = std::env::var("OUT_DIR")?;
    
    KernelBuilder::new()
        .source_dir("src/kernels")
        .arg("-O3")
        .arg("-std=c++17")
        .arg("--use_fast_math")
        .build_lib(format!("{}/libkernels.a", out_dir))?;
    
    println!("cargo:rustc-link-search={}", out_dir);
    println!("cargo:rustc-link-lib=kernels");
    println!("cargo:rustc-link-lib=dylib=cudart");
    
    Ok(())
}

Building PTX Files

use cudaforge::KernelBuilder;

fn main() -> cudaforge::Result<()> {
    let output = KernelBuilder::new()
        .source_glob("src/**/*.cu")
        .build_ptx()?;
    
    // Generate Rust file with const declarations
    output.write("src/kernels.rs")?;
    
    Ok(())
}

Compute Capability

Auto-Detection

CudaForge automatically detects compute capability in this order:

  1. CUDA_COMPUTE_CAP environment variable (supports "90", "90a", "100a")
  2. nvidia-smi --query-gpu=compute_cap

For sm_90+ architectures, the 'a' suffix is automatically added for async features.

Per-Kernel Override

Override compute capability for specific kernels using filename patterns:

KernelBuilder::new()
    .source_dir("src")
    .with_compute_override("sm90_*.cu", 90)  // Hopper (auto → sm_90a)
    .with_compute_override("sm80_*.cu", 80)  // Ampere (sm_80)
    .with_compute_override_arch("sm100_*.cu", "100a")  // Explicit arch string
    .build_lib("libkernels.a")?;

String-Based Architecture

For explicit control over GPU architecture including suffix:

KernelBuilder::new()
    .compute_cap_arch("90a")  // Explicit sm_90a
    .source_dir("src")
    .build_lib("libkernels.a")?;

Numeric (Auto-Suffix)

KernelBuilder::new()
    .compute_cap(90)  // Auto-selects sm_90a for 90+
    .source_dir("src")
    .build_lib("libkernels.a")?;

Source Selection

Directory (Recursive)

KernelBuilder::new()
    .source_dir("src/kernels")  // All .cu files recursively

Glob Pattern

KernelBuilder::new()
    .source_glob("src/**/*.cu")

Specific Files

KernelBuilder::new()
    .source_files(vec!["src/kernel1.cu", "src/kernel2.cu"])

With Exclusions

KernelBuilder::new()
    .source_dir("src/kernels")
    .exclude(&["*_test.cu", "deprecated/*", "wip_*.cu"])

Watch Additional Files

Track header files that should trigger rebuilds:

KernelBuilder::new()
    .source_dir("src/kernels")
    .watch(vec!["src/common.cuh", "src/utils.cuh"])

External Dependencies

CUTLASS Integration

KernelBuilder::new()
    .source_dir("src")
    .with_cutlass(Some("7127592069c2fe01b041e174ba4345ef9b279671"))
    .arg("-DUSE_CUTLASS")
    .arg("-std=c++17")
    .build_lib("libkernels.a")?;

Custom Git Repository

Fetch include directories from any git repository:

KernelBuilder::new()
    .source_dir("src")
    .with_git_dependency(
        "my_lib",                              // Name
        "https://github.com/org/my_lib.git",   // Repository
        "abc123def456",                        // Commit hash
        vec!["include", "src/include"],        // Include paths
        false,                                 // Do not recurse submodules
    )
    .build_lib("libkernels.a")?;

Local Include Paths

KernelBuilder::new()
    .source_dir("src")
    .include_path("third_party/include")
    .include_path("/opt/cuda/samples/common/inc")
    .build_lib("libkernels.a")?;

Parallel Compilation

Thread Percentage

Use a percentage of available threads:

KernelBuilder::new()
    .thread_percentage(0.5)  // 50% of available threads
    .source_dir("src")
    .build_lib("libkernels.a")?;

Maximum Threads

Set an absolute limit:

KernelBuilder::new()
    .max_threads(8)  // Use at most 8 threads
    .source_dir("src")
    .build_lib("libkernels.a")?;

Environment Variables

  • CUDAFORGE_THREADS - Override thread count
  • RAYON_NUM_THREADS - Alternative for compatibility
  • NVCC_THREADS - Number of threads for nvcc internal parallelism

Pattern-Based Threading

Enable multiple nvcc threads only for specific files (supports globs):

KernelBuilder::new()
    .nvcc_thread_patterns(&[
        "gemm_*.cu",       // Matches filename (gemm_vp8.cu)
        "**/special/*.cu", // Matches path
        "flash_api",       // Matches substring
    ])
    .build_lib("libkernels.a")?;

CUDA Toolkit Detection

CudaForge automatically locates the CUDA toolkit in this order:

  1. NVCC environment variable
  2. nvcc in PATH
  3. CUDA_HOME/bin/nvcc
  4. /usr/local/cuda/bin/nvcc
  5. Common installation paths

Manual Override

KernelBuilder::new()
    .cuda_root("/opt/cuda-12.1")

Incremental Builds

Incremental builds are enabled by default. CudaForge tracks:

  • File content hashes (SHA-256)
  • Compute capability used
  • Compiler arguments

To disable:

KernelBuilder::new()
    .no_incremental()

Full Example

use cudaforge::{KernelBuilder, Result};

fn main() -> Result<()> {
    println!("cargo:rerun-if-changed=build.rs");
    println!("cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP");
    
    let out_dir = std::env::var("OUT_DIR")?;
    
    // Build with full feature set
    KernelBuilder::new()
        // Source selection
        .source_dir("src/kernels")
        .exclude(&["*_test.cu", "deprecated/*"])
        .watch(vec!["src/common.cuh"])
        
        // Per-kernel compute cap
        .with_compute_override("sm90_*.cu", 90)
        .with_compute_override("sm80_*.cu", 80)
        
        // External dependencies
        .with_cutlass(None)
        
        // Compiler options
        .arg("--expt-relaxed-constexpr")
        .arg("-std=c++17")
        .arg("-O3")
        .arg("--use_fast_math")
        
        // Parallel build
        .thread_percentage(0.5)
        
        // Build
        .build_lib(format!("{}/libkernels.a", out_dir))?;
    
    println!("cargo:rustc-link-search={}", out_dir);
    println!("cargo:rustc-link-lib=kernels");
    println!("cargo:rustc-link-lib=dylib=cudart");
    
    Ok(())
}

Multiple Builders

Use multiple builders in sequence for different output types or configurations:

use cudaforge::KernelBuilder;

fn main() -> cudaforge::Result<()> {
    let out_dir = std::env::var("OUT_DIR").expect("OUT_DIR");
    
    // Builder 1: Static library for main kernels
    KernelBuilder::new()
        .source_dir("src/kernels")
        .exclude(&["*_ptx.cu"])
        .compute_cap(90)
        .arg("-O3")
        .build_lib(format!("{}/libkernels.a", out_dir))?;
    
    // Builder 2: PTX files for runtime compilation
    let ptx_output = KernelBuilder::new()
        .source_glob("src/ptx_kernels/*.cu")
        .compute_cap(80)
        .build_ptx()?;
    
    ptx_output.write("src/kernels.rs")?;
    
    // Builder 3: Separate lib with CUTLASS
    KernelBuilder::new()
        .source_dir("src/cutlass_kernels")
        .with_cutlass(None)
        .compute_cap_arch("100a")
        .build_lib(format!("{}/libcutlass.a", out_dir))?;
    
    println!("cargo:rustc-link-search={}", out_dir);
    println!("cargo:rustc-link-lib=kernels");
    println!("cargo:rustc-link-lib=cutlass");
    
    Ok(())
}

Environment Variables

Variable Description
CUDA_COMPUTE_CAP Default compute capability (e.g., 80, 90)
NVCC Path to nvcc binary
CUDA_HOME CUDA installation root
NVCC_CCBIN C++ compiler for nvcc
CUDAFORGE_THREADS Override thread count
NVCC_THREADS nvcc internal parallel threads

Docker Builds

[!IMPORTANT] GPU is NOT accessible during docker build — only during docker run --gpus all.

When building CUDA kernels inside a Dockerfile, nvidia-smi cannot be used to auto-detect compute capability. You must explicitly set CUDA_COMPUTE_CAP:

Dockerfile Example

FROM nvidia/cuda:12.8.0-devel-ubuntu22.04

# Install Rust
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

# Set compute capability for the build
ARG CUDA_COMPUTE_CAP=90
ENV CUDA_COMPUTE_CAP=${CUDA_COMPUTE_CAP}

# Build with explicit compute cap
WORKDIR /app
COPY . .
RUN cargo build --release

Build for different GPU architectures:

# Build for Hopper (sm_90)
docker build --build-arg CUDA_COMPUTE_CAP=90 -t myapp:hopper .

# Build for Blackwell (sm_100)
docker build --build-arg CUDA_COMPUTE_CAP=100 -t myapp:blackwell .

# Build for Ampere (sm_80)
docker build --build-arg CUDA_COMPUTE_CAP=80 -t myapp:ampere .

Fail-Fast Mode

For CI/Docker builds, use require_explicit_compute_cap() to fail immediately if compute capability is not set:

KernelBuilder::new()
    .require_explicit_compute_cap()?  // Fails fast if CUDA_COMPUTE_CAP not set
    .source_dir("src/kernels")
    .build_lib("libkernels.a")?;

Migration from bindgen_cuda

Old API New API
Builder::default() KernelBuilder::new()
.kernel_paths(vec![...]) .source_files(vec![...])
.kernel_paths_glob("...") .source_glob("...")
.include_paths(vec![...]) .include_path("...")
Bindings PtxOutput

Backward compatibility aliases are available:

  • cudaforge::BuilderKernelBuilder
  • cudaforge::BindingsPtxOutput

License

MIT OR Apache-2.0