cudaforge 0.1.5

Advanced CUDA kernel builder for Rust with incremental builds, auto-detection, and external dependency support.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
# CudaForge

<!-- Uncomment these after publishing to crates.io:
[![Crates.io](https://img.shields.io/crates/v/cudaforge.svg)](https://crates.io/crates/cudaforge)
[![Documentation](https://docs.rs/cudaforge/badge.svg)](https://docs.rs/cudaforge)
![License](https://img.shields.io/crates/l/cudaforge.svg)
-->
![Version](https://img.shields.io/badge/version-0.1.0-blue)
![License](https://img.shields.io/badge/license-MIT%2FApache--2.0-green)
![CUDA](https://img.shields.io/badge/CUDA-12.x-76B900?logo=nvidia)

Advanced CUDA kernel builder for Rust with **incremental builds**, **auto-detection**, and **external dependency support**.

## Features

- 🚀 **Incremental Builds** - Only recompile modified kernels using content hashing
- 🔍 **Auto-Detection** - Automatically find CUDA toolkit, nvcc, and compute capability
- 🎯 **Per-Kernel Compute Cap** - Override compute capability for specific kernels by filename
- 📦 **External Dependencies** - Built-in CUTLASS support, or fetch any git repo
-**Parallel Compilation** - Configurable thread percentage for parallel builds
- 📁 **Flexible Sources** - Directory, glob, files, or exclude patterns

## Installation

Add to your `Cargo.toml`:

```toml
[build-dependencies]
cudaforge = "0.1"
```

## Quick Start

### Building a Static Library

```rust
// build.rs
use cudaforge::{KernelBuilder, Result};

fn main() -> Result<()> {
    let out_dir = std::env::var("OUT_DIR")?;
    
    KernelBuilder::new()
        .source_dir("src/kernels")
        .arg("-O3")
        .arg("-std=c++17")
        .arg("--use_fast_math")
        .build_lib(format!("{}/libkernels.a", out_dir))?;
    
    println!("cargo:rustc-link-search={}", out_dir);
    println!("cargo:rustc-link-lib=kernels");
    println!("cargo:rustc-link-lib=dylib=cudart");
    
    Ok(())
}
```

### Building PTX Files

```rust
use cudaforge::KernelBuilder;

fn main() -> cudaforge::Result<()> {
    let output = KernelBuilder::new()
        .source_glob("src/**/*.cu")
        .build_ptx()?;
    
    // Generate Rust file with const declarations
    output.write("src/kernels.rs")?;
    
    Ok(())
}
```

## Compute Capability

### Auto-Detection

CudaForge automatically detects compute capability in this order:
1. `CUDA_COMPUTE_CAP` environment variable (supports "90", "90a", "100a")
2. `nvidia-smi --query-gpu=compute_cap`

For sm_90+ architectures, the 'a' suffix is automatically added for async features.

### Per-Kernel Override

Override compute capability for specific kernels using filename patterns:

```rust
KernelBuilder::new()
    .source_dir("src")
    .with_compute_override("sm90_*.cu", 90)  // Hopper (auto → sm_90a)
    .with_compute_override("sm80_*.cu", 80)  // Ampere (sm_80)
    .with_compute_override_arch("sm100_*.cu", "100a")  // Explicit arch string
    .build_lib("libkernels.a")?;
```

### String-Based Architecture

For explicit control over GPU architecture including suffix:

```rust
KernelBuilder::new()
    .compute_cap_arch("90a")  // Explicit sm_90a
    .source_dir("src")
    .build_lib("libkernels.a")?;
```

### Numeric (Auto-Suffix)

```rust
KernelBuilder::new()
    .compute_cap(90)  // Auto-selects sm_90a for 90+
    .source_dir("src")
    .build_lib("libkernels.a")?;
```

## Source Selection

### Directory (Recursive)

```rust
KernelBuilder::new()
    .source_dir("src/kernels")  // All .cu files recursively
```

### Glob Pattern

```rust
KernelBuilder::new()
    .source_glob("src/**/*.cu")
```

### Specific Files

```rust
KernelBuilder::new()
    .source_files(vec!["src/kernel1.cu", "src/kernel2.cu"])
```

### With Exclusions

```rust
KernelBuilder::new()
    .source_dir("src/kernels")
    .exclude(&["*_test.cu", "deprecated/*", "wip_*.cu"])
```

### Watch Additional Files

Track header files that should trigger rebuilds:

```rust
KernelBuilder::new()
    .source_dir("src/kernels")
    .watch(vec!["src/common.cuh", "src/utils.cuh"])
```

## External Dependencies

### CUTLASS Integration

```rust
KernelBuilder::new()
    .source_dir("src")
    .with_cutlass(Some("7127592069c2fe01b041e174ba4345ef9b279671"))
    .arg("-DUSE_CUTLASS")
    .arg("-std=c++17")
    .build_lib("libkernels.a")?;
```

### Custom Git Repository

Fetch include directories from any git repository:

```rust
KernelBuilder::new()
    .source_dir("src")
    .with_git_dependency(
        "my_lib",                              // Name
        "https://github.com/org/my_lib.git",   // Repository
        "abc123def456",                        // Commit hash
        vec!["include", "src/include"],        // Include paths
        vec!["src/kernels", "third_party"],    // Extra sparse-checkout paths
        false,                                 // Do not recurse submodules
    )
    .build_lib("libkernels.a")?;
```

`include_paths` are added to nvcc as `-I...` include directories.

`extra_paths` are fetched into the sparse checkout but are not added as include directories automatically. Use them when your build needs additional source trees, generated files, templates, or other repo content beyond headers.

If you need to compile source files from the fetched dependency, you can fetch the checkout root and reference files from there:

```rust
let builder = KernelBuilder::new()
    .source_dir("src")
    .with_git_dependency(
        "my_lib",
        "https://github.com/org/my_lib.git",
        "abc123def456",
        vec!["include"],
        vec!["src/kernels"],
        false,
    );

let my_lib_root = builder.fetch_git_dependency("my_lib")?;
let builder = builder.source_files(vec![my_lib_root.join("src/kernels/my_kernel.cu")]);
```

### Local Include Paths

```rust
KernelBuilder::new()
    .source_dir("src")
    .include_path("third_party/include")
    .include_path("/opt/cuda/samples/common/inc")
    .build_lib("libkernels.a")?;
```

## Parallel Compilation

### Thread Percentage

Use a percentage of available threads:

```rust
KernelBuilder::new()
    .thread_percentage(0.5)  // 50% of available threads
    .source_dir("src")
    .build_lib("libkernels.a")?;
```

### Maximum Threads

Set an absolute limit:

```rust
KernelBuilder::new()
    .max_threads(8)  // Use at most 8 threads
    .source_dir("src")
    .build_lib("libkernels.a")?;
```

### Environment Variables

- `CUDAFORGE_THREADS` - Override thread count
- `RAYON_NUM_THREADS` - Alternative for compatibility

### Pattern-Based Threading

Enable multiple nvcc threads only for specific files (supports globs):

```rust
KernelBuilder::new()
    .nvcc_thread_patterns(&[
        "gemm_*.cu",       // Matches filename (gemm_vp8.cu)
        "**/special/*.cu", // Matches path
        "flash_api",       // Matches substring
    ], 4)  // Use 4 nvcc threads for matching files
    .build_lib("libkernels.a")?;
```

## CUDA Toolkit Detection

CudaForge automatically locates the CUDA toolkit in this order:

1. `NVCC` environment variable
2. `nvcc` in `PATH`
3. `CUDA_HOME/bin/nvcc`
4. `/usr/local/cuda/bin/nvcc`
5. Common installation paths

### Manual Override

```rust
KernelBuilder::new()
    .cuda_root("/opt/cuda-12.1")
```

## Incremental Builds

Incremental builds are enabled by default. CudaForge tracks:
- File content hashes (SHA-256)
- Compute capability used
- Compiler arguments

To disable:

```rust
KernelBuilder::new()
    .no_incremental()
```

## Full Example

```rust
use cudaforge::{KernelBuilder, Result};

fn main() -> Result<()> {
    println!("cargo:rerun-if-changed=build.rs");
    println!("cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP");
    
    let out_dir = std::env::var("OUT_DIR")?;
    
    // Build with full feature set
    KernelBuilder::new()
        // Source selection
        .source_dir("src/kernels")
        .exclude(&["*_test.cu", "deprecated/*"])
        .watch(vec!["src/common.cuh"])
        
        // Per-kernel compute cap
        .with_compute_override("sm90_*.cu", 90)
        .with_compute_override("sm80_*.cu", 80)
        
        // External dependencies
        .with_cutlass(None)
        
        // Compiler options
        .arg("--expt-relaxed-constexpr")
        .arg("-std=c++17")
        .arg("-O3")
        .arg("--use_fast_math")
        
        // Parallel build
        .thread_percentage(0.5)
        
        // Build
        .build_lib(format!("{}/libkernels.a", out_dir))?;
    
    println!("cargo:rustc-link-search={}", out_dir);
    println!("cargo:rustc-link-lib=kernels");
    println!("cargo:rustc-link-lib=dylib=cudart");
    
    Ok(())
}
```

## Multiple Builders

Use multiple builders in sequence for different output types or configurations:

```rust
use cudaforge::KernelBuilder;

fn main() -> cudaforge::Result<()> {
    let out_dir = std::env::var("OUT_DIR").expect("OUT_DIR");
    
    // Builder 1: Static library for main kernels
    KernelBuilder::new()
        .source_dir("src/kernels")
        .exclude(&["*_ptx.cu"])
        .compute_cap(90)
        .arg("-O3")
        .build_lib(format!("{}/libkernels.a", out_dir))?;
    
    // Builder 2: PTX files for runtime compilation
    let ptx_output = KernelBuilder::new()
        .source_glob("src/ptx_kernels/*.cu")
        .compute_cap(80)
        .build_ptx()?;
    
    ptx_output.write("src/kernels.rs")?;
    
    // Builder 3: Separate lib with CUTLASS
    KernelBuilder::new()
        .source_dir("src/cutlass_kernels")
        .with_cutlass(None)
        .compute_cap_arch("100a")
        .build_lib(format!("{}/libcutlass.a", out_dir))?;
    
    println!("cargo:rustc-link-search={}", out_dir);
    println!("cargo:rustc-link-lib=kernels");
    println!("cargo:rustc-link-lib=cutlass");
    
    Ok(())
}
```

## Environment Variables

| Variable | Description |
|----------|-------------|
| `CUDA_COMPUTE_CAP` | Default compute capability (e.g., `80`, `90`) |
| `NVCC` | Path to nvcc binary |
| `CUDA_HOME` | CUDA installation root |
| `NVCC_CCBIN` | C++ compiler for nvcc |
| `CUDAFORGE_THREADS` | Override thread count |

## Docker Builds

> [!IMPORTANT]
> **GPU is NOT accessible during `docker build`** — only during `docker run --gpus all`.

When building CUDA kernels inside a Dockerfile, `nvidia-smi` cannot be used to auto-detect compute capability. You must explicitly set `CUDA_COMPUTE_CAP`:

### Dockerfile Example

```dockerfile
FROM nvidia/cuda:12.8.0-devel-ubuntu22.04

# Install Rust
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

# Set compute capability for the build
ARG CUDA_COMPUTE_CAP=90
ENV CUDA_COMPUTE_CAP=${CUDA_COMPUTE_CAP}

# Build with explicit compute cap
WORKDIR /app
COPY . .
RUN cargo build --release
```

Build for different GPU architectures:

```bash
# Build for Hopper (sm_90)
docker build --build-arg CUDA_COMPUTE_CAP=90 -t myapp:hopper .

# Build for Blackwell (sm_100)
docker build --build-arg CUDA_COMPUTE_CAP=100 -t myapp:blackwell .

# Build for Ampere (sm_80)
docker build --build-arg CUDA_COMPUTE_CAP=80 -t myapp:ampere .
```

### Fail-Fast Mode

For CI/Docker builds, use `require_explicit_compute_cap()` to fail immediately if compute capability is not set:

```rust
KernelBuilder::new()
    .require_explicit_compute_cap()?  // Fails fast if CUDA_COMPUTE_CAP not set
    .source_dir("src/kernels")
    .build_lib("libkernels.a")?;
```

## Migration from bindgen_cuda

| Old API | New API |
|---------|---------|
| `Builder::default()` | `KernelBuilder::new()` |
| `.kernel_paths(vec![...])` | `.source_files(vec![...])` |
| `.kernel_paths_glob("...")` | `.source_glob("...")` |
| `.include_paths(vec![...])` | `.include_path("...")` |
| `Bindings` | `PtxOutput` |

Backward compatibility aliases are available:
- `cudaforge::Builder``KernelBuilder`
- `cudaforge::Bindings``PtxOutput`

## License

MIT OR Apache-2.0