# GPU Backend
Trueno provides two GPU acceleration options:
1. **wgpu (Cross-platform)** - Vulkan, Metal, DX12, WebGPU via [wgpu](https://wgpu.rs/)
2. **CUDA (NVIDIA)** - Native PTX code generation via [trueno-gpu](../architecture/ptx-generation.md)
## CUDA Support (trueno-gpu)
For NVIDIA GPUs, trueno-gpu provides **pure Rust PTX code generation** without requiring LLVM, nvcc, or external toolchains.
### Quick Start with CUDA
```rust
use trueno_gpu::ptx::{PtxModule, PtxKernel, PtxType};
use trueno_gpu::kernels::{GemmKernel, Kernel};
// Generate optimized GEMM kernel
let kernel = GemmKernel::tiled(1024, 1024, 1024, 32);
let ptx = kernel.emit_ptx();
// PTX can be loaded via CUDA driver API
println!("{}", ptx);
```
### Running CUDA Examples
```bash
# PTX code generation (no GPU required)
cargo run -p trueno-gpu --example ptx_quickstart
cargo run -p trueno-gpu --example gemm_kernel
# CUDA runtime examples (requires NVIDIA GPU)
cargo run -p trueno-gpu --example cuda_monitor
cargo run -p trueno-gpu --example flash_attention_cuda
```
### Pre-built CUDA Kernels
| GEMM | Matrix multiplication (naive/tiled/tensor core) | `gemm_kernel` |
| Softmax | Numerically stable softmax | `ptx_quickstart` |
| LayerNorm | Layer normalization | `simple_attention_cuda` |
| Attention | Multi-head attention | `flash_attention_cuda` |
| Quantize | Q4_K/Q5_K/Q6_K quantization | `q4k_gemm` |
See [PTX Code Generation](./ptx-generation.md) for detailed documentation.
---
## wgpu Support (Cross-Platform)
For cross-platform GPU compute, Trueno uses [wgpu](https://wgpu.rs/), supporting Vulkan, Metal, DX12, and WebGPU.
## Overview
The wgpu backend enables massive parallelism for compute-heavy operations like matrix multiplication. It supports both native platforms (Linux, macOS, Windows) and WebAssembly (via WebGPU in browsers).
### Key Features
- **Cross-platform**: Single codebase for native and WASM
- **Async-first**: All operations have async variants for non-blocking execution
- **Sync wrappers**: Native platforms get convenient sync APIs
- **Automatic fallback**: Falls back to SIMD when GPU unavailable
## Platform Support
| Linux | Vulkan | ✅ | ✅ |
| macOS | Metal | ✅ | ✅ |
| Windows | DX12/Vulkan | ✅ | ✅ |
| WASM (Browser) | WebGPU | ❌ | ✅ |
**Note**: WASM cannot use sync APIs because JavaScript's single-threaded model prohibits blocking the main thread.
## Feature Flags
```toml
[dependencies]
trueno = { version = "0.7.3", features = ["gpu"] } # Native GPU
trueno = { version = "0.7.3", features = ["gpu-wasm"] } # WASM GPU (WebGPU)
```
### Feature Differences
| wgpu | ✅ | ✅ |
| pollster (sync runtime) | ✅ | ❌ |
| wasm-bindgen-futures | ❌ | ✅ |
| Sync methods | ✅ | ❌ |
| Async methods | ✅ | ✅ |
## API Design
### Sync API (Native Only)
```rust
use trueno::backends::gpu::GpuDevice;
// Initialize device
let device = GpuDevice::new()?;
// Check availability
if GpuDevice::is_available() {
// Execute operations
device.matmul(&a, &b, &mut result, m, k, n)?;
device.relu(&input, &mut output)?;
let dot = device.dot(&a, &b)?;
}
```
### Async API (All Platforms)
```rust
use trueno::backends::gpu::GpuDevice;
// Initialize device
let device = GpuDevice::new_async().await?;
// Check availability
if GpuDevice::is_available_async().await {
// Execute operations
device.matmul_async(&a, &b, &mut result, m, k, n).await?;
device.relu_async(&input, &mut output).await?;
let dot = device.dot_async(&a, &b).await?;
}
```
### Runtime Detection
```rust
use trueno::backends::gpu::runtime;
if runtime::sync_available() {
// Can use sync APIs (native only)
let device = GpuDevice::new()?;
} else {
// Must use async APIs (WASM)
let device = GpuDevice::new_async().await?;
}
```
## Available Operations
### Element-wise Operations
| `relu` | ✅ | ✅ | max(0, x) |
| `leaky_relu` | ✅ | ✅ | max(αx, x) |
| `elu` | ✅ | ✅ | x if x>0, else α(eˣ-1) |
| `sigmoid` | ✅ | ✅ | 1/(1+e⁻ˣ) |
| `tanh` | ✅ | ✅ | tanh(x) |
| `swish` | ✅ | ✅ | x·sigmoid(x) |
| `gelu` | ✅ | ✅ | Gaussian Error Linear Unit |
| `clip` | ✅ | ✅ | clamp(x, min, max) |
| `softmax` | ✅ | ✅ | exp(x)/Σexp(x) |
| `log_softmax` | ✅ | ✅ | log(softmax(x)) |
### Vector Operations
| `vec_add` | ✅ | ✅ | Element-wise addition |
| `dot` | ✅ | ✅ | Dot product with reduction |
### Matrix Operations
| `matmul` | ✅ | ✅ | Matrix multiplication |
| `convolve2d` | ✅ | ✅ | 2D convolution |
## WebGPU for WASM
The `gpu-wasm` feature enables GPU compute in browsers via WebGPU. This is particularly useful for:
- **Browser-based ML inference**: Run models client-side
- **Interactive visualizations**: GPU-accelerated data processing
- **Scientific computing in browsers**: Heavy computations without server round-trips
### Example: trueno-viz
[trueno-viz](https://github.com/paiml/trueno-viz) demonstrates Trueno's WebGPU capabilities for browser-based visualization:
```rust
// In WASM context, use async API
#[wasm_bindgen]
pub async fn process_data(input: &[f32]) -> Result<Vec<f32>, JsValue> {
let device = GpuDevice::new_async().await
.map_err(|e| JsValue::from_str(&e))?;
let mut output = vec![0.0; input.len()];
device.relu_async(input, &mut output).await
.map_err(|e| JsValue::from_str(&e))?;
Ok(output)
}
```
### WASM Build Configuration
```toml
# Cargo.toml
[target.'cfg(target_arch = "wasm32")'.dependencies]
trueno = { version = "0.7.3", features = ["gpu-wasm"] }
wasm-bindgen = "0.2"
wasm-bindgen-futures = "0.4"
```
Build with:
```bash
wasm-pack build --target web --features gpu-wasm
```
## Batch API
For chaining multiple GPU operations, use the batch API to minimize transfer overhead:
```rust
use trueno::backends::gpu::{GpuDevice, GpuCommandBatch};
let device = GpuDevice::new()?;
let mut batch = GpuCommandBatch::new(device);
// Queue operations (no GPU execution yet)
let input = batch.upload(&[1.0, 2.0, -3.0, 4.0]);
let a = batch.relu(input);
let b = batch.scale(a, 2.0);
// Execute batch in single GPU round-trip
batch.execute().await?;
// Read result
let result = batch.read(b).await?;
```
See [GPU Performance](../performance/gpu-performance.md) for detailed batch API documentation.
## Performance Considerations
### When to Use GPU
✅ **Use GPU for**:
- Matrix multiplication >500×500
- 2D convolutions with large kernels
- Batched operations (multiple ops chained)
❌ **Use SIMD instead for**:
- Vector operations (add, mul, dot)
- Small matrices (<500×500)
- Single operations (transfer overhead dominates)
### Transfer Overhead
GPU operations incur ~3.5ms fixed overhead per operation:
| Buffer creation | ~0.5ms |
| CPU→GPU transfer | ~1.5ms |
| Kernel dispatch | ~0.3ms |
| GPU→CPU readback | ~1.2ms |
This overhead makes GPU slower than SIMD for simple operations. See [GPU Performance](../performance/gpu-performance.md) for benchmarks.
## Implementation Details
### Runtime Module
The `runtime` module (`src/backends/gpu/runtime.rs`) provides platform-specific async runtime helpers:
```rust
// Native: Uses pollster for blocking
#[cfg(all(feature = "gpu", not(target_arch = "wasm32")))]
pub fn block_on<F: Future>(f: F) -> F::Output {
pollster::block_on(f)
}
// Check if sync operations are available
pub const fn sync_available() -> bool {
#[cfg(not(target_arch = "wasm32"))]
{ true }
#[cfg(target_arch = "wasm32")]
{ false }
}
// WASM: Spawn async tasks
#[cfg(all(feature = "gpu-wasm", target_arch = "wasm32"))]
pub fn spawn_local<F: Future<Output = ()> + 'static>(f: F) {
wasm_bindgen_futures::spawn_local(f);
}
```
### Conditional Compilation
Sync methods are only available on native platforms:
```rust
#[cfg(all(feature = "gpu", not(target_arch = "wasm32")))]
pub fn relu(&self, input: &[f32], result: &mut [f32]) -> Result<(), String> {
runtime::block_on(self.relu_async(input, result))
}
// Async always available
pub async fn relu_async(&self, input: &[f32], result: &mut [f32]) -> Result<(), String> {
// Implementation
}
```
## Next Steps
- **[GPU Performance](../performance/gpu-performance.md)** - Detailed benchmarks and thresholds
- **[WASM Backend](./wasm-backend.md)** - SIMD128 for non-GPU WASM
- **[Backend Selection](./backend-selection.md)** - How Trueno chooses backends