trueno-gpu
Pure Rust PTX generation for NVIDIA CUDA - no LLVM, no nvcc, no external dependencies.
Philosophy
Own the Stack - Build everything from first principles for complete control, auditability, and reproducibility.
Features
- Pure Rust PTX Generation: Generate PTX assembly directly from Rust code
- No External Dependencies: No LLVM, nvcc, or CUDA toolkit required for code generation
- Builder Pattern API: Ergonomic API for constructing PTX modules and kernels
- Hand-Optimized Kernels: Pre-built kernels for common ML operations
Quick Start
use ;
// Build a vector addition kernel
let module = new
.version
.target
.address_size;
let ptx_source = module.emit;
assert!;
Available Kernels
| Kernel | Description |
|---|---|
| GEMM | Matrix multiplication (naive, tiled, tensor core) |
| GEMV | Matrix-vector multiply with warp shuffle reduction |
| Softmax | Numerically stable softmax with warp shuffle |
| LayerNorm | Fused layer normalization |
| Attention | FlashAttention-style tiled attention |
| BiasActivation | Fused bias + activation epilogue (None/ReLU/GELU) |
| Quantize | Q4_K/Q5_K/Q6_K dequantization fused with matmul |
Usage
use ;
// Create a tiled GEMM kernel
let kernel = tiled;
let ptx = kernel.emit_ptx;
// The PTX can be loaded by CUDA driver API
println!;
Examples
# PTX quickstart - basic vector addition
# GEMM kernel variants (naive, tiled, tensor core)
# Bias + Activation epilogue kernel (ReLU, GELU)
# Quantized GEMM (Q5_K, Q6_K formats)
# FlashAttention (requires CUDA)
# Register allocation visualization
Modules
ptx- PTX code generation (builder pattern)kernels- Hand-optimized GPU kernelsdriver- CUDA driver API (minimal FFI, optional)memory- GPU memory managementbackend- Multi-backend abstraction
Requirements
- Rust 1.70+
- For GPU execution: NVIDIA CUDA driver (optional, only needed to run generated PTX)
License
MIT License - see LICENSE for details.
Part of Trueno
This crate is part of the Trueno high-performance compute library.