oxicuda-ptx
Pure Rust PTX code generation DSL and intermediate representation for GPU kernel development.
Part of the OxiCUDA project.
Overview
oxicuda-ptx provides a complete Rust-native DSL and intermediate representation
for generating NVIDIA PTX (Parallel Thread Execution) assembly code at runtime.
It eliminates the dependency on nvcc, the proprietary CUDA Toolkit, or any
C/C++ compiler toolchain -- PTX text is constructed entirely from safe Rust code.
The crate spans every stage of kernel authoring: a typed IR that catches operand mismatches at construction time, a fluent builder API for rapid kernel prototyping, high-level templates for common workloads (GEMM, reduction, softmax, elementwise), and Tensor Core helpers covering three hardware generations (WMMA, MMA, WGMMA). A disk-based cache avoids redundant regeneration of identical kernels.
Architecture
| Module | Purpose |
|---|---|
ir |
Typed intermediate representation -- registers, ~40 opcodes, basic blocks, functions, modules |
builder |
Fluent builder API -- KernelBuilder (params, shared mem, target) and BodyBuilder (register alloc, arithmetic, control flow, sync) |
templates |
Parameterized kernel templates -- elementwise, GEMM, reduction, softmax |
tensor_core |
Tensor Core instruction generation -- WMMA (sm_70+), MMA (sm_80+), WGMMA (sm_90+) |
emit |
PTX text printer and structural validator |
arch |
Architecture definitions and capability queries (sm_75 through sm_120) |
cache |
Disk-based content-addressable PTX kernel cache |
error |
Error types for all PTX generation failure modes |
Supported Data Types
F16, BF16, F32, F64, U8--U64, S8--S64, Pred, B16--B64.
Supported Architectures
sm_75 (Turing), sm_80/sm_86 (Ampere), sm_89 (Ada Lovelace), sm_90/sm_90a (Hopper), sm_100 (Blackwell), sm_120 (Next-gen Blackwell).
Quick Start
use *;
// Build a vector-add kernel targeting Ampere
let ptx = new
.target
.param
.param
.param
.param
.body
.build
.expect;
Low-Level IR
use *;
let mut alloc = new;
let tid = alloc.alloc;
let inst = MovSpecial ;
assert!;
Templates
High-level templates handle shared memory layout, thread coordination, and architecture-specific optimizations automatically:
ElementwiseTemplate-- unary/binary ops (add, relu, sigmoid)ReductionTemplate-- parallel block-level reductions (sum, max, min)GemmTemplate-- tiled matrix multiplication with epilogue supportSoftmaxTemplate-- numerically stable row-wise softmax
Features
All functionality is available by default -- no optional feature flags. The crate is 100% pure Rust with zero external tool requirements for PTX text generation.
License
Apache-2.0 -- (C) 2026 COOLJAPAN OU (Team KitaSan)