Expand description
§OxiCUDA PTX – Pure Rust PTX Code Generation DSL
oxicuda-ptx provides a complete Rust-native DSL and intermediate representation
for generating NVIDIA PTX (Parallel Thread Execution) assembly code at runtime.
It eliminates the dependency on nvcc, the proprietary CUDA Toolkit, or any
C/C++ compiler toolchain – PTX text is constructed entirely from safe Rust code.
§Crate Architecture
The crate is organized into six major subsystems:
| Module | Purpose |
|---|---|
ir | Typed intermediate representation for PTX instructions |
builder | Ergonomic fluent builder API for kernel construction |
templates | High-level templates for GEMM, reduction, softmax, etc. |
tensor_core | Tensor Core instruction helpers (WMMA, MMA, WGMMA) |
emit | PTX text printer and validation passes |
arch | Architecture definitions and capability queries (sm_75+) |
cache | Disk-based PTX kernel cache for avoiding regeneration |
error | Error types for all PTX generation failure modes |
§Supported Architectures
PTX generation targets NVIDIA architectures from Turing through Blackwell:
sm_75– Turing (RTX 20xx, T4)sm_80/sm_86– Ampere (A100, RTX 30xx)sm_89– Ada Lovelace (RTX 40xx, L40)sm_90/sm_90a– Hopper (H100, H200)sm_100– Blackwell (B100, B200)sm_120– Next-generation Blackwell
§Design Principles
- Type-safe IR: Every PTX register carries its type, preventing mismatched
operand types at construction time rather than at
ptxasassembly time. - Zero external tools: No
nvcc,ptxas, or CUDA Toolkit installation required for PTX text generation (only needed for final binary compilation). - Architecture-aware: Templates and builders automatically select optimal
instruction sequences based on the target
SmVersion. - Composable: The IR, builder, and template layers compose freely – templates produce the same IR types that manual builders do.
- Cacheable: Generated PTX is deterministic and can be cached to disk
via
PtxCache, keyed by kernel name, parameters, and target architecture.
§Quick Start
use oxicuda_ptx::prelude::*;
// Build a vector-add kernel targeting Ampere
let ptx = KernelBuilder::new("vector_add")
.target(SmVersion::Sm80)
.param("a_ptr", PtxType::U64)
.param("b_ptr", PtxType::U64)
.param("c_ptr", PtxType::U64)
.param("n", PtxType::U32)
.body(|b| {
let gid = b.global_thread_id_x();
// ... load, add, store ...
})
.build()
.expect("PTX generation failed");§Low-Level IR Usage
use oxicuda_ptx::ir::*;
let mut alloc = RegisterAllocator::new();
let tid = alloc.alloc(PtxType::U32);
let inst = Instruction::MovSpecial {
dst: tid,
special: SpecialReg::TidX,
};
assert!(inst.emit().contains("%tid.x"));§Template-Based Generation
For common patterns, use the high-level templates which handle shared memory, thread coordination, and architecture-specific optimizations automatically:
templates::elementwise– Unary/binary elementwise ops (add, relu, sigmoid)templates::reduction– Parallel block-level reductions (sum, max, min)templates::gemm– Matrix multiplication kernelstemplates::softmax– Numerically stable row-wise softmax
§Tensor Core Support
The tensor_core module provides configuration and code generation helpers
for NVIDIA Tensor Core instructions across three generations:
tensor_core::wmma– WMMA for Volta/Turing+ (wmma.load,wmma.mma)tensor_core::mma– MMA for Ampere+ (mma.sync.aligned)tensor_core::wgmma– WGMMA for Hopper+ (warp-group level MMA)
Re-exports§
pub use analysis::bank_conflict::BankConflict;pub use analysis::bank_conflict::BankConflictReport;pub use analysis::bank_conflict::analyze_bank_conflicts;pub use analysis::constant_folding::fold_constants;pub use analysis::dead_code::eliminate_dead_code;pub use analysis::instruction_scheduling::SchedulingReport;pub use analysis::instruction_scheduling::SchedulingStrategy;pub use analysis::instruction_scheduling::schedule_instructions;pub use analysis::register_pressure::RegisterPressureReport;pub use analysis::register_pressure::analyze_register_pressure;pub use analysis::strength_reduction::reduce_strength;pub use arch::ArchCapabilities;pub use arch::SmVersion;pub use ir::AtomOp;pub use ir::BasicBlock;pub use ir::CacheQualifier;pub use ir::CmpOp;pub use ir::FenceScope;pub use ir::ImmValue;pub use ir::Instruction;pub use ir::MemorySpace;pub use ir::MmaShape;pub use ir::MulMode;pub use ir::Operand;pub use ir::PtxFunction;pub use ir::PtxModule;pub use ir::PtxType;pub use ir::Register;pub use ir::RegisterAllocator;pub use ir::RoundingMode;pub use ir::SpecialReg;pub use ir::SurfaceOp;pub use ir::TextureDim;pub use ir::VectorWidth;pub use ir::WgmmaShape;pub use ir::WmmaLayout;pub use ir::WmmaOp;pub use ir::WmmaShape;pub use builder::BodyBuilder;pub use builder::KernelBuilder;pub use error::PtxGenError;pub use cache::PtxCache;pub use cache::PtxCacheKey;pub use profile_guided::Bottleneck;pub use profile_guided::BranchProfile;pub use profile_guided::CodeGenDecision;pub use profile_guided::HotSpot;pub use profile_guided::KernelProfile;pub use profile_guided::MemoryAccessProfile;pub use profile_guided::ProfileData;pub use profile_guided::ProfileGuidedOptimizer;pub use profile_guided::ProfileMetrics;pub use profile_guided::StallReason;pub use profile_guided::TileConfig;pub use profile_guided::apply_profile_decisions;pub use tui_explorer::ExplorerConfig;pub use tui_explorer::PtxExplorer;pub use emit::printer::emit_function;pub use emit::printer::emit_function_standalone;pub use emit::printer::emit_module;pub use emit::printer::try_emit_module;pub use emit::validator::ValidationError;pub use emit::validator::ValidationResult;pub use emit::validator::validate_ptx;pub use emit::validator::validate_ptx_for_target;pub use templates::elementwise::ElementwiseOp;pub use templates::elementwise::ElementwiseTemplate;pub use templates::gemm::EpilogueKind;pub use templates::gemm::GemmTemplate;pub use templates::reduction::ReductionOp;pub use templates::reduction::ReductionTemplate;pub use templates::softmax::SoftmaxTemplate;pub use tensor_core::mma::MmaConfig;pub use tensor_core::wgmma::WgmmaConfig;pub use tensor_core::wmma::WmmaConfig;
Modules§
- analysis
- Static analysis passes for PTX instruction sequences. Analysis passes for PTX intermediate representation.
- arch
- NVIDIA GPU architecture definitions and capability queries. NVIDIA GPU architecture definitions and capability queries.
- builder
- Ergonomic fluent builder API for PTX kernel construction. High-level PTX kernel builder DSL.
- cache
- Disk-based PTX kernel cache with content-addressable storage. Disk-based PTX kernel cache.
- emit
- PTX text emission (printer) and structural validation. PTX text emission and validation.
- error
- Error types for PTX code generation. Error types for PTX code generation.
- features
- Feature flags for
oxicuda-ptx. - ir
- Typed intermediate representation for PTX instructions. PTX Intermediate Representation.
- prelude
- Convenient wildcard import for the most commonly needed types.
- profile_
guided - Profile-guided code generation using autotune/profiling data. Profile-guided code generation for PTX kernels.
- templates
- High-level parameterized kernel templates for common GPU workloads. High-level PTX kernel templates for common GPU operations.
- tensor_
core - Tensor Core instruction configuration and generation helpers. Tensor Core instruction generation helpers.
- tui_
explorer - Visual PTX explorer for terminal-based PTX analysis. Visual PTX explorer for terminal-based PTX analysis.