Crate oxicuda_ptx

Expand description

§`OxiCUDA` PTX – Pure Rust PTX Code Generation DSL

oxicuda-ptx provides a complete Rust-native DSL and intermediate representation for generating NVIDIA PTX (Parallel Thread Execution) assembly code at runtime. It eliminates the dependency on nvcc, the proprietary CUDA Toolkit, or any C/C++ compiler toolchain – PTX text is constructed entirely from safe Rust code.

§Crate Architecture

The crate is organized into six major subsystems:

Module	Purpose
`ir`	Typed intermediate representation for PTX instructions
`builder`	Ergonomic fluent builder API for kernel construction
`templates`	High-level templates for GEMM, reduction, softmax, etc.
`tensor_core`	Tensor Core instruction helpers (WMMA, MMA, WGMMA)
`emit`	PTX text printer and validation passes
`arch`	Architecture definitions and capability queries (`sm_75`+)
`cache`	Disk-based PTX kernel cache for avoiding regeneration
`error`	Error types for all PTX generation failure modes

§Supported Architectures

PTX generation targets NVIDIA architectures from Turing through Blackwell:

sm_75 – Turing (RTX 20xx, T4)
sm_80 / sm_86 – Ampere (A100, RTX 30xx)
sm_89 – Ada Lovelace (RTX 40xx, L40)
sm_90 / sm_90a – Hopper (H100, H200)
sm_100 – Blackwell (B100, B200)
sm_120 – Next-generation Blackwell

§Design Principles

Type-safe IR: Every PTX register carries its type, preventing mismatched operand types at construction time rather than at ptxas assembly time.
Zero external tools: No nvcc, ptxas, or CUDA Toolkit installation required for PTX text generation (only needed for final binary compilation).
Architecture-aware: Templates and builders automatically select optimal instruction sequences based on the target SmVersion.
Composable: The IR, builder, and template layers compose freely – templates produce the same IR types that manual builders do.
Cacheable: Generated PTX is deterministic and can be cached to disk via PtxCache, keyed by kernel name, parameters, and target architecture.

§Quick Start

use oxicuda_ptx::prelude::*;

// Build a vector-add kernel targeting Ampere
let ptx = KernelBuilder::new("vector_add")
    .target(SmVersion::Sm80)
    .param("a_ptr", PtxType::U64)
    .param("b_ptr", PtxType::U64)
    .param("c_ptr", PtxType::U64)
    .param("n", PtxType::U32)
    .body(|b| {
        let gid = b.global_thread_id_x();
        // ... load, add, store ...
    })
    .build()
    .expect("PTX generation failed");

§Low-Level IR Usage

use oxicuda_ptx::ir::*;

let mut alloc = RegisterAllocator::new();
let tid = alloc.alloc(PtxType::U32);

let inst = Instruction::MovSpecial {
    dst: tid,
    special: SpecialReg::TidX,
};
assert!(inst.emit().contains("%tid.x"));

§Template-Based Generation

For common patterns, use the high-level templates which handle shared memory, thread coordination, and architecture-specific optimizations automatically:

templates::elementwise – Unary/binary elementwise ops (add, relu, sigmoid)
templates::reduction – Parallel block-level reductions (sum, max, min)
templates::gemm – Matrix multiplication kernels
templates::softmax – Numerically stable row-wise softmax

§Tensor Core Support

The tensor_core module provides configuration and code generation helpers for NVIDIA Tensor Core instructions across three generations:

tensor_core::wmma – WMMA for Volta/Turing+ (wmma.load, wmma.mma)
tensor_core::mma – MMA for Ampere+ (mma.sync.aligned)
tensor_core::wgmma – WGMMA for Hopper+ (warp-group level MMA)

Re-exports§

pub use analysis::bank_conflict::BankConflict;
pub use analysis::bank_conflict::BankConflictReport;
pub use analysis::bank_conflict::analyze_bank_conflicts;
pub use analysis::constant_folding::fold_constants;
pub use analysis::dead_code::eliminate_dead_code;
pub use analysis::instruction_scheduling::SchedulingReport;
pub use analysis::instruction_scheduling::SchedulingStrategy;
pub use analysis::instruction_scheduling::schedule_instructions;
pub use analysis::register_pressure::RegisterPressureReport;
pub use analysis::register_pressure::analyze_register_pressure;
pub use analysis::strength_reduction::reduce_strength;
pub use arch::ArchCapabilities;
pub use arch::SmVersion;
pub use ir::AtomOp;
pub use ir::BasicBlock;
pub use ir::CacheQualifier;
pub use ir::CmpOp;
pub use ir::FenceScope;
pub use ir::ImmValue;
pub use ir::Instruction;
pub use ir::MemorySpace;
pub use ir::MmaShape;
pub use ir::MulMode;
pub use ir::Operand;
pub use ir::PtxFunction;
pub use ir::PtxModule;
pub use ir::PtxType;
pub use ir::Register;
pub use ir::RegisterAllocator;
pub use ir::RoundingMode;
pub use ir::SpecialReg;
pub use ir::SurfaceOp;
pub use ir::TextureDim;
pub use ir::VectorWidth;
pub use ir::WgmmaShape;
pub use ir::WmmaLayout;
pub use ir::WmmaOp;
pub use ir::WmmaShape;
pub use builder::BodyBuilder;
pub use builder::KernelBuilder;
pub use error::PtxGenError;
pub use cache::PtxCache;
pub use cache::PtxCacheKey;
pub use profile_guided::Bottleneck;
pub use profile_guided::BranchProfile;
pub use profile_guided::CodeGenDecision;
pub use profile_guided::HotSpot;
pub use profile_guided::KernelProfile;
pub use profile_guided::MemoryAccessProfile;
pub use profile_guided::ProfileData;
pub use profile_guided::ProfileGuidedOptimizer;
pub use profile_guided::ProfileMetrics;
pub use profile_guided::StallReason;
pub use profile_guided::TileConfig;
pub use profile_guided::apply_profile_decisions;
pub use tui_explorer::ExplorerConfig;
pub use tui_explorer::PtxExplorer;
pub use emit::printer::emit_function;
pub use emit::printer::emit_function_standalone;
pub use emit::printer::emit_module;
pub use emit::printer::try_emit_module;
pub use emit::validator::ValidationError;
pub use emit::validator::ValidationResult;
pub use emit::validator::validate_ptx;
pub use emit::validator::validate_ptx_for_target;
pub use templates::elementwise::ElementwiseOp;
pub use templates::elementwise::ElementwiseTemplate;
pub use templates::gemm::EpilogueKind;
pub use templates::gemm::GemmTemplate;
pub use templates::reduction::ReductionOp;
pub use templates::reduction::ReductionTemplate;
pub use templates::softmax::SoftmaxTemplate;
pub use tensor_core::mma::MmaConfig;
pub use tensor_core::wgmma::WgmmaConfig;
pub use tensor_core::wmma::WmmaConfig;

Modules§

analysis: Static analysis passes for PTX instruction sequences. Analysis passes for PTX intermediate representation.
arch: NVIDIA GPU architecture definitions and capability queries. NVIDIA GPU architecture definitions and capability queries.
builder: Ergonomic fluent builder API for PTX kernel construction. High-level PTX kernel builder DSL.
cache: Disk-based PTX kernel cache with content-addressable storage. Disk-based PTX kernel cache.
emit: PTX text emission (printer) and structural validation. PTX text emission and validation.
error: Error types for PTX code generation. Error types for PTX code generation.
features: Feature flags for oxicuda-ptx.
ir: Typed intermediate representation for PTX instructions. PTX Intermediate Representation.
prelude: Convenient wildcard import for the most commonly needed types.
profile_guided: Profile-guided code generation using autotune/profiling data. Profile-guided code generation for PTX kernels.
templates: High-level parameterized kernel templates for common GPU workloads. High-level PTX kernel templates for common GPU operations.
tensor_core: Tensor Core instruction configuration and generation helpers. Tensor Core instruction generation helpers.
tui_explorer: Visual PTX explorer for terminal-based PTX analysis. Visual PTX explorer for terminal-based PTX analysis.