Skip to main content

Crate oxicuda_ptx

Crate oxicuda_ptx 

Source
Expand description

§OxiCUDA PTX – Pure Rust PTX Code Generation DSL

oxicuda-ptx provides a complete Rust-native DSL and intermediate representation for generating NVIDIA PTX (Parallel Thread Execution) assembly code at runtime. It eliminates the dependency on nvcc, the proprietary CUDA Toolkit, or any C/C++ compiler toolchain – PTX text is constructed entirely from safe Rust code.

§Crate Architecture

The crate is organized into six major subsystems:

ModulePurpose
irTyped intermediate representation for PTX instructions
builderErgonomic fluent builder API for kernel construction
templatesHigh-level templates for GEMM, reduction, softmax, etc.
tensor_coreTensor Core instruction helpers (WMMA, MMA, WGMMA)
emitPTX text printer and validation passes
archArchitecture definitions and capability queries (sm_75+)
cacheDisk-based PTX kernel cache for avoiding regeneration
errorError types for all PTX generation failure modes

§Supported Architectures

PTX generation targets NVIDIA architectures from Turing through Blackwell:

  • sm_75 – Turing (RTX 20xx, T4)
  • sm_80 / sm_86 – Ampere (A100, RTX 30xx)
  • sm_89 – Ada Lovelace (RTX 40xx, L40)
  • sm_90 / sm_90a – Hopper (H100, H200)
  • sm_100 – Blackwell (B100, B200)
  • sm_120 – Next-generation Blackwell

§Design Principles

  1. Type-safe IR: Every PTX register carries its type, preventing mismatched operand types at construction time rather than at ptxas assembly time.
  2. Zero external tools: No nvcc, ptxas, or CUDA Toolkit installation required for PTX text generation (only needed for final binary compilation).
  3. Architecture-aware: Templates and builders automatically select optimal instruction sequences based on the target SmVersion.
  4. Composable: The IR, builder, and template layers compose freely – templates produce the same IR types that manual builders do.
  5. Cacheable: Generated PTX is deterministic and can be cached to disk via PtxCache, keyed by kernel name, parameters, and target architecture.

§Quick Start

use oxicuda_ptx::prelude::*;

// Build a vector-add kernel targeting Ampere
let ptx = KernelBuilder::new("vector_add")
    .target(SmVersion::Sm80)
    .param("a_ptr", PtxType::U64)
    .param("b_ptr", PtxType::U64)
    .param("c_ptr", PtxType::U64)
    .param("n", PtxType::U32)
    .body(|b| {
        let gid = b.global_thread_id_x();
        // ... load, add, store ...
    })
    .build()
    .expect("PTX generation failed");

§Low-Level IR Usage

use oxicuda_ptx::ir::*;

let mut alloc = RegisterAllocator::new();
let tid = alloc.alloc(PtxType::U32);

let inst = Instruction::MovSpecial {
    dst: tid,
    special: SpecialReg::TidX,
};
assert!(inst.emit().contains("%tid.x"));

§Template-Based Generation

For common patterns, use the high-level templates which handle shared memory, thread coordination, and architecture-specific optimizations automatically:

§Tensor Core Support

The tensor_core module provides configuration and code generation helpers for NVIDIA Tensor Core instructions across three generations:

Re-exports§

pub use analysis::bank_conflict::BankConflict;
pub use analysis::bank_conflict::BankConflictReport;
pub use analysis::bank_conflict::analyze_bank_conflicts;
pub use analysis::constant_folding::fold_constants;
pub use analysis::dead_code::eliminate_dead_code;
pub use analysis::instruction_scheduling::SchedulingReport;
pub use analysis::instruction_scheduling::SchedulingStrategy;
pub use analysis::instruction_scheduling::schedule_instructions;
pub use analysis::register_pressure::RegisterPressureReport;
pub use analysis::register_pressure::analyze_register_pressure;
pub use analysis::strength_reduction::reduce_strength;
pub use arch::ArchCapabilities;
pub use arch::SmVersion;
pub use ir::AtomOp;
pub use ir::BasicBlock;
pub use ir::CacheQualifier;
pub use ir::CmpOp;
pub use ir::FenceScope;
pub use ir::ImmValue;
pub use ir::Instruction;
pub use ir::MemorySpace;
pub use ir::MmaShape;
pub use ir::MulMode;
pub use ir::Operand;
pub use ir::PtxFunction;
pub use ir::PtxModule;
pub use ir::PtxType;
pub use ir::Register;
pub use ir::RegisterAllocator;
pub use ir::RoundingMode;
pub use ir::SpecialReg;
pub use ir::SurfaceOp;
pub use ir::TextureDim;
pub use ir::VectorWidth;
pub use ir::WgmmaShape;
pub use ir::WmmaLayout;
pub use ir::WmmaOp;
pub use ir::WmmaShape;
pub use builder::BodyBuilder;
pub use builder::KernelBuilder;
pub use error::PtxGenError;
pub use cache::PtxCache;
pub use cache::PtxCacheKey;
pub use profile_guided::Bottleneck;
pub use profile_guided::BranchProfile;
pub use profile_guided::CodeGenDecision;
pub use profile_guided::HotSpot;
pub use profile_guided::KernelProfile;
pub use profile_guided::MemoryAccessProfile;
pub use profile_guided::ProfileData;
pub use profile_guided::ProfileGuidedOptimizer;
pub use profile_guided::ProfileMetrics;
pub use profile_guided::StallReason;
pub use profile_guided::TileConfig;
pub use profile_guided::apply_profile_decisions;
pub use tui_explorer::ExplorerConfig;
pub use tui_explorer::PtxExplorer;
pub use emit::printer::emit_function;
pub use emit::printer::emit_function_standalone;
pub use emit::printer::emit_module;
pub use emit::printer::try_emit_module;
pub use emit::validator::ValidationError;
pub use emit::validator::ValidationResult;
pub use emit::validator::validate_ptx;
pub use emit::validator::validate_ptx_for_target;
pub use templates::elementwise::ElementwiseOp;
pub use templates::elementwise::ElementwiseTemplate;
pub use templates::gemm::EpilogueKind;
pub use templates::gemm::GemmTemplate;
pub use templates::reduction::ReductionOp;
pub use templates::reduction::ReductionTemplate;
pub use templates::softmax::SoftmaxTemplate;
pub use tensor_core::mma::MmaConfig;
pub use tensor_core::wgmma::WgmmaConfig;
pub use tensor_core::wmma::WmmaConfig;

Modules§

analysis
Static analysis passes for PTX instruction sequences. Analysis passes for PTX intermediate representation.
arch
NVIDIA GPU architecture definitions and capability queries. NVIDIA GPU architecture definitions and capability queries.
builder
Ergonomic fluent builder API for PTX kernel construction. High-level PTX kernel builder DSL.
cache
Disk-based PTX kernel cache with content-addressable storage. Disk-based PTX kernel cache.
emit
PTX text emission (printer) and structural validation. PTX text emission and validation.
error
Error types for PTX code generation. Error types for PTX code generation.
features
Feature flags for oxicuda-ptx.
ir
Typed intermediate representation for PTX instructions. PTX Intermediate Representation.
prelude
Convenient wildcard import for the most commonly needed types.
profile_guided
Profile-guided code generation using autotune/profiling data. Profile-guided code generation for PTX kernels.
templates
High-level parameterized kernel templates for common GPU workloads. High-level PTX kernel templates for common GPU operations.
tensor_core
Tensor Core instruction configuration and generation helpers. Tensor Core instruction generation helpers.
tui_explorer
Visual PTX explorer for terminal-based PTX analysis. Visual PTX explorer for terminal-based PTX analysis.