oxicuda_ptx/
lib.rs

1//! # `OxiCUDA` PTX -- Pure Rust PTX Code Generation DSL
2//!
3//! `oxicuda-ptx` provides a complete Rust-native DSL and intermediate representation
4//! for generating NVIDIA PTX (Parallel Thread Execution) assembly code at runtime.
5//! It eliminates the dependency on `nvcc`, the proprietary CUDA Toolkit, or any
6//! C/C++ compiler toolchain -- PTX text is constructed entirely from safe Rust code.
7//!
8//! ## Crate Architecture
9//!
10//! The crate is organized into six major subsystems:
11//!
12//! | Module         | Purpose                                                   |
13//! |----------------|-----------------------------------------------------------|
14//! | [`ir`]         | Typed intermediate representation for PTX instructions    |
15//! | [`builder`]    | Ergonomic fluent builder API for kernel construction      |
16//! | [`templates`]  | High-level templates for GEMM, reduction, softmax, etc.   |
17//! | [`tensor_core`]| Tensor Core instruction helpers (WMMA, MMA, WGMMA)       |
18//! | [`emit`]       | PTX text printer and validation passes                    |
19//! | [`arch`]       | Architecture definitions and capability queries (`sm_75`+)  |
20//! | [`cache`]      | Disk-based PTX kernel cache for avoiding regeneration     |
21//! | [`error`]      | Error types for all PTX generation failure modes          |
22//!
23//! ## Supported Architectures
24//!
25//! PTX generation targets NVIDIA architectures from Turing through Blackwell:
26//!
27//! - **`sm_75`** -- Turing (RTX 20xx, T4)
28//! - **`sm_80` / `sm_86`** -- Ampere (A100, RTX 30xx)
29//! - **`sm_89`** -- Ada Lovelace (RTX 40xx, L40)
30//! - **`sm_90` / `sm_90a`** -- Hopper (H100, H200)
31//! - **`sm_100`** -- Blackwell (B100, B200)
32//! - **`sm_120`** -- Next-generation Blackwell
33//!
34//! ## Design Principles
35//!
36//! 1. **Type-safe IR**: Every PTX register carries its type, preventing mismatched
37//!    operand types at construction time rather than at `ptxas` assembly time.
38//! 2. **Zero external tools**: No `nvcc`, `ptxas`, or CUDA Toolkit installation
39//!    required for PTX text generation (only needed for final binary compilation).
40//! 3. **Architecture-aware**: Templates and builders automatically select optimal
41//!    instruction sequences based on the target [`SmVersion`].
42//! 4. **Composable**: The IR, builder, and template layers compose freely --
43//!    templates produce the same IR types that manual builders do.
44//! 5. **Cacheable**: Generated PTX is deterministic and can be cached to disk
45//!    via [`PtxCache`], keyed by kernel name, parameters, and target architecture.
46//!
47//! ## Quick Start
48//!
49//! ```rust,no_run
50//! use oxicuda_ptx::prelude::*;
51//!
52//! // Build a vector-add kernel targeting Ampere
53//! let ptx = KernelBuilder::new("vector_add")
54//!     .target(SmVersion::Sm80)
55//!     .param("a_ptr", PtxType::U64)
56//!     .param("b_ptr", PtxType::U64)
57//!     .param("c_ptr", PtxType::U64)
58//!     .param("n", PtxType::U32)
59//!     .body(|b| {
60//!         let gid = b.global_thread_id_x();
61//!         // ... load, add, store ...
62//!     })
63//!     .build()
64//!     .expect("PTX generation failed");
65//! ```
66//!
67//! ## Low-Level IR Usage
68//!
69//! ```rust
70//! use oxicuda_ptx::ir::*;
71//!
72//! let mut alloc = RegisterAllocator::new();
73//! let tid = alloc.alloc(PtxType::U32);
74//!
75//! let inst = Instruction::MovSpecial {
76//!     dst: tid,
77//!     special: SpecialReg::TidX,
78//! };
79//! assert!(inst.emit().contains("%tid.x"));
80//! ```
81//!
82//! ## Template-Based Generation
83//!
84//! For common patterns, use the high-level templates which handle shared memory,
85//! thread coordination, and architecture-specific optimizations automatically:
86//!
87//! - [`templates::elementwise`] -- Unary/binary elementwise ops (add, relu, sigmoid)
88//! - [`templates::reduction`] -- Parallel block-level reductions (sum, max, min)
89//! - [`templates::gemm`] -- Matrix multiplication kernels
90//! - [`templates::softmax`] -- Numerically stable row-wise softmax
91//!
92//! ## Tensor Core Support
93//!
94//! The [`tensor_core`] module provides configuration and code generation helpers
95//! for NVIDIA Tensor Core instructions across three generations:
96//!
97//! - [`tensor_core::wmma`] -- WMMA for Volta/Turing+ (`wmma.load`, `wmma.mma`)
98//! - [`tensor_core::mma`] -- MMA for Ampere+ (`mma.sync.aligned`)
99//! - [`tensor_core::wgmma`] -- WGMMA for Hopper+ (warp-group level MMA)
100
101// ---------------------------------------------------------------------------
102// Lint configuration
103// ---------------------------------------------------------------------------
104#![warn(clippy::all)]
105#![warn(clippy::pedantic)]
106#![warn(clippy::nursery)]
107#![warn(missing_docs)]
108#![warn(rustdoc::broken_intra_doc_links)]
109#![warn(rustdoc::private_intra_doc_links)]
110#![deny(unsafe_code)]
111#![allow(clippy::module_name_repetitions)]
112#![allow(clippy::must_use_candidate)]
113#![allow(clippy::missing_errors_doc)]
114
115// ---------------------------------------------------------------------------
116// Module declarations
117// ---------------------------------------------------------------------------
118
119/// Static analysis passes for PTX instruction sequences.
120pub mod analysis;
121
122/// NVIDIA GPU architecture definitions and capability queries.
123pub mod arch;
124
125/// Ergonomic fluent builder API for PTX kernel construction.
126pub mod builder;
127
128/// Disk-based PTX kernel cache with content-addressable storage.
129pub mod cache;
130
131/// PTX text emission (printer) and structural validation.
132pub mod emit;
133
134/// Error types for PTX code generation.
135pub mod error;
136
137/// Typed intermediate representation for PTX instructions.
138pub mod ir;
139
140/// Profile-guided code generation using autotune/profiling data.
141pub mod profile_guided;
142
143/// High-level parameterized kernel templates for common GPU workloads.
144pub mod templates;
145
146/// Tensor Core instruction configuration and generation helpers.
147pub mod tensor_core;
148
149/// Visual PTX explorer for terminal-based PTX analysis.
150pub mod tui_explorer;
151
152// ---------------------------------------------------------------------------
153// Public re-exports of key types
154// ---------------------------------------------------------------------------
155
156// Analysis
157pub use analysis::bank_conflict::{BankConflict, BankConflictReport, analyze_bank_conflicts};
158pub use analysis::constant_folding::fold_constants;
159pub use analysis::dead_code::eliminate_dead_code;
160pub use analysis::instruction_scheduling::{
161    SchedulingReport, SchedulingStrategy, schedule_instructions,
162};
163pub use analysis::register_pressure::{RegisterPressureReport, analyze_register_pressure};
164pub use analysis::strength_reduction::reduce_strength;
165
166// Architecture
167pub use arch::{ArchCapabilities, SmVersion};
168
169// IR core types
170pub use ir::{
171    AtomOp, BasicBlock, CacheQualifier, CmpOp, FenceScope, ImmValue, Instruction, MemorySpace,
172    MmaShape, MulMode, Operand, PtxFunction, PtxModule, PtxType, Register, RegisterAllocator,
173    RoundingMode, SpecialReg, SurfaceOp, TextureDim, VectorWidth, WgmmaShape, WmmaLayout, WmmaOp,
174    WmmaShape,
175};
176
177// Builder API
178pub use builder::{BodyBuilder, KernelBuilder};
179
180// Error type
181pub use error::PtxGenError;
182
183// Cache
184pub use cache::{PtxCache, PtxCacheKey};
185
186// Profile-guided optimisation
187pub use profile_guided::{
188    Bottleneck, BranchProfile, CodeGenDecision, HotSpot, KernelProfile, MemoryAccessProfile,
189    ProfileData, ProfileGuidedOptimizer, ProfileMetrics, StallReason, TileConfig,
190    apply_profile_decisions,
191};
192
193// TUI Explorer
194pub use tui_explorer::{ExplorerConfig, PtxExplorer};
195
196// Emit utilities
197pub use emit::printer::{emit_function, emit_function_standalone, emit_module, try_emit_module};
198pub use emit::validator::{
199    ValidationError, ValidationResult, validate_ptx, validate_ptx_for_target,
200};
201
202// Templates
203pub use templates::elementwise::{ElementwiseOp, ElementwiseTemplate};
204pub use templates::gemm::{EpilogueKind, GemmTemplate};
205pub use templates::reduction::{ReductionOp, ReductionTemplate};
206pub use templates::softmax::SoftmaxTemplate;
207
208// Tensor Core configurations
209pub use tensor_core::mma::MmaConfig;
210pub use tensor_core::wgmma::WgmmaConfig;
211pub use tensor_core::wmma::WmmaConfig;
212
213// ---------------------------------------------------------------------------
214// Prelude module
215// ---------------------------------------------------------------------------
216
217/// Convenient wildcard import for the most commonly needed types.
218///
219/// ```rust
220/// use oxicuda_ptx::prelude::*;
221/// ```
222///
223/// This re-exports the types you need for typical kernel construction:
224/// architecture selection, the builder API, core IR types, error handling,
225/// and the PTX cache.
226pub mod prelude {
227    // Architecture
228    pub use crate::arch::{ArchCapabilities, SmVersion};
229
230    // Builder API (primary entry point for most users)
231    pub use crate::builder::{BodyBuilder, KernelBuilder};
232
233    // Core IR types frequently used in builder closures
234    pub use crate::ir::{
235        AtomOp, BasicBlock, CacheQualifier, CmpOp, FenceScope, ImmValue, Instruction, MemorySpace,
236        MulMode, Operand, PtxFunction, PtxModule, PtxType, Register, RegisterAllocator,
237        RoundingMode, SpecialReg, VectorWidth,
238    };
239
240    // Tensor Core shape types (needed for MMA instructions in body closures)
241    pub use crate::ir::{MmaShape, WgmmaShape, WmmaLayout, WmmaOp, WmmaShape};
242
243    // Error type
244    pub use crate::error::PtxGenError;
245
246    // Cache
247    pub use crate::cache::{PtxCache, PtxCacheKey};
248
249    // Emit utilities
250    pub use crate::emit::printer::emit_module;
251    pub use crate::emit::validator::{ValidationResult, validate_ptx};
252
253    // Template types
254    pub use crate::templates::elementwise::{ElementwiseOp, ElementwiseTemplate};
255    pub use crate::templates::gemm::{EpilogueKind, GemmTemplate};
256    pub use crate::templates::reduction::{ReductionOp, ReductionTemplate};
257    pub use crate::templates::softmax::SoftmaxTemplate;
258
259    // Tensor Core configurations
260    pub use crate::tensor_core::mma::MmaConfig;
261    pub use crate::tensor_core::wgmma::WgmmaConfig;
262    pub use crate::tensor_core::wmma::WmmaConfig;
263}
264
265// ---------------------------------------------------------------------------
266// Features documentation module
267// ---------------------------------------------------------------------------
268
269/// Feature flags for `oxicuda-ptx`.
270///
271/// Currently `oxicuda-ptx` has no optional feature gates -- all functionality
272/// is available by default as part of the pure-Rust design philosophy.
273///
274/// Future feature flags may include:
275///
276/// - `serde`: Serialization support for IR types and cache keys
277/// - `rayon`: Parallel template generation for batch kernel compilation
278/// - `tracing`: Structured logging and diagnostics for the code generation pipeline
279pub mod features {}
oxicuda_ptx/lib.rs

oxicuda_ptx/
lib.rs