oxicuda_ptx/lib.rs
1//! # `OxiCUDA` PTX -- Pure Rust PTX Code Generation DSL
2//!
3//! `oxicuda-ptx` provides a complete Rust-native DSL and intermediate representation
4//! for generating NVIDIA PTX (Parallel Thread Execution) assembly code at runtime.
5//! It eliminates the dependency on `nvcc`, the proprietary CUDA Toolkit, or any
6//! C/C++ compiler toolchain -- PTX text is constructed entirely from safe Rust code.
7//!
8//! ## Crate Architecture
9//!
10//! The crate is organized into six major subsystems:
11//!
12//! | Module | Purpose |
13//! |----------------|-----------------------------------------------------------|
14//! | [`ir`] | Typed intermediate representation for PTX instructions |
15//! | [`builder`] | Ergonomic fluent builder API for kernel construction |
16//! | [`templates`] | High-level templates for GEMM, reduction, softmax, etc. |
17//! | [`tensor_core`]| Tensor Core instruction helpers (WMMA, MMA, WGMMA) |
18//! | [`emit`] | PTX text printer and validation passes |
19//! | [`arch`] | Architecture definitions and capability queries (`sm_75`+) |
20//! | [`cache`] | Disk-based PTX kernel cache for avoiding regeneration |
21//! | [`error`] | Error types for all PTX generation failure modes |
22//!
23//! ## Supported Architectures
24//!
25//! PTX generation targets NVIDIA architectures from Turing through Blackwell:
26//!
27//! - **`sm_75`** -- Turing (RTX 20xx, T4)
28//! - **`sm_80` / `sm_86`** -- Ampere (A100, RTX 30xx)
29//! - **`sm_89`** -- Ada Lovelace (RTX 40xx, L40)
30//! - **`sm_90` / `sm_90a`** -- Hopper (H100, H200)
31//! - **`sm_100`** -- Blackwell (B100, B200)
32//! - **`sm_120`** -- Next-generation Blackwell
33//!
34//! ## Design Principles
35//!
36//! 1. **Type-safe IR**: Every PTX register carries its type, preventing mismatched
37//! operand types at construction time rather than at `ptxas` assembly time.
38//! 2. **Zero external tools**: No `nvcc`, `ptxas`, or CUDA Toolkit installation
39//! required for PTX text generation (only needed for final binary compilation).
40//! 3. **Architecture-aware**: Templates and builders automatically select optimal
41//! instruction sequences based on the target [`SmVersion`].
42//! 4. **Composable**: The IR, builder, and template layers compose freely --
43//! templates produce the same IR types that manual builders do.
44//! 5. **Cacheable**: Generated PTX is deterministic and can be cached to disk
45//! via [`PtxCache`], keyed by kernel name, parameters, and target architecture.
46//!
47//! ## Quick Start
48//!
49//! ```rust,no_run
50//! use oxicuda_ptx::prelude::*;
51//!
52//! // Build a vector-add kernel targeting Ampere
53//! let ptx = KernelBuilder::new("vector_add")
54//! .target(SmVersion::Sm80)
55//! .param("a_ptr", PtxType::U64)
56//! .param("b_ptr", PtxType::U64)
57//! .param("c_ptr", PtxType::U64)
58//! .param("n", PtxType::U32)
59//! .body(|b| {
60//! let gid = b.global_thread_id_x();
61//! // ... load, add, store ...
62//! })
63//! .build()
64//! .expect("PTX generation failed");
65//! ```
66//!
67//! ## Low-Level IR Usage
68//!
69//! ```rust
70//! use oxicuda_ptx::ir::*;
71//!
72//! let mut alloc = RegisterAllocator::new();
73//! let tid = alloc.alloc(PtxType::U32);
74//!
75//! let inst = Instruction::MovSpecial {
76//! dst: tid,
77//! special: SpecialReg::TidX,
78//! };
79//! assert!(inst.emit().contains("%tid.x"));
80//! ```
81//!
82//! ## Template-Based Generation
83//!
84//! For common patterns, use the high-level templates which handle shared memory,
85//! thread coordination, and architecture-specific optimizations automatically:
86//!
87//! - [`templates::elementwise`] -- Unary/binary elementwise ops (add, relu, sigmoid)
88//! - [`templates::reduction`] -- Parallel block-level reductions (sum, max, min)
89//! - [`templates::gemm`] -- Matrix multiplication kernels
90//! - [`templates::softmax`] -- Numerically stable row-wise softmax
91//!
92//! ## Tensor Core Support
93//!
94//! The [`tensor_core`] module provides configuration and code generation helpers
95//! for NVIDIA Tensor Core instructions across three generations:
96//!
97//! - [`tensor_core::wmma`] -- WMMA for Volta/Turing+ (`wmma.load`, `wmma.mma`)
98//! - [`tensor_core::mma`] -- MMA for Ampere+ (`mma.sync.aligned`)
99//! - [`tensor_core::wgmma`] -- WGMMA for Hopper+ (warp-group level MMA)
100
101// ---------------------------------------------------------------------------
102// Lint configuration
103// ---------------------------------------------------------------------------
104#![warn(clippy::all)]
105#![warn(clippy::pedantic)]
106#![warn(clippy::nursery)]
107#![warn(missing_docs)]
108#![warn(rustdoc::broken_intra_doc_links)]
109#![warn(rustdoc::private_intra_doc_links)]
110#![deny(unsafe_code)]
111#![allow(clippy::module_name_repetitions)]
112#![allow(clippy::must_use_candidate)]
113#![allow(clippy::missing_errors_doc)]
114
115// ---------------------------------------------------------------------------
116// Module declarations
117// ---------------------------------------------------------------------------
118
119/// Static analysis passes for PTX instruction sequences.
120pub mod analysis;
121
122/// NVIDIA GPU architecture definitions and capability queries.
123pub mod arch;
124
125/// Ergonomic fluent builder API for PTX kernel construction.
126pub mod builder;
127
128/// Disk-based PTX kernel cache with content-addressable storage.
129pub mod cache;
130
131/// PTX text emission (printer) and structural validation.
132pub mod emit;
133
134/// Error types for PTX code generation.
135pub mod error;
136
137/// Typed intermediate representation for PTX instructions.
138pub mod ir;
139
140/// Profile-guided code generation using autotune/profiling data.
141pub mod profile_guided;
142
143/// High-level parameterized kernel templates for common GPU workloads.
144pub mod templates;
145
146/// Tensor Core instruction configuration and generation helpers.
147pub mod tensor_core;
148
149/// Visual PTX explorer for terminal-based PTX analysis.
150pub mod tui_explorer;
151
152// ---------------------------------------------------------------------------
153// Public re-exports of key types
154// ---------------------------------------------------------------------------
155
156// Analysis
157pub use analysis::bank_conflict::{BankConflict, BankConflictReport, analyze_bank_conflicts};
158pub use analysis::constant_folding::fold_constants;
159pub use analysis::dead_code::eliminate_dead_code;
160pub use analysis::instruction_scheduling::{
161 SchedulingReport, SchedulingStrategy, schedule_instructions,
162};
163pub use analysis::register_pressure::{RegisterPressureReport, analyze_register_pressure};
164pub use analysis::strength_reduction::reduce_strength;
165
166// Architecture
167pub use arch::{ArchCapabilities, SmVersion};
168
169// IR core types
170pub use ir::{
171 AtomOp, BasicBlock, CacheQualifier, CmpOp, FenceScope, ImmValue, Instruction, MemorySpace,
172 MmaShape, MulMode, Operand, PtxFunction, PtxModule, PtxType, Register, RegisterAllocator,
173 RoundingMode, SpecialReg, SurfaceOp, TextureDim, VectorWidth, WgmmaShape, WmmaLayout, WmmaOp,
174 WmmaShape,
175};
176
177// Builder API
178pub use builder::{BodyBuilder, KernelBuilder};
179
180// Error type
181pub use error::PtxGenError;
182
183// Cache
184pub use cache::{PtxCache, PtxCacheKey};
185
186// Profile-guided optimisation
187pub use profile_guided::{
188 Bottleneck, BranchProfile, CodeGenDecision, HotSpot, KernelProfile, MemoryAccessProfile,
189 ProfileData, ProfileGuidedOptimizer, ProfileMetrics, StallReason, TileConfig,
190 apply_profile_decisions,
191};
192
193// TUI Explorer
194pub use tui_explorer::{ExplorerConfig, PtxExplorer};
195
196// Emit utilities
197pub use emit::printer::{emit_function, emit_function_standalone, emit_module, try_emit_module};
198pub use emit::validator::{
199 ValidationError, ValidationResult, validate_ptx, validate_ptx_for_target,
200};
201
202// Templates
203pub use templates::elementwise::{ElementwiseOp, ElementwiseTemplate};
204pub use templates::gemm::{EpilogueKind, GemmTemplate};
205pub use templates::reduction::{ReductionOp, ReductionTemplate};
206pub use templates::softmax::SoftmaxTemplate;
207
208// Tensor Core configurations
209pub use tensor_core::mma::MmaConfig;
210pub use tensor_core::wgmma::WgmmaConfig;
211pub use tensor_core::wmma::WmmaConfig;
212
213// ---------------------------------------------------------------------------
214// Prelude module
215// ---------------------------------------------------------------------------
216
217/// Convenient wildcard import for the most commonly needed types.
218///
219/// ```rust
220/// use oxicuda_ptx::prelude::*;
221/// ```
222///
223/// This re-exports the types you need for typical kernel construction:
224/// architecture selection, the builder API, core IR types, error handling,
225/// and the PTX cache.
226pub mod prelude {
227 // Architecture
228 pub use crate::arch::{ArchCapabilities, SmVersion};
229
230 // Builder API (primary entry point for most users)
231 pub use crate::builder::{BodyBuilder, KernelBuilder};
232
233 // Core IR types frequently used in builder closures
234 pub use crate::ir::{
235 AtomOp, BasicBlock, CacheQualifier, CmpOp, FenceScope, ImmValue, Instruction, MemorySpace,
236 MulMode, Operand, PtxFunction, PtxModule, PtxType, Register, RegisterAllocator,
237 RoundingMode, SpecialReg, VectorWidth,
238 };
239
240 // Tensor Core shape types (needed for MMA instructions in body closures)
241 pub use crate::ir::{MmaShape, WgmmaShape, WmmaLayout, WmmaOp, WmmaShape};
242
243 // Error type
244 pub use crate::error::PtxGenError;
245
246 // Cache
247 pub use crate::cache::{PtxCache, PtxCacheKey};
248
249 // Emit utilities
250 pub use crate::emit::printer::emit_module;
251 pub use crate::emit::validator::{ValidationResult, validate_ptx};
252
253 // Template types
254 pub use crate::templates::elementwise::{ElementwiseOp, ElementwiseTemplate};
255 pub use crate::templates::gemm::{EpilogueKind, GemmTemplate};
256 pub use crate::templates::reduction::{ReductionOp, ReductionTemplate};
257 pub use crate::templates::softmax::SoftmaxTemplate;
258
259 // Tensor Core configurations
260 pub use crate::tensor_core::mma::MmaConfig;
261 pub use crate::tensor_core::wgmma::WgmmaConfig;
262 pub use crate::tensor_core::wmma::WmmaConfig;
263}
264
265// ---------------------------------------------------------------------------
266// Features documentation module
267// ---------------------------------------------------------------------------
268
269/// Feature flags for `oxicuda-ptx`.
270///
271/// Currently `oxicuda-ptx` has no optional feature gates -- all functionality
272/// is available by default as part of the pure-Rust design philosophy.
273///
274/// Future feature flags may include:
275///
276/// - `serde`: Serialization support for IR types and cache keys
277/// - `rayon`: Parallel template generation for batch kernel compilation
278/// - `tracing`: Structured logging and diagnostics for the code generation pipeline
279pub mod features {}