1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
//! # `OxiCUDA` PTX -- Pure Rust PTX Code Generation DSL
//!
//! `oxicuda-ptx` provides a complete Rust-native DSL and intermediate representation
//! for generating NVIDIA PTX (Parallel Thread Execution) assembly code at runtime.
//! It eliminates the dependency on `nvcc`, the proprietary CUDA Toolkit, or any
//! C/C++ compiler toolchain -- PTX text is constructed entirely from safe Rust code.
//!
//! ## Crate Architecture
//!
//! The crate is organized into six major subsystems:
//!
//! | Module | Purpose |
//! |----------------|-----------------------------------------------------------|
//! | [`ir`] | Typed intermediate representation for PTX instructions |
//! | [`builder`] | Ergonomic fluent builder API for kernel construction |
//! | [`templates`] | High-level templates for GEMM, reduction, softmax, etc. |
//! | [`tensor_core`]| Tensor Core instruction helpers (WMMA, MMA, WGMMA) |
//! | [`emit`] | PTX text printer and validation passes |
//! | [`arch`] | Architecture definitions and capability queries (`sm_75`+) |
//! | [`cache`] | Disk-based PTX kernel cache for avoiding regeneration |
//! | [`error`] | Error types for all PTX generation failure modes |
//!
//! ## Supported Architectures
//!
//! PTX generation targets NVIDIA architectures from Turing through Blackwell:
//!
//! - **`sm_75`** -- Turing (RTX 20xx, T4)
//! - **`sm_80` / `sm_86`** -- Ampere (A100, RTX 30xx)
//! - **`sm_89`** -- Ada Lovelace (RTX 40xx, L40)
//! - **`sm_90` / `sm_90a`** -- Hopper (H100, H200)
//! - **`sm_100`** -- Blackwell (B100, B200)
//! - **`sm_120`** -- Next-generation Blackwell
//!
//! ## Design Principles
//!
//! 1. **Type-safe IR**: Every PTX register carries its type, preventing mismatched
//! operand types at construction time rather than at `ptxas` assembly time.
//! 2. **Zero external tools**: No `nvcc`, `ptxas`, or CUDA Toolkit installation
//! required for PTX text generation (only needed for final binary compilation).
//! 3. **Architecture-aware**: Templates and builders automatically select optimal
//! instruction sequences based on the target [`SmVersion`].
//! 4. **Composable**: The IR, builder, and template layers compose freely --
//! templates produce the same IR types that manual builders do.
//! 5. **Cacheable**: Generated PTX is deterministic and can be cached to disk
//! via [`PtxCache`], keyed by kernel name, parameters, and target architecture.
//!
//! ## Quick Start
//!
//! ```rust,no_run
//! use oxicuda_ptx::prelude::*;
//!
//! // Build a vector-add kernel targeting Ampere
//! let ptx = KernelBuilder::new("vector_add")
//! .target(SmVersion::Sm80)
//! .param("a_ptr", PtxType::U64)
//! .param("b_ptr", PtxType::U64)
//! .param("c_ptr", PtxType::U64)
//! .param("n", PtxType::U32)
//! .body(|b| {
//! let gid = b.global_thread_id_x();
//! // ... load, add, store ...
//! })
//! .build()
//! .expect("PTX generation failed");
//! ```
//!
//! ## Low-Level IR Usage
//!
//! ```rust
//! use oxicuda_ptx::ir::*;
//!
//! let mut alloc = RegisterAllocator::new();
//! let tid = alloc.alloc(PtxType::U32);
//!
//! let inst = Instruction::MovSpecial {
//! dst: tid,
//! special: SpecialReg::TidX,
//! };
//! assert!(inst.emit().contains("%tid.x"));
//! ```
//!
//! ## Template-Based Generation
//!
//! For common patterns, use the high-level templates which handle shared memory,
//! thread coordination, and architecture-specific optimizations automatically:
//!
//! - [`templates::elementwise`] -- Unary/binary elementwise ops (add, relu, sigmoid)
//! - [`templates::reduction`] -- Parallel block-level reductions (sum, max, min)
//! - [`templates::gemm`] -- Matrix multiplication kernels
//! - [`templates::softmax`] -- Numerically stable row-wise softmax
//!
//! ## Tensor Core Support
//!
//! The [`tensor_core`] module provides configuration and code generation helpers
//! for NVIDIA Tensor Core instructions across three generations:
//!
//! - [`tensor_core::wmma`] -- WMMA for Volta/Turing+ (`wmma.load`, `wmma.mma`)
//! - [`tensor_core::mma`] -- MMA for Ampere+ (`mma.sync.aligned`)
//! - [`tensor_core::wgmma`] -- WGMMA for Hopper+ (warp-group level MMA)
// ---------------------------------------------------------------------------
// Lint configuration
// ---------------------------------------------------------------------------
// ---------------------------------------------------------------------------
// Module declarations
// ---------------------------------------------------------------------------
/// Static analysis passes for PTX instruction sequences.
/// NVIDIA GPU architecture definitions and capability queries.
/// Ergonomic fluent builder API for PTX kernel construction.
/// Disk-based PTX kernel cache with content-addressable storage.
/// PTX text emission (printer) and structural validation.
/// Error types for PTX code generation.
/// Typed intermediate representation for PTX instructions.
/// Profile-guided code generation using autotune/profiling data.
/// High-level parameterized kernel templates for common GPU workloads.
/// Tensor Core instruction configuration and generation helpers.
/// Visual PTX explorer for terminal-based PTX analysis.
// ---------------------------------------------------------------------------
// Public re-exports of key types
// ---------------------------------------------------------------------------
// Analysis
pub use ;
pub use fold_constants;
pub use eliminate_dead_code;
pub use ;
pub use ;
pub use reduce_strength;
// Architecture
pub use ;
// IR core types
pub use ;
// Builder API
pub use ;
// Error type
pub use PtxGenError;
// Cache
pub use ;
// Profile-guided optimisation
pub use ;
// TUI Explorer
pub use ;
// Emit utilities
pub use ;
pub use ;
// Templates
pub use ;
pub use ;
pub use ;
pub use SoftmaxTemplate;
// Tensor Core configurations
pub use MmaConfig;
pub use WgmmaConfig;
pub use WmmaConfig;
// ---------------------------------------------------------------------------
// Prelude module
// ---------------------------------------------------------------------------
/// Convenient wildcard import for the most commonly needed types.
///
/// ```rust
/// use oxicuda_ptx::prelude::*;
/// ```
///
/// This re-exports the types you need for typical kernel construction:
/// architecture selection, the builder API, core IR types, error handling,
/// and the PTX cache.
// ---------------------------------------------------------------------------
// Features documentation module
// ---------------------------------------------------------------------------
/// Feature flags for `oxicuda-ptx`.
///
/// Currently `oxicuda-ptx` has no optional feature gates -- all functionality
/// is available by default as part of the pure-Rust design philosophy.
///
/// Future feature flags may include:
///
/// - `serde`: Serialization support for IR types and cache keys
/// - `rayon`: Parallel template generation for batch kernel compilation
/// - `tracing`: Structured logging and diagnostics for the code generation pipeline