1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
//! CUDA PTX Generation and Execution Module
//!
//! Provides NVIDIA CUDA-specific PTX code generation and execution via `trueno-gpu`.
//! This is an optional backend for maximum performance on NVIDIA hardware.
//!
//! ## Architecture
//!
//! ```text
//! +-----------------------+
//! | CudaExecutor API | <- High-level execution API
//! +-----------------------+
//! | CudaKernels API | <- PTX generation
//! +-----------------------+
//! | trueno_gpu::driver | <- CUDA runtime (context, stream, memory)
//! +-----------------------+
//! | trueno_gpu::kernels | <- Hand-optimized PTX kernels
//! +-----------------------+
//! | trueno_gpu::ptx | <- Pure Rust PTX generation
//! +-----------------------+
//! ```
//!
//! ## Module Organization
//!
//! - `kernels`: Kernel type definitions and PTX generation
//! - `memory`: GPU memory pool and staging buffers
//! - `types`: Weight loading types and transformer workspace
//! - `pipeline`: Async pipeline and PTX optimization
//! - `executor`: CUDA execution engine (split into submodules)
//!
//! ## Available Kernels
//!
//! - **GEMM**: Matrix multiplication (naive, tiled, tensor core)
//! - **Softmax**: Numerically stable softmax with warp shuffle
//! - **LayerNorm**: Fused layer normalization
//! - **Attention**: FlashAttention-style tiled attention
//! - **Quantize**: Q4_K/Q5_K/Q6_K dequantization-fused GEMM/GEMV
// Submodules
// Re-export everything for backwards compatibility
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
// The executor module (21K lines) - future work to split into submodules:
// - executor/core.rs: Basic context and profiling
// - executor/weights.rs: Weight loading and caching
// - executor/gemm.rs: GEMM/GEMV operations
// - executor/quantized.rs: Quantized GEMV operations
// - executor/activations.rs: GELU, SiLU, RMSNorm, RoPE
// - executor/attention.rs: Flash attention, incremental attention
// - executor/layer.rs: Transformer layer operations
// - executor/forward.rs: Forward pass methods
// - executor/graph.rs: CUDA graph capture and replay
// - executor/kv_cache.rs: KV cache management
pub use CudaExecutor;
pub use GpuProfile;