1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
//! Element-wise GPU Kernels
//!
//! Simple element-wise operations for transformer forward passes.
//!
//! ## Available Kernels
//!
//! ### Residual Connection Kernels
//! - [`ResidualAddKernel`]: Element-wise addition for residual connections
//! - [`BatchedResidualAddKernel`]: Batched version processing M sequences
//! - [`FusedResidualRmsNormKernel`]: Fused residual add + RMSNorm
//!
//! ### Activation Kernels
//! - [`ReluKernel`]: ReLU activation
//! - [`SiluKernel`]: SiLU/Swish activation
//! - [`GeluKernel`]: GELU activation (approximate)
//! - [`ElementwiseMulKernel`]: Element-wise multiplication
//! - [`ScaleKernel`]: Scalar multiplication
//!
//! ### SwiGLU Kernels
//! - [`FusedSwigluKernel`]: Fused SiLU + multiply
//! - [`BatchedSwigluKernel`]: Batched SwiGLU
//!
//! ### KV Cache Kernels
//! - [`KvCacheScatterKernel`]: Scatter K/V to cache
//! - [`KvCacheScatterIndirectKernel`]: CUDA Graph compatible
//!
//! ### RoPE Kernels
//! - [`RopeKernel`]: Standard adjacent-pair RoPE
//! - [`RopeIndirectKernel`]: CUDA Graph compatible
//! - [`RopeNeoxKernel`]: NEOX-style (split halves)
//! - [`RopeNeoxIndirectKernel`]: NEOX + CUDA Graph
//! - [`BatchedRopeKernel`]: Multi-sequence batched RoPE
//! - [`PreciseRopeKernel`]: High-precision for theta=1M
//! - [`PreciseRopeIndirectKernel`]: Precise + CUDA Graph
//!
//! ### Transform Kernels
//! - [`TransposeKernel`]: Matrix transpose
//! - [`InterleavedToBatchedKernel`]: Layout conversion
//! - [`BatchedToInterleavedKernel`]: Layout conversion
//! - [`ExtractSingleHeadKernel`]: Extract one head
//! - [`CopySingleHeadKernel`]: Copy to head position
//! - [`BatchedTransposeKernel`]: Batched transpose
//! - [`BatchedScaleKernel`]: Batched scale
//! - [`BatchedSoftmaxKernel`]: Row-wise softmax
//!
//! # PAR-023: Async pipeline support
//!
//! These kernels are designed for GPU-resident execution without sync.
// Re-export all kernel types
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;