Expand description
Helpers for building TMA (Tensor Memory Accelerator) descriptors.
Structs§
Functions§
- remap_
storage_ for_ tma - CUDA’s TMA loads f32 as tf32 internally; remap explicitly so the descriptor matches.
- tma_
meta_ tiled - Build a tiled
TensorMapMetawith the defaults shared by every current call site (no interleave, no prefetch, OOB-fill = zero, elem_stride =[1; rank]). - transpose_
inner_ for_ tma - TMA assumes the last stride is contiguous and discards it. For ColMajor inputs we therefore swap the inner two dims so the contiguous one ends up last. The tensor’s own metadata stays in its original layout — only the TMA descriptor sees the transposed form.