Skip to main content

Module tma

Module tma 

Source
Expand description

Helpers for building TMA (Tensor Memory Accelerator) descriptors.

Structs§

Metadata

Functions§

remap_storage_for_tma
CUDA’s TMA loads f32 as tf32 internally; remap explicitly so the descriptor matches.
tma_meta_tiled
Build a tiled TensorMapMeta with the defaults shared by every current call site (no interleave, no prefetch, OOB-fill = zero, elem_stride = [1; rank]).
transpose_inner_for_tma
TMA assumes the last stride is contiguous and discards it. For ColMajor inputs we therefore swap the inner two dims so the contiguous one ends up last. The tensor’s own metadata stays in its original layout — only the TMA descriptor sees the transposed form.