Module transpose

Expand description

GPU-accelerated 2D matrix transpose.

Transposes a 2D matrix [rows, cols] to [cols, rows]. Supports F32 and F16 dtypes.

Functions§

permute_021_bf16
permute_021_bf16_to_f32: Fused permute_021 + bf16→f32 cast. Replaces the two-pass sequence permute_021_bf16(bf16 → bf16) ; cast_bf16_to_f32(bf16 → f32) with a single dispatch that reads bf16 in [A, B, C] order and writes f32 in [B, A, C] order, halving the global-memory traffic on the post-FA SDPA output buffer. Wave P4.10.
permute_021_f32: Encode a 3D permutation: [A, B, C] -> [B, A, C] (bf16).
transpose_2d: Encode a 2D matrix transpose: output[col, row] = input[row, col].
transpose_last2_bf16: Swap the last two axes of a 3D bf16 tensor: [A, B, C] -> [A, C, B].