Expand description
GPU-accelerated 2D matrix transpose.
Transposes a 2D matrix [rows, cols] to [cols, rows].
Supports F32 and F16 dtypes.
Functions§
- permute_
021_ bf16 - permute_
021_ bf16_ to_ f32 - Fused permute_021 + bf16→f32 cast. Replaces the two-pass sequence
permute_021_bf16(bf16 → bf16) ; cast_bf16_to_f32(bf16 → f32)with a single dispatch that reads bf16 in [A, B, C] order and writes f32 in [B, A, C] order, halving the global-memory traffic on the post-FA SDPA output buffer. Wave P4.10. - permute_
021_ f32 - Encode a 3D permutation:
[A, B, C] -> [B, A, C](bf16). - transpose_
2d - Encode a 2D matrix transpose:
output[col, row] = input[row, col]. - transpose_
last2_ bf16 - Swap the last two axes of a 3D bf16 tensor: [A, B, C] -> [A, C, B].