Expand description
Matrix transpose kernel: out-of-place B = A^T with AVX2 8×8 micro-kernel.
Matches transpose-kernel-v1.yaml.
Three phases: outer_blocking -> avx2_8x8_microkernel -> remainder.
§Algorithm
Process the matrix in 8×8 blocks. For each block, load 8 source rows into YMM registers, perform 3-phase in-register transpose (unpack → shuffle → permute), then store 8 transposed rows. Contiguous 32-byte stores coalesce cache misses (8 vs 64 in scalar).
§References
- Lam, Rothberg & Wolf (1991) Cache Performance of Blocked Algorithms
- Intel Intrinsics Guide: _mm256_unpacklo_ps, _mm256_shuffle_ps, _mm256_permute2f128_ps
Functions§
- transpose
- Transpose a matrix: B = A^T. Dispatches to AVX2 or scalar.
- transpose_
avx2 ⚠ - AVX2 matrix transpose using 8×8 in-register micro-kernel.
- transpose_
scalar - Scalar reference transpose: B[j * rows + i] = A[i * cols + j].