Skip to main content

Module transpose

Module transpose 

Source
Expand description

Matrix transpose kernel: out-of-place B = A^T with AVX2 8×8 micro-kernel.

Matches transpose-kernel-v1.yaml. Three phases: outer_blocking -> avx2_8x8_microkernel -> remainder.

§Algorithm

Process the matrix in 8×8 blocks. For each block, load 8 source rows into YMM registers, perform 3-phase in-register transpose (unpack → shuffle → permute), then store 8 transposed rows. Contiguous 32-byte stores coalesce cache misses (8 vs 64 in scalar).

§References

  • Lam, Rothberg & Wolf (1991) Cache Performance of Blocked Algorithms
  • Intel Intrinsics Guide: _mm256_unpacklo_ps, _mm256_shuffle_ps, _mm256_permute2f128_ps

Functions§

transpose
Transpose a matrix: B = A^T. Dispatches to AVX2 or scalar.
transpose_avx2
AVX2 matrix transpose using 8×8 in-register micro-kernel.
transpose_scalar
Scalar reference transpose: B[j * rows + i] = A[i * cols + j].