[][src]Function lokacore::arch::x86_64::transpose4

pub fn transpose4(r0: &mut m128, r1: &mut m128, r2: &mut m128, r3: &mut m128)

Transposes, in place, the four m128 values as if they formed a 4x4 Matrix.

The Intel guide lists the official implementation of this as being:

__m128 tmp3, tmp2, tmp1, tmp0;
tmp0 := _mm_unpacklo_ps(row0, row1);
tmp2 := _mm_unpacklo_ps(row2, row3);
tmp1 := _mm_unpackhi_ps(row0, row1);
tmp3 := _mm_unpackhi_ps(row2, row3);
row0 := _mm_movelh_ps(tmp0, tmp2);
row1 := _mm_movehl_ps(tmp2, tmp0);
row2 := _mm_movelh_ps(tmp1, tmp3);
row3 := _mm_movehl_ps(tmp3, tmp1);