Skip to main content

avx512_microkernel_broadcast_b

Macro avx512_microkernel_broadcast_b 

Source
avx512_microkernel_broadcast_b!() { /* proc-macro */ }
Expand description

Generate an AVX-512 broadcast-B microkernel (faer-style).

Layout: A is MR×K packed column-major (MR contiguous per K step), B is K×NR packed row-major (NR contiguous per K step). C is MR×NR with stride ldc, stored in MR/16 zmm chunks per column.

Strategy (broadcast-B):

  • Each K step: load MR/16 zmm from A, broadcast NR B scalars
  • Each accumulator holds 16 elements of one column of C
  • Total accumulators = (MR/16) × NR
  • Per K step: MR/16 A loads + NR B broadcasts + (MR/16)*NR FMAs

Advantage over broadcast-A: NR can be small (6), keeping B panel tiny, allowing KC to stay large (256+). This matches faer’s nano-gemm approach.