avx512_microkernel_broadcast_b!() { /* proc-macro */ }Expand description
Generate an AVX-512 broadcast-B microkernel (faer-style).
Layout: A is MR×K packed column-major (MR contiguous per K step),
B is K×NR packed row-major (NR contiguous per K step).
C is MR×NR with stride ldc, stored in MR/16 zmm chunks per column.
Strategy (broadcast-B):
- Each K step: load MR/16 zmm from A, broadcast NR B scalars
- Each accumulator holds 16 elements of one column of C
- Total accumulators = (MR/16) × NR
- Per K step: MR/16 A loads + NR B broadcasts + (MR/16)*NR FMAs
Advantage over broadcast-A: NR can be small (6), keeping B panel tiny, allowing KC to stay large (256+). This matches faer’s nano-gemm approach.