pub fn tmp_buffer_bytes(num_heads: u32, head_dim: u32) -> usize
Compute the size in bytes of the temporary buffer needed for TQ SDPA.
Same formula as F16 SDPA: stores NWG partial output vectors + S/M values.