Expand description
2-D row-major slice + concat-by-column primitives.
Used by hf2q’s ADR-020 Track 1 multi-head SDPA on GpuTape: Q/K/V tensors are sliced into per-head views, each head runs the single-head SDPA chain, and per-head context outputs are concatenated back into the full attention output.
Two kernels:
slice_2d_cols_f32(input[rows, in_cols], output[rows, out_cols], (in_cols, out_cols, start_col))producesoutput[r, c] = input[r, start_col + c].copy_2d_cols_into_f32(src[rows, src_cols], dst[rows, dst_cols], (src_cols, dst_cols, start))writesdst[r, start + c] = src[r, c]forc < src_cols. Caller pre-zeros (or pre-populates)dst; this kernel writes the slab only.
Statics§
Functions§
- dispatch_
copy_ 2d_ cols_ into_ f32 - Write
src[rows, src_cols]intodst[rows, dst_cols]at column offsetstart_col. Does NOT touch dst columns outside the slab — caller pre-zeros (or pre-populates)dst. - dispatch_
slice_ 2d_ cols_ f32 - Slice
output[r, c] = input[r, start_col + c]forc < out_cols. - register