Skip to main content

Module slice_concat_2d

Module slice_concat_2d 

Source
Expand description

2-D row-major slice + concat-by-column primitives.

Used by hf2q’s ADR-020 Track 1 multi-head SDPA on GpuTape: Q/K/V tensors are sliced into per-head views, each head runs the single-head SDPA chain, and per-head context outputs are concatenated back into the full attention output.

Two kernels:

  • slice_2d_cols_f32(input[rows, in_cols], output[rows, out_cols], (in_cols, out_cols, start_col)) produces output[r, c] = input[r, start_col + c].
  • copy_2d_cols_into_f32(src[rows, src_cols], dst[rows, dst_cols], (src_cols, dst_cols, start)) writes dst[r, start + c] = src[r, c] for c < src_cols. Caller pre-zeros (or pre-populates) dst; this kernel writes the slab only.

Statics§

SLICE_CONCAT_2D_SHADER_SOURCE

Functions§

dispatch_copy_2d_cols_into_f32
Write src[rows, src_cols] into dst[rows, dst_cols] at column offset start_col. Does NOT touch dst columns outside the slab — caller pre-zeros (or pre-populates) dst.
dispatch_slice_2d_cols_f32
Slice output[r, c] = input[r, start_col + c] for c < out_cols.
register