Skip to main content

Module qkv_split

Module qkv_split 

Source
Expand description

GPU-accelerated split of a fused QKV tensor into separate Q/K/V outputs.

Input layout (per token, contiguous f32):

qkv[t, :] = [ Q (q_sp) | K (k_sp) | V (v_sp) ]   (length = qkv_ch)

Where q_sp = n_k_heads * d_k, k_sp = n_k_heads * d_k, and v_sp = n_v_heads * d_v. The kernel writes each input element to exactly one of {q, k, v} in a single dispatch — replacing the prior CPU download → triple-loop split → 3× upload round-trip used by the qwen35 Gated DeltaNet prefill path.

ADR-005 W-5b.18 (2026-04-27): targets the 838 ms / 17.5 ms-per-layer layer.qkv_deinterleave bucket in hf2q::gpu_delta_net.

Production caller: hf2q::inference::models::qwen35::gpu_delta_net:: apply_proj (prefill seq>1 branch).

Structs§

QkvSplitParams
Parameters for a fused-QKV split operation.

Statics§

QKV_SPLIT_SHADER_SOURCE
MSL source for the QKV-split kernel (embedded at compile time).

Functions§

dispatch_qkv_split_f32
Dispatch a fused-QKV split on the GPU.
register
Register the QKV-split shader source with the given kernel registry.