Expand description
GPU-accelerated split of a fused QKV tensor into separate Q/K/V outputs.
Input layout (per token, contiguous f32):
qkv[t, :] = [ Q (q_sp) | K (k_sp) | V (v_sp) ] (length = qkv_ch)Where q_sp = n_k_heads * d_k, k_sp = n_k_heads * d_k, and
v_sp = n_v_heads * d_v. The kernel writes each input element to exactly
one of {q, k, v} in a single dispatch — replacing the prior CPU
download → triple-loop split → 3× upload round-trip used by the qwen35
Gated DeltaNet prefill path.
ADR-005 W-5b.18 (2026-04-27): targets the 838 ms / 17.5 ms-per-layer
layer.qkv_deinterleave bucket in hf2q::gpu_delta_net.
Production caller: hf2q::inference::models::qwen35::gpu_delta_net:: apply_proj (prefill seq>1 branch).
Structs§
- QkvSplit
Params - Parameters for a fused-QKV split operation.
Statics§
- QKV_
SPLIT_ SHADER_ SOURCE - MSL source for the QKV-split kernel (embedded at compile time).
Functions§
- dispatch_
qkv_ split_ f32 - Dispatch a fused-QKV split on the GPU.
- register
- Register the QKV-split shader source with the given kernel registry.