Expand description
GPU-accelerated tiled-GQA broadcast: [T, Hg, K] → [T, H, K] F32.
Replaces the hf2q-side CPU triple-loop tiled-replicate at
gpu_delta_net.rs:893-940 (q_expanded / k_expanded fill,
~497 ms / 10.4 ms-per-layer at PP4106 per the W-5b.17 audit).
Mapping:
dst[t, h, k] = src[t, h % Hg, k]Where Hg = n_k_heads, H = n_v_heads, K = head_dim. The “tiled”
variant matches Qwen3.6 GGUF tensor layout (per
project_qwen36_gqa_tiled_vs_block and gpu_delta_net.rs:834-866),
and is the same convention as llama.cpp’s ggml_repeat_4d graph op.
ADR-005 W-5b.19 (2026-04-27): single-dispatch GPU broadcast eliminates
the chunk-wrapper’s CPU memcpy bucket. Production caller:
hf2q::inference::models::qwen35::gpu_delta_net::apply_gated_delta_net_chunk
(chunk-prefill GQA pre-expansion).
Structs§
- Repeat
Tiled Params - Parameters for a tiled-GQA broadcast operation.
Statics§
- REPEAT_
TILED_ SHADER_ SOURCE - MSL source for the tiled-repeat kernel (embedded at compile time).
Functions§
- dispatch_
repeat_ tiled_ f32 - Dispatch a tiled-GQA broadcast on the GPU.
- register
- Register the repeat-tiled shader source with the given kernel registry.