Module repeat_tiled

Expand description

GPU-accelerated tiled-GQA broadcast: [T, Hg, K] → [T, H, K] F32.

Replaces the hf2q-side CPU triple-loop tiled-replicate at gpu_delta_net.rs:893-940 (q_expanded / k_expanded fill, ~497 ms / 10.4 ms-per-layer at PP4106 per the W-5b.17 audit).

Mapping:

dst[t, h, k] = src[t, h % Hg, k]

Where Hg = n_k_heads, H = n_v_heads, K = head_dim. The “tiled” variant matches Qwen3.6 GGUF tensor layout (per project_qwen36_gqa_tiled_vs_block and gpu_delta_net.rs:834-866), and is the same convention as llama.cpp’s ggml_repeat_4d graph op.

ADR-005 W-5b.19 (2026-04-27): single-dispatch GPU broadcast eliminates the chunk-wrapper’s CPU memcpy bucket. Production caller: hf2q::inference::models::qwen35::gpu_delta_net::apply_gated_delta_net_chunk (chunk-prefill GQA pre-expansion).

Structs§

RepeatTiledParams: Parameters for a tiled-GQA broadcast operation.

Statics§

REPEAT_TILED_SHADER_SOURCE: MSL source for the tiled-repeat kernel (embedded at compile time).

Functions§

dispatch_repeat_tiled_f32: Dispatch a tiled-GQA broadcast on the GPU.
register: Register the repeat-tiled shader source with the given kernel registry.