Expand description
KV cache GPU copy dispatch.
Copies new K or V data directly from a source GPU buffer into a pre-allocated KV cache buffer at the correct write position, with optional modulo wrapping for sliding window (ring buffer) caches.
This eliminates the CPU round-trip that append_bf16 requires:
instead of GPU -> CPU (as_slice) -> CPU (copy loop) -> shared buffer,
the GPU copies directly between two shared Metal buffers.
Statics§
- KV_
CACHE_ COPY_ SHADER_ SOURCE - MSL source for the KV cache copy kernel (embedded at compile time).
Functions§
- dispatch_
kv_ cache_ copy - Dispatch a GPU copy from a source bf16 buffer into a KV cache buffer.
- dispatch_
kv_ cache_ copy_ batch_ f32 - Dispatch a batched GPU copy from a source f32 buffer into a f32 KV cache.
- dispatch_
kv_ cache_ copy_ batch_ f32_ kv_ dual - Fused single-position K + V cache copy (F32 source → F32 cache) — DECODE shape.
- dispatch_
kv_ cache_ copy_ batch_ f32_ to_ f16 - Dispatch a batched F32→F16 copy from a source f32 buffer into an f16 KV cache.
- dispatch_
kv_ cache_ copy_ batch_ f32_ to_ f16_ kv_ dual - Fused single-position K + V cache copy (F32 source → F16 cache) — DECODE shape.
- dispatch_
kv_ cache_ copy_ f32 - Dispatch a GPU copy from a source f32 buffer into a f32 KV cache buffer.
- dispatch_
kv_ cache_ copy_ seq_ bf16 - Multi-position, all-heads KV cache copy (BF16 source → F32 cache, batched prefill).
- dispatch_
kv_ cache_ copy_ seq_ bf16_ to_ bf16_ head_ major - ADR-030 iter-95: bit-exact BF16→BF16 strided cache copy from pf_k_perm (head-major BF16) to bf16_xlen_cache (head-major BF16).
- dispatch_
kv_ cache_ copy_ seq_ f32 - Multi-position, all-heads KV cache copy (F32 → F32 cache, batched prefill).
- dispatch_
kv_ cache_ copy_ seq_ f32_ dual - Fused K + V cache copy (F32 source → F32 cache). Wave P4.11.
- dispatch_
kv_ cache_ copy_ seq_ f32_ to_ f16 - Multi-position, all-heads KV cache copy (F32 source → F16 cache, batched prefill).
- dispatch_
kv_ cache_ copy_ seq_ f32_ to_ f16_ dual - Fused K + V cache copy (F32 source → F16 cache). Wave P4.11
f16-cache variant of
dispatch_kv_cache_copy_seq_f32_dual. - register
- Register KV cache copy shader source with the given kernel registry.