pub struct DispatchRecord {
pub pipeline: ComputePipelineState,
pub threadgroups: MTLSize,
pub threads_per_tg: MTLSize,
pub threadgroup_mem: Vec<(u64, u64)>,
pub params_bytes: Vec<u8>,
pub params_slot: u64,
pub buffer_slots: Vec<u64>,
pub op_kind: CapturedOpKind,
pub kernel_name: String,
}Expand description
Pre-baked dispatch record for hot decode paths.
ADR-029 iter-175 Step 1d — first piece of the multi-week “Option A” refactor that the gemma4 decode gap analysis localized to per-dispatch CPU orchestration (forward_mlx::forward_decode → encode_one_layer → dispatch_qmatmul → quantized_matmul_ggml → dispatch_mv → encoder.encode_threadgroups_with_args).
At gemma4 decode m=1, every dispatch within the inner loop has
load-time-immutable shape: the kernel pipeline, threadgroup
geometry, params struct bytes, and binding-slot layout are fully
determined by the weight + ggml_type and never change across the
thousands of decode tokens that follow. DispatchRecord captures
that state once at model-load (or on first-call lazy-init) so the
hot path skips:
KernelRegistry::get_pipeline*HashMap lookups- match expressions over
ggml_typefor kernel-name + geometry MTLSize::newconstruction (already-known values)- param-struct field stores + bytemuck::bytes_of conversion
Only the runtime-varying buffers (input, output) need to be passed
to CommandEncoder::dispatch_record. Weight buffers are baked
inline via the bake_buffers slot list.
§Bake-time invariants
buffer_slots.len() == bake_buffers.len() + runtime_buffer_countfor the call site contract; the call site documentsruntime_buffer_countand the order of runtime buffers.params_bytes.len()is whatever the kernel’sKernelArg::Bytesexpects (typically 8-byte aligned per Metal struct layout).threadgroup_memis(slot, byte_length)pairs; empty when the kernel doesn’t request[[threadgroup]]memory.
§Coherence
dispatch_record produces a byte-identical Metal command stream
to the equivalent encode_threadgroups_with_args* call. Capture
mode is supported (replays into CapturedNode::Dispatch exactly
like the unbaked path). See dispatch_record for the lockstep.
Fields§
§pipeline: ComputePipelineStatePipeline reference, looked up once at bake time.
threadgroups: MTLSizeThreadgroup count.
threads_per_tg: MTLSizeThreads per threadgroup.
threadgroup_mem: Vec<(u64, u64)>Threadgroup shared-memory bindings: (slot_index, byte_length).
Empty when the kernel doesn’t allocate [[threadgroup]] memory.
params_bytes: Vec<u8>Pre-encoded params struct bytes (bound as KernelArg::Bytes).
Empty when the kernel has no inline-bytes parameter.
params_slot: u64Slot index for params_bytes. Ignored when params_bytes is empty.
buffer_slots: Vec<u64>Slot indices for runtime buffer arguments, in caller order.
dispatch_record zips runtime_buffers against this list.
op_kind: CapturedOpKindCapturedOpKind used when the encoder is in capture mode.
kernel_name: StringDiagnostic label (kernel name) for debug/timing.
Trait Implementations§
Source§impl Clone for DispatchRecord
impl Clone for DispatchRecord
Source§fn clone(&self) -> DispatchRecord
fn clone(&self) -> DispatchRecord
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more