Expand description
Per-command-buffer GPU timing accumulator for kernel-level profiling.
Hf2q’s HF2Q_DECODE_PROFILE=1 instrumentation tracks CPU-side wall
clock per layer phase, but does not attribute time to specific GPU
kernel dispatches. The MoE dwq46 0.93× decode parity gap residual
(per ADR-012 §Optimize / Task #15) cannot be localized further
without per-cb (or per-dispatch) GPU timing.
This module exposes a thread-safe accumulator keyed by string label.
Each labeled commit_and_wait records the cb’s GPU wall-clock
(MTLCommandBuffer.GPUEndTime - GPUStartTime). At decode end,
dump() produces a sorted breakdown showing which labeled cb
contributed the most GPU time per token.
Per-DISPATCH timing (using MTLCounterSampleBuffer.sampleCounters)
is a separate Metal API surface deferred to a future ADR; this
module establishes the per-CB ground truth first.
Structs§
- Profile
Entry - Per-label accumulator entry.
Functions§
- dump
- Dump the profile table sorted by descending total_ns.
- is_
enabled - Whether per-CB profiling is enabled via
MLX_PROFILE_CB=1. - record
- Record a labeled GPU duration.
- reset
- Reset the profile table. Typically called at start of decode.