Expand description
Per-command-buffer + per-dispatch GPU timing accumulator for kernel-level profiling.
Hf2q’s HF2Q_DECODE_PROFILE=1 instrumentation tracks CPU-side wall
clock per layer phase, but does not attribute time to specific GPU
kernel dispatches. The MoE dwq46 0.93× decode parity gap residual
(per ADR-012 §Optimize / Task #15) cannot be localized further
without per-cb (or per-dispatch) GPU timing.
This module exposes two thread-safe accumulators:
- Per-CB (
MLX_PROFILE_CB=1) — a HashMap keyed by string label. Each labeledcommit_and_waitrecords the cb’s GPU wall-clock (MTLCommandBuffer.GPUEndTime - GPUStartTime). - Per-dispatch (
MLX_PROFILE_DISPATCH=1, ADR-015 iter63) — a flatVec<DispatchEntry>populated fromMTLCounterSampleBuffer.sampleCountersbetweenset_compute_pipeline_stateanddispatch_threadsat everyencode*site. Dump groups entries by their owningcb_label, preserving insertion order within each group.
At decode end, dump / dump_dispatches produce sorted
breakdowns showing which labeled cb (and which kernel within each cb)
contributed the most GPU time per token.
§Cross-validation (ADR-015 iter63 Risk R3)
Per-dispatch numbers are upper-bound serialized cost — the
withBarrier:YES requirement on sampleCountersInBuffer serializes
the encoder under MTLDispatchTypeConcurrent. The per-CB sum will
therefore be ≥ the matching MLX_PROFILE_CB total. Acceptable
drift: ≤ 5%; > 10% indicates a clock-domain or sampling bug.
§Apple Silicon caveat (NEW Risk discovered iter63 impl)
Verified runtime: AGXG17XFamilyComputeContext (M-series, macOS 26)
supports counter sampling only at
MTLCounterSamplingPoint::AtStageBoundary, never
AtDispatchBoundary. The latter is required for sampling between
dispatches inside a persistent compute encoder (which mlx-native
uses to amortize ~800 encoder create/end cycles per forward pass).
On such hardware, MLX_PROFILE_DISPATCH=1 gracefully degrades to a
no-op + one-shot stderr warning; only the per-CB path
(MLX_PROFILE_CB=1) populates. The kit is forward-compatible for
AMD/Intel discrete and any future Apple silicon that reports
AtDispatchBoundary support.
Structs§
- Dispatch
Entry - One per-dispatch timing entry within a CB (ADR-015 iter63 Phase A).
- Profile
Entry - Per-label accumulator entry.
Functions§
- convert_
gpu_ ticks_ to_ ns - Convert a raw GPU tick value to ns using the most recent
(cpu_ns, gpu_ticks)pair, falling back to a 1:1 ratio when no pair has been recorded yet. - dump
- Dump the per-CB profile table sorted by descending total_ns.
- dump_
dispatches - Dump per-dispatch entries grouped by
cb_label, preserving CB-arrival order within each group. - is_
dispatch_ enabled - Whether per-DISPATCH profiling is enabled via
MLX_PROFILE_DISPATCH=1. - is_
enabled - Whether per-CB profiling is enabled via
MLX_PROFILE_CB=1. - record
- Record a labeled GPU duration.
- record_
clock_ pair - Record a
(cpu_ns, gpu_ticks)snapshot fromMTLDevice.sampleTimestamps. Most recent snapshot wins. - record_
dispatch - Append a per-dispatch entry to the global dispatch table.
- reset
- Reset the profile tables. Typically called at start of decode.