Module kernel_profile

Expand description

Per-command-buffer + per-dispatch GPU timing accumulator for kernel-level profiling.

Hf2q’s HF2Q_DECODE_PROFILE=1 instrumentation tracks CPU-side wall clock per layer phase, but does not attribute time to specific GPU kernel dispatches. The MoE dwq46 0.93× decode parity gap residual (per ADR-012 §Optimize / Task #15) cannot be localized further without per-cb (or per-dispatch) GPU timing.

This module exposes two thread-safe accumulators:

Per-CB (MLX_PROFILE_CB=1) — a HashMap keyed by string label. Each labeled commit_and_wait records the cb’s GPU wall-clock (MTLCommandBuffer.GPUEndTime - GPUStartTime).
Per-dispatch (MLX_PROFILE_DISPATCH=1, ADR-015 iter63) — a flat Vec<DispatchEntry> populated from MTLCounterSampleBuffer.sampleCounters between set_compute_pipeline_state and dispatch_threads at every encode* site. Dump groups entries by their owning cb_label, preserving insertion order within each group.

At decode end, dump / dump_dispatches produce sorted breakdowns showing which labeled cb (and which kernel within each cb) contributed the most GPU time per token.

§Cross-validation (ADR-015 iter63 Risk R3)

Per-dispatch numbers are upper-bound serialized cost — the withBarrier:YES requirement on sampleCountersInBuffer serializes the encoder under MTLDispatchTypeConcurrent. The per-CB sum will therefore be ≥ the matching MLX_PROFILE_CB total. Acceptable drift: ≤ 5%; > 10% indicates a clock-domain or sampling bug.

§Apple Silicon caveat (NEW Risk discovered iter63 impl)

Verified runtime: AGXG17XFamilyComputeContext (M-series, macOS 26) supports counter sampling only at MTLCounterSamplingPoint::AtStageBoundary, never AtDispatchBoundary. The latter is required for sampling between dispatches inside a persistent compute encoder (which mlx-native uses to amortize ~800 encoder create/end cycles per forward pass). On such hardware, MLX_PROFILE_DISPATCH=1 gracefully degrades to a no-op + one-shot stderr warning; only the per-CB path (MLX_PROFILE_CB=1) populates. The kit is forward-compatible for AMD/Intel discrete and any future Apple silicon that reports AtDispatchBoundary support.

Structs§

DispatchEntry: One per-dispatch timing entry within a CB (ADR-015 iter63 Phase A).
ProfileEntry: Per-label accumulator entry.

Functions§

convert_gpu_ticks_to_ns: Convert a raw GPU tick value to ns using the most recent (cpu_ns, gpu_ticks) pair, falling back to a 1:1 ratio when no pair has been recorded yet.
dump: Dump the per-CB profile table sorted by descending total_ns.
dump_dispatches: Dump per-dispatch entries grouped by cb_label, preserving CB-arrival order within each group.
is_dispatch_enabled: Whether per-DISPATCH profiling is enabled via MLX_PROFILE_DISPATCH=1.
is_enabled: Whether per-CB profiling is enabled via MLX_PROFILE_CB=1.
record: Record a labeled GPU duration.
record_clock_pair: Record a (cpu_ns, gpu_ticks) snapshot from MTLDevice.sampleTimestamps. Most recent snapshot wins.
record_dispatch: Append a per-dispatch entry to the global dispatch table.
reset: Reset the profile tables. Typically called at start of decode.