Skip to main content

Module kernel_profile

Module kernel_profile 

Source
Expand description

Per-command-buffer GPU timing accumulator for kernel-level profiling.

Hf2q’s HF2Q_DECODE_PROFILE=1 instrumentation tracks CPU-side wall clock per layer phase, but does not attribute time to specific GPU kernel dispatches. The MoE dwq46 0.93× decode parity gap residual (per ADR-012 §Optimize / Task #15) cannot be localized further without per-cb (or per-dispatch) GPU timing.

This module exposes a thread-safe accumulator keyed by string label. Each labeled commit_and_wait records the cb’s GPU wall-clock (MTLCommandBuffer.GPUEndTime - GPUStartTime). At decode end, dump() produces a sorted breakdown showing which labeled cb contributed the most GPU time per token.

Per-DISPATCH timing (using MTLCounterSampleBuffer.sampleCounters) is a separate Metal API surface deferred to a future ADR; this module establishes the per-CB ground truth first.

Structs§

ProfileEntry
Per-label accumulator entry.

Functions§

dump
Dump the profile table sorted by descending total_ns.
is_enabled
Whether per-CB profiling is enabled via MLX_PROFILE_CB=1.
record
Record a labeled GPU duration.
reset
Reset the profile table. Typically called at start of decode.