Expand description
Cross-backend GPU-side timer trait — PLAYBOOK § Phase 1.1.
Replaces the Instant::now() calls inside FERRUM_*_PROF probes
(crates/ferrum-models/src/moe/forward.rs, qwen3_moe.rs, etc.).
Those measure CPU-side dispatch + queue depth — they DON’T see GPU
execution time, so the “per-op µs” they report has been misleading
all the perf debugging this session has built on top of.
§Backend behaviour
-
CUDA (
cudarc::driver::sysevents) —cuEventRecordis asynchronous on the stream;elapsed_ms()callscuEventSynchronizecuEventElapsedTime. Overhead per scope: ~5µs (event create + record × 2 + sync at read). Accuracy: ±0.5µs.
-
Metal — Metal’s
MTLCommandBufferexposesgpuStartTime/gpuEndTimeper command buffer. For sub-command-buffer scope we wrap the section in an explicitsync()boundary. This adds command-buffer commit overhead (~50-100µs) but gives accurate on-GPU timing. Caveat: on Metal the sync-wrap inflates each timed scope’s CPU side; use sparingly. -
CPU —
Instant. (CPU is the “GPU” here — wall-clock is correct.)
§Usage
use ferrum_kernels::backend::timer::BackendTimer;
let mut timer = <B as Backend>::Timer::new();
timer.record_start(&mut ctx);
Backend::rms_norm(&mut ctx, &x, &w, eps, &mut out, tokens, dim);
timer.record_end(&mut ctx);
let us = timer.elapsed_ms() * 1000.0;
tracing::info!("rms_norm: {us:.1} us");Hot loops should reuse a single Timer instance across scopes via
record_start / record_end — new() allocates events on CUDA.
Structs§
- CpuTimer
- CPU timer — wall-clock via
Instant. There’s no GPU to wait on, so the “GPU time” is just the CPU work duration.
Traits§
- Backend
Timer - GPU-side timer scoped to a single Backend context.
Functions§
- finish_
probe_ timer - Close a timer started by [
start_probe_timer] and return the elapsed microseconds.Nonepropagates the “disabled” state so the caller can keep theif let Some(us) = ... { record(us) }pattern. - finish_
probe_ timer_ traced - Convenience wrapper: close a timer AND push a chrome-trace event in
one call. When
FERRUM_TRACE_OUTis unset, the trace push is a no-op (cheap atomic check inside [global_trace]). - start_
probe_ timer_ if - Start a timer iff
enabledis true —Noneis the disabled state. Pair withfinish_probe_timerat the end of the scope. The env/config gate is intentionally resolved by the caller so hot probes do not read process env while a token/layer loop is running.