Skip to main content

Module timer

Module timer 

Source
Expand description

Cross-backend GPU-side timer trait — PLAYBOOK § Phase 1.1.

Replaces the Instant::now() calls inside FERRUM_*_PROF probes (crates/ferrum-models/src/moe/forward.rs, qwen3_moe.rs, etc.). Those measure CPU-side dispatch + queue depth — they DON’T see GPU execution time, so the “per-op µs” they report has been misleading all the perf debugging this session has built on top of.

§Backend behaviour

  • CUDA (cudarc::driver::sys events) — cuEventRecord is asynchronous on the stream; elapsed_ms() calls cuEventSynchronize

    • cuEventElapsedTime. Overhead per scope: ~5µs (event create + record × 2 + sync at read). Accuracy: ±0.5µs.
  • Metal — Metal’s MTLCommandBuffer exposes gpuStartTime / gpuEndTime per command buffer. For sub-command-buffer scope we wrap the section in an explicit sync() boundary. This adds command-buffer commit overhead (~50-100µs) but gives accurate on-GPU timing. Caveat: on Metal the sync-wrap inflates each timed scope’s CPU side; use sparingly.

  • CPUInstant. (CPU is the “GPU” here — wall-clock is correct.)

§Usage

use ferrum_kernels::backend::timer::BackendTimer;

let mut timer = <B as Backend>::Timer::new();
timer.record_start(&mut ctx);
Backend::rms_norm(&mut ctx, &x, &w, eps, &mut out, tokens, dim);
timer.record_end(&mut ctx);
let us = timer.elapsed_ms() * 1000.0;
tracing::info!("rms_norm: {us:.1} us");

Hot loops should reuse a single Timer instance across scopes via record_start / record_endnew() allocates events on CUDA.

Structs§

CpuTimer
CPU timer — wall-clock via Instant. There’s no GPU to wait on, so the “GPU time” is just the CPU work duration.

Traits§

BackendTimer
GPU-side timer scoped to a single Backend context.

Functions§

finish_probe_timer
Close a timer started by [start_probe_timer] and return the elapsed microseconds. None propagates the “disabled” state so the caller can keep the if let Some(us) = ... { record(us) } pattern.
finish_probe_timer_traced
Convenience wrapper: close a timer AND push a chrome-trace event in one call. When FERRUM_TRACE_OUT is unset, the trace push is a no-op (cheap atomic check inside [global_trace]).
start_probe_timer_if
Start a timer iff enabled is true — None is the disabled state. Pair with finish_probe_timer at the end of the scope. The env/config gate is intentionally resolved by the caller so hot probes do not read process env while a token/layer loop is running.