vor
Opinionated cross-platform performance instrumentation for Rust and Python (see bindings/python) that unifies system metrics into a single unified profiler (handles CPU, GPU, and I/O at once). It does both halves of the job: measuring your code, and visualizing the system metrics, live or asynchronously.
Annotate a function and the same scope goes to a puffin flame chart, a tracing span, and (with the cuda feature) an NVTX range. With the viz feature, vor also draws an egui panel with that flame chart, frame-rate bars, and live system and GPU metrics.
https://github.com/user-attachments/assets/ac7643ab-504b-4033-98f5-5dc419937414
Highlights
- Macros for functions, methods, and whole
implblocks:#[profile],#[all_functions],#[skip].const fns are left alone. - The same annotations work on native macOS, web/wasm, and NVIDIA (NVTX).
- An egui panel with frame bars, a puffin flame chart, and one line plot per metric, with pin, pause, range-select, and zoom.
- System metrics sampled for you every frame: frame time, resident memory, and per-frame I/O.
- Headless capture: set
VOR_RECORDand the same instrumentation streams system, GPU, and named metrics (plus opt-in flame frames) to a.vorfile you replay later, with no panel in the binary. - Python bindings: profile Python with
@vor.profileandrecord_metric, capturing to the same.vorstream the Rust tools read. - Live GPU metrics in the panel: Apple Silicon via IOKit and IOReport (no
sudo), NVIDIA via NVML. - Sinks that write a Chrome trace on native or push to the browser DevTools timeline on web.
Install
vor is feature-gated, so pull in only what your platform needs.
[]
= { = "0.2", = ["viz", "mac"] }
Or cargo add vor --features viz,mac.
| feature | adds |
|---|---|
| (none) | instrumentation macros plus puffin/tracing scopes (no cost until enable()) |
viz |
the egui profiler panel (vor::viz) |
mac |
macOS: ChromeTraceSink, resident-memory sampling, and the IOKit/IOReport GPU collector |
web |
wasm: BrowserSink (DevTools User Timing), JS-heap memory, browser-safe puffin |
cuda |
NVIDIA: live GPU rows via NVML, plus an NVTX range per scope for Nsight Systems |
These features are independent; combine them as needed, for example ["viz", "mac", "cuda"].
Instrumenting code
// A single function or method.
// Every method in an impl. Scopes are named Renderer::sort,
// Renderer::shade, and so on, with no per-method attribute.
// An ad-hoc block scope.
Turn collection on once, and mark a boundary per rendered frame:
Until enable() is called the puffin half does nothing. The tracing half is always live for whatever subscriber you install.
Headless capture (VOR_RECORD)
The panel is one consumer of vor's per-frame samples; a file stream is another. For a headless job (ML training or inference, a server, a batch tool) set VOR_RECORD and the same instrumentation writes each frame_mark to an append-only .vor capture. No panel, no egui, no render loop, and nothing extra in the binary when the variable is unset.
VOR_RECORD=/scratch/run.vor
record_metric(name, value) is the headless-friendly, generics-free counterpart to a panel Metric<R>: the latest value per name is snapshotted into each frame's record. Good for loss, learning rate, tokens/sec, or batch size. To label a row with a unit, call record_metric_unit(name, unit) once (e.g. at startup); metrics stay unitless otherwise.
Every metric, system or user, is a column: a (name, unit) pair with a stable id, declared once before any value references it. System columns are declared in the header; a user metric is declared the first time it appears, taking the unit registered for it. Frame records then carry values by id, so a name or unit is never repeated per frame. Each record holds one system sample, the frame's user scalars, and (opt-in) one puffin flame frame. Records are length-delimited and compressed one at a time, so a reader can tail a growing file or stop cleanly at a truncated final record left by a crashed job:
[header] [u32 len][lz4 record] [u32 len][lz4 record] ...
The default is metrics-only (tens of bytes per step lz4'd), since that time series is what a long run actually wants. Flame frames are heavier and gated behind env vars:
| variable | effect |
|---|---|
VOR_RECORD |
output path (/scratch/run.vor); unset disables recording |
VOR_RECORD_FLAME=1 |
also capture puffin flame frames (default off, metrics only) |
VOR_RECORD_EVERY=N |
capture a flame frame on 1 step in N |
VOR_RECORD_MAX_FRAMES=N |
stop capturing flame frames after N of them (metrics continue) |
Read a capture back with vor::Reader (header columns, then frames one at a time; stops at EOF or a torn trailing record, so the same code reads a finished or still-growing file):
let mut reader = open.unwrap;
for column in reader.columns
while let Some = reader.next_frame.unwrap
next_frame returns None at EOF or a partial trailing record and keeps the buffered bytes, so the same loop reads a finished file (stop at None) or one still being written (retry after None to pick up new frames as they land).
Replaying in the panel (viz)
With the viz feature, vor::viz::ReplayState renders a capture through the same frame bars, flame chart, and metric rows as the live panel, fed from the stream instead of in-process sampling:
let mut state = open.unwrap;
// each egui frame:
state.show;
With follow on (default) it tails a growing file, so you can watch a job live on the same host; off, or once the file stops growing, it is a post-mortem of the last few hundred frames. Click a bar to pause and inspect, shift-drag to zoom a frame range, and if the run captured flame frames the pinned step's flame chart fills in. examples/replay.rs wires this into a window. (Very long post-mortem runs that need scrolling past the bounded ring are future work.)
The in-app panel (viz)
vor owns the system rows (frame_ms, memory_mb, io_ms, io_MB, and gpu_* where supported). You describe only your own per-frame workload.
use VecDeque;
use ;
const
const METRICS: & =
&;
let mut state = new;
let cap = FRAME_MS.history_capacity;
let mut history: = with_capacity;
// Once per displayed frame, inside your egui update. Skip the tick
// and the push while paused so every graph freezes together instead
// of scrolling under the pinned cursor:
if !state.is_paused
show; // draw the panel
PanelState::tick() advances vor's own system ring. Push one workload record per tick so the two stay aligned, and gate both on is_paused() as above.
Panel interactions
The bars and every metric plot share one time axis: a pin, a zoom range, and pause apply to all of them at once.
| action | effect |
|---|---|
| click a frame bar | pin the cursor on that frame (all graphs) and pause |
| shift-drag the bars | zoom every graph to that frame range (pins the slowest frame) |
| pause/resume button | freeze / follow the live stream (PanelState::toggle_pause) |
| scroll over the flame chart | zoom the flame chart's within-frame time; drag pans, double-click resets |
| profiler chip | annotate frame_ms with vor's own per-frame cost |
System and GPU metrics
vor samples these itself on each tick():
| metric | source | platforms |
|---|---|---|
frame_ms |
wall time between ticks | all |
memory_mb |
RSS on mac, performance.memory on web (Chromium) |
mac, web |
io_ms, io_MB |
your record_io(ns, bytes) calls, drained per frame |
all |
gpu_util |
IOKit IOAccelerator on mac, NVML utilization on cuda |
mac, cuda |
gpu_sm |
IOKit IOAccelerator renderer utilization |
mac |
gpu_power |
IOReport GPU Energy on mac, NVML power draw on cuda |
mac, cuda |
pcie |
NVML PCIe TX+RX | cuda |
gpu_mem |
IOKit IOAccelerator in-use memory on mac, NVML used on cuda |
mac, cuda |
gpu_temp |
NVML core temperature | cuda |
gpu_clock |
NVML SM clock | cuda |
A background thread the panel starts polls the GPU backend (mac or cuda, no sudo) and the rows show only metrics that backend supplies: gpu_sm is macOS-only (NVML has no SM-occupancy counter), while pcie, gpu_temp, and gpu_clock are NVIDIA-only (the macOS backend doesn't read them). On a platform with no backend, including the browser (which gives a web page no GPU-telemetry API), the GPU rows are dropped rather than drawn as flat zeros.
Feed I/O time from anywhere, including background threads:
record_io; // lock-free accumulator
Sinks (offline traces)
Install a sink once at startup, then drop the returned guard to flush.
// macOS. Open the output in chrome://tracing or Perfetto.
use ;
let guard = ChromeTraceSink .install;
// Web. Spans show up in the DevTools Performance tab.
use ;
let guard = BrowserSink.install;
NVIDIA (cuda)
The cuda feature does two independent things on NVIDIA hardware:
- Fills the panel's
gpu_util,pcie, andgpu_powerrows from NVML, the same waymacfills them from IOReport. - Opens an NVTX range per scope, so your instrumented code lines up on an Nsight Systems timeline next to CUDA and GPU work. No code changes are needed: the same
#[profile],#[all_functions], andprofile_scope!carry over.
Neither needs a CUDA toolkit to build. nvtx vendors its headers and compiles them with cc; nvml-wrapper loads libnvidia-ml from the driver at runtime, so the GPU rows populate on any machine with an NVIDIA driver installed.
Other utilities
FrameStats: an HDR histogram of per-frame nanoseconds, withp50_ns,p95_ns,p99_ns, andmean_ns.calibrate()andempty_span_ns(): measure the per-span instrumentation overhead so you can subtract it.current_memory_bytes(): process memory on supported platforms.
Examples
examples/custom_metrics.rs is headless and shows the API shape (#[profile],
#[all_functions], caller-defined metrics, the PanelState loop):
examples/headless.rs profiles an ML-style loop with no panel, records it when
VOR_RECORD is set, and reads the capture back with vor::Reader:
VOR_RECORD=/tmp/run.vor VOR_RECORD=/tmp/run.vor VOR_RECORD_FLAME=1
examples/replay.rs opens that capture in the panel, tailing it live or
replaying it after the fact:
examples/live_panel.rs opens a window and renders the live panel, so it doubles
as an end-to-end check of each platform backend. Pick the feature set for the
machine you are on:
# macOS (Apple Silicon): live gpu_util / gpu_sm / gpu_power via IOKit + IOReport
# NVIDIA box: live gpu_util / pcie / gpu_power via NVML, plus NVTX ranges
# Web / browser: the standalone demo in web/ renders the panel in a canvas
&&
(examples/web/ is a minimal eframe + trunk app; GPU rows are absent in the browser,
so it verifies the web build, the panel, and the DevTools timeline path.)
Run the GPU smoke tests directly (each asserts the backend returns sane readings; run on the matching machine):
License
Dual-licensed under MIT or Apache-2.0.