vor

Opinionated cross-platform performance instrumentation for Rust and Python (see bindings/python) that unifies system metrics into a single unified profiler (handles CPU, GPU, and I/O at once). It does both halves of the job: measuring your code, and visualizing the system metrics, live or asynchronously.

Annotate a function and the same scope goes to a puffin flame chart, a tracing span, and (with the cuda feature) an NVTX range. With the viz feature, vor also draws an egui panel with that flame chart, frame-rate bars, and live system and GPU metrics.

https://github.com/user-attachments/assets/ac7643ab-504b-4033-98f5-5dc419937414

Highlights

Macros for functions, methods, and whole impl blocks: #[profile], #[all_functions], #[skip]. const fns are left alone.
The same annotations work on native macOS, web/wasm, and NVIDIA (NVTX).
An egui panel with frame bars, a puffin flame chart, and one line plot per metric, with pin, pause, range-select, and zoom.
System metrics sampled for you every frame: frame time, resident memory, and per-frame I/O.
Headless capture: set VOR_RECORD and the same instrumentation streams system, GPU, and named metrics (plus opt-in flame frames) to a .vor file you replay later, with no panel in the binary.
Python bindings: profile Python with @vor.profile and record_metric, capturing to the same .vor stream the Rust tools read.
Live GPU metrics in the panel: Apple Silicon via IOKit and IOReport (no sudo), NVIDIA via NVML.
Sinks that write a Chrome trace on native or push to the browser DevTools timeline on web.

Install

vor is feature-gated, so pull in only what your platform needs.

[dependencies]
vor = { version = "0.2", features = ["viz", "mac"] }

Or cargo add vor --features viz,mac.

feature	adds
(none)	instrumentation macros plus puffin/tracing scopes (no cost until `enable()`)
`viz`	the egui profiler panel (`vor::viz`)
`mac`	macOS: `ChromeTraceSink`, resident-memory sampling, and the IOKit/IOReport GPU collector
`web`	wasm: `BrowserSink` (DevTools User Timing), JS-heap memory, browser-safe puffin
`cuda`	NVIDIA: live GPU rows via NVML, plus an NVTX range per scope for Nsight Systems

These features are independent; combine them as needed, for example ["viz", "mac", "cuda"].

Instrumenting code

// A single function or method.
#[vor::profile]
fn render(frame: u32) { /* ... */ }

// Every method in an impl. Scopes are named Renderer::sort,
// Renderer::shade, and so on, with no per-method attribute.
struct Renderer { /* ... */ }

#[vor::all_functions]
impl Renderer {
    fn sort(&self)  { /* ... */ }
    fn shade(&self) { /* ... */ }

    // Keep a hot trivial helper out of the flame chart.
    #[vor::skip]
    fn dirty(&self) -> bool { /* ... */ }
}

// An ad-hoc block scope.
fn step() {
    vor::profile_scope!("expensive_part");
    /* ... */
}

Turn collection on once, and mark a boundary per rendered frame:

fn main() {
    vor::enable();          // switch puffin scope collection on
    loop {
        // ... your frame ...
        vor::frame_mark();  // group scopes into this frame
    }
}

Until enable() is called the puffin half does nothing. The tracing half is always live for whatever subscriber you install.

Headless capture (`VOR_RECORD`)

The panel is one consumer of vor's per-frame samples; a file stream is another. For a headless job (ML training or inference, a server, a batch tool) set VOR_RECORD and the same instrumentation writes each frame_mark to an append-only .vor capture. No panel, no egui, no render loop, and nothing extra in the binary when the variable is unset.

fn main() {
    vor::enable();                     // arms the recorder if VOR_RECORD is set
    for step in 0..steps {
        train_step();                  // #[vor::profile] on the hot fns inside
        vor::record_metric("loss", loss);  // optional named scalars
        vor::frame_mark();             // one record per step
    }
    vor::flush_recording();            // write the tail before exit
}

VOR_RECORD=/scratch/run.vor cargo run --release    # capture
cargo run --release                                # no recording, instrumentation only

record_metric(name, value) is the headless-friendly, generics-free counterpart to a panel Metric<R>: the latest value per name is snapshotted into each frame's record. Good for loss, learning rate, tokens/sec, or batch size. To label a row with a unit, call record_metric_unit(name, unit) once (e.g. at startup); metrics stay unitless otherwise.

Every metric, system or user, is a column: a (name, unit) pair with a stable id, declared once before any value references it. System columns are declared in the header; a user metric is declared the first time it appears, taking the unit registered for it. Frame records then carry values by id, so a name or unit is never repeated per frame. Each record holds one system sample, the frame's user scalars, and (opt-in) one puffin flame frame. Records are length-delimited and compressed one at a time, so a reader can tail a growing file or stop cleanly at a truncated final record left by a crashed job:

[header] [u32 len][lz4 record] [u32 len][lz4 record] ...

The default is metrics-only (tens of bytes per step lz4'd), since that time series is what a long run actually wants. Flame frames are heavier and gated behind env vars:

variable	effect
`VOR_RECORD`	output path (`/scratch/run.vor`); unset disables recording
`VOR_RECORD_FLAME=1`	also capture puffin flame frames (default off, metrics only)
`VOR_RECORD_EVERY=N`	capture a flame frame on 1 step in N
`VOR_RECORD_MAX_FRAMES=N`	stop capturing flame frames after N of them (metrics continue)

Read a capture back with vor::Reader (header columns, then frames one at a time; stops at EOF or a torn trailing record, so the same code reads a finished or still-growing file):

let mut reader = vor::Reader::open("/scratch/run.vor").unwrap();
for column in reader.columns() {            // system columns (name, unit)
    println!("{} ({})", column.name, column.unit);
}
while let Some(frame) = reader.next_frame().unwrap() {
    // frame.system aligns to reader.columns(); frame.user are (name, value) scalars,
    // units in reader.user_columns(); frame.flame is a serialized puffin frame.
}

next_frame returns None at EOF or a partial trailing record and keeps the buffered bytes, so the same loop reads a finished file (stop at None) or one still being written (retry after None to pick up new frames as they land).

Replaying in the panel (`viz`)

With the viz feature, vor::viz::ReplayState renders a capture through the same frame bars, flame chart, and metric rows as the live panel, fed from the stream instead of in-process sampling:

let mut state = vor::viz::ReplayState::open("/scratch/run.vor").unwrap();
// each egui frame:
state.show(ui);

With follow on (default) it tails a growing file, so you can watch a job live on the same host; off, or once the file stops growing, it is a post-mortem of the last few hundred frames. Click a bar to pause and inspect, shift-drag to zoom a frame range, and if the run captured flame frames the pinned step's flame chart fills in. examples/replay.rs wires this into a window. (Very long post-mortem runs that need scrolling past the bounded ring are future work.)

The in-app panel (`viz`)

vor owns the system rows (frame_ms, memory_mb, io_ms, io_MB, and gpu_* where supported). You describe only your own per-frame workload.

use std::collections::VecDeque;
use vor::viz::{Metric, PanelConfig, PanelState, show};

#[derive(Clone, Copy)]
struct AppFrame { visible: u32 }

const fn visible_of(f: &AppFrame) -> f64 { f.visible as f64 }
const METRICS: &[Metric<AppFrame>] =
    &[Metric::new("visible", visible_of, "splats").as_integer()];

let mut state = PanelState::new(PanelConfig::FRAME_MS);
let cap = PanelConfig::FRAME_MS.history_capacity;
let mut history: VecDeque<AppFrame> = VecDeque::with_capacity(cap);

// Once per displayed frame, inside your egui update. Skip the tick
// and the push while paused so every graph freezes together instead
// of scrolling under the pinned cursor:
if !state.is_paused() {
    state.tick();                              // sample system metrics, mark a puffin frame
    if history.len() >= cap { history.pop_front(); }
    history.push_back(AppFrame { visible: 1_500_000 });
}
show(ui, &mut state, &history, METRICS);       // draw the panel

PanelState::tick() advances vor's own system ring. Push one workload record per tick so the two stay aligned, and gate both on is_paused() as above.

Panel interactions

The bars and every metric plot share one time axis: a pin, a zoom range, and pause apply to all of them at once.

action	effect
click a frame bar	pin the cursor on that frame (all graphs) and pause
shift-drag the bars	zoom every graph to that frame range (pins the slowest frame)
pause/resume button	freeze / follow the live stream (`PanelState::toggle_pause`)
scroll over the flame chart	zoom the flame chart's within-frame time; drag pans, double-click resets
profiler chip	annotate `frame_ms` with vor's own per-frame cost

System and GPU metrics

vor samples these itself on each tick():

metric	source	platforms
`frame_ms`	wall time between ticks	all
`memory_mb`	RSS on `mac`, `performance.memory` on `web` (Chromium)	`mac`, `web`
`io_ms`, `io_MB`	your `record_io(ns, bytes)` calls, drained per frame	all
`gpu_util`	IOKit `IOAccelerator` on `mac`, NVML utilization on `cuda`	`mac`, `cuda`
`gpu_sm`	IOKit `IOAccelerator` renderer utilization	`mac`
`gpu_power`	IOReport `GPU Energy` on `mac`, NVML power draw on `cuda`	`mac`, `cuda`
`pcie`	NVML PCIe TX+RX	`cuda`
`gpu_mem`	IOKit `IOAccelerator` in-use memory on `mac`, NVML used on `cuda`	`mac`, `cuda`
`gpu_temp`	NVML core temperature	`cuda`
`gpu_clock`	NVML SM clock	`cuda`

A background thread the panel starts polls the GPU backend (mac or cuda, no sudo) and the rows show only metrics that backend supplies: gpu_sm is macOS-only (NVML has no SM-occupancy counter), while pcie, gpu_temp, and gpu_clock are NVIDIA-only (the macOS backend doesn't read them). On a platform with no backend, including the browser (which gives a web page no GPU-telemetry API), the GPU rows are dropped rather than drawn as flat zeros.

Feed I/O time from anywhere, including background threads:

vor::record_io(elapsed_ns, bytes);   // lock-free accumulator

Sinks (offline traces)

Install a sink once at startup, then drop the returned guard to flush.

// macOS. Open the output in chrome://tracing or Perfetto.
use vor::{ChromeTraceSink, Sink};
let guard = ChromeTraceSink { path: "trace.json".into() }.install();

// Web. Spans show up in the DevTools Performance tab.
use vor::{BrowserSink, Sink};
let guard = BrowserSink.install();

NVIDIA (`cuda`)

The cuda feature does two independent things on NVIDIA hardware:

Fills the panel's gpu_util, pcie, and gpu_power rows from NVML, the same way mac fills them from IOReport.
Opens an NVTX range per scope, so your instrumented code lines up on an Nsight Systems timeline next to CUDA and GPU work. No code changes are needed: the same #[profile], #[all_functions], and profile_scope! carry over.

Neither needs a CUDA toolkit to build. nvtx vendors its headers and compiles them with cc; nvml-wrapper loads libnvidia-ml from the driver at runtime, so the GPU rows populate on any machine with an NVIDIA driver installed.

Other utilities

FrameStats: an HDR histogram of per-frame nanoseconds, with p50_ns, p95_ns, p99_ns, and mean_ns.
calibrate() and empty_span_ns(): measure the per-span instrumentation overhead so you can subtract it.
current_memory_bytes(): process memory on supported platforms.

Examples

examples/custom_metrics.rs is headless and shows the API shape (#[profile], #[all_functions], caller-defined metrics, the PanelState loop):

cargo run --features viz --example custom_metrics

examples/headless.rs profiles an ML-style loop with no panel, records it when VOR_RECORD is set, and reads the capture back with vor::Reader:

VOR_RECORD=/tmp/run.vor cargo run --example headless              # capture
VOR_RECORD=/tmp/run.vor VOR_RECORD_FLAME=1 cargo run --example headless --features mac
cargo run --example headless -- /tmp/run.vor                     # summarize the capture

examples/replay.rs opens that capture in the panel, tailing it live or replaying it after the fact:

cargo run --example replay --features viz,mac -- /tmp/run.vor

examples/live_panel.rs opens a window and renders the live panel, so it doubles as an end-to-end check of each platform backend. Pick the feature set for the machine you are on:

# macOS (Apple Silicon): live gpu_util / gpu_sm / gpu_power via IOKit + IOReport
cargo run --example live_panel --features viz,mac

# NVIDIA box: live gpu_util / pcie / gpu_power via NVML, plus NVTX ranges
cargo run --example live_panel --features viz,cuda

# Web / browser: the standalone demo in web/ renders the panel in a canvas
cd examples/web && trunk serve --open   # needs: cargo install trunk; rustup target add wasm32-unknown-unknown

(examples/web/ is a minimal eframe + trunk app; GPU rows are absent in the browser, so it verifies the web build, the panel, and the DevTools timeline path.)

Run the GPU smoke tests directly (each asserts the backend returns sane readings; run on the matching machine):

cargo test --features viz,mac  poll_yields_sane_readings   # macOS
cargo test --features viz,cuda poll_yields_sane_readings   # NVIDIA host

License

Dual-licensed under MIT or Apache-2.0.

vor 0.2.1

vor