# vor
Opinionated cross-platform performance instrumentation for Rust and Python (see [bindings/python](bindings/python)) that unifies system metrics into a single unified profiler (handles CPU, GPU, and I/O at once). It does both halves of the job: measuring your code, and visualizing the system metrics, live or asynchronously.
Annotate a function and the same scope goes to a puffin flame chart, a `tracing` span, and (with the `cuda` feature) an NVTX range. With the `viz` feature, vor also draws an egui panel with that flame chart, frame-rate bars, and live system and GPU metrics.
https://github.com/user-attachments/assets/ac7643ab-504b-4033-98f5-5dc419937414
## Highlights
- Macros for functions, methods, and whole `impl` blocks: `#[profile]`, `#[all_functions]`, `#[skip]`. `const fn`s are left alone.
- The same annotations work on native macOS, web/wasm, and NVIDIA (NVTX).
- An egui panel with frame bars, a puffin flame chart, and one line plot per metric, with pin, pause, range-select, and zoom.
- System metrics sampled for you every frame: frame time, resident memory, and per-frame I/O.
- Headless capture: set `VOR_RECORD` and the same instrumentation streams system, GPU, and named metrics (plus opt-in flame frames) to a `.vor` file you replay later, with no panel in the binary.
- Python bindings: profile Python with `@vor.profile` and `record_metric`, capturing to the same `.vor` stream the Rust tools read.
- Live GPU metrics in the panel: Apple Silicon via IOKit and IOReport (no `sudo`), NVIDIA via NVML.
- Sinks that write a Chrome trace on native or push to the browser DevTools timeline on web.
## Install
vor is feature-gated, so pull in only what your platform needs.
```toml
[dependencies]
vor = { version = "0.2", features = ["viz", "mac"] }
```
Or `cargo add vor --features viz,mac`.
| feature | adds |
| --------- | ---------------------------------------------------------------------------------------- |
| *(none)* | instrumentation macros plus puffin/tracing scopes (no cost until `enable()`) |
| `viz` | the egui profiler panel (`vor::viz`) |
| `mac` | macOS: `ChromeTraceSink`, resident-memory sampling, and the IOKit/IOReport GPU collector |
| `web` | wasm: `BrowserSink` (DevTools User Timing), JS-heap memory, browser-safe puffin |
| `cuda` | NVIDIA: live GPU rows via NVML, plus an NVTX range per scope for Nsight Systems |
These features are independent; combine them as needed, for example `["viz", "mac", "cuda"]`.
## Instrumenting code
```rust
// A single function or method.
#[vor::profile]
fn render(frame: u32) { /* ... */ }
// Every method in an impl. Scopes are named Renderer::sort,
// Renderer::shade, and so on, with no per-method attribute.
struct Renderer { /* ... */ }
#[vor::all_functions]
impl Renderer {
fn sort(&self) { /* ... */ }
fn shade(&self) { /* ... */ }
// Keep a hot trivial helper out of the flame chart.
#[vor::skip]
fn dirty(&self) -> bool { /* ... */ }
}
// An ad-hoc block scope.
fn step() {
vor::profile_scope!("expensive_part");
/* ... */
}
```
Turn collection on once, and mark a boundary per rendered frame:
```rust
fn main() {
vor::enable(); // switch puffin scope collection on
loop {
// ... your frame ...
vor::frame_mark(); // group scopes into this frame
}
}
```
Until `enable()` is called the puffin half does nothing. The `tracing` half is always live for whatever subscriber you install.
## Headless capture (`VOR_RECORD`)
The panel is one consumer of vor's per-frame samples; a file stream is another. For a headless job (ML training or inference, a server, a batch tool) set `VOR_RECORD` and the same instrumentation writes each `frame_mark` to an append-only `.vor` capture. No panel, no egui, no render loop, and nothing extra in the binary when the variable is unset.
```rust
fn main() {
vor::enable(); // arms the recorder if VOR_RECORD is set
for step in 0..steps {
train_step(); // #[vor::profile] on the hot fns inside
vor::record_metric("loss", loss); // optional named scalars
vor::frame_mark(); // one record per step
}
vor::flush_recording(); // write the tail before exit
}
```
```sh
VOR_RECORD=/scratch/run.vor cargo run --release # capture
cargo run --release # no recording, instrumentation only
```
`record_metric(name, value)` is the headless-friendly, generics-free counterpart to a panel `Metric<R>`: the latest value per name is snapshotted into each frame's record. Good for loss, learning rate, tokens/sec, or batch size. To label a row with a unit, call `record_metric_unit(name, unit)` once (e.g. at startup); metrics stay unitless otherwise.
Every metric, system or user, is a column: a `(name, unit)` pair with a stable id, declared once before any value references it. System columns are declared in the header; a user metric is declared the first time it appears, taking the unit registered for it. Frame records then carry values by id, so a name or unit is never repeated per frame. Each record holds one system sample, the frame's user scalars, and (opt-in) one puffin flame frame. Records are length-delimited and compressed one at a time, so a reader can tail a growing file or stop cleanly at a truncated final record left by a crashed job:
```
[header] [u32 len][lz4 record] [u32 len][lz4 record] ...
```
The default is metrics-only (tens of bytes per step lz4'd), since that time series is what a long run actually wants. Flame frames are heavier and gated behind env vars:
| variable | effect |
| ----------------------- | --------------------------------------------------------------- |
| `VOR_RECORD` | output path (`/scratch/run.vor`); unset disables recording |
| `VOR_RECORD_FLAME=1` | also capture puffin flame frames (default off, metrics only) |
| `VOR_RECORD_EVERY=N` | capture a flame frame on 1 step in N |
| `VOR_RECORD_MAX_FRAMES=N` | stop capturing flame frames after N of them (metrics continue) |
Read a capture back with `vor::Reader` (header columns, then frames one at a time; stops at EOF or a torn trailing record, so the same code reads a finished or still-growing file):
```rust
let mut reader = vor::Reader::open("/scratch/run.vor").unwrap();
for column in reader.columns() { // system columns (name, unit)
println!("{} ({})", column.name, column.unit);
}
while let Some(frame) = reader.next_frame().unwrap() {
// frame.system aligns to reader.columns(); frame.user are (name, value) scalars,
// units in reader.user_columns(); frame.flame is a serialized puffin frame.
}
```
`next_frame` returns `None` at EOF or a partial trailing record and keeps the buffered bytes, so the same loop reads a finished file (stop at `None`) or one still being written (retry after `None` to pick up new frames as they land).
### Replaying in the panel (`viz`)
With the `viz` feature, `vor::viz::ReplayState` renders a capture through the same frame bars, flame chart, and metric rows as the live panel, fed from the stream instead of in-process sampling:
```rust
let mut state = vor::viz::ReplayState::open("/scratch/run.vor").unwrap();
// each egui frame:
state.show(ui);
```
With `follow` on (default) it tails a growing file, so you can watch a job live on the same host; off, or once the file stops growing, it is a post-mortem of the last few hundred frames. Click a bar to pause and inspect, shift-drag to zoom a frame range, and if the run captured flame frames the pinned step's flame chart fills in. `examples/replay.rs` wires this into a window. (Very long post-mortem runs that need scrolling past the bounded ring are future work.)
## The in-app panel (`viz`)
vor owns the system rows (`frame_ms`, `memory_mb`, `io_ms`, `io_MB`, and `gpu_*` where supported). You describe only your own per-frame workload.
```rust
use std::collections::VecDeque;
use vor::viz::{Metric, PanelConfig, PanelState, show};
#[derive(Clone, Copy)]
struct AppFrame { visible: u32 }
const fn visible_of(f: &AppFrame) -> f64 { f.visible as f64 }
const METRICS: &[Metric<AppFrame>] =
&[Metric::new("visible", visible_of, "splats").as_integer()];
let mut state = PanelState::new(PanelConfig::FRAME_MS);
let cap = PanelConfig::FRAME_MS.history_capacity;
let mut history: VecDeque<AppFrame> = VecDeque::with_capacity(cap);
// Once per displayed frame, inside your egui update. Skip the tick
// and the push while paused so every graph freezes together instead
// of scrolling under the pinned cursor:
if !state.is_paused() {
state.tick(); // sample system metrics, mark a puffin frame
if history.len() >= cap { history.pop_front(); }
history.push_back(AppFrame { visible: 1_500_000 });
}
show(ui, &mut state, &history, METRICS); // draw the panel
```
`PanelState::tick()` advances vor's own system ring. Push one workload record per `tick` so the two stay aligned, and gate both on `is_paused()` as above.
### Panel interactions
The bars and every metric plot share one time axis: a pin, a zoom range, and pause apply to all of them at once.
| action | effect |
| --------------------------- | --------------------------------------------------------------- |
| click a frame bar | pin the cursor on that frame (all graphs) and pause |
| shift-drag the bars | zoom every graph to that frame range (pins the slowest frame) |
| pause/resume button | freeze / follow the live stream (`PanelState::toggle_pause`) |
| scroll over the flame chart | zoom the flame chart's within-frame time; drag pans, double-click resets |
| profiler chip | annotate `frame_ms` with vor's own per-frame cost |
## System and GPU metrics
vor samples these itself on each `tick()`:
| metric | source | platforms |
| ---------------- | ------------------------------------------------------------ | -------------- |
| `frame_ms` | wall time between ticks | all |
| `memory_mb` | RSS on `mac`, `performance.memory` on `web` (Chromium) | `mac`, `web` |
| `io_ms`, `io_MB` | your `record_io(ns, bytes)` calls, drained per frame | all |
| `gpu_util` | IOKit `IOAccelerator` on `mac`, NVML utilization on `cuda` | `mac`, `cuda` |
| `gpu_sm` | IOKit `IOAccelerator` renderer utilization | `mac` |
| `gpu_power` | IOReport `GPU Energy` on `mac`, NVML power draw on `cuda` | `mac`, `cuda` |
| `pcie` | NVML PCIe TX+RX | `cuda` |
| `gpu_mem` | IOKit `IOAccelerator` in-use memory on `mac`, NVML used on `cuda` | `mac`, `cuda` |
| `gpu_temp` | NVML core temperature | `cuda` |
| `gpu_clock` | NVML SM clock | `cuda` |
A background thread the panel starts polls the GPU backend (`mac` or `cuda`, no `sudo`) and the rows show only metrics that backend supplies: `gpu_sm` is macOS-only (NVML has no SM-occupancy counter), while `pcie`, `gpu_temp`, and `gpu_clock` are NVIDIA-only (the macOS backend doesn't read them). On a platform with no backend, including the browser (which gives a web page no GPU-telemetry API), the GPU rows are dropped rather than drawn as flat zeros.
Feed I/O time from anywhere, including background threads:
```rust
vor::record_io(elapsed_ns, bytes); // lock-free accumulator
```
## Sinks (offline traces)
Install a sink once at startup, then drop the returned guard to flush.
```rust
// macOS. Open the output in chrome://tracing or Perfetto.
use vor::{ChromeTraceSink, Sink};
let guard = ChromeTraceSink { path: "trace.json".into() }.install();
```
```rust
// Web. Spans show up in the DevTools Performance tab.
use vor::{BrowserSink, Sink};
let guard = BrowserSink.install();
```
## NVIDIA (`cuda`)
The `cuda` feature does two independent things on NVIDIA hardware:
- Fills the panel's `gpu_util`, `pcie`, and `gpu_power` rows from [NVML](https://crates.io/crates/nvml-wrapper), the same way `mac` fills them from IOReport.
- Opens an [NVTX](https://github.com/NVIDIA/NVTX) range per scope, so your instrumented code lines up on an Nsight Systems timeline next to CUDA and GPU work. No code changes are needed: the same `#[profile]`, `#[all_functions]`, and `profile_scope!` carry over.
Neither needs a CUDA toolkit to build. `nvtx` vendors its headers and compiles them with `cc`; `nvml-wrapper` loads `libnvidia-ml` from the driver at runtime, so the GPU rows populate on any machine with an NVIDIA driver installed.
## Other utilities
- `FrameStats`: an HDR histogram of per-frame nanoseconds, with `p50_ns`, `p95_ns`, `p99_ns`, and `mean_ns`.
- `calibrate()` and `empty_span_ns()`: measure the per-span instrumentation overhead so you can subtract it.
- `current_memory_bytes()`: process memory on supported platforms.
## Examples
`examples/custom_metrics.rs` is headless and shows the API shape (`#[profile]`,
`#[all_functions]`, caller-defined metrics, the `PanelState` loop):
```sh
cargo run --features viz --example custom_metrics
```
`examples/headless.rs` profiles an ML-style loop with no panel, records it when
`VOR_RECORD` is set, and reads the capture back with `vor::Reader`:
```sh
VOR_RECORD=/tmp/run.vor cargo run --example headless # capture
VOR_RECORD=/tmp/run.vor VOR_RECORD_FLAME=1 cargo run --example headless --features mac
cargo run --example headless -- /tmp/run.vor # summarize the capture
```
`examples/replay.rs` opens that capture in the panel, tailing it live or
replaying it after the fact:
```sh
cargo run --example replay --features viz,mac -- /tmp/run.vor
```
`examples/live_panel.rs` opens a window and renders the live panel, so it doubles
as an end-to-end check of each platform backend. Pick the feature set for the
machine you are on:
```sh
# macOS (Apple Silicon): live gpu_util / gpu_sm / gpu_power via IOKit + IOReport
cargo run --example live_panel --features viz,mac
# NVIDIA box: live gpu_util / pcie / gpu_power via NVML, plus NVTX ranges
cargo run --example live_panel --features viz,cuda
# Web / browser: the standalone demo in web/ renders the panel in a canvas
cd examples/web && trunk serve --open # needs: cargo install trunk; rustup target add wasm32-unknown-unknown
```
(`examples/web/` is a minimal `eframe` + trunk app; GPU rows are absent in the browser,
so it verifies the `web` build, the panel, and the DevTools timeline path.)
Run the GPU smoke tests directly (each asserts the backend returns sane readings;
run on the matching machine):
```sh
cargo test --features viz,mac poll_yields_sane_readings # macOS
cargo test --features viz,cuda poll_yields_sane_readings # NVIDIA host
```
## License
Dual-licensed under MIT or Apache-2.0.