vor 0.2.1 - Docs.rs

# vor

Opinionated cross-platform performance instrumentation for Rust and Python (see [bindings/python](bindings/python)) that unifies system metrics into a single unified profiler (handles CPU, GPU, and I/O at once). It does both halves of the job: measuring your code, and visualizing the system metrics, live or asynchronously.

Annotate a function and the same scope goes to a puffin flame chart, a `tracing` span, and (with the `cuda` feature) an NVTX range. With the `viz` feature, vor also draws an egui panel with that flame chart, frame-rate bars, and live system and GPU metrics.

https://github.com/user-attachments/assets/ac7643ab-504b-4033-98f5-5dc419937414

## Highlights

- Macros for functions, methods, and whole `impl` blocks: `#[profile]`, `#[all_functions]`, `#[skip]`. `const fn`s are left alone.
- The same annotations work on native macOS, web/wasm, and NVIDIA (NVTX).
- An egui panel with frame bars, a puffin flame chart, and one line plot per metric, with pin, pause, range-select, and zoom.
- System metrics sampled for you every frame: frame time, resident memory, and per-frame I/O.
- Headless capture: set `VOR_RECORD` and the same instrumentation streams system, GPU, and named metrics (plus opt-in flame frames) to a `.vor` file you replay later, with no panel in the binary.
- Python bindings: profile Python with `@vor.profile` and `record_metric`, capturing to the same `.vor` stream the Rust tools read.
- Live GPU metrics in the panel: Apple Silicon via IOKit and IOReport (no `sudo`), NVIDIA via NVML.
- Sinks that write a Chrome trace on native or push to the browser DevTools timeline on web.

## Install

vor is feature-gated, so pull in only what your platform needs.

```toml
[dependencies]
vor = { version = "0.2", features = ["viz", "mac"] }
```

Or `cargo add vor --features viz,mac`.

| feature   | adds                                                                                     |
| --------- | ---------------------------------------------------------------------------------------- |
| *(none)*  | instrumentation macros plus puffin/tracing scopes (no cost until `enable()`)             |
| `viz`     | the egui profiler panel (`vor::viz`)                                                 |
| `mac`     | macOS: `ChromeTraceSink`, resident-memory sampling, and the IOKit/IOReport GPU collector |
| `web`     | wasm: `BrowserSink` (DevTools User Timing), JS-heap memory, browser-safe puffin          |
| `cuda`    | NVIDIA: live GPU rows via NVML, plus an NVTX range per scope for Nsight Systems           |

These features are independent; combine them as needed, for example `["viz", "mac", "cuda"]`.

## Instrumenting code

```rust
// A single function or method.
#[vor::profile]
fn render(frame: u32) { /* ... */ }

// Every method in an impl. Scopes are named Renderer::sort,
// Renderer::shade, and so on, with no per-method attribute.
struct Renderer { /* ... */ }

#[vor::all_functions]
impl Renderer {
    fn sort(&self)  { /* ... */ }
    fn shade(&self) { /* ... */ }

    // Keep a hot trivial helper out of the flame chart.
    #[vor::skip]
    fn dirty(&self) -> bool { /* ... */ }
}

// An ad-hoc block scope.
fn step() {
    vor::profile_scope!("expensive_part");
    /* ... */
}
```

Turn collection on once, and mark a boundary per rendered frame:

```rust
fn main() {
    vor::enable();          // switch puffin scope collection on
    loop {
        // ... your frame ...
        vor::frame_mark();  // group scopes into this frame
    }
}
```

Until `enable()` is called the puffin half does nothing. The `tracing` half is always live for whatever subscriber you install.

## Headless capture (`VOR_RECORD`)

The panel is one consumer of vor's per-frame samples; a file stream is another. For a headless job (ML training or inference, a server, a batch tool) set `VOR_RECORD` and the same instrumentation writes each `frame_mark` to an append-only `.vor` capture. No panel, no egui, no render loop, and nothing extra in the binary when the variable is unset.

```rust
fn main() {
    vor::enable();                     // arms the recorder if VOR_RECORD is set
    for step in 0..steps {
        train_step();                  // #[vor::profile] on the hot fns inside
        vor::record_metric("loss", loss);  // optional named scalars
        vor::frame_mark();             // one record per step
    }
    vor::flush_recording();            // write the tail before exit
}
```

```sh
VOR_RECORD=/scratch/run.vor cargo run --release    # capture
cargo run --release                                # no recording, instrumentation only
```

`record_metric(name, value)` is the headless-friendly, generics-free counterpart to a panel `Metric<R>`: the latest value per name is snapshotted into each frame's record. Good for loss, learning rate, tokens/sec, or batch size. To label a row with a unit, call `record_metric_unit(name, unit)` once (e.g. at startup); metrics stay unitless otherwise.

Every metric, system or user, is a column: a `(name, unit)` pair with a stable id, declared once before any value references it. System columns are declared in the header; a user metric is declared the first time it appears, taking the unit registered for it. Frame records then carry values by id, so a name or unit is never repeated per frame. Each record holds one system sample, the frame's user scalars, and (opt-in) one puffin flame frame. Records are length-delimited and compressed one at a time, so a reader can tail a growing file or stop cleanly at a truncated final record left by a crashed job:

```
[header] [u32 len][lz4 record] [u32 len][lz4 record] ...
```

The default is metrics-only (tens of bytes per step lz4'd), since that time series is what a long run actually wants. Flame frames are heavier and gated behind env vars:

| variable                | effect                                                          |
| ----------------------- | --------------------------------------------------------------- |
| `VOR_RECORD`            | output path (`/scratch/run.vor`); unset disables recording      |
| `VOR_RECORD_FLAME=1`    | also capture puffin flame frames (default off, metrics only)    |
| `VOR_RECORD_EVERY=N`    | capture a flame frame on 1 step in N                            |
| `VOR_RECORD_MAX_FRAMES=N` | stop capturing flame frames after N of them (metrics continue) |

Read a capture back with `vor::Reader` (header columns, then frames one at a time; stops at EOF or a torn trailing record, so the same code reads a finished or still-growing file):

```rust
let mut reader = vor::Reader::open("/scratch/run.vor").unwrap();
for column in reader.columns() {            // system columns (name, unit)
    println!("{} ({})", column.name, column.unit);
}
while let Some(frame) = reader.next_frame().unwrap() {
    // frame.system aligns to reader.columns(); frame.user are (name, value) scalars,
    // units in reader.user_columns(); frame.flame is a serialized puffin frame.
}
```

`next_frame` returns `None` at EOF or a partial trailing record and keeps the buffered bytes, so the same loop reads a finished file (stop at `None`) or one still being written (retry after `None` to pick up new frames as they land).

### Replaying in the panel (`viz`)

With the `viz` feature, `vor::viz::ReplayState` renders a capture through the same frame bars, flame chart, and metric rows as the live panel, fed from the stream instead of in-process sampling:

```rust
let mut state = vor::viz::ReplayState::open("/scratch/run.vor").unwrap();
// each egui frame:
state.show(ui);
```

With `follow` on (default) it tails a growing file, so you can watch a job live on the same host; off, or once the file stops growing, it is a post-mortem of the last few hundred frames. Click a bar to pause and inspect, shift-drag to zoom a frame range, and if the run captured flame frames the pinned step's flame chart fills in. `examples/replay.rs` wires this into a window. (Very long post-mortem runs that need scrolling past the bounded ring are future work.)

## The in-app panel (`viz`)

vor owns the system rows (`frame_ms`, `memory_mb`, `io_ms`, `io_MB`, and `gpu_*` where supported). You describe only your own per-frame workload.

```rust
use std::collections::VecDeque;
use vor::viz::{Metric, PanelConfig, PanelState, show};

#[derive(Clone, Copy)]
struct AppFrame { visible: u32 }

const fn visible_of(f: &AppFrame) -> f64 { f.visible as f64 }
const METRICS: &[Metric<AppFrame>] =
    &[Metric::new("visible", visible_of, "splats").as_integer()];

let mut state = PanelState::new(PanelConfig::FRAME_MS);
let cap = PanelConfig::FRAME_MS.history_capacity;
let mut history: VecDeque<AppFrame> = VecDeque::with_capacity(cap);

// Once per displayed frame, inside your egui update. Skip the tick
// and the push while paused so every graph freezes together instead
// of scrolling under the pinned cursor:
if !state.is_paused() {
    state.tick();                              // sample system metrics, mark a puffin frame
    if history.len() >= cap { history.pop_front(); }
    history.push_back(AppFrame { visible: 1_500_000 });
}
show(ui, &mut state, &history, METRICS);       // draw the panel
```

`PanelState::tick()` advances vor's own system ring. Push one workload record per `tick` so the two stay aligned, and gate both on `is_paused()` as above.

### Panel interactions

The bars and every metric plot share one time axis: a pin, a zoom range, and pause apply to all of them at once.

| action                      | effect                                                          |
| --------------------------- | --------------------------------------------------------------- |
| click a frame bar           | pin the cursor on that frame (all graphs) and pause             |
| shift-drag the bars         | zoom every graph to that frame range (pins the slowest frame)   |
| pause/resume button         | freeze / follow the live stream (`PanelState::toggle_pause`)    |
| scroll over the flame chart | zoom the flame chart's within-frame time; drag pans, double-click resets |
| profiler chip               | annotate `frame_ms` with vor's own per-frame cost           |

## System and GPU metrics

vor samples these itself on each `tick()`:

| metric           | source                                                       | platforms      |
| ---------------- | ------------------------------------------------------------ | -------------- |
| `frame_ms`       | wall time between ticks                                      | all            |
| `memory_mb`      | RSS on `mac`, `performance.memory` on `web` (Chromium)       | `mac`, `web`   |
| `io_ms`, `io_MB` | your `record_io(ns, bytes)` calls, drained per frame         | all            |
| `gpu_util`       | IOKit `IOAccelerator` on `mac`, NVML utilization on `cuda`   | `mac`, `cuda`  |
| `gpu_sm`         | IOKit `IOAccelerator` renderer utilization                   | `mac`          |
| `gpu_power`      | IOReport `GPU Energy` on `mac`, NVML power draw on `cuda`     | `mac`, `cuda`  |
| `pcie`           | NVML PCIe TX+RX                                              | `cuda`         |
| `gpu_mem`        | IOKit `IOAccelerator` in-use memory on `mac`, NVML used on `cuda` | `mac`, `cuda` |
| `gpu_temp`       | NVML core temperature                                        | `cuda`         |
| `gpu_clock`      | NVML SM clock                                                | `cuda`         |

A background thread the panel starts polls the GPU backend (`mac` or `cuda`, no `sudo`) and the rows show only metrics that backend supplies: `gpu_sm` is macOS-only (NVML has no SM-occupancy counter), while `pcie`, `gpu_temp`, and `gpu_clock` are NVIDIA-only (the macOS backend doesn't read them). On a platform with no backend, including the browser (which gives a web page no GPU-telemetry API), the GPU rows are dropped rather than drawn as flat zeros.

Feed I/O time from anywhere, including background threads:

```rust
vor::record_io(elapsed_ns, bytes);   // lock-free accumulator
```

## Sinks (offline traces)

Install a sink once at startup, then drop the returned guard to flush.

```rust
// macOS. Open the output in chrome://tracing or Perfetto.
use vor::{ChromeTraceSink, Sink};
let guard = ChromeTraceSink { path: "trace.json".into() }.install();
```

```rust
// Web. Spans show up in the DevTools Performance tab.
use vor::{BrowserSink, Sink};
let guard = BrowserSink.install();
```

## NVIDIA (`cuda`)

The `cuda` feature does two independent things on NVIDIA hardware:

- Fills the panel's `gpu_util`, `pcie`, and `gpu_power` rows from [NVML](https://crates.io/crates/nvml-wrapper), the same way `mac` fills them from IOReport.
- Opens an [NVTX](https://github.com/NVIDIA/NVTX) range per scope, so your instrumented code lines up on an Nsight Systems timeline next to CUDA and GPU work. No code changes are needed: the same `#[profile]`, `#[all_functions]`, and `profile_scope!` carry over.

Neither needs a CUDA toolkit to build. `nvtx` vendors its headers and compiles them with `cc`; `nvml-wrapper` loads `libnvidia-ml` from the driver at runtime, so the GPU rows populate on any machine with an NVIDIA driver installed.

## Other utilities

- `FrameStats`: an HDR histogram of per-frame nanoseconds, with `p50_ns`, `p95_ns`, `p99_ns`, and `mean_ns`.
- `calibrate()` and `empty_span_ns()`: measure the per-span instrumentation overhead so you can subtract it.
- `current_memory_bytes()`: process memory on supported platforms.

## Examples

`examples/custom_metrics.rs` is headless and shows the API shape (`#[profile]`,
`#[all_functions]`, caller-defined metrics, the `PanelState` loop):

```sh
cargo run --features viz --example custom_metrics
```

`examples/headless.rs` profiles an ML-style loop with no panel, records it when
`VOR_RECORD` is set, and reads the capture back with `vor::Reader`:

```sh
VOR_RECORD=/tmp/run.vor cargo run --example headless              # capture
VOR_RECORD=/tmp/run.vor VOR_RECORD_FLAME=1 cargo run --example headless --features mac
cargo run --example headless -- /tmp/run.vor                     # summarize the capture
```

`examples/replay.rs` opens that capture in the panel, tailing it live or
replaying it after the fact:

```sh
cargo run --example replay --features viz,mac -- /tmp/run.vor
```

`examples/live_panel.rs` opens a window and renders the live panel, so it doubles
as an end-to-end check of each platform backend. Pick the feature set for the
machine you are on:

```sh
# macOS (Apple Silicon): live gpu_util / gpu_sm / gpu_power via IOKit + IOReport
cargo run --example live_panel --features viz,mac

# NVIDIA box: live gpu_util / pcie / gpu_power via NVML, plus NVTX ranges
cargo run --example live_panel --features viz,cuda

# Web / browser: the standalone demo in web/ renders the panel in a canvas
cd examples/web && trunk serve --open   # needs: cargo install trunk; rustup target add wasm32-unknown-unknown
```

(`examples/web/` is a minimal `eframe` + trunk app; GPU rows are absent in the browser,
so it verifies the `web` build, the panel, and the DevTools timeline path.)

Run the GPU smoke tests directly (each asserts the backend returns sane readings;
run on the matching machine):

```sh
cargo test --features viz,mac  poll_yields_sane_readings   # macOS
cargo test --features viz,cuda poll_yields_sane_readings   # NVIDIA host
```

## License

Dual-licensed under MIT or Apache-2.0.