raytop 0.1.2

A real-time TUI monitor for Ray clusters
raytop-0.1.2 is not a library.

raytop

Crates.io License

A real-time TUI monitor for Ray clusters — like htop for distributed GPU training.

raytop provides a unified view of cluster-wide resource utilization — including CPU, GPU, and memory — as well as per-node breakdowns, per-GPU utilization via Prometheus metrics, running job status, and live actor counts. All data is retrieved from the Ray dashboard API and rendered in a continuously updating display, similar in spirit to htop but tailored for multi-node GPU clusters.

Install

Install from crates.io:

cargo install raytop

Usage

Point raytop at your Ray dashboard endpoint. The dashboard is typically available on port 8265 of the Ray head node.

raytop --master http://<HEAD_IP>:8265

Use j/k or arrow keys to navigate nodes, Enter to open the detail panel, Tab to switch focus between jobs and nodes, t to cycle themes, and q to quit.

Build from Source

make build   # cargo build --release
make install # cargo install --path .
make fmt     # cargo fmt
make clean   # cargo clean

Launch Ray Cluster

The included sbatch script provisions a Ray cluster inside Docker containers across Slurm-managed nodes. It starts a Ray head process on the first allocated node and Ray worker processes on all remaining nodes, then polls until every node has successfully registered with the cluster. Once the cluster is fully formed, the script exits while the detached containers continue running.

salloc -N 2 bash examples/ray/ray.sbatch
# or
sbatch -N 2 examples/ray/ray.sbatch

To use a custom container image (either a local tarball or a registry reference):

sbatch -N 4 examples/ray/ray.sbatch --image /fsx/ray+latest.tar.gz

Examples

How It Works

  1. Cluster status — REST API (/api/cluster_status) for cluster-wide CPU/GPU/memory allocation
  2. Node discovery — REST API (/api/v0/nodes) for per-node info and state
  3. Per-GPU metrics — Prometheus scraping (/api/prometheus/sd) for real-time GPU utilization, GRAM usage
  4. Jobs & actors — REST API (/api/jobs/, /api/v0/actors) for running jobs and actor counts per node
  5. Async — Background tokio task fetches all data in parallel, TUI never blocks

License

Apache-2.0