raytop 0.1.1

A real-time TUI monitor for Ray clusters
raytop-0.1.1 is not a library.

raytop

A real-time TUI monitor for Ray clusters — like htop for distributed GPU training.

Monitors cluster-wide CPU/GPU/memory utilization, per-node resource usage, per-GPU utilization via Prometheus metrics, running jobs, and live actor counts — all from the Ray dashboard API.

Install

Install from crates.io:

cargo install raytop

Usage

Point raytop at your Ray dashboard endpoint. The dashboard is typically available on port 8265 of the Ray head node.

raytop --master http://<HEAD_IP>:8265

Use j/k or arrow keys to navigate nodes, Enter to open the detail panel, Tab to switch focus between jobs and nodes, t to cycle themes, and q to quit.

Build from Source

make build   # cargo build --release
make install # cargo install --path .
make fmt     # cargo fmt
make clean   # cargo clean

Examples

  • verl Training — PPO/GRPO training on a Ray cluster using FSDP2 backend

How It Works

  1. Cluster status — REST API (/api/cluster_status) for cluster-wide CPU/GPU/memory allocation
  2. Node discovery — REST API (/api/v0/nodes) for per-node info and state
  3. Per-GPU metrics — Prometheus scraping (/api/prometheus/sd) for real-time GPU utilization, GRAM usage
  4. Jobs & actors — REST API (/api/jobs/, /api/v0/actors) for running jobs and actor counts per node
  5. Async — Background tokio task fetches all data in parallel, TUI never blocks

License

Apache-2.0