raytop 0.1.0

A real-time TUI monitor for Ray clusters
raytop-0.1.0 is not a library.

raytop

A real-time TUI monitor for Ray clusters — like htop for distributed GPU training.

Monitors cluster-wide CPU/GPU/memory utilization, per-node resource usage, per-GPU utilization via Prometheus metrics, running jobs, and live actor counts — all from the Ray dashboard API.

Build

make build   # cargo build --release
make install # cargo install
make fmt     # cargo fmt
make clean   # cargo clean

# monitor a Ray cluster
raytop --master http://<HEAD_IP>:8265

Build Docker Image

make docker

Launch Ray Cluster

salloc -N 2 bash scripts/ray.sbatch
# or
sbatch -N 2 scripts/ray.sbatch

The script starts a Ray head + workers inside Docker containers across Slurm nodes, probes until all nodes are registered, then exits. The cluster stays alive in the detached containers.

Pass a custom image:

sbatch -N 4 scripts/ray.sbatch --image /fsx/ray+latest.tar.gz

Examples

How It Works

  1. Cluster status — REST API (/api/cluster_status) for cluster-wide CPU/GPU/memory allocation
  2. Node discovery — REST API (/api/v0/nodes) for per-node info and state
  3. Per-GPU metrics — Prometheus scraping (/api/prometheus/sd) for real-time GPU utilization, GRAM usage
  4. Jobs & actors — REST API (/api/jobs/, /api/v0/actors) for running jobs and actor counts per node
  5. Async — Background tokio task fetches all data in parallel, TUI never blocks

License

Apache-2.0