raytop-0.1.0 is not a library.

raytop

A real-time TUI monitor for Ray clusters — like htop for distributed GPU training.

Monitors cluster-wide CPU/GPU/memory utilization, per-node resource usage, per-GPU utilization via Prometheus metrics, running jobs, and live actor counts — all from the Ray dashboard API.

Build

make build   # cargo build --release
make install # cargo install
make fmt     # cargo fmt
make clean   # cargo clean

# monitor a Ray cluster
raytop --master http://<HEAD_IP>:8265

Build Docker Image

make docker

Launch Ray Cluster

salloc -N 2 bash scripts/ray.sbatch
# or
sbatch -N 2 scripts/ray.sbatch

The script starts a Ray head + workers inside Docker containers across Slurm nodes, probes until all nodes are registered, then exits. The cluster stays alive in the detached containers.

Pass a custom image:

sbatch -N 4 scripts/ray.sbatch --image /fsx/ray+latest.tar.gz

Examples

verl Training — PPO/GRPO training on a Ray cluster

How It Works

Cluster status — REST API (/api/cluster_status) for cluster-wide CPU/GPU/memory allocation
Node discovery — REST API (/api/v0/nodes) for per-node info and state
Per-GPU metrics — Prometheus scraping (/api/prometheus/sd) for real-time GPU utilization, GRAM usage
Jobs & actors — REST API (/api/jobs/, /api/v0/actors) for running jobs and actor counts per node
Async — Background tokio task fetches all data in parallel, TUI never blocks

License

Apache-2.0