# raytop
A real-time TUI monitor for Ray clusters — like `htop` for distributed GPU training.
<p align="center">
<img src="images/raytop.gif" alt="rdmatop" width="800">
</p>
Monitors cluster-wide CPU/GPU/memory utilization, per-node resource usage,
per-GPU utilization via Prometheus metrics, running jobs, and live actor counts —
all from the Ray dashboard API.
## Build
```bash
make build # cargo build --release
make install # cargo install
make fmt # cargo fmt
make clean # cargo clean
# monitor a Ray cluster
raytop --master http://<HEAD_IP>:8265
```
## Build Docker Image
```bash
make docker
```
## Launch Ray Cluster
```bash
salloc -N 2 bash scripts/ray.sbatch
# or
sbatch -N 2 scripts/ray.sbatch
```
The script starts a Ray head + workers inside Docker containers across Slurm nodes,
probes until all nodes are registered, then exits. The cluster stays alive in the
detached containers.
Pass a custom image:
```bash
sbatch -N 4 scripts/ray.sbatch --image /fsx/ray+latest.tar.gz
```
## Examples
- [verl Training](examples/) — PPO/GRPO training on a Ray cluster
## How It Works
1. **Cluster status** — REST API (`/api/cluster_status`) for cluster-wide CPU/GPU/memory allocation
2. **Node discovery** — REST API (`/api/v0/nodes`) for per-node info and state
3. **Per-GPU metrics** — Prometheus scraping (`/api/prometheus/sd`) for real-time GPU utilization, GRAM usage
4. **Jobs & actors** — REST API (`/api/jobs/`, `/api/v0/actors`) for running jobs and actor counts per node
5. **Async** — Background `tokio` task fetches all data in parallel, TUI never blocks
## License
Apache-2.0