raytop 0.1.0

A real-time TUI monitor for Ray clusters
# raytop

A real-time TUI monitor for Ray clusters — like `htop` for distributed GPU training.

<p align="center">
  <img src="images/raytop.gif" alt="rdmatop" width="800">
</p>


Monitors cluster-wide CPU/GPU/memory utilization, per-node resource usage,
per-GPU utilization via Prometheus metrics, running jobs, and live actor counts —
all from the Ray dashboard API.

## Build

```bash
make build   # cargo build --release
make install # cargo install
make fmt     # cargo fmt
make clean   # cargo clean

# monitor a Ray cluster
raytop --master http://<HEAD_IP>:8265
```

## Build Docker Image

```bash
make docker
```

## Launch Ray Cluster

```bash
salloc -N 2 bash scripts/ray.sbatch
# or
sbatch -N 2 scripts/ray.sbatch
```

The script starts a Ray head + workers inside Docker containers across Slurm nodes,
probes until all nodes are registered, then exits. The cluster stays alive in the
detached containers.

Pass a custom image:

```bash
sbatch -N 4 scripts/ray.sbatch --image /fsx/ray+latest.tar.gz
```

## Examples

- [verl Training]examples/ — PPO/GRPO training on a Ray cluster

## How It Works

1. **Cluster status** — REST API (`/api/cluster_status`) for cluster-wide CPU/GPU/memory allocation
2. **Node discovery** — REST API (`/api/v0/nodes`) for per-node info and state
3. **Per-GPU metrics** — Prometheus scraping (`/api/prometheus/sd`) for real-time GPU utilization, GRAM usage
4. **Jobs & actors** — REST API (`/api/jobs/`, `/api/v0/actors`) for running jobs and actor counts per node
5. **Async** — Background `tokio` task fetches all data in parallel, TUI never blocks

## License

Apache-2.0