raytop 0.1.1

A real-time TUI monitor for Ray clusters
# raytop

A real-time TUI monitor for Ray clusters — like `htop` for distributed GPU training.

<p align="center">
  <img src="images/raytop.gif" alt="rdmatop" width="800">
</p>


Monitors cluster-wide CPU/GPU/memory utilization, per-node resource usage,
per-GPU utilization via Prometheus metrics, running jobs, and live actor counts —
all from the Ray dashboard API.

## Install

Install from [crates.io](https://crates.io/crates/raytop):

```bash
cargo install raytop
```

## Usage

Point `raytop` at your Ray dashboard endpoint. The dashboard is typically
available on port 8265 of the Ray head node.

```bash
raytop --master http://<HEAD_IP>:8265
```

Use `j`/`k` or arrow keys to navigate nodes, `Enter` to open the detail panel,
`Tab` to switch focus between jobs and nodes, `t` to cycle themes, and `q` to quit.

## Build from Source

```bash
make build   # cargo build --release
make install # cargo install --path .
make fmt     # cargo fmt
make clean   # cargo clean
```

## Examples

- [verl Training]examples/ — PPO/GRPO training on a Ray cluster using FSDP2 backend

## How It Works

1. **Cluster status** — REST API (`/api/cluster_status`) for cluster-wide CPU/GPU/memory allocation
2. **Node discovery** — REST API (`/api/v0/nodes`) for per-node info and state
3. **Per-GPU metrics** — Prometheus scraping (`/api/prometheus/sd`) for real-time GPU utilization, GRAM usage
4. **Jobs & actors** — REST API (`/api/jobs/`, `/api/v0/actors`) for running jobs and actor counts per node
5. **Async** — Background `tokio` task fetches all data in parallel, TUI never blocks

## License

Apache-2.0