raytop-0.1.0 is not a library.
raytop
A real-time TUI monitor for Ray clusters — like htop for distributed GPU training.
Monitors cluster-wide CPU/GPU/memory utilization, per-node resource usage, per-GPU utilization via Prometheus metrics, running jobs, and live actor counts — all from the Ray dashboard API.
Build
# monitor a Ray cluster
Build Docker Image
Launch Ray Cluster
# or
The script starts a Ray head + workers inside Docker containers across Slurm nodes, probes until all nodes are registered, then exits. The cluster stays alive in the detached containers.
Pass a custom image:
Examples
- verl Training — PPO/GRPO training on a Ray cluster
How It Works
- Cluster status — REST API (
/api/cluster_status) for cluster-wide CPU/GPU/memory allocation - Node discovery — REST API (
/api/v0/nodes) for per-node info and state - Per-GPU metrics — Prometheus scraping (
/api/prometheus/sd) for real-time GPU utilization, GRAM usage - Jobs & actors — REST API (
/api/jobs/,/api/v0/actors) for running jobs and actor counts per node - Async — Background
tokiotask fetches all data in parallel, TUI never blocks
License
Apache-2.0