Spark Dashboard
Real-time hardware and LLM inference monitoring for Linux systems with NVIDIA GPUs. Developed and tested on the NVIDIA DGX Spark, but works on any Linux host with NVIDIA drivers — discrete-GPU workstations, DGX boxes, cloud VMs. A Rust backend collects GPU, CPU, memory, disk, and network metrics alongside vLLM engine statistics and streams them over WebSocket to a React frontend.

Quick Start
Install on your Linux host
Run as your normal user on any Linux host with NVIDIA drivers (requires Rust 1.75+):
The dashboard is now served on port 3000. See Install on your Linux host for the full guide, config overrides, and uninstall.
Develop locally
Open http://localhost:5173 in your browser. See dev/README.md
for details on what each script does.
Features
Hardware Monitoring (1s polling via NVML, sysinfo, procfs)
- GPU utilization, temperature, power draw, clock frequencies, fan speed
- GPU event detection — thermal throttling, hardware slowdown, power brake
- CPU aggregate and per-core utilization with heatmap
- Memory breakdown — CPU RAM and GPU VRAM separately on discrete-GPU hosts, or a single unified pool on systems where CPU and GPU share memory (e.g. DGX Spark GB10, GH200)
- Disk and network I/O throughput
LLM Engine Monitoring (vLLM via Prometheus metrics)
- Tokens per second (generation + prompt)
- Time to first token, inter-token latency, end-to-end latency, queue time
- Active/queued requests, batch size
- KV cache utilization, prefix cache hit rate
- Automatic engine discovery via process scan and Docker API
- SLO Goodput
Multi-Engine Support
- Run and monitor any number of inference engines side by side — each vLLM process or container is detected automatically and gets its own tab
- All Engines overview tab with running-engine count, summed throughput, and weighted-mean latencies across every detected engine
- Per-engine drill-down tabs with provider chips showing the engine type (vLLM) and source (Docker / host process)
- Auto-rotating tab carousel with a configurable interval (default 3 s); rotation pauses automatically the moment you interact with a tab so you can focus on a single engine without fighting the UI
Dashboard
- Arc gauges, time-series charts, sparklines, per-core heatmap
- 15-minute rolling history with circular buffers
- Connection status badge, staleness detection, auto-reconnect
Architecture
┌──────────────────────┐ WebSocket (JSON) ┌────────────────────┐
│ Rust Backend │ ──────────────────────────────▶ │ React Frontend │
│ │ │ │
│ Tokio tasks: │ │ useMetrics hook │
│ ├─ metrics_collector│ broadcast channel (capacity 16) │ ├─ WebSocket conn │
│ │ (GPU/CPU/mem/…) ─┼──▶ tx ──▶ ws_handler ──▶ client │ ├─ batch flush 2s │
│ └─ engine_collector │ │ └─ circular bufs │
│ (vLLM/Docker) │ │ │
│ │ Static files (rust-embed) │ Recharts, Tailwind│
│ Axum router │ ◀──── production only ───────── │ shadcn/ui │
└──────────────────────┘ └────────────────────┘
Linux host (e.g. DGX Spark) Browser
Two independent Tokio tasks run in parallel — one for hardware metrics (NVML,
sysinfo, procfs) and one for engine detection/polling. Both feed into a
broadcast channel that fans out to all connected WebSocket clients. In
production the frontend is embedded in the binary via rust-embed; in
development, Vite serves the frontend locally and proxies API/WebSocket
traffic to the remote backend.
Configuration
All operator config lives in a repo-root .env file. Copy the template and
edit:
| Variable | Purpose |
|---|---|
DEPLOY_USER |
SSH user on the remote host (required) |
DEPLOY_HOST |
Hostname or IP of the remote host (required) |
DEPLOY_DIR |
Project path on the remote host, relative to remote home (default spark-dashboard) |
VITE_BACKEND_URL |
Where Vite proxies /ws and /api (default http://localhost:3000) |
Legacy SPARK_USER / SPARK_HOST / SPARK_DIR are still accepted as a
fallback when DEPLOY_* are unset — dev.sh prints a one-line deprecation
note. The scripts in dev/ source this file; Vite picks up VITE_* variables
automatically. .env is gitignored — never commit it.
Install on your Linux host
The dashboard runs as a supervised systemd service. Two install paths; both
build from source on the host.
Option A — via cargo (recommended)
# On the host. Requires Rust 1.75+, NVIDIA drivers, and internet access.
cargo install pulls the crate from crates.io
and compiles it locally. service install copies the binary to
/usr/local/bin, creates a locked-down spark-dashboard system user (added
to video, render, docker groups for NVML and Docker access), writes the
systemd unit, and enables it.
Why the explicit
~/.cargo/bin/path?cargo installdrops the binary in~/.cargo/bin, which isn't onsudo's sanitizedsecure_pathand isn't always on the user's interactive PATH either (depends on how Rust was installed). Passing the absolute path makes the command work regardless. Afterservice installcopies the binary to/usr/local/bin, subsequentsudo spark-dashboard …calls (e.g.service status,service uninstall) resolve normally.
Option B — from a local checkout
Use this when you want to install without crates.io (audit the source, air-gapped install, or deploy an unreleased commit).
# On the host. Run as your normal user — the script escalates to sudo
# only for the systemd wiring step.
This builds the frontend (npm run build) and the Rust binary
(cargo build --release), then hands off to the same service install
logic as Option A. You'll be prompted for your sudo password once, when
the service is installed.
Managing the service
Optional overrides live in /etc/spark-dashboard/config.env — set
SPARK_DASHBOARD_PORT, SPARK_DASHBOARD_BIND, SPARK_DASHBOARD_POLL_INTERVAL,
SPARK_DASHBOARD_GPU_INDEX, or RUST_LOG, then
sudo systemctl restart spark-dashboard.
Upgrade
# Option A
&&
# Option B
&& &&
Re-running service install is idempotent: it stops the service, swaps the
binary, and starts it again, preserving /etc/spark-dashboard/config.env.
Uninstall
CLI options
spark-dashboard [OPTIONS] run the server (default)
spark-dashboard service install [--prefix /usr/local]
spark-dashboard service uninstall [--purge]
spark-dashboard service status
-p, --port <PORT> Listen port [default: 3000] [env: SPARK_DASHBOARD_PORT]
-b, --bind <BIND> Bind address [default: 0.0.0.0] [env: SPARK_DASHBOARD_BIND]
--poll-interval <MS> Polling interval ms [default: 1000] [env: SPARK_DASHBOARD_POLL_INTERVAL]
--gpu-index <IDX> NVML GPU index to monitor [default: 0] [env: SPARK_DASHBOARD_GPU_INDEX]
--engine <TYPE> Manual engine type (e.g. vllm)
--engine-url <URL> Manual engine endpoint (requires --engine)
On multi-GPU hosts use --gpu-index to select which device the dashboard
monitors. Engines are auto-detected via process scan and Docker API. Use
--engine and --engine-url to override when auto-detection doesn't work.
Development
Prerequisites
- Local machine (macOS or Linux): Node.js 20+, npm, rsync, ssh
- Remote host: Linux + NVIDIA drivers, Rust 1.75+, SSH access with key-based auth (no password prompts)
- Optional:
brew install fswatchfor instant file-change detection (the watcher falls back to 2s polling without it)
Running the dev environment
The script handles everything:
- Syncs the full project to the remote host via rsync
- Builds the Rust backend on the remote host (
cargo build --release) - Starts the backend on the remote host (port 3000)
- Starts the Vite dev server locally (port 5173)
- Watches
src/andCargo.tomlfor Rust changes — auto-syncs and rebuilds on the remote host
| What you edit | What happens |
|---|---|
Frontend files (frontend/src/) |
Vite hot-reloads instantly in the browser |
Backend files (src/, Cargo.toml) |
Auto-detected → rsync to remote host → rebuild → restart (~compile time) |
Useful while dev.sh is running:
# Watch backend logs in another terminal
# Press Ctrl+C in the dev.sh terminal to stop everything (cleans up the remote process too)
How the proxy works
By default, Vite proxies /ws and /api to localhost:3000 — this works out
of the box with any SSH tunnel that maps the remote host's port 3000 to your
local machine.
Browser → localhost:5173/ws → Vite proxy → localhost:3000/ws (forwarded to remote)
Browser → localhost:5173/api → Vite proxy → localhost:3000/api (forwarded to remote)
To connect directly over the network instead, set in .env:
VITE_BACKEND_URL=http://:3000
The frontend connects to the WebSocket using window.location.host, so the
proxy is transparent — no code changes between dev and production.
Releases
Releases are cut from main via release-please —
conventional commits drive the version bump, merging the release PR tags
vX.Y.Z and triggers cargo publish to crates.io. main always reflects
the latest stable version; see CHANGELOG.md for release notes.
Testing
# Frontend
&&
# Backend (on Linux)
Backend tests include platform-aware stubs — GPU and memory tests validate real NVML/procfs parsing on Linux, with compile-time stubs on other platforms.
Project Structure
├── src/
│ ├── main.rs CLI args, task spawning, server startup
│ ├── server.rs Axum router, static file serving
│ ├── ws.rs WebSocket handler
│ ├── metrics/
│ │ ├── mod.rs MetricsSnapshot, collector loop
│ │ ├── gpu.rs NVML GPU metrics + event detection
│ │ ├── cpu.rs CPU aggregate + per-core
│ │ ├── memory.rs System RAM + GPU VRAM + unified-memory detection
│ │ ├── disk.rs Disk I/O rates
│ │ └── network.rs Network I/O rates
│ └── engines/
│ ├── mod.rs Engine trait, state machine, collector
│ ├── detector.rs Process scan + Docker discovery
│ ├── vllm.rs vLLM adapter (Prometheus parsing)
│ └── prometheus.rs Prometheus text-format parser
├── frontend/
│ └── src/
│ ├── hooks/ useMetrics, useMetricsHistory
│ ├── components/
│ │ ├── views/ Dashboard, GlanceableView, DetailedView
│ │ ├── engines/ EngineSection, EngineCard
│ │ ├── charts/ TimeSeriesChart, Sparkline, CoreHeatmap
│ │ └── gauges/ ArcGauge
│ ├── types/ TypeScript type definitions
│ └── lib/ Circular buffer, formatting, theme
├── dev/
│ ├── dev.sh Dev loop (local frontend + remote backend)
│ └── README.md Operator docs
├── .env.example Configuration template
├── LICENSE MIT
├── CONTRIBUTING.md
└── Cargo.toml
Contributing
See CONTRIBUTING.md.
License
MIT — see LICENSE.