# all-smi
`all-smi` is a command-line utility for monitoring GPU hardware across multiple systems. It provides a real-time view of GPU utilization, memory usage, temperature, power consumption, and other metrics. The tool is designed to be a cross-platform alternative to `nvidia-smi`, with support for NVIDIA GPUs, Apple Silicon GPUs, and NVIDIA Jetson platforms.
The application presents a terminal-based user interface with cluster overview, interactive sorting, and both local and remote monitoring capabilities. It also provides an API mode for Prometheus metrics integration.

## Features
### GPU Monitoring
- **Real-time Metrics:** Displays comprehensive GPU information including:
- GPU Name and Driver Version
- Utilization Percentage with color-coded status
- Memory Usage (Used/Total in GB)
- Temperature in Celsius
- Clock Frequency in MHz
- Power Consumption in Watts
- **Multi-GPU Support:** Handles multiple GPUs per system with individual monitoring
- **Interactive Sorting:** Sort GPUs by utilization, memory usage, or default (hostname+index) order
### Cluster Management
- **Cluster Overview Dashboard:** Real-time statistics showing:
- Total nodes and GPUs across the cluster
- Average utilization and memory usage
- Temperature statistics with standard deviation
- Total and average power consumption
- **Live Statistics History:** Visual graphs showing utilization, memory, and temperature trends
- **Tabbed Interface:** Switch between "All" view and individual host tabs
### Process Information
- **GPU Process Monitoring:** Lists processes running on GPUs with:
- Process ID (PID) and Parent PID
- Process Name and Command Line
- GPU Memory Usage
- User and State Information
- **Interactive Sorting:** Sort processes by PID or memory usage
- **System Integration:** Full process details from system information
### Cross-Platform Support
- **Linux:** Supports NVIDIA GPUs via `nvidia-smi` command
- **macOS:** Supports Apple Silicon GPUs via `powermetrics` and Metal framework
- **NVIDIA Jetson:** Special support for Tegra-based systems with DLA (Deep Learning Accelerator)
### Remote Monitoring
- **Multi-Host Support:** Monitor up to 128+ remote systems simultaneously
- **Connection Management:** Optimized networking with connection pooling and retry logic
- **Storage Monitoring:** Disk usage information for remote hosts
- **High Availability:** Resilient to connection failures with automatic retry
### Interactive UI
- **Keyboard Controls:**
- Navigation: Arrow keys, Page Up/Down for scrolling
- Sorting: 'd' (default), 'u' (utilization), 'g' (GPU memory), 'p' (PID), 'm' (memory)
- Interface: '1' or 'h' (help), 'q' (quit), Tab switching
- **Color-Coded Status:** Green (≤60%), Yellow (60-80%), Red (>80%) for resource usage
- **Responsive Design:** Adapts to terminal size with optimized space allocation
- **Help System:** Comprehensive built-in help with context-sensitive shortcuts
## Technology Stack
- **Language:** Rust 2021 Edition
- **Async Runtime:** Tokio for high-performance networking
- **Key Dependencies:**
- `crossterm`: Terminal manipulation and UI
- `axum`: Web framework for API mode
- `reqwest`: HTTP client for remote monitoring
- `chrono`: Date/time handling
- `clap`: Command-line argument parsing
- `serde`: Serialization for data exchange
- `metal`/`objc`: Apple Silicon GPU integration on macOS
- `sysinfo`: System information gathering
## Installation
### Prerequisites
- **Rust:** Version 1.75 or later with Cargo
- **Linux (NVIDIA):** `nvidia-smi` command must be available
- **macOS:** Requires `sudo` privileges for `powermetrics` access
- **Network:** For remote monitoring functionality
### Building from Source
1. **Clone the repository:**
```bash
git clone https://github.com/inureyes/all-smi.git
cd all-smi
```
2. **Build the project:**
```bash
cargo build --release
cargo build --release --bin mock-server
```
3. **Run tests:**
```bash
cargo test
cargo clippy
cargo fmt --check
```
## Usage
### Command Overview
```bash
# Show help
./target/release/all-smi --help
# Local monitoring (requires sudo on macOS)
sudo ./target/release/all-smi view
# Remote monitoring
./target/release/all-smi view --hosts http://node1:9090 http://node2:9090
./target/release/all-smi view --hostfile hosts.csv
# API mode
./target/release/all-smi api --port 9090
```
### View Mode (Interactive Monitoring)
The `view` mode provides a terminal-based interface with real-time updates.
#### Local Mode
```bash
# Monitor local GPUs (requires sudo on macOS)
sudo ./target/release/all-smi view
# With custom refresh interval
sudo ./target/release/all-smi view --interval 5
```
#### Remote Monitoring
Monitor multiple remote systems running in API mode:
```bash
# Direct host specification
./target/release/all-smi view --hosts http://gpu-node1:9090 http://gpu-node2:9090
# Using host file
./target/release/all-smi view --hostfile hosts.csv --interval 2
```
Host file format (CSV):
```
http://gpu-node1:9090
http://gpu-node2:9090
http://gpu-node3:9090
```
#### Keyboard Controls
- **Navigation:** ←→ (switch tabs), ↑↓ (scroll), PgUp/PgDn (page navigation)
- **Sorting:** 'd' (default), 'u' (utilization), 'g' (GPU memory), 'p' (PID), 'm' (memory)
- **Interface:** '1'/'h' (help), 'q' (quit), ESC (close help)
### API Mode (Prometheus Metrics)
Expose GPU metrics in Prometheus format for integration with monitoring systems:
```bash
# Start API server
./target/release/all-smi api --port 9090
# Custom bind address
./target/release/all-smi api --port 8080 --bind 0.0.0.0
```
Metrics available at `http://localhost:9090/metrics` include:
- `all_smi_gpu_utilization`
- `all_smi_gpu_memory_used_bytes`
- `all_smi_gpu_memory_total_bytes`
- `all_smi_gpu_temperature_celsius`
- `all_smi_gpu_power_consumption_watts`
- `all_smi_disk_total_bytes`
- `all_smi_disk_available_bytes`
## Development and Testing
### Mock Server for Testing
The included mock server simulates realistic GPU clusters for development and testing:
```bash
# Build mock server
cargo build --release --bin mock-server
# Start single mock instance
./target/release/mock-server --port-range 9090
# Start multiple instances
./target/release/mock-server --port-range 10001-10010 -o hosts.csv
# Custom configuration
./target/release/mock-server --port-range 10001-10005 \
--gpu-name "NVIDIA H100 80GB HBM3" -o test-hosts.csv
```
Mock server features:
- **8 GPUs per node** with realistic metrics
- **Randomized values** that change over time
- **Storage simulation** with various disk sizes
- **Template-based responses** for performance
- **Instance naming** with node-XXXX format
### Testing High-Scale Scenarios
```bash
# Start 128 mock servers
./target/release/mock-server --port-range 10001-10128 -o large-cluster.csv &
# Monitor large cluster
./target/release/all-smi view --hostfile large-cluster.csv --interval 1
```
## Architecture
### Core Components
- **GPU Abstraction Layer:** Platform-specific readers implementing the `GpuReader` trait
- **Async Networking:** Concurrent remote data collection with connection pooling
- **Terminal UI:** Double-buffered rendering with responsive layout
- **Data Processing:** Real-time metrics aggregation and historical tracking
### Platform Support
- **NVIDIA GPUs:** Via `nvidia-smi` command parsing
- **Apple Silicon:** Via `powermetrics` and Metal framework integration
- **NVIDIA Jetson:** Specialized Tegra platform support with DLA monitoring
### Performance Optimizations
- **Connection Management:** 64 concurrent connections with retry logic
- **Adaptive Intervals:** 2-6 second refresh based on cluster size
- **Memory Efficiency:** Stream processing and connection pooling
- **Rendering:** Double buffering to prevent display flickering
## Contributing
Contributions are welcome! Areas for contribution include:
- **Platform Support:** Additional GPU vendors or operating systems
- **Features:** New metrics, visualization improvements, or monitoring capabilities
- **Performance:** Optimization for larger clusters or resource usage
- **Documentation:** Examples, tutorials, or API documentation
Please submit pull requests or open issues for bugs, feature requests, or questions.
## License
This project is licensed under the MIT License. See the LICENSE file for details.
## Changelog
### Recent Updates
- **v0.3.0:** Multi-architecture support, optimized space allocation, enhanced UI
- **v0.2.2:** GPU sorting functionality with hotkeys
- **v0.2.1:** Help system improvements and code refactoring
- **v0.2.0:** Remote monitoring and cluster management features
- **v0.1.0:** Initial release with local GPU monitoring