all-smi
all-smi is a command-line utility for monitoring GPU hardware across multiple systems. It provides a real-time view of GPU utilization, memory usage, temperature, power consumption, and other metrics. The tool is designed to be a cross-platform alternative to nvidia-smi, with support for NVIDIA GPUs, Apple Silicon GPUs, and NVIDIA Jetson platforms.
The application presents a terminal-based user interface with cluster overview, interactive sorting, and both local and remote monitoring capabilities. It also provides an API mode for Prometheus metrics integration.


Features
GPU Monitoring
- Real-time Metrics: Displays comprehensive GPU information including:
- GPU Name and Driver Version
- Utilization Percentage with color-coded status
- Memory Usage (Used/Total in GB)
- Temperature in Celsius
- Clock Frequency in MHz
- Power Consumption in Watts
- Multi-GPU Support: Handles multiple GPUs per system with individual monitoring
- Interactive Sorting: Sort GPUs by utilization, memory usage, or default (hostname+index) order
Cluster Management
- Cluster Overview Dashboard: Real-time statistics showing:
- Total nodes and GPUs across the cluster
- Average utilization and memory usage
- Temperature statistics with standard deviation
- Total and average power consumption
- Live Statistics History: Visual graphs showing utilization, memory, and temperature trends
- Tabbed Interface: Switch between "All" view and individual host tabs
Process Information
- GPU Process Monitoring: Lists processes running on GPUs with:
- Process ID (PID) and Parent PID
- Process Name and Command Line
- GPU Memory Usage
- User and State Information
- Interactive Sorting: Sort processes by PID or memory usage
- System Integration: Full process details from system information
Cross-Platform Support
- Linux: Supports NVIDIA GPUs via
NVMLandnvidia-smi(fallback) command - macOS: Supports Apple Silicon GPUs via
powermetricsand Metal framework - NVIDIA Jetson: Special support for Tegra-based systems with DLA (Deep Learning Accelerator)
Remote Monitoring
- Multi-Host Support: Monitor up to 256+ remote systems simultaneously
- Connection Management: Optimized networking with connection pooling and retry logic
- Supports up to 128 concurrent connections
- Automatic retry with exponential backoff
- TCP keepalive for persistent connections
- Storage Monitoring: Disk usage information for remote hosts
- High Availability: Resilient to connection failures with automatic retry
Interactive UI
- Keyboard Controls:
- Navigation: Arrow keys, Page Up/Down for scrolling
- Sorting: 'd' (default), 'u' (utilization), 'g' (GPU memory), 'p' (PID), 'm' (memory)
- Interface: '1' or 'h' (help), 'q' (quit), Tab switching
- Color-Coded Status: Green (≤60%), Yellow (60-80%), Red (>80%) for resource usage
- Responsive Design: Adapts to terminal size with optimized space allocation
- Help System: Comprehensive built-in help with context-sensitive shortcuts
Technology Stack
- Language: Rust 2021 Edition
- Async Runtime: Tokio for high-performance networking
- Key Dependencies:
crossterm: Terminal manipulation and UIaxum: Web framework for API modereqwest: HTTP client for remote monitoringchrono: Date/time handlingclap: Command-line argument parsingserde: Serialization for data exchangemetal/objc: Apple Silicon GPU integration on macOSsysinfo: System information gathering
Installation
Option 1: Install via Homebrew (macOS/Linux)
The easiest way to install all-smi on macOS and Linux is through Homebrew:
Option 2: Install from Cargo
Install all-smi through Cargo:
After installation, the binary will be available in your $PATH as all-smi.
Option 3: Download Pre-built Binary
Download the latest release from the GitHub releases page:
- Go to https://github.com/inureyes/all-smi/releases
- Download the appropriate binary for your platform
- Extract the archive and place the binary in your
$PATH
Option 4: Build from Source
Prerequisites
- Rust: Version 1.75 or later with Cargo
- Linux (NVIDIA):
CUDA,nvidia-smicommand must be available - macOS: Requires
sudoprivileges forpowermetricsaccess - Network: For remote monitoring functionality
Building from Source
-
Clone the repository:
-
Build the project:
# Build the main application # Build mock server for testing -
Run tests:
Usage
Command Overview
# Show help
# Local monitoring (requires sudo on macOS)
# Remote monitoring
# API mode
Quick Start with Make Commands
For development and testing, you can use the provided Makefile:
# Run local monitoring
# Run remote monitoring with hosts file
# Start mock server for testing
# Build release version
# Run tests
View Mode (Interactive Monitoring)
The view mode provides a terminal-based interface with real-time updates.
Local Mode
# Monitor local GPUs (requires sudo on macOS)
# With custom refresh interval
Remote Monitoring
Monitor multiple remote systems running in API mode:
# Direct host specification
# Using host file
Host file format (CSV):
http://gpu-node1:9090
http://gpu-node2:9090
http://gpu-node3:9090
Keyboard Controls
- Navigation: ←→ (switch tabs), ↑↓ (scroll), PgUp/PgDn (page navigation)
- Sorting: 'd' (default), 'u' (utilization), 'g' (GPU memory), 'p' (PID), 'm' (memory)
- Interface: '1'/'h' (help), 'q' (quit), ESC (close help)
API Mode (Prometheus Metrics)
Expose GPU metrics in Prometheus format for integration with monitoring systems:
# Start API server
# Custom bind address
Metrics available at http://localhost:9090/metrics include:
GPU Metrics:
all_smi_gpu_utilizationall_smi_gpu_memory_used_bytesall_smi_gpu_memory_total_bytesall_smi_gpu_temperature_celsiusall_smi_gpu_power_consumption_wattsall_smi_gpu_frequency_mhz
CPU Metrics:
all_smi_cpu_utilizationall_smi_cpu_socket_countall_smi_cpu_core_countall_smi_cpu_thread_countall_smi_cpu_frequency_mhzall_smi_cpu_temperature_celsiusall_smi_cpu_power_consumption_wattsall_smi_cpu_socket_utilization(per-socket for multi-socket systems)
Apple Silicon Specific:
all_smi_cpu_p_core_countall_smi_cpu_e_core_countall_smi_cpu_gpu_core_countall_smi_cpu_p_core_utilizationall_smi_cpu_e_core_utilization
Storage Metrics:
all_smi_disk_total_bytesall_smi_disk_available_bytes
Development and Testing
Mock Server for Testing
The included mock server simulates realistic GPU and CPU clusters for development and testing.
Mock Server Options
)
)
)
)
)
Basic Usage
# Build mock server (requires mock feature)
# Start single mock instance
# Start multiple instances
# Custom GPU configuration
# Start with custom node naming index
# Simulate node failures for testing
Platform-Specific Testing
Test different hardware platforms with realistic CPU and GPU metrics:
# NVIDIA GPU servers (default - Intel/AMD CPUs with NVIDIA GPUs)
# Apple Silicon systems (M1/M2/M3 with P/E cores)
# Intel CPU servers
# AMD CPU servers
# NVIDIA Jetson platforms
Platform-Specific Features
- NVIDIA Platform: Multi-socket Intel/AMD CPUs with NVIDIA GPUs
- Apple Silicon: P-core/E-core CPU monitoring with integrated GPU metrics
- Intel Platform: Intel Xeon processors with hyperthreading
- AMD Platform: AMD EPYC/Ryzen processors with SMT
- Jetson Platform: ARM-based Tegra processors with integrated GPUs
Mock server features:
- 8 GPUs per node with realistic metrics
- Platform-specific CPU metrics (socket count, core types, utilization)
- Randomized values that change over time
- Storage simulation with various disk sizes (1TB/4TB/12TB)
- Template-based responses for performance
- Instance naming with node-XXXX format (customizable with --start-index)
- Failure simulation for testing high availability (--failure-nodes)
- Automatic node naming prevents duplicates across multiple instances
Testing High-Scale Scenarios
Using the Mock Cluster Script (Recommended)
For large deployments, use the included start-mock-cluster.sh script:
# Start 200 mock servers across 4 processes (50 ports each)
# Start with custom configuration
# Stop all mock servers
The script automatically:
- Calculates optimal process distribution
- Sets proper file descriptor limits
- Prevents duplicate node names across processes
- Combines all host files into a single
hosts.csv
Manual High-Scale Testing
# Start 128 mock servers manually
&
# Monitor large cluster
# Test mixed platform environments
&
&
&
# Combine host files and monitor mixed environment
Architecture
Core Components
- GPU Abstraction Layer: Platform-specific readers implementing the
GpuReadertrait - Async Networking: Concurrent remote data collection with connection pooling
- Terminal UI: Double-buffered rendering with responsive layout
- Data Processing: Real-time metrics aggregation and historical tracking
Platform Support
- NVIDIA GPUs: Via
NVMLdirect query (default) andnvidia-smi(fallback) command parsing - Apple Silicon: Via
powermetricsand Metal framework integration - NVIDIA Jetson: Specialized Tegra platform support with DLA monitoring
Performance Optimizations
- Connection Management: 128 concurrent connections with retry logic
- Adaptive Intervals: 2-6 second refresh based on cluster size
- Memory Efficiency: Stream processing and connection pooling
- Rendering: Double buffering to prevent display flickering
- File Descriptor Management: Automatic handling for large-scale deployments
Contributing
Contributions are welcome! Areas for contribution include:
- Platform Support: Additional GPU vendors or operating systems
- Features: New metrics, visualization improvements, or monitoring capabilities
- Performance: Optimization for larger clusters or resource usage
- Documentation: Examples, tutorials, or API documentation
Please submit pull requests or open issues for bugs, feature requests, or questions.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Changelog
Recent Updates
- v0.4.2 (2025/07/12): Eliminate PowerMetrics temp file growth with in-memory buffer, Homebrew installation support
- v0.4.1 (2025/07/10): Mock server improvements, efficient Apple Silicon and NVidia GPU support
- v0.4.0 (2025/07/08): Architectural refactoring, Smart sudo detection and comprehensive unit testing
- v0.3.3 (2025/07/07): CPU, Memory, and ANE support, and UI fixes
- v0.3.2 (2025/07/06): Cargo.toml for publishing and release process
- v0.3.1 (2025/07/06): GitHub actions and Dockerfile, and UI fixes
- v0.3.0 (2025/07/06): Multi-architecture support, optimized space allocation, enhanced UI
- v0.2.2 (2025/07/06): GPU sorting functionality with hotkeys
- v0.2.1 (2025/07/05): Help system improvements and code refactoring
- v0.2.0 (2025/07/05): Remote monitoring and cluster management features
- v0.1.1 (2025/07/04): ANE (Apple Neural Engine) support, page navigation keys, and scrolling fixes
- v0.1.0 (2024/08/11): Initial release with local GPU monitoring