all-smi 0.3.2

Command-line utility for monitoring GPU hardware. It provides a real-time view of GPU utilization, memory usage, temperature, power consumption, and other metrics.
all-smi-0.3.2 is not a library.
Visit the last successful build: all-smi-0.17.3

all-smi

all-smi is a command-line utility for monitoring GPU hardware across multiple systems. It provides a real-time view of GPU utilization, memory usage, temperature, power consumption, and other metrics. The tool is designed to be a cross-platform alternative to nvidia-smi, with support for NVIDIA GPUs, Apple Silicon GPUs, and NVIDIA Jetson platforms.

The application presents a terminal-based user interface with cluster overview, interactive sorting, and both local and remote monitoring capabilities. It also provides an API mode for Prometheus metrics integration.

screenshot

Features

GPU Monitoring

  • Real-time Metrics: Displays comprehensive GPU information including:
    • GPU Name and Driver Version
    • Utilization Percentage with color-coded status
    • Memory Usage (Used/Total in GB)
    • Temperature in Celsius
    • Clock Frequency in MHz
    • Power Consumption in Watts
  • Multi-GPU Support: Handles multiple GPUs per system with individual monitoring
  • Interactive Sorting: Sort GPUs by utilization, memory usage, or default (hostname+index) order

Cluster Management

  • Cluster Overview Dashboard: Real-time statistics showing:
    • Total nodes and GPUs across the cluster
    • Average utilization and memory usage
    • Temperature statistics with standard deviation
    • Total and average power consumption
  • Live Statistics History: Visual graphs showing utilization, memory, and temperature trends
  • Tabbed Interface: Switch between "All" view and individual host tabs

Process Information

  • GPU Process Monitoring: Lists processes running on GPUs with:
    • Process ID (PID) and Parent PID
    • Process Name and Command Line
    • GPU Memory Usage
    • User and State Information
  • Interactive Sorting: Sort processes by PID or memory usage
  • System Integration: Full process details from system information

Cross-Platform Support

  • Linux: Supports NVIDIA GPUs via nvidia-smi command
  • macOS: Supports Apple Silicon GPUs via powermetrics and Metal framework
  • NVIDIA Jetson: Special support for Tegra-based systems with DLA (Deep Learning Accelerator)

Remote Monitoring

  • Multi-Host Support: Monitor up to 128+ remote systems simultaneously
  • Connection Management: Optimized networking with connection pooling and retry logic
  • Storage Monitoring: Disk usage information for remote hosts
  • High Availability: Resilient to connection failures with automatic retry

Interactive UI

  • Keyboard Controls:
    • Navigation: Arrow keys, Page Up/Down for scrolling
    • Sorting: 'd' (default), 'u' (utilization), 'g' (GPU memory), 'p' (PID), 'm' (memory)
    • Interface: '1' or 'h' (help), 'q' (quit), Tab switching
  • Color-Coded Status: Green (≤60%), Yellow (60-80%), Red (>80%) for resource usage
  • Responsive Design: Adapts to terminal size with optimized space allocation
  • Help System: Comprehensive built-in help with context-sensitive shortcuts

Technology Stack

  • Language: Rust 2021 Edition
  • Async Runtime: Tokio for high-performance networking
  • Key Dependencies:
    • crossterm: Terminal manipulation and UI
    • axum: Web framework for API mode
    • reqwest: HTTP client for remote monitoring
    • chrono: Date/time handling
    • clap: Command-line argument parsing
    • serde: Serialization for data exchange
    • metal/objc: Apple Silicon GPU integration on macOS
    • sysinfo: System information gathering

Installation

Prerequisites

  • Rust: Version 1.75 or later with Cargo
  • Linux (NVIDIA): nvidia-smi command must be available
  • macOS: Requires sudo privileges for powermetrics access
  • Network: For remote monitoring functionality

Building from Source

  1. Clone the repository:

    git clone https://github.com/inureyes/all-smi.git
    cd all-smi
    
  2. Build the project:

    # Build the main application
    cargo build --release
    
    # Build mock server for testing
    cargo build --release --bin mock-server
    
  3. Run tests:

    cargo test
    cargo clippy
    cargo fmt --check
    

Usage

Command Overview

# Show help
./target/release/all-smi --help

# Local monitoring (requires sudo on macOS)
sudo ./target/release/all-smi view

# Remote monitoring
./target/release/all-smi view --hosts http://node1:9090 http://node2:9090
./target/release/all-smi view --hostfile hosts.csv

# API mode
./target/release/all-smi api --port 9090

View Mode (Interactive Monitoring)

The view mode provides a terminal-based interface with real-time updates.

Local Mode

# Monitor local GPUs (requires sudo on macOS)
sudo ./target/release/all-smi view

# With custom refresh interval
sudo ./target/release/all-smi view --interval 5

Remote Monitoring

Monitor multiple remote systems running in API mode:

# Direct host specification
./target/release/all-smi view --hosts http://gpu-node1:9090 http://gpu-node2:9090

# Using host file
./target/release/all-smi view --hostfile hosts.csv --interval 2

Host file format (CSV):

http://gpu-node1:9090
http://gpu-node2:9090
http://gpu-node3:9090

Keyboard Controls

  • Navigation: ←→ (switch tabs), ↑↓ (scroll), PgUp/PgDn (page navigation)
  • Sorting: 'd' (default), 'u' (utilization), 'g' (GPU memory), 'p' (PID), 'm' (memory)
  • Interface: '1'/'h' (help), 'q' (quit), ESC (close help)

API Mode (Prometheus Metrics)

Expose GPU metrics in Prometheus format for integration with monitoring systems:

# Start API server
./target/release/all-smi api --port 9090

# Custom bind address
./target/release/all-smi api --port 8080 --bind 0.0.0.0

Metrics available at http://localhost:9090/metrics include:

  • all_smi_gpu_utilization
  • all_smi_gpu_memory_used_bytes
  • all_smi_gpu_memory_total_bytes
  • all_smi_gpu_temperature_celsius
  • all_smi_gpu_power_consumption_watts
  • all_smi_disk_total_bytes
  • all_smi_disk_available_bytes

Development and Testing

Mock Server for Testing

The included mock server simulates realistic GPU clusters for development and testing:

# Build mock server
cargo build --release --bin mock-server

# Start single mock instance
./target/release/mock-server --port-range 9090

# Start multiple instances
./target/release/mock-server --port-range 10001-10010 -o hosts.csv

# Custom configuration
./target/release/mock-server --port-range 10001-10005 \
  --gpu-name "NVIDIA H100 80GB HBM3" -o test-hosts.csv

Mock server features:

  • 8 GPUs per node with realistic metrics
  • Randomized values that change over time
  • Storage simulation with various disk sizes
  • Template-based responses for performance
  • Instance naming with node-XXXX format

Testing High-Scale Scenarios

# Start 128 mock servers
./target/release/mock-server --port-range 10001-10128 -o large-cluster.csv &

# Monitor large cluster
./target/release/all-smi view --hostfile large-cluster.csv --interval 1

Architecture

Core Components

  • GPU Abstraction Layer: Platform-specific readers implementing the GpuReader trait
  • Async Networking: Concurrent remote data collection with connection pooling
  • Terminal UI: Double-buffered rendering with responsive layout
  • Data Processing: Real-time metrics aggregation and historical tracking

Platform Support

  • NVIDIA GPUs: Via nvidia-smi command parsing
  • Apple Silicon: Via powermetrics and Metal framework integration
  • NVIDIA Jetson: Specialized Tegra platform support with DLA monitoring

Performance Optimizations

  • Connection Management: 64 concurrent connections with retry logic
  • Adaptive Intervals: 2-6 second refresh based on cluster size
  • Memory Efficiency: Stream processing and connection pooling
  • Rendering: Double buffering to prevent display flickering

Contributing

Contributions are welcome! Areas for contribution include:

  • Platform Support: Additional GPU vendors or operating systems
  • Features: New metrics, visualization improvements, or monitoring capabilities
  • Performance: Optimization for larger clusters or resource usage
  • Documentation: Examples, tutorials, or API documentation

Please submit pull requests or open issues for bugs, feature requests, or questions.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Changelog

Recent Updates

  • v0.3.0: Multi-architecture support, optimized space allocation, enhanced UI
  • v0.2.2: GPU sorting functionality with hotkeys
  • v0.2.1: Help system improvements and code refactoring
  • v0.2.0: Remote monitoring and cluster management features
  • v0.1.0: Initial release with local GPU monitoring