# API Metrics Reference
`all-smi` provides comprehensive hardware metrics in Prometheus format through its API mode. This document details all available metrics across different hardware platforms.
## Starting API Mode
```bash
# Start API server on TCP port
all-smi api --port 9090
# Custom update interval (default: 3 seconds)
all-smi api --port 9090 --interval 5
# Include process information
all-smi api --port 9090 --processes
```
Metrics are available at `http://localhost:9090/metrics`
### Unix Domain Socket Support (Unix Only)
For local IPC scenarios, API mode supports Unix Domain Sockets:
```bash
# Use default socket path
all-smi api --socket
# Linux: /var/run/all-smi.sock (or /tmp/all-smi.sock)
# macOS: /tmp/all-smi.sock
# Use custom socket path
all-smi api --socket /custom/path/all-smi.sock
# TCP and Unix socket simultaneously
all-smi api --port 9090 --socket
# Unix socket only (disable TCP)
all-smi api --port 0 --socket
```
Access metrics via Unix socket:
```bash
curl --unix-socket /tmp/all-smi.sock http://localhost/metrics
```
```python
# Python example
import requests_unixsocket
session = requests_unixsocket.Session()
r = session.get('http+unix://%2Ftmp%2Fall-smi.sock/metrics')
```
**Security**: Socket permissions are set to `0600` (owner-only access).
## Available Metrics
### GPU Metrics (All Platforms)
| `all_smi_gpu_utilization` | GPU utilization percentage | percent | `gpu_index`, `gpu_name` |
| `all_smi_gpu_memory_used_bytes` | GPU memory used | bytes | `gpu_index`, `gpu_name` |
| `all_smi_gpu_memory_total_bytes` | GPU memory total | bytes | `gpu_index`, `gpu_name` |
| `all_smi_gpu_temperature_celsius` | GPU temperature | celsius | `gpu_index`, `gpu_name` |
| `all_smi_gpu_power_consumption_watts` | GPU power consumption | watts | `gpu_index`, `gpu_name` |
| `all_smi_gpu_frequency_mhz` | GPU frequency | MHz | `gpu_index`, `gpu_name` |
| `all_smi_gpu_info` | GPU device information | info | `gpu_index`, `gpu_name`, `driver_version` |
### Unified AI Acceleration Library Labels
The `all_smi_gpu_info` metric includes standardized labels for AI acceleration libraries across all GPU/accelerator platforms. These unified labels allow platform-agnostic queries and dashboards:
| `lib_name` | Name of the AI acceleration library | `CUDA`, `ROCm`, `Metal` |
| `lib_version` | Version of the AI acceleration library | `13.0`, `7.0.2`, `Metal 3` |
#### Platform-Specific Library Mappings
| NVIDIA GPU | `CUDA` | CUDA version | `cuda_version` |
| AMD GPU | `ROCm` | ROCm version | `rocm_version` |
| NVIDIA Jetson | `CUDA` | CUDA version | `cuda_version` |
| Apple Silicon | `Metal` | Metal version | N/A |
**Note**: Platform-specific labels (e.g., `cuda_version`, `rocm_version`) are maintained for backward compatibility with existing queries and dashboards.
#### Example PromQL Queries
```promql
# Count devices by AI library type
count by (lib_name) (all_smi_gpu_info)
# Get all CUDA devices with version 12 or higher
all_smi_gpu_info{lib_name="CUDA", lib_version=~"1[2-9].*|[2-9][0-9].*"}
# Alert on outdated ROCm versions (< 7.0)
all_smi_gpu_info{lib_name="ROCm", lib_version!~"[7-9].*"} == 1
# Cross-platform library distribution
sum by (lib_name, lib_version) (all_smi_gpu_info)
# Find all devices using Metal (Apple Silicon)
all_smi_gpu_info{lib_name="Metal"}
# Monitor library version consistency across cluster
count by (lib_name, lib_version) (all_smi_gpu_info) > 1
```
### NVIDIA GPU Specific Metrics
| `all_smi_gpu_pcie_gen_current` | Current PCIe generation | - | `gpu_index`, `gpu_name` |
| `all_smi_gpu_pcie_width_current` | Current PCIe link width | - | `gpu_index`, `gpu_name` |
| `all_smi_gpu_performance_state` | GPU performance state (P0=0, P1=1, etc.) | - | `gpu_index`, `gpu_name` |
| `all_smi_gpu_clock_graphics_max_mhz` | Maximum graphics clock | MHz | `gpu_index`, `gpu_name` |
| `all_smi_gpu_clock_memory_max_mhz` | Maximum memory clock | MHz | `gpu_index`, `gpu_name` |
| `all_smi_gpu_power_limit_current_watts` | Current power limit | watts | `gpu_index`, `gpu_name` |
| `all_smi_gpu_power_limit_max_watts` | Maximum power limit | watts | `gpu_index`, `gpu_name` |
### NVIDIA Jetson Specific Metrics
| `all_smi_dla_utilization` | DLA (Deep Learning Accelerator) utilization | percent | `gpu_index`, `gpu_name` |
### AMD GPU Specific Metrics
AMD GPUs (Radeon and Instinct series) provide comprehensive monitoring through ROCm and the DRM subsystem:
| `all_smi_gpu_fan_speed_rpm` | GPU fan speed | RPM | `gpu_index`, `gpu_name` |
| `all_smi_amd_rocm_version` | AMD ROCm version installed | info | `instance`, `version` |
| `all_smi_gpu_memory_gtt_bytes`| GTT (GPU Translation Table) memory usage | bytes | `gpu_index`, `gpu_name` |
| `all_smi_gpu_memory_vram_bytes`| VRAM (Video RAM) usage | bytes | `gpu_index`, `gpu_name` |
**Additional Details Available** (in `all_smi_gpu_info` labels):
- **Driver Version**: AMDGPU kernel driver version (e.g., "30.10.1")
- **ROCm Version**: ROCm software stack version (e.g., "7.0.2")
- **PCIe Information**: Current link generation and width, max GPU/system link capabilities
- **VBIOS**: Version and date information
- **Power Management**: Current, minimum, and maximum power cap values
- **ASIC Information**: Device ID, revision ID, ASIC name
- **Memory Clock**: Current memory clock frequency
**Process Tracking**:
- AMD GPU process detection uses `fdinfo` from `/proc/<pid>/fdinfo/` for accurate memory tracking
- Tracks both VRAM and GTT memory usage per process
- Available with `--processes` flag in API mode
**Platform Requirements**:
- Requires ROCm drivers and `libamdgpu_top` library
- Requires sudo access to `/dev/dri` devices or user in `video`/`render` groups
- Only available in glibc builds (not musl static builds)
### Apple Silicon GPU Specific Metrics
| `all_smi_ane_utilization` | ANE utilization | mW | `gpu_index`, `gpu_name` |
| `all_smi_ane_power_watts` | ANE power consumption | watts | `gpu_index`, `gpu_name` |
| `all_smi_thermal_pressure_info` | Thermal pressure level | info | `gpu_index`, `gpu_name`, `level` |
Note: For Apple Silicon (M1/M2/M3/M4), `gpu_temperature_celsius` is not available; thermal pressure level is provided instead.
### Tenstorrent NPU Metrics
#### Basic NPU Metrics
| `all_smi_gpu_utilization` | NPU utilization percentage | percent | `gpu_index`, `gpu_name` |
| `all_smi_gpu_memory_used_bytes` | NPU memory used | bytes | `gpu_index`, `gpu_name` |
| `all_smi_gpu_memory_total_bytes` | NPU memory total | bytes | `gpu_index`, `gpu_name` |
| `all_smi_gpu_temperature_celsius` | NPU ASIC temperature | celsius | `gpu_index`, `gpu_name` |
| `all_smi_gpu_power_consumption_watts` | NPU power consumption | watts | `gpu_index`, `gpu_name` |
| `all_smi_gpu_frequency_mhz` | NPU AI clock frequency | MHz | `gpu_index`, `gpu_name` |
| `all_smi_gpu_info` | NPU device information | info | `gpu_index`, `gpu_name`, `driver_version` |
| `all_smi_npu_firmware_info` | NPU firmware version | info | `npu`, `instance`, `uuid`, `index`, `firmware` |
#### Tenstorrent-Specific Metrics
| `all_smi_tenstorrent_board_info` | Board and architecture information | info | `npu`, `instance`, `uuid`, `index`, `board_type`, `board_id`, `architecture` |
| `all_smi_tenstorrent_collection_method_info` | Data collection method used | info | `npu`, `instance`, `uuid`, `index`, `method` |
| **Firmware Versions** | | | |
| `all_smi_tenstorrent_arc_firmware_info` | ARC firmware version | info | `npu`, `instance`, `uuid`, `index`, `version` |
| `all_smi_tenstorrent_eth_firmware_info` | Ethernet firmware version | info | `npu`, `instance`, `uuid`, `index`, `version` |
| `all_smi_tenstorrent_ddr_firmware_info` | DDR firmware version | info | `npu`, `instance`, `uuid`, `index`, `version` |
| `all_smi_tenstorrent_spibootrom_firmware_info` | SPI Boot ROM firmware version | info | `npu`, `instance`, `uuid`, `index`, `version` |
| `all_smi_tenstorrent_firmware_date_info` | Firmware build date | info | `npu`, `instance`, `uuid`, `index`, `date` |
| **Temperature Sensors** | | | |
| `all_smi_tenstorrent_asic_temperature_celsius` | ASIC temperature | celsius | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_vreg_temperature_celsius` | Voltage regulator temperature | celsius | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_inlet_temperature_celsius` | Inlet temperature | celsius | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_outlet1_temperature_celsius`| Outlet 1 temperature | celsius | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_outlet2_temperature_celsius`| Outlet 2 temperature | celsius | `npu`, `instance`, `uuid`, `index` |
| **Clock Frequencies** | | | |
| `all_smi_tenstorrent_aiclk_mhz` | AI clock frequency | MHz | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_axiclk_mhz` | AXI clock frequency | MHz | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_arcclk_mhz` | ARC clock frequency | MHz | `npu`, `instance`, `uuid`, `index` |
| **Power and Electrical** | | | |
| `all_smi_tenstorrent_voltage_volts` | Core voltage | volts | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_current_amperes` | Current draw | amperes | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_power_raw_watts` | Raw power consumption | watts | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_tdp_limit_watts` | TDP limit | watts | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_tdc_limit_amperes` | TDC limit | amperes | `npu`, `instance`, `uuid`, `index` |
| **Status and Health** | | | |
| `all_smi_tenstorrent_heartbeat` | Device heartbeat counter | counter | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_arc0_health` | ARC0 health counter | counter | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_arc3_health` | ARC3 health counter | counter | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_faults` | Fault register value | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_throttler` | Throttler state register | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_pcie_status_info` | PCIe status register | info | `npu`, `instance`, `uuid`, `index`, `status` |
| `all_smi_tenstorrent_eth_status_info` | Ethernet status register | info | `npu`, `instance`, `uuid`, `index`, `port`, `status` |
| `all_smi_tenstorrent_ddr_status` | DDR status register | gauge | `npu`, `instance`, `uuid`, `index` |
| **Fan Metrics** | | | |
| `all_smi_tenstorrent_fan_speed_percent` | Fan speed percentage | percent | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_fan_rpm` | Fan speed in RPM | gauge | `npu`, `instance`, `uuid`, `index` |
| **PCIe Information** | | | |
| `all_smi_tenstorrent_pcie_generation` | PCIe generation | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_pcie_width` | PCIe link width | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tenstorrent_pcie_address_info` | PCIe address | info | `npu`, `instance`, `uuid`, `index`, `address` |
| `all_smi_tenstorrent_pcie_device_info` | PCIe device identification | info | `npu`, `instance`, `uuid`, `index`, `vendor_id`, `device_id` |
| **DRAM Information** | | | |
| `all_smi_tenstorrent_dram_info` | DRAM configuration | info | `npu`, `instance`, `uuid`, `index`, `speed` |
Note: Tenstorrent NPUs use the same basic metric names as GPUs for compatibility with existing monitoring infrastructure. Additional Tenstorrent-specific metrics provide detailed hardware monitoring capabilities.
### Rebellions NPU Metrics
#### Basic NPU Metrics
| `all_smi_gpu_utilization` | NPU utilization percentage | percent | `gpu_index`, `gpu_name` |
| `all_smi_gpu_memory_used_bytes` | NPU memory used | bytes | `gpu_index`, `gpu_name` |
| `all_smi_gpu_memory_total_bytes` | NPU memory total | bytes | `gpu_index`, `gpu_name` |
| `all_smi_gpu_temperature_celsius` | NPU temperature | celsius | `gpu_index`, `gpu_name` |
| `all_smi_gpu_power_consumption_watts` | NPU power consumption | watts | `gpu_index`, `gpu_name` |
| `all_smi_gpu_frequency_mhz` | NPU clock frequency | MHz | `gpu_index`, `gpu_name` |
| `all_smi_gpu_info` | NPU device information | info | `gpu_index`, `gpu_name`, `driver_version` |
#### Rebellions-Specific Metrics
| `all_smi_rebellions_device_info` | Device model and variant information | info | `npu`, `instance`, `uuid`, `index`, `model`, `variant` |
| `all_smi_rebellions_firmware_info` | NPU firmware version | info | `npu`, `instance`, `uuid`, `index`, `firmware_version` |
| `all_smi_rebellions_kmd_info` | Kernel Mode Driver version | info | `npu`, `instance`, `uuid`, `index`, `kmd_version` |
| `all_smi_rebellions_device_status` | Device operational status | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_rebellions_performance_state` | NPU performance state (P0-P15) | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_rebellions_pcie_generation` | PCIe generation (Gen4) | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_rebellions_pcie_width` | PCIe link width (x16) | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_rebellions_memory_bandwidth_gbps`| Memory bandwidth capacity | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_rebellions_compute_tops` | Compute capacity in TOPS | gauge | `npu`, `instance`, `uuid`, `index` |
Note: Rebellions NPUs support ATOM, ATOM+, and ATOM Max variants with varying compute and memory capabilities. All variants use PCIe Gen4 x16 interface.
### Furiosa NPU Metrics
#### Basic NPU Metrics
| `all_smi_gpu_utilization` | NPU utilization percentage | percent | `gpu_index`, `gpu_name` |
| `all_smi_gpu_memory_used_bytes` | NPU memory used | bytes | `gpu_index`, `gpu_name` |
| `all_smi_gpu_memory_total_bytes` | NPU memory total | bytes | `gpu_index`, `gpu_name` |
| `all_smi_gpu_temperature_celsius` | NPU temperature | celsius | `gpu_index`, `gpu_name` |
| `all_smi_gpu_power_consumption_watts` | NPU power consumption | watts | `gpu_index`, `gpu_name` |
| `all_smi_gpu_frequency_mhz` | NPU clock frequency | MHz | `gpu_index`, `gpu_name` |
| `all_smi_gpu_info` | NPU device information | info | `gpu_index`, `gpu_name`, `driver_version` |
#### Furiosa-Specific Metrics
| `all_smi_furiosa_device_info` | Device architecture and model info | info | `npu`, `instance`, `uuid`, `index`, `architecture`, `model` |
| `all_smi_furiosa_firmware_info` | NPU firmware version | info | `npu`, `instance`, `uuid`, `index`, `firmware_version` |
| `all_smi_furiosa_pert_info` | PERT (runtime) version | info | `npu`, `instance`, `uuid`, `index`, `pert_version` |
| `all_smi_furiosa_liveness_status` | Device liveness status | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_furiosa_core_count` | Number of cores in NPU | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_furiosa_core_status` | Core availability status | gauge | `npu`, `instance`, `uuid`, `index`, `core` |
| `all_smi_furiosa_pe_utilization` | Processing Element utilization | percent | `npu`, `instance`, `uuid`, `index`, `core` |
| `all_smi_furiosa_core_frequency_mhz` | Per-core frequency | MHz | `npu`, `instance`, `uuid`, `index`, `core` |
| `all_smi_furiosa_power_governor_info` | Power governor mode | info | `npu`, `instance`, `uuid`, `index`, `governor` |
| `all_smi_furiosa_error_count` | Cumulative error count | counter | `npu`, `instance`, `uuid`, `index` |
| `all_smi_furiosa_pcie_generation` | PCIe generation | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_furiosa_pcie_width` | PCIe link width | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_furiosa_memory_bandwidth_utilization` | Memory bandwidth utilization | percent | `npu`, `instance`, `uuid`, `index` |
Note: Furiosa NPUs use the RNGD architecture with 8 cores per NPU. Each core contains multiple Processing Elements (PEs) that handle neural network computations. The power governor supports OnDemand mode for dynamic power management.
### Intel Gaudi NPU Metrics
#### Basic NPU Metrics
| `all_smi_gpu_utilization` | NPU utilization percentage | percent | `gpu_index`, `gpu_name` |
| `all_smi_gpu_memory_used_bytes` | NPU memory used | bytes | `gpu_index`, `gpu_name` |
| `all_smi_gpu_memory_total_bytes` | NPU memory total | bytes | `gpu_index`, `gpu_name` |
| `all_smi_gpu_temperature_celsius` | NPU temperature | celsius | `gpu_index`, `gpu_name` |
| `all_smi_gpu_power_consumption_watts` | NPU power consumption | watts | `gpu_index`, `gpu_name` |
| `all_smi_gpu_frequency_mhz` | NPU clock frequency | MHz | `gpu_index`, `gpu_name` |
| `all_smi_gpu_info` | NPU device information | info | `gpu_index`, `gpu_name`, `driver_version` |
#### Intel Gaudi-Specific Metrics
| `all_smi_gaudi_device_info` | Device model and information | info | `npu`, `instance`, `uuid`, `index` |
| `all_smi_gaudi_internal_name_info` | Internal device name (e.g., HL-325L) | info | `npu`, `instance`, `uuid`, `index`, `internal_name` |
| `all_smi_gaudi_driver_info` | Habana driver version | info | `npu`, `instance`, `uuid`, `index`, `version` |
| `all_smi_gaudi_aip_utilization_percent` | AIP (AI Processor) utilization | percent | `npu`, `instance`, `uuid`, `index` |
| `all_smi_gaudi_memory_used_bytes` | HBM memory used | bytes | `npu`, `instance`, `uuid`, `index` |
| `all_smi_gaudi_memory_total_bytes` | HBM total memory | bytes | `npu`, `instance`, `uuid`, `index` |
| `all_smi_gaudi_memory_utilization_percent` | HBM memory utilization percentage | percent | `npu`, `instance`, `uuid`, `index` |
| `all_smi_gaudi_power_draw_watts` | Current power consumption | watts | `npu`, `instance`, `uuid`, `index` |
| `all_smi_gaudi_power_max_watts` | Maximum power limit | watts | `npu`, `instance`, `uuid`, `index` |
| `all_smi_gaudi_power_utilization_percent` | Power utilization percentage | percent | `npu`, `instance`, `uuid`, `index` |
| `all_smi_gaudi_temperature_celsius` | AIP temperature | celsius | `npu`, `instance`, `uuid`, `index` |
Note: Intel Gaudi NPUs (Gaudi 1/2/3) are monitored via the `hl-smi` command-line tool running as a background process. Device names are automatically mapped from internal identifiers (e.g., HL-325L) to human-friendly names (e.g., Intel Gaudi 3 PCIe LP). The tool supports various form factors including PCIe, OAM, UBB, and HLS variants.
### Google TPU Metrics
#### Basic NPU Metrics
| `all_smi_gpu_utilization` | TPU utilization percentage | percent | `gpu_index`, `gpu_name` |
| `all_smi_gpu_memory_used_bytes` | TPU memory used | bytes | `gpu_index`, `gpu_name` |
| `all_smi_gpu_memory_total_bytes` | TPU memory total | bytes | `gpu_index`, `gpu_name` |
| `all_smi_gpu_temperature_celsius` | TPU temperature | celsius | `gpu_index`, `gpu_name` |
| `all_smi_gpu_power_consumption_watts` | TPU power consumption | watts | `gpu_index`, `gpu_name` |
| `all_smi_gpu_frequency_mhz` | TPU clock frequency | MHz | `gpu_index`, `gpu_name` |
| `all_smi_gpu_info` | TPU device information | info | `gpu_index`, `gpu_name`, `driver_version` |
#### TPU-Specific Metrics
| `all_smi_tpu_utilization_percent` | TPU duty cycle utilization | percent| `npu`, `instance`, `uuid`, `index` |
| `all_smi_tpu_memory_used_bytes` | TPU HBM memory used | bytes | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tpu_memory_total_bytes` | TPU HBM memory total | bytes | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tpu_memory_utilization_percent` | TPU HBM memory utilization percentage| percent| `npu`, `instance`, `uuid`, `index` |
| `all_smi_tpu_chip_version_info` | TPU chip version information | info | `npu`, `instance`, `uuid`, `index`, `version` |
| `all_smi_tpu_accelerator_type_info` | TPU accelerator type information | info | `npu`, `instance`, `uuid`, `index`, `type` |
| `all_smi_tpu_core_count` | Number of TPU cores | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tpu_tensorcore_count` | Number of TensorCores per chip | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tpu_memory_type_info` | TPU memory type (HBM2/HBM2e/HBM3e) | info | `npu`, `instance`, `uuid`, `index`, `type` |
| `all_smi_tpu_runtime_version_info` | TPU runtime/library version | info | `npu`, `instance`, `uuid`, `index`, `version` |
| `all_smi_tpu_power_max_watts` | TPU maximum power limit | watts | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tpu_hlo_queue_size` | Number of pending HLO programs | gauge | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tpu_hlo_exec_mean_microseconds` | HLO execution timing (mean) | µs | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tpu_hlo_exec_p50_microseconds` | HLO execution timing (P50) | µs | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tpu_hlo_exec_p90_microseconds` | HLO execution timing (P90) | µs | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tpu_hlo_exec_p95_microseconds` | HLO execution timing (P95) | µs | `npu`, `instance`, `uuid`, `index` |
| `all_smi_tpu_hlo_exec_p999_microseconds` | HLO execution timing (P99.9) | µs | `npu`, `instance`, `uuid`, `index` |
Note: Google Cloud TPUs (v2-v7/Ironwood) are monitored via the `tpu-info` command-line tool running in streaming mode. Metrics include duty cycle utilization, HBM memory tracking, and chip configuration details.
### CPU Metrics (All Platforms)
| `all_smi_cpu_utilization` | CPU utilization percentage | percent | - |
| `all_smi_cpu_socket_count` | Number of CPU sockets | count | - |
| `all_smi_cpu_core_count` | Total number of CPU cores | count | - |
| `all_smi_cpu_thread_count` | Total number of CPU threads| count | - |
| `all_smi_cpu_frequency_mhz` | CPU frequency | MHz | - |
| `all_smi_cpu_temperature_celsius` | CPU temperature | celsius | - |
| `all_smi_cpu_power_consumption_watts` | CPU power consumption | watts | - |
| `all_smi_cpu_socket_utilization` | Per-socket CPU utilization | percent | `socket` |
### Apple Silicon CPU Specific Metrics
| `all_smi_cpu_p_core_count` | Number of performance cores | count | - |
| `all_smi_cpu_e_core_count` | Number of efficiency cores | count | - |
| `all_smi_cpu_gpu_core_count` | Number of integrated GPU cores | count | - |
| `all_smi_cpu_p_core_utilization` | P-core utilization percentage | percent | - |
| `all_smi_cpu_e_core_utilization` | E-core utilization percentage | percent | - |
| `all_smi_cpu_p_cluster_frequency_mhz` | P-cluster frequency | MHz | - |
| `all_smi_cpu_e_cluster_frequency_mhz` | E-cluster frequency | MHz | - |
### Memory Metrics (All Platforms)
| `all_smi_memory_total_bytes` | Total system memory | bytes | - |
| `all_smi_memory_used_bytes` | Used system memory | bytes | - |
| `all_smi_memory_available_bytes` | Available system memory | bytes | - |
| `all_smi_memory_free_bytes` | Free system memory | bytes | - |
| `all_smi_memory_utilization` | Memory utilization percentage | percent | - |
| `all_smi_swap_total_bytes` | Total swap space | bytes | - |
| `all_smi_swap_used_bytes` | Used swap space | bytes | - |
| `all_smi_swap_free_bytes` | Free swap space | bytes | - |
### Linux-Specific Memory Metrics
| `all_smi_memory_buffers_bytes` | Memory used for buffers | bytes | - |
| `all_smi_memory_cached_bytes` | Memory used for cache | bytes | - |
### Storage Metrics
| `all_smi_disk_total_bytes` | Total disk space | bytes | `mount_point` |
| `all_smi_disk_available_bytes` | Available disk space | bytes | `mount_point` |
Note: Storage metrics exclude Docker bind mounts and are filtered to show only relevant filesystems.
### Chassis/Node-Level Metrics
Chassis metrics provide visibility into system-wide power consumption, thermal conditions, and cooling status at the node level. These metrics aggregate information from CPU, GPU, ANE, and BMC sensors.
#### Common Chassis Metrics (All Platforms)
| `all_smi_chassis_power_watts` | Total chassis power consumption (CPU+GPU+ANE) | watts | `hostname`, `instance` |
#### Apple Silicon Chassis Metrics
| `all_smi_chassis_thermal_pressure_info` | Thermal pressure level | info | `hostname`, `instance`, `level` |
| `all_smi_chassis_cpu_power_watts` | CPU power consumption | watts | `hostname`, `instance` |
| `all_smi_chassis_gpu_power_watts` | GPU power consumption | watts | `hostname`, `instance` |
| `all_smi_chassis_ane_power_watts` | ANE (Apple Neural Engine) power | watts | `hostname`, `instance` |
#### Server Chassis Metrics (BMC-enabled Systems)
| `all_smi_chassis_inlet_temperature_celsius` | Chassis inlet temperature | celsius | `hostname`, `instance` |
| `all_smi_chassis_outlet_temperature_celsius`| Chassis outlet temperature | celsius | `hostname`, `instance` |
| `all_smi_chassis_fan_speed_rpm` | Fan speed | RPM | `hostname`, `instance`, `fan_id`, `fan_name` |
Note: Chassis metrics provide a unified view of node-level power consumption and thermal conditions, useful for cluster-wide capacity planning and power monitoring.
### Runtime Environment Metrics
| `all_smi_runtime_environment` | Current runtime environment (container or VM) | gauge | `hostname`, `environment` |
| `all_smi_container_runtime_info` | Container runtime environment information | gauge | `hostname`, `runtime`, `container_id` |
| `all_smi_kubernetes_pod_info` | Kubernetes pod information (K8s only) | gauge | `hostname`, `pod_name`, `namespace` |
| `all_smi_virtualization_info` | Virtualization environment information | gauge | `hostname`, `vm_type`, `hypervisor` |
Runtime environment metrics are detected at startup and provide information about the execution context:
- Container environments: Docker, Kubernetes, Podman, containerd, LXC, CRI-O, Backend.AI
- Virtualization platforms: VMware, VirtualBox, KVM, QEMU, Hyper-V, Xen, AWS EC2, Google Cloud, Azure, DigitalOcean, Parallels
### Process Metrics (When --processes Flag is Used)
| `all_smi_gpu_process_memory_bytes` | GPU memory used by process | bytes | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |
| `all_smi_gpu_process_sm_util` | Process GPU SM utilization | percent | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |
| `all_smi_gpu_process_mem_util` | Process GPU memory utilization | percent | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |
| `all_smi_gpu_process_enc_util` | Process GPU encoder utilization | percent | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |
| `all_smi_gpu_process_dec_util` | Process GPU decoder utilization | percent | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |
## Platform Support Matrix
| Linux + NVIDIA | ✓ Full | ✓ Full | ✓ Full | ✓ Full |
| Linux + Intel Gaudi | ✓ Full | ✓ Full | ✓ Full | ✗ N/A******* |
| Linux + Tenstorrent | ✓ Full*** | ✓ Full | ✓ Full | ✗ N/A**** |
| Linux + Rebellions | ✓ Full | ✓ Full | ✓ Full | ✗ N/A***** |
| Linux + Furiosa | ✓ Full | ✓ Full | ✓ Full | ✗ N/A****** |
| Linux + Google TPU | ✓ Full | ✓ Full | ✓ Full | ✗ N/A******** |
| macOS + Apple Silicon | ✓ Partial* | ✓ Enhanced** | ✓ Full | ✓ Basic |
| NVIDIA Jetson | ✓ Full + DLA | ✓ Full | ✓ Full | ✓ Full |
*Apple Silicon (M1/M2/M3/M4) GPU metrics do not include temperature (thermal pressure provided instead)
**Apple Silicon (M1/M2/M3/M4) provides enhanced P-core/E-core metrics and cluster frequencies
***Tenstorrent provides extensive hardware monitoring including multiple temperature sensors, health counters, and status registers
****Tenstorrent NPUs do not expose per-process GPU usage information
*****Rebellions NPUs do not expose per-process GPU usage information
******Furiosa NPUs do not expose per-process GPU usage information
*******Intel Gaudi NPUs do not expose per-process GPU usage information via hl-smi
********Google Cloud TPUs do not expose per-process GPU usage information via tpu-info
## Example Prometheus Queries
### Basic Monitoring
```promql
# Average GPU utilization across all GPUs
avg(all_smi_gpu_utilization)
# Memory usage percentage per GPU
(all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) * 100
# GPUs running above 80°C
all_smi_gpu_temperature_celsius > 80
```
### Power Monitoring
```promql
# Total power consumption across all GPUs
sum(all_smi_gpu_power_consumption_watts)
# Power efficiency (utilization per watt)
all_smi_gpu_utilization / all_smi_gpu_power_consumption_watts
```
### AMD GPU Specific
```promql
# AMD GPUs with high fan speed (potential cooling issues)
all_smi_gpu_fan_speed_rpm > 3000
# VRAM utilization percentage
(all_smi_gpu_memory_vram_bytes / all_smi_gpu_memory_total_bytes) * 100
# AMD GPUs approaching power cap
all_smi_gpu_power_consumption_watts / all_smi_amd_power_cap_watts > 0.9
# Memory bandwidth usage (VRAM + GTT)
all_smi_gpu_memory_vram_bytes + all_smi_gpu_memory_gtt_bytes
# AMD GPU thermal efficiency (utilization per degree)
all_smi_gpu_utilization / all_smi_gpu_temperature_celsius
```
### Apple Silicon Specific
```promql
# P-core vs E-core utilization comparison
all_smi_cpu_p_core_utilization - all_smi_cpu_e_core_utilization
# ANE power consumption in watts
all_smi_ane_power_watts
```
### Tenstorrent NPU Specific
```promql
# NPUs with high temperature on any sensor
max by (instance) ({
__name__=~"all_smi_tenstorrent_.*_temperature_celsius",
instance=~"tt.*"
}) > 80
# Power efficiency by board type
all_smi_gpu_utilization / on(instance) group_left(board_type)
(all_smi_tenstorrent_board_info * 0 + all_smi_gpu_power_consumption_watts)
# Throttling detection
all_smi_tenstorrent_throttler > 0
# Health monitoring - ARC processors not incrementing
rate(all_smi_tenstorrent_arc0_health[5m]) == 0
```
### Rebellions NPU Specific
```promql
# NPUs in low performance state
all_smi_rebellions_performance_state > 0
# Devices with non-operational status
all_smi_rebellions_device_status != 1
# Power efficiency (TOPS per watt)
all_smi_rebellions_compute_tops / all_smi_gpu_power_consumption_watts
# Memory bandwidth saturation check
(all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) > 0.9
```
### Furiosa NPU Specific
```promql
# NPUs with unavailable cores
all_smi_furiosa_core_status == 0
# Average PE utilization across all cores
avg by (instance) (all_smi_furiosa_pe_utilization)
# NPUs with high error rates
rate(all_smi_furiosa_error_count[5m]) > 0.1
# Power governor not in OnDemand mode
all_smi_furiosa_power_governor_info{governor!="OnDemand"}
# Memory bandwidth bottleneck detection
all_smi_furiosa_memory_bandwidth_utilization > 80
```
### Intel Gaudi NPU Specific
```promql
# NPUs with high AIP utilization
all_smi_gaudi_aip_utilization_percent > 80
# HBM memory utilization across cluster
avg by (instance) (all_smi_gaudi_memory_utilization_percent)
# NPUs approaching power limit
all_smi_gaudi_power_draw_watts / all_smi_gaudi_power_max_watts > 0.9
# Power efficiency (AIP utilization per watt)
all_smi_gaudi_aip_utilization_percent / all_smi_gaudi_power_draw_watts
# NPUs running hot (temperature > 70°C)
all_smi_gaudi_temperature_celsius > 70
# Total HBM memory usage across all Gaudi NPUs
sum(all_smi_gaudi_memory_used_bytes)
# Gaudi NPUs by device variant
count by (internal_name) (all_smi_gaudi_internal_name_info)
# Driver version consistency check
count by (version) (all_smi_gaudi_driver_info) > 1
```
### Google TPU Specific
```promql
# TPU utilization across all chips
avg(all_smi_tpu_utilization_percent)
# HBM memory utilization percentage
all_smi_tpu_memory_utilization_percent
# Count TPUs by accelerator type
count by (type) (all_smi_tpu_accelerator_type_info)
# Monitor HLO queue size
all_smi_tpu_hlo_queue_size > 5
# Alert on high HLO execution latency
all_smi_tpu_hlo_exec_p90_microseconds > 1000000
```
### Process Monitoring
```promql
# Top 5 GPU memory consumers
topk(5, all_smi_gpu_process_memory_bytes)
# Processes using more than 1GB GPU memory
all_smi_gpu_process_memory_bytes > 1073741824
```
### Chassis/Node-Level Monitoring
```promql
# Total power consumption across all nodes
sum(all_smi_chassis_power_watts)
# Nodes with high power consumption (> 3000W)
all_smi_chassis_power_watts > 3000
# Power breakdown by component (Apple Silicon)
sum by (hostname) (all_smi_chassis_cpu_power_watts)
sum by (hostname) (all_smi_chassis_gpu_power_watts)
sum by (hostname) (all_smi_chassis_ane_power_watts)
# Nodes with non-nominal thermal pressure
all_smi_chassis_thermal_pressure_info{level!="Nominal"}
# Average chassis power per node
avg(all_smi_chassis_power_watts)
# Nodes with high inlet temperature
all_smi_chassis_inlet_temperature_celsius > 35
# Delta between inlet and outlet temperature (thermal dissipation)
all_smi_chassis_outlet_temperature_celsius - all_smi_chassis_inlet_temperature_celsius
# Fan speed monitoring
avg by (hostname) (all_smi_chassis_fan_speed_rpm)
```
### Runtime Environment Monitoring
```promql
# All containers running in Kubernetes
all_smi_container_runtime_info{runtime="Kubernetes"}
# All instances running in AWS EC2
all_smi_virtualization_info{vm_type="AWS EC2"}
# Containers running in Backend.AI
all_smi_runtime_environment{environment="Backend.AI"}
# Group metrics by runtime environment
sum by (environment) (all_smi_gpu_utilization) * on(hostname) group_left(environment) all_smi_runtime_environment
```
## Integration Examples
### Grafana Dashboard
Create a comprehensive monitoring dashboard with:
- GPU utilization heatmap
- Memory usage time series
- Power consumption stacked graph
- Temperature alerts
- Process resource usage table
### AlertManager Rules
```yaml
groups:
- name: gpu_alerts
rules:
- alert: HighGPUTemperature
expr: all_smi_gpu_temperature_celsius > 85
for: 5m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu_name }} is running hot"
- alert: GPUMemoryExhausted
expr: (all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) > 0.95
for: 5m
labels:
severity: critical
- alert: TenstorrentNPUFault
expr: all_smi_tenstorrent_faults > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Tenstorrent NPU {{ $labels.instance }} has fault condition"
- alert: TenstorrentNPUThrottling
expr: all_smi_tenstorrent_throttler > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Tenstorrent NPU {{ $labels.instance }} is throttling"
- alert: RebellionsNPULowPerformance
expr: all_smi_rebellions_performance_state > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Rebellions NPU {{ $labels.instance }} stuck in low performance state P{{ $value }}"
- alert: FuriosaNPUCoreFailure
expr: all_smi_furiosa_core_status == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Furiosa NPU {{ $labels.instance }} has unavailable core {{ $labels.core }}"
- alert: FuriosaNPUHighErrorRate
expr: rate(all_smi_furiosa_error_count[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Furiosa NPU {{ $labels.instance }} experiencing high error rate"
- alert: GaudiNPUHighTemperature
expr: all_smi_gaudi_temperature_celsius > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Intel Gaudi NPU {{ $labels.instance }} is running hot at {{ $value }}°C"
- alert: GaudiNPUPowerLimitApproaching
expr: all_smi_gaudi_power_draw_watts / all_smi_gaudi_power_max_watts > 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "Intel Gaudi NPU {{ $labels.instance }} approaching power limit"
- alert: GaudiNPUHBMMemoryExhausted
expr: all_smi_gaudi_memory_utilization_percent > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Intel Gaudi NPU {{ $labels.instance }} HBM memory nearly exhausted"
- alert: ChassisHighPowerConsumption
expr: all_smi_chassis_power_watts > 3500
for: 5m
labels:
severity: warning
annotations:
summary: "Chassis {{ $labels.hostname }} power consumption is high at {{ $value }}W"
- alert: ChassisThermalPressureElevated
expr: all_smi_chassis_thermal_pressure_info{level!="Nominal"} == 1
for: 2m
labels:
severity: warning
annotations:
summary: "Chassis {{ $labels.hostname }} thermal pressure elevated to {{ $labels.level }}"
- alert: ChassisHighInletTemperature
expr: all_smi_chassis_inlet_temperature_celsius > 40
for: 5m
labels:
severity: warning
annotations:
summary: "Chassis {{ $labels.hostname }} inlet temperature is high at {{ $value }}°C"
```
## Update Intervals
The metrics update interval can be configured:
- Default: 3 seconds
- Minimum recommended: 1 second
- Maximum recommended: 60 seconds
Higher update rates provide more real-time data but increase system load. For production monitoring, 5-10 seconds is typically sufficient.
## Notes
1. All metrics follow Prometheus naming conventions
2. Labels are used to differentiate between multiple devices
3. Info metrics (ending in `_info`) provide static metadata
4. Some metrics may not be available on all platforms
5. Process metrics require the `--processes` flag and may impact performance
6. Tenstorrent NPU metrics include comprehensive hardware monitoring data:
- Multiple temperature sensors (ASIC, voltage regulator, inlet/outlet)
- Detailed firmware versions and health counters
- Power limits (TDP/TDC) and throttling information
- PCIe and DDR status registers for diagnostics
7. Tenstorrent utilization is calculated based on power consumption as a proxy metric
8. Rebellions NPU metrics include:
- Performance state monitoring (P0-P15) for power management
- Device status and KMD version tracking
- Support for ATOM, ATOM+, and ATOM Max variants
- PCIe Gen4 x16 interface metrics
9. Furiosa NPU metrics include:
- Per-core PE utilization monitoring
- Core availability status tracking
- Power governor mode information
- Error counting and liveness monitoring
- RNGD architecture with 8 cores per NPU
10. Intel Gaudi NPU metrics include:
- AIP (AI Processor) utilization monitoring
- HBM memory usage and utilization tracking (up to 128GB per device)
- Power consumption with configurable power limits (up to 850W)
- Temperature monitoring
- Automatic device name mapping (HL-325L → Intel Gaudi 3 PCIe LP)
- Support for Gaudi 1/2/3 across PCIe, OAM, UBB, and HLS form factors
- Background process monitoring via hl-smi with circular buffer
11. Chassis/Node-level metrics include:
- Total chassis power consumption aggregating CPU, GPU, and ANE power
- Thermal pressure monitoring (Apple Silicon)
- Individual power component breakdown (CPU, GPU, ANE)
- Inlet/outlet temperature monitoring (BMC-enabled servers)
- Fan speed monitoring with per-fan granularity