all-smi 0.7.1

Command-line utility for monitoring GPU hardware. It provides a real-time view of GPU utilization, memory usage, temperature, power consumption, and other metrics.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
# API Metrics Reference

`all-smi` provides comprehensive hardware metrics in Prometheus format through its API mode. This document details all available metrics across different hardware platforms.

## Starting API Mode

```bash
# Start API server
all-smi api --port 9090

# Custom update interval (default: 3 seconds)
all-smi api --port 9090 --interval 5

# Include process information
all-smi api --port 9090 --processes
```

Metrics are available at `http://localhost:9090/metrics`

## Available Metrics

### GPU Metrics (All Platforms)

| Metric                                | Description                | Unit    | Labels                                    |
|---------------------------------------|----------------------------|---------|-------------------------------------------|
| `all_smi_gpu_utilization`             | GPU utilization percentage | percent | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_used_bytes`       | GPU memory used            | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_total_bytes`      | GPU memory total           | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_temperature_celsius`     | GPU temperature            | celsius | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_power_consumption_watts` | GPU power consumption      | watts   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_frequency_mhz`           | GPU frequency              | MHz     | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_info`                    | GPU device information     | info    | `gpu_index`, `gpu_name`, `driver_version` |

### NVIDIA GPU Specific Metrics

| Metric                                  | Description                              | Unit  | Labels                  |
|-----------------------------------------|------------------------------------------|-------|-------------------------|
| `all_smi_gpu_pcie_gen_current`          | Current PCIe generation                  | -     | `gpu_index`, `gpu_name` |
| `all_smi_gpu_pcie_width_current`        | Current PCIe link width                  | -     | `gpu_index`, `gpu_name` |
| `all_smi_gpu_performance_state`         | GPU performance state (P0=0, P1=1, etc.) | -     | `gpu_index`, `gpu_name` |
| `all_smi_gpu_clock_graphics_max_mhz`    | Maximum graphics clock                   | MHz   | `gpu_index`, `gpu_name` |
| `all_smi_gpu_clock_memory_max_mhz`      | Maximum memory clock                     | MHz   | `gpu_index`, `gpu_name` |
| `all_smi_gpu_power_limit_current_watts` | Current power limit                      | watts | `gpu_index`, `gpu_name` |
| `all_smi_gpu_power_limit_max_watts`     | Maximum power limit                      | watts | `gpu_index`, `gpu_name` |

### NVIDIA Jetson Specific Metrics

| Metric                    | Description                                 | Unit    | Labels                  |
|---------------------------|---------------------------------------------|---------|-------------------------|
| `all_smi_dla_utilization` | DLA (Deep Learning Accelerator) utilization | percent | `gpu_index`, `gpu_name` |

### Apple Silicon GPU Specific Metrics

| Metric                          | Description            | Unit  | Labels                           |
|---------------------------------|------------------------|-------|----------------------------------|
| `all_smi_ane_utilization`       | ANE utilization        | mW    | `gpu_index`, `gpu_name`          |
| `all_smi_ane_power_watts`       | ANE power consumption  | watts | `gpu_index`, `gpu_name`          |
| `all_smi_thermal_pressure_info` | Thermal pressure level | info  | `gpu_index`, `gpu_name`, `level` |

Note: For Apple Silicon (M1/M2/M3/M4), `gpu_temperature_celsius` is not available; thermal pressure level is provided instead.

### Tenstorrent NPU Metrics

#### Basic NPU Metrics
| Metric                                | Description                | Unit    | Labels                                    |
|---------------------------------------|----------------------------|---------|-------------------------------------------|
| `all_smi_gpu_utilization`             | NPU utilization percentage | percent | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_used_bytes`       | NPU memory used            | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_total_bytes`      | NPU memory total           | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_temperature_celsius`     | NPU ASIC temperature       | celsius | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_power_consumption_watts` | NPU power consumption      | watts   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_frequency_mhz`           | NPU AI clock frequency     | MHz     | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_info`                    | NPU device information     | info    | `gpu_index`, `gpu_name`, `driver_version` |
| `all_smi_npu_firmware_info`           | NPU firmware version       | info    | `npu`, `instance`, `uuid`, `index`, `firmware` |

#### Tenstorrent-Specific Metrics
| Metric                                          | Description                        | Unit    | Labels                                                    |
|-------------------------------------------------|------------------------------------|---------|-----------------------------------------------------------|
| `all_smi_tenstorrent_board_info`                | Board and architecture information | info    | `npu`, `instance`, `uuid`, `index`, `board_type`, `board_id`, `architecture` |
| `all_smi_tenstorrent_collection_method_info`    | Data collection method used        | info    | `npu`, `instance`, `uuid`, `index`, `method`             |
| **Firmware Versions**                           |                                    |         |                                                           |
| `all_smi_tenstorrent_arc_firmware_info`         | ARC firmware version               | info    | `npu`, `instance`, `uuid`, `index`, `version`            |
| `all_smi_tenstorrent_eth_firmware_info`         | Ethernet firmware version          | info    | `npu`, `instance`, `uuid`, `index`, `version`            |
| `all_smi_tenstorrent_ddr_firmware_info`         | DDR firmware version               | info    | `npu`, `instance`, `uuid`, `index`, `version`            |
| `all_smi_tenstorrent_spibootrom_firmware_info`  | SPI Boot ROM firmware version      | info    | `npu`, `instance`, `uuid`, `index`, `version`            |
| `all_smi_tenstorrent_firmware_date_info`        | Firmware build date                | info    | `npu`, `instance`, `uuid`, `index`, `date`               |
| **Temperature Sensors**                         |                                    |         |                                                           |
| `all_smi_tenstorrent_asic_temperature_celsius`  | ASIC temperature                   | celsius | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_vreg_temperature_celsius`  | Voltage regulator temperature      | celsius | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_inlet_temperature_celsius` | Inlet temperature                  | celsius | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_outlet1_temperature_celsius`| Outlet 1 temperature              | celsius | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_outlet2_temperature_celsius`| Outlet 2 temperature              | celsius | `npu`, `instance`, `uuid`, `index`                       |
| **Clock Frequencies**                           |                                    |         |                                                           |
| `all_smi_tenstorrent_aiclk_mhz`                | AI clock frequency                 | MHz     | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_axiclk_mhz`               | AXI clock frequency                | MHz     | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_arcclk_mhz`               | ARC clock frequency                | MHz     | `npu`, `instance`, `uuid`, `index`                       |
| **Power and Electrical**                        |                                    |         |                                                           |
| `all_smi_tenstorrent_voltage_volts`            | Core voltage                       | volts   | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_current_amperes`          | Current draw                       | amperes | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_power_raw_watts`          | Raw power consumption              | watts   | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_tdp_limit_watts`          | TDP limit                          | watts   | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_tdc_limit_amperes`        | TDC limit                          | amperes | `npu`, `instance`, `uuid`, `index`                       |
| **Status and Health**                           |                                    |         |                                                           |
| `all_smi_tenstorrent_heartbeat`                | Device heartbeat counter           | counter | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_arc0_health`              | ARC0 health counter                | counter | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_arc3_health`              | ARC3 health counter                | counter | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_faults`                   | Fault register value               | gauge   | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_throttler`                | Throttler state register           | gauge   | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_pcie_status_info`         | PCIe status register               | info    | `npu`, `instance`, `uuid`, `index`, `status`             |
| `all_smi_tenstorrent_eth_status_info`          | Ethernet status register           | info    | `npu`, `instance`, `uuid`, `index`, `port`, `status`     |
| `all_smi_tenstorrent_ddr_status`               | DDR status register                | gauge   | `npu`, `instance`, `uuid`, `index`                       |
| **Fan Metrics**                                 |                                    |         |                                                           |
| `all_smi_tenstorrent_fan_speed_percent`        | Fan speed percentage               | percent | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_fan_rpm`                  | Fan speed in RPM                   | gauge   | `npu`, `instance`, `uuid`, `index`                       |
| **PCIe Information**                            |                                    |         |                                                           |
| `all_smi_tenstorrent_pcie_generation`          | PCIe generation                    | gauge   | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_pcie_width`               | PCIe link width                    | gauge   | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_pcie_address_info`        | PCIe address                       | info    | `npu`, `instance`, `uuid`, `index`, `address`            |
| `all_smi_tenstorrent_pcie_device_info`         | PCIe device identification         | info    | `npu`, `instance`, `uuid`, `index`, `vendor_id`, `device_id` |
| **DRAM Information**                            |                                    |         |                                                           |
| `all_smi_tenstorrent_dram_info`                | DRAM configuration                 | info    | `npu`, `instance`, `uuid`, `index`, `speed`              |

Note: Tenstorrent NPUs use the same basic metric names as GPUs for compatibility with existing monitoring infrastructure. Additional Tenstorrent-specific metrics provide detailed hardware monitoring capabilities.

### Rebellions NPU Metrics

#### Basic NPU Metrics
| Metric                                | Description                | Unit    | Labels                                    |
|---------------------------------------|----------------------------|---------|-------------------------------------------|
| `all_smi_gpu_utilization`             | NPU utilization percentage | percent | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_used_bytes`       | NPU memory used            | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_total_bytes`      | NPU memory total           | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_temperature_celsius`     | NPU temperature            | celsius | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_power_consumption_watts` | NPU power consumption      | watts   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_frequency_mhz`           | NPU clock frequency        | MHz     | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_info`                    | NPU device information     | info    | `gpu_index`, `gpu_name`, `driver_version` |

#### Rebellions-Specific Metrics
| Metric                                    | Description                          | Unit  | Labels                                                               |
|-------------------------------------------|--------------------------------------|-------|----------------------------------------------------------------------|
| `all_smi_rebellions_device_info`          | Device model and variant information | info  | `npu`, `instance`, `uuid`, `index`, `model`, `variant`              |
| `all_smi_rebellions_firmware_info`        | NPU firmware version                 | info  | `npu`, `instance`, `uuid`, `index`, `firmware_version`              |
| `all_smi_rebellions_kmd_info`             | Kernel Mode Driver version           | info  | `npu`, `instance`, `uuid`, `index`, `kmd_version`                   |
| `all_smi_rebellions_device_status`        | Device operational status            | gauge | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_rebellions_performance_state`    | NPU performance state (P0-P15)       | gauge | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_rebellions_pcie_generation`      | PCIe generation (Gen4)               | gauge | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_rebellions_pcie_width`           | PCIe link width (x16)                | gauge | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_rebellions_memory_bandwidth_gbps`| Memory bandwidth capacity            | gauge | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_rebellions_compute_tops`         | Compute capacity in TOPS             | gauge | `npu`, `instance`, `uuid`, `index`                                  |

Note: Rebellions NPUs support ATOM, ATOM+, and ATOM Max variants with varying compute and memory capabilities. All variants use PCIe Gen4 x16 interface.

### Furiosa NPU Metrics

#### Basic NPU Metrics
| Metric                                | Description                | Unit    | Labels                                    |
|---------------------------------------|----------------------------|---------|-------------------------------------------|
| `all_smi_gpu_utilization`             | NPU utilization percentage | percent | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_used_bytes`       | NPU memory used            | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_total_bytes`      | NPU memory total           | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_temperature_celsius`     | NPU temperature            | celsius | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_power_consumption_watts` | NPU power consumption      | watts   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_frequency_mhz`           | NPU clock frequency        | MHz     | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_info`                    | NPU device information     | info    | `gpu_index`, `gpu_name`, `driver_version` |

#### Furiosa-Specific Metrics
| Metric                                      | Description                            | Unit    | Labels                                                               |
|---------------------------------------------|----------------------------------------|---------|----------------------------------------------------------------------|
| `all_smi_furiosa_device_info`               | Device architecture and model info     | info    | `npu`, `instance`, `uuid`, `index`, `architecture`, `model`         |
| `all_smi_furiosa_firmware_info`             | NPU firmware version                   | info    | `npu`, `instance`, `uuid`, `index`, `firmware_version`              |
| `all_smi_furiosa_pert_info`                 | PERT (runtime) version                 | info    | `npu`, `instance`, `uuid`, `index`, `pert_version`                  |
| `all_smi_furiosa_liveness_status`           | Device liveness status                 | gauge   | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_furiosa_core_count`                | Number of cores in NPU                 | gauge   | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_furiosa_core_status`               | Core availability status               | gauge   | `npu`, `instance`, `uuid`, `index`, `core`                          |
| `all_smi_furiosa_pe_utilization`            | Processing Element utilization         | percent | `npu`, `instance`, `uuid`, `index`, `core`                          |
| `all_smi_furiosa_core_frequency_mhz`        | Per-core frequency                     | MHz     | `npu`, `instance`, `uuid`, `index`, `core`                          |
| `all_smi_furiosa_power_governor_info`       | Power governor mode                    | info    | `npu`, `instance`, `uuid`, `index`, `governor`                      |
| `all_smi_furiosa_error_count`               | Cumulative error count                 | counter | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_furiosa_pcie_generation`           | PCIe generation                        | gauge   | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_furiosa_pcie_width`                | PCIe link width                        | gauge   | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_furiosa_memory_bandwidth_utilization` | Memory bandwidth utilization        | percent | `npu`, `instance`, `uuid`, `index`                                  |

Note: Furiosa NPUs use the RNGD architecture with 8 cores per NPU. Each core contains multiple Processing Elements (PEs) that handle neural network computations. The power governor supports OnDemand mode for dynamic power management.

### CPU Metrics (All Platforms)

| Metric                                | Description                | Unit    | Labels   |
|---------------------------------------|----------------------------|---------|----------|
| `all_smi_cpu_utilization`             | CPU utilization percentage | percent | -        |
| `all_smi_cpu_socket_count`            | Number of CPU sockets      | count   | -        |
| `all_smi_cpu_core_count`              | Total number of CPU cores  | count   | -        |
| `all_smi_cpu_thread_count`            | Total number of CPU threads| count   | -        |
| `all_smi_cpu_frequency_mhz`           | CPU frequency              | MHz     | -        |
| `all_smi_cpu_temperature_celsius`     | CPU temperature            | celsius | -        |
| `all_smi_cpu_power_consumption_watts` | CPU power consumption      | watts   | -        |
| `all_smi_cpu_socket_utilization`      | Per-socket CPU utilization | percent | `socket` |

### Apple Silicon CPU Specific Metrics

| Metric                                | Description                    | Unit    | Labels |
|---------------------------------------|--------------------------------|---------|--------|
| `all_smi_cpu_p_core_count`            | Number of performance cores    | count   | -      |
| `all_smi_cpu_e_core_count`            | Number of efficiency cores     | count   | -      |
| `all_smi_cpu_gpu_core_count`          | Number of integrated GPU cores | count   | -      |
| `all_smi_cpu_p_core_utilization`      | P-core utilization percentage  | percent | -      |
| `all_smi_cpu_e_core_utilization`      | E-core utilization percentage  | percent | -      |
| `all_smi_cpu_p_cluster_frequency_mhz` | P-cluster frequency            | MHz     | -      |
| `all_smi_cpu_e_cluster_frequency_mhz` | E-cluster frequency            | MHz     | -      |

### Memory Metrics (All Platforms)

| Metric                           | Description                   | Unit    | Labels |
|----------------------------------|-------------------------------|---------|--------|
| `all_smi_memory_total_bytes`     | Total system memory           | bytes   | -      |
| `all_smi_memory_used_bytes`      | Used system memory            | bytes   | -      |
| `all_smi_memory_available_bytes` | Available system memory       | bytes   | -      |
| `all_smi_memory_free_bytes`      | Free system memory            | bytes   | -      |
| `all_smi_memory_utilization`     | Memory utilization percentage | percent | -      |
| `all_smi_swap_total_bytes`       | Total swap space              | bytes   | -      |
| `all_smi_swap_used_bytes`        | Used swap space               | bytes   | -      |
| `all_smi_swap_free_bytes`        | Free swap space               | bytes   | -      |

### Linux-Specific Memory Metrics

| Metric                         | Description             | Unit  | Labels |
|--------------------------------|-------------------------|-------|--------|
| `all_smi_memory_buffers_bytes` | Memory used for buffers | bytes | -      |
| `all_smi_memory_cached_bytes`  | Memory used for cache   | bytes | -      |

### Storage Metrics

| Metric                         | Description          | Unit  | Labels        |
|--------------------------------|----------------------|-------|---------------|
| `all_smi_disk_total_bytes`     | Total disk space     | bytes | `mount_point` |
| `all_smi_disk_available_bytes` | Available disk space | bytes | `mount_point` |

Note: Storage metrics exclude Docker bind mounts and are filtered to show only relevant filesystems.

### Process Metrics (When --processes Flag is Used)

| Metric                             | Description                     | Unit    | Labels                                                 |
|------------------------------------|---------------------------------|---------|--------------------------------------------------------|
| `all_smi_gpu_process_memory_bytes` | GPU memory used by process      | bytes   | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |
| `all_smi_gpu_process_sm_util`      | Process GPU SM utilization      | percent | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |
| `all_smi_gpu_process_mem_util`     | Process GPU memory utilization  | percent | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |
| `all_smi_gpu_process_enc_util`     | Process GPU encoder utilization | percent | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |
| `all_smi_gpu_process_dec_util`     | Process GPU decoder utilization | percent | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |

## Platform Support Matrix

| Platform                     | GPU Metrics    | CPU Metrics    | Memory Metrics | Process Metrics |
|------------------------------|----------------|----------------|----------------|-----------------|  
| Linux + NVIDIA               | ✓ Full         | ✓ Full         | ✓ Full         | ✓ Full          |
| Linux + Tenstorrent          | ✓ Full***      | ✓ Full         | ✓ Full         | ✗ N/A****       |
| Linux + Rebellions           | ✓ Full         | ✓ Full         | ✓ Full         | ✗ N/A*****      |
| Linux + Furiosa              | ✓ Full         | ✓ Full         | ✓ Full         | ✗ N/A******     |
| macOS + Apple Silicon        | ✓ Partial*     | ✓ Enhanced**   | ✓ Full         | ✓ Basic         |
| NVIDIA Jetson                | ✓ Full + DLA   | ✓ Full         | ✓ Full         | ✓ Full          |

*Apple Silicon (M1/M2/M3/M4) GPU metrics do not include temperature (thermal pressure provided instead)  
**Apple Silicon (M1/M2/M3/M4) provides enhanced P-core/E-core metrics and cluster frequencies  
***Tenstorrent provides extensive hardware monitoring including multiple temperature sensors, health counters, and status registers  
****Tenstorrent NPUs do not expose per-process GPU usage information  
*****Rebellions NPUs do not expose per-process GPU usage information  
******Furiosa NPUs do not expose per-process GPU usage information

## Example Prometheus Queries

### Basic Monitoring
```promql
# Average GPU utilization across all GPUs
avg(all_smi_gpu_utilization)

# Memory usage percentage per GPU
(all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) * 100

# GPUs running above 80°C
all_smi_gpu_temperature_celsius > 80
```

### Power Monitoring
```promql
# Total power consumption across all GPUs
sum(all_smi_gpu_power_consumption_watts)

# Power efficiency (utilization per watt)
all_smi_gpu_utilization / all_smi_gpu_power_consumption_watts
```

### Apple Silicon Specific
```promql
# P-core vs E-core utilization comparison
all_smi_cpu_p_core_utilization - all_smi_cpu_e_core_utilization

# ANE power consumption in watts
all_smi_ane_power_watts
```

### Tenstorrent NPU Specific
```promql
# NPUs with high temperature on any sensor
max by (instance) ({
  __name__=~"all_smi_tenstorrent_.*_temperature_celsius",
  instance=~"tt.*"
}) > 80

# Power efficiency by board type
all_smi_gpu_utilization / on(instance) group_left(board_type) 
  (all_smi_tenstorrent_board_info * 0 + all_smi_gpu_power_consumption_watts)

# Throttling detection
all_smi_tenstorrent_throttler > 0

# Health monitoring - ARC processors not incrementing
rate(all_smi_tenstorrent_arc0_health[5m]) == 0
```

### Rebellions NPU Specific
```promql
# NPUs in low performance state
all_smi_rebellions_performance_state > 0

# Devices with non-operational status
all_smi_rebellions_device_status != 1

# Power efficiency (TOPS per watt)
all_smi_rebellions_compute_tops / all_smi_gpu_power_consumption_watts

# Memory bandwidth saturation check
(all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) > 0.9
```

### Furiosa NPU Specific
```promql
# NPUs with unavailable cores
all_smi_furiosa_core_status == 0

# Average PE utilization across all cores
avg by (instance) (all_smi_furiosa_pe_utilization)

# NPUs with high error rates
rate(all_smi_furiosa_error_count[5m]) > 0.1

# Power governor not in OnDemand mode
all_smi_furiosa_power_governor_info{governor!="OnDemand"}

# Memory bandwidth bottleneck detection
all_smi_furiosa_memory_bandwidth_utilization > 80
```

### Process Monitoring
```promql
# Top 5 GPU memory consumers
topk(5, all_smi_gpu_process_memory_bytes)

# Processes using more than 1GB GPU memory
all_smi_gpu_process_memory_bytes > 1073741824
```

## Integration Examples

### Grafana Dashboard
Create a comprehensive monitoring dashboard with:
- GPU utilization heatmap
- Memory usage time series
- Power consumption stacked graph
- Temperature alerts
- Process resource usage table

### AlertManager Rules
```yaml
groups:
  - name: gpu_alerts
    rules:
      - alert: HighGPUTemperature
        expr: all_smi_gpu_temperature_celsius > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu_name }} is running hot"
          
      - alert: GPUMemoryExhausted
        expr: (all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) > 0.95
        for: 5m
        labels:
          severity: critical
          
      - alert: TenstorrentNPUFault
        expr: all_smi_tenstorrent_faults > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Tenstorrent NPU {{ $labels.instance }} has fault condition"
          
      - alert: TenstorrentNPUThrottling
        expr: all_smi_tenstorrent_throttler > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Tenstorrent NPU {{ $labels.instance }} is throttling"
          
      - alert: RebellionsNPULowPerformance
        expr: all_smi_rebellions_performance_state > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Rebellions NPU {{ $labels.instance }} stuck in low performance state P{{ $value }}"
          
      - alert: FuriosaNPUCoreFailure
        expr: all_smi_furiosa_core_status == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Furiosa NPU {{ $labels.instance }} has unavailable core {{ $labels.core }}"
          
      - alert: FuriosaNPUHighErrorRate
        expr: rate(all_smi_furiosa_error_count[5m]) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Furiosa NPU {{ $labels.instance }} experiencing high error rate"
```

## Update Intervals

The metrics update interval can be configured:
- Default: 3 seconds
- Minimum recommended: 1 second
- Maximum recommended: 60 seconds

Higher update rates provide more real-time data but increase system load. For production monitoring, 5-10 seconds is typically sufficient.

## Notes

1. All metrics follow Prometheus naming conventions
2. Labels are used to differentiate between multiple devices
3. Info metrics (ending in `_info`) provide static metadata
4. Some metrics may not be available on all platforms
5. Process metrics require the `--processes` flag and may impact performance
6. Tenstorrent NPU metrics include comprehensive hardware monitoring data:
   - Multiple temperature sensors (ASIC, voltage regulator, inlet/outlet)
   - Detailed firmware versions and health counters
   - Power limits (TDP/TDC) and throttling information
   - PCIe and DDR status registers for diagnostics
7. Tenstorrent utilization is calculated based on power consumption as a proxy metric
8. Rebellions NPU metrics include:
   - Performance state monitoring (P0-P15) for power management
   - Device status and KMD version tracking
   - Support for ATOM, ATOM+, and ATOM Max variants
   - PCIe Gen4 x16 interface metrics
9. Furiosa NPU metrics include:
   - Per-core PE utilization monitoring
   - Core availability status tracking
   - Power governor mode information
   - Error counting and liveness monitoring
   - RNGD architecture with 8 cores per NPU