all-smi 0.17.3

Command-line utility for monitoring GPU hardware. It provides a real-time view of GPU utilization, memory usage, temperature, power consumption, and other metrics.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
# API Metrics Reference

`all-smi` provides comprehensive hardware metrics in Prometheus format through its API mode. This document details all available metrics across different hardware platforms.

## Starting API Mode

```bash
# Start API server on TCP port
all-smi api --port 9090

# Custom update interval (default: 3 seconds)
all-smi api --port 9090 --interval 5

# Include process information
all-smi api --port 9090 --processes
```

Metrics are available at `http://localhost:9090/metrics`

### Unix Domain Socket Support (Unix Only)

For local IPC scenarios, API mode supports Unix Domain Sockets:

```bash
# Use default socket path
all-smi api --socket
# Linux: /var/run/all-smi.sock (or /tmp/all-smi.sock)
# macOS: /tmp/all-smi.sock

# Use custom socket path
all-smi api --socket /custom/path/all-smi.sock

# TCP and Unix socket simultaneously
all-smi api --port 9090 --socket

# Unix socket only (disable TCP)
all-smi api --port 0 --socket
```

Access metrics via Unix socket:
```bash
curl --unix-socket /tmp/all-smi.sock http://localhost/metrics
```

```python
# Python example
import requests_unixsocket
session = requests_unixsocket.Session()
r = session.get('http+unix://%2Ftmp%2Fall-smi.sock/metrics')
```

**Security**: Socket permissions are set to `0600` (owner-only access).

## Available Metrics

### GPU Metrics (All Platforms)

| Metric                                | Description                | Unit    | Labels                                    |
|---------------------------------------|----------------------------|---------|-------------------------------------------|
| `all_smi_gpu_utilization`             | GPU utilization percentage | percent | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_used_bytes`       | GPU memory used            | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_total_bytes`      | GPU memory total           | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_temperature_celsius`     | GPU temperature            | celsius | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_power_consumption_watts` | GPU power consumption      | watts   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_frequency_mhz`           | GPU frequency              | MHz     | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_info`                    | GPU device information     | info    | `gpu_index`, `gpu_name`, `driver_version` |

### Unified AI Acceleration Library Labels

The `all_smi_gpu_info` metric includes standardized labels for AI acceleration libraries across all GPU/accelerator platforms. These unified labels allow platform-agnostic queries and dashboards:

| Label         | Description                              | Example Values                    |
|---------------|------------------------------------------|-----------------------------------|
| `lib_name`    | Name of the AI acceleration library      | `CUDA`, `ROCm`, `Metal`          |
| `lib_version` | Version of the AI acceleration library   | `13.0`, `7.0.2`, `Metal 3`       |

#### Platform-Specific Library Mappings

| Platform          | lib_name | lib_version Source | Platform-Specific Label |
|-------------------|----------|-------------------|-------------------------|
| NVIDIA GPU        | `CUDA`   | CUDA version      | `cuda_version`         |
| AMD GPU           | `ROCm`   | ROCm version      | `rocm_version`         |
| NVIDIA Jetson     | `CUDA`   | CUDA version      | `cuda_version`         |
| Apple Silicon     | `Metal`  | Metal version     | N/A                    |

**Note**: Platform-specific labels (e.g., `cuda_version`, `rocm_version`) are maintained for backward compatibility with existing queries and dashboards.

#### Example PromQL Queries

```promql
# Count devices by AI library type
count by (lib_name) (all_smi_gpu_info)

# Get all CUDA devices with version 12 or higher
all_smi_gpu_info{lib_name="CUDA", lib_version=~"1[2-9].*|[2-9][0-9].*"}

# Alert on outdated ROCm versions (< 7.0)
all_smi_gpu_info{lib_name="ROCm", lib_version!~"[7-9].*"} == 1

# Cross-platform library distribution
sum by (lib_name, lib_version) (all_smi_gpu_info)

# Find all devices using Metal (Apple Silicon)
all_smi_gpu_info{lib_name="Metal"}

# Monitor library version consistency across cluster
count by (lib_name, lib_version) (all_smi_gpu_info) > 1
```

### NVIDIA GPU Specific Metrics

| Metric                                  | Description                              | Unit  | Labels                  |
|-----------------------------------------|------------------------------------------|-------|-------------------------|
| `all_smi_gpu_pcie_gen_current`          | Current PCIe generation                  | -     | `gpu_index`, `gpu_name` |
| `all_smi_gpu_pcie_width_current`        | Current PCIe link width                  | -     | `gpu_index`, `gpu_name` |
| `all_smi_gpu_performance_state`         | GPU performance state (P0=0, P1=1, etc.) | -     | `gpu_index`, `gpu_name` |
| `all_smi_gpu_clock_graphics_max_mhz`    | Maximum graphics clock                   | MHz   | `gpu_index`, `gpu_name` |
| `all_smi_gpu_clock_memory_max_mhz`      | Maximum memory clock                     | MHz   | `gpu_index`, `gpu_name` |
| `all_smi_gpu_power_limit_current_watts` | Current power limit                      | watts | `gpu_index`, `gpu_name` |
| `all_smi_gpu_power_limit_max_watts`     | Maximum power limit                      | watts | `gpu_index`, `gpu_name` |

### NVIDIA Jetson Specific Metrics

| Metric                    | Description                                 | Unit    | Labels                  |
|---------------------------|---------------------------------------------|---------|-------------------------|
| `all_smi_dla_utilization` | DLA (Deep Learning Accelerator) utilization | percent | `gpu_index`, `gpu_name` |

### AMD GPU Specific Metrics

AMD GPUs (Radeon and Instinct series) provide comprehensive monitoring through ROCm and the DRM subsystem:

| Metric                        | Description                              | Unit    | Labels                                      |
|-------------------------------|------------------------------------------|---------|---------------------------------------------|
| `all_smi_gpu_fan_speed_rpm`   | GPU fan speed                            | RPM     | `gpu_index`, `gpu_name`                     |
| `all_smi_amd_rocm_version`    | AMD ROCm version installed               | info    | `instance`, `version`                       |
| `all_smi_gpu_memory_gtt_bytes`| GTT (GPU Translation Table) memory usage | bytes   | `gpu_index`, `gpu_name`                     |
| `all_smi_gpu_memory_vram_bytes`| VRAM (Video RAM) usage                  | bytes   | `gpu_index`, `gpu_name`                     |

**Additional Details Available** (in `all_smi_gpu_info` labels):
- **Driver Version**: AMDGPU kernel driver version (e.g., "30.10.1")
- **ROCm Version**: ROCm software stack version (e.g., "7.0.2")
- **PCIe Information**: Current link generation and width, max GPU/system link capabilities
- **VBIOS**: Version and date information
- **Power Management**: Current, minimum, and maximum power cap values
- **ASIC Information**: Device ID, revision ID, ASIC name
- **Memory Clock**: Current memory clock frequency

**Process Tracking**:
- AMD GPU process detection uses `fdinfo` from `/proc/<pid>/fdinfo/` for accurate memory tracking
- Tracks both VRAM and GTT memory usage per process
- Available with `--processes` flag in API mode

**Platform Requirements**:
- Requires ROCm drivers and `libamdgpu_top` library
- Requires sudo access to `/dev/dri` devices or user in `video`/`render` groups
- Only available in glibc builds (not musl static builds)

### Apple Silicon GPU Specific Metrics

| Metric                          | Description            | Unit  | Labels                           |
|---------------------------------|------------------------|-------|----------------------------------|
| `all_smi_ane_utilization`       | ANE utilization        | mW    | `gpu_index`, `gpu_name`          |
| `all_smi_ane_power_watts`       | ANE power consumption  | watts | `gpu_index`, `gpu_name`          |
| `all_smi_thermal_pressure_info` | Thermal pressure level | info  | `gpu_index`, `gpu_name`, `level` |

Note: For Apple Silicon (M1/M2/M3/M4), `gpu_temperature_celsius` is not available; thermal pressure level is provided instead.

### Tenstorrent NPU Metrics

#### Basic NPU Metrics
| Metric                                | Description                | Unit    | Labels                                    |
|---------------------------------------|----------------------------|---------|-------------------------------------------|
| `all_smi_gpu_utilization`             | NPU utilization percentage | percent | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_used_bytes`       | NPU memory used            | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_total_bytes`      | NPU memory total           | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_temperature_celsius`     | NPU ASIC temperature       | celsius | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_power_consumption_watts` | NPU power consumption      | watts   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_frequency_mhz`           | NPU AI clock frequency     | MHz     | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_info`                    | NPU device information     | info    | `gpu_index`, `gpu_name`, `driver_version` |
| `all_smi_npu_firmware_info`           | NPU firmware version       | info    | `npu`, `instance`, `uuid`, `index`, `firmware` |

#### Tenstorrent-Specific Metrics
| Metric                                          | Description                        | Unit    | Labels                                                    |
|-------------------------------------------------|------------------------------------|---------|-----------------------------------------------------------|
| `all_smi_tenstorrent_board_info`                | Board and architecture information | info    | `npu`, `instance`, `uuid`, `index`, `board_type`, `board_id`, `architecture` |
| `all_smi_tenstorrent_collection_method_info`    | Data collection method used        | info    | `npu`, `instance`, `uuid`, `index`, `method`             |
| **Firmware Versions**                           |                                    |         |                                                           |
| `all_smi_tenstorrent_arc_firmware_info`         | ARC firmware version               | info    | `npu`, `instance`, `uuid`, `index`, `version`            |
| `all_smi_tenstorrent_eth_firmware_info`         | Ethernet firmware version          | info    | `npu`, `instance`, `uuid`, `index`, `version`            |
| `all_smi_tenstorrent_ddr_firmware_info`         | DDR firmware version               | info    | `npu`, `instance`, `uuid`, `index`, `version`            |
| `all_smi_tenstorrent_spibootrom_firmware_info`  | SPI Boot ROM firmware version      | info    | `npu`, `instance`, `uuid`, `index`, `version`            |
| `all_smi_tenstorrent_firmware_date_info`        | Firmware build date                | info    | `npu`, `instance`, `uuid`, `index`, `date`               |
| **Temperature Sensors**                         |                                    |         |                                                           |
| `all_smi_tenstorrent_asic_temperature_celsius`  | ASIC temperature                   | celsius | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_vreg_temperature_celsius`  | Voltage regulator temperature      | celsius | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_inlet_temperature_celsius` | Inlet temperature                  | celsius | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_outlet1_temperature_celsius`| Outlet 1 temperature              | celsius | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_outlet2_temperature_celsius`| Outlet 2 temperature              | celsius | `npu`, `instance`, `uuid`, `index`                       |
| **Clock Frequencies**                           |                                    |         |                                                           |
| `all_smi_tenstorrent_aiclk_mhz`                | AI clock frequency                 | MHz     | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_axiclk_mhz`               | AXI clock frequency                | MHz     | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_arcclk_mhz`               | ARC clock frequency                | MHz     | `npu`, `instance`, `uuid`, `index`                       |
| **Power and Electrical**                        |                                    |         |                                                           |
| `all_smi_tenstorrent_voltage_volts`            | Core voltage                       | volts   | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_current_amperes`          | Current draw                       | amperes | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_power_raw_watts`          | Raw power consumption              | watts   | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_tdp_limit_watts`          | TDP limit                          | watts   | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_tdc_limit_amperes`        | TDC limit                          | amperes | `npu`, `instance`, `uuid`, `index`                       |
| **Status and Health**                           |                                    |         |                                                           |
| `all_smi_tenstorrent_heartbeat`                | Device heartbeat counter           | counter | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_arc0_health`              | ARC0 health counter                | counter | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_arc3_health`              | ARC3 health counter                | counter | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_faults`                   | Fault register value               | gauge   | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_throttler`                | Throttler state register           | gauge   | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_pcie_status_info`         | PCIe status register               | info    | `npu`, `instance`, `uuid`, `index`, `status`             |
| `all_smi_tenstorrent_eth_status_info`          | Ethernet status register           | info    | `npu`, `instance`, `uuid`, `index`, `port`, `status`     |
| `all_smi_tenstorrent_ddr_status`               | DDR status register                | gauge   | `npu`, `instance`, `uuid`, `index`                       |
| **Fan Metrics**                                 |                                    |         |                                                           |
| `all_smi_tenstorrent_fan_speed_percent`        | Fan speed percentage               | percent | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_fan_rpm`                  | Fan speed in RPM                   | gauge   | `npu`, `instance`, `uuid`, `index`                       |
| **PCIe Information**                            |                                    |         |                                                           |
| `all_smi_tenstorrent_pcie_generation`          | PCIe generation                    | gauge   | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_pcie_width`               | PCIe link width                    | gauge   | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tenstorrent_pcie_address_info`        | PCIe address                       | info    | `npu`, `instance`, `uuid`, `index`, `address`            |
| `all_smi_tenstorrent_pcie_device_info`         | PCIe device identification         | info    | `npu`, `instance`, `uuid`, `index`, `vendor_id`, `device_id` |
| **DRAM Information**                            |                                    |         |                                                           |
| `all_smi_tenstorrent_dram_info`                | DRAM configuration                 | info    | `npu`, `instance`, `uuid`, `index`, `speed`              |

Note: Tenstorrent NPUs use the same basic metric names as GPUs for compatibility with existing monitoring infrastructure. Additional Tenstorrent-specific metrics provide detailed hardware monitoring capabilities.

### Rebellions NPU Metrics

#### Basic NPU Metrics
| Metric                                | Description                | Unit    | Labels                                    |
|---------------------------------------|----------------------------|---------|-------------------------------------------|
| `all_smi_gpu_utilization`             | NPU utilization percentage | percent | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_used_bytes`       | NPU memory used            | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_total_bytes`      | NPU memory total           | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_temperature_celsius`     | NPU temperature            | celsius | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_power_consumption_watts` | NPU power consumption      | watts   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_frequency_mhz`           | NPU clock frequency        | MHz     | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_info`                    | NPU device information     | info    | `gpu_index`, `gpu_name`, `driver_version` |

#### Rebellions-Specific Metrics
| Metric                                    | Description                          | Unit  | Labels                                                               |
|-------------------------------------------|--------------------------------------|-------|----------------------------------------------------------------------|
| `all_smi_rebellions_device_info`          | Device model and variant information | info  | `npu`, `instance`, `uuid`, `index`, `model`, `variant`              |
| `all_smi_rebellions_firmware_info`        | NPU firmware version                 | info  | `npu`, `instance`, `uuid`, `index`, `firmware_version`              |
| `all_smi_rebellions_kmd_info`             | Kernel Mode Driver version           | info  | `npu`, `instance`, `uuid`, `index`, `kmd_version`                   |
| `all_smi_rebellions_device_status`        | Device operational status            | gauge | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_rebellions_performance_state`    | NPU performance state (P0-P15)       | gauge | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_rebellions_pcie_generation`      | PCIe generation (Gen4)               | gauge | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_rebellions_pcie_width`           | PCIe link width (x16)                | gauge | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_rebellions_memory_bandwidth_gbps`| Memory bandwidth capacity            | gauge | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_rebellions_compute_tops`         | Compute capacity in TOPS             | gauge | `npu`, `instance`, `uuid`, `index`                                  |

Note: Rebellions NPUs support ATOM, ATOM+, and ATOM Max variants with varying compute and memory capabilities. All variants use PCIe Gen4 x16 interface.

### Furiosa NPU Metrics

#### Basic NPU Metrics
| Metric                                | Description                | Unit    | Labels                                    |
|---------------------------------------|----------------------------|---------|-------------------------------------------|
| `all_smi_gpu_utilization`             | NPU utilization percentage | percent | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_used_bytes`       | NPU memory used            | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_total_bytes`      | NPU memory total           | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_temperature_celsius`     | NPU temperature            | celsius | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_power_consumption_watts` | NPU power consumption      | watts   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_frequency_mhz`           | NPU clock frequency        | MHz     | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_info`                    | NPU device information     | info    | `gpu_index`, `gpu_name`, `driver_version` |

#### Furiosa-Specific Metrics
| Metric                                      | Description                            | Unit    | Labels                                                               |
|---------------------------------------------|----------------------------------------|---------|----------------------------------------------------------------------|
| `all_smi_furiosa_device_info`               | Device architecture and model info     | info    | `npu`, `instance`, `uuid`, `index`, `architecture`, `model`         |
| `all_smi_furiosa_firmware_info`             | NPU firmware version                   | info    | `npu`, `instance`, `uuid`, `index`, `firmware_version`              |
| `all_smi_furiosa_pert_info`                 | PERT (runtime) version                 | info    | `npu`, `instance`, `uuid`, `index`, `pert_version`                  |
| `all_smi_furiosa_liveness_status`           | Device liveness status                 | gauge   | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_furiosa_core_count`                | Number of cores in NPU                 | gauge   | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_furiosa_core_status`               | Core availability status               | gauge   | `npu`, `instance`, `uuid`, `index`, `core`                          |
| `all_smi_furiosa_pe_utilization`            | Processing Element utilization         | percent | `npu`, `instance`, `uuid`, `index`, `core`                          |
| `all_smi_furiosa_core_frequency_mhz`        | Per-core frequency                     | MHz     | `npu`, `instance`, `uuid`, `index`, `core`                          |
| `all_smi_furiosa_power_governor_info`       | Power governor mode                    | info    | `npu`, `instance`, `uuid`, `index`, `governor`                      |
| `all_smi_furiosa_error_count`               | Cumulative error count                 | counter | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_furiosa_pcie_generation`           | PCIe generation                        | gauge   | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_furiosa_pcie_width`                | PCIe link width                        | gauge   | `npu`, `instance`, `uuid`, `index`                                  |
| `all_smi_furiosa_memory_bandwidth_utilization` | Memory bandwidth utilization        | percent | `npu`, `instance`, `uuid`, `index`                                  |

Note: Furiosa NPUs use the RNGD architecture with 8 cores per NPU. Each core contains multiple Processing Elements (PEs) that handle neural network computations. The power governor supports OnDemand mode for dynamic power management.

### Intel Gaudi NPU Metrics

#### Basic NPU Metrics
| Metric                                | Description                | Unit    | Labels                                    |
|---------------------------------------|----------------------------|---------|-------------------------------------------|
| `all_smi_gpu_utilization`             | NPU utilization percentage | percent | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_used_bytes`       | NPU memory used            | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_total_bytes`      | NPU memory total           | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_temperature_celsius`     | NPU temperature            | celsius | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_power_consumption_watts` | NPU power consumption      | watts   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_frequency_mhz`           | NPU clock frequency        | MHz     | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_info`                    | NPU device information     | info    | `gpu_index`, `gpu_name`, `driver_version` |

#### Intel Gaudi-Specific Metrics
| Metric                                        | Description                              | Unit    | Labels                                                        |
|-----------------------------------------------|------------------------------------------|---------|---------------------------------------------------------------|
| `all_smi_gaudi_device_info`                   | Device model and information             | info    | `npu`, `instance`, `uuid`, `index`                           |
| `all_smi_gaudi_internal_name_info`            | Internal device name (e.g., HL-325L)     | info    | `npu`, `instance`, `uuid`, `index`, `internal_name`          |
| `all_smi_gaudi_driver_info`                   | Habana driver version                    | info    | `npu`, `instance`, `uuid`, `index`, `version`                |
| `all_smi_gaudi_aip_utilization_percent`       | AIP (AI Processor) utilization           | percent | `npu`, `instance`, `uuid`, `index`                           |
| `all_smi_gaudi_memory_used_bytes`             | HBM memory used                          | bytes   | `npu`, `instance`, `uuid`, `index`                           |
| `all_smi_gaudi_memory_total_bytes`            | HBM total memory                         | bytes   | `npu`, `instance`, `uuid`, `index`                           |
| `all_smi_gaudi_memory_utilization_percent`    | HBM memory utilization percentage        | percent | `npu`, `instance`, `uuid`, `index`                           |
| `all_smi_gaudi_power_draw_watts`              | Current power consumption                | watts   | `npu`, `instance`, `uuid`, `index`                           |
| `all_smi_gaudi_power_max_watts`               | Maximum power limit                      | watts   | `npu`, `instance`, `uuid`, `index`                           |
| `all_smi_gaudi_power_utilization_percent`     | Power utilization percentage             | percent | `npu`, `instance`, `uuid`, `index`                           |
| `all_smi_gaudi_temperature_celsius`           | AIP temperature                          | celsius | `npu`, `instance`, `uuid`, `index`                           |

Note: Intel Gaudi NPUs (Gaudi 1/2/3) are monitored via the `hl-smi` command-line tool running as a background process. Device names are automatically mapped from internal identifiers (e.g., HL-325L) to human-friendly names (e.g., Intel Gaudi 3 PCIe LP). The tool supports various form factors including PCIe, OAM, UBB, and HLS variants.

### Google TPU Metrics

#### Basic NPU Metrics
| Metric                                | Description                | Unit    | Labels                                    |
|---------------------------------------|----------------------------|---------|-------------------------------------------|
| `all_smi_gpu_utilization`             | TPU utilization percentage | percent | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_used_bytes`       | TPU memory used            | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_memory_total_bytes`      | TPU memory total           | bytes   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_temperature_celsius`     | TPU temperature            | celsius | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_power_consumption_watts` | TPU power consumption      | watts   | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_frequency_mhz`           | TPU clock frequency        | MHz     | `gpu_index`, `gpu_name`                   |
| `all_smi_gpu_info`                    | TPU device information     | info    | `gpu_index`, `gpu_name`, `driver_version` |

#### TPU-Specific Metrics
| Metric                                     | Description                          | Unit  | Labels                                                   |
|--------------------------------------------|--------------------------------------|-------|----------------------------------------------------------|
| `all_smi_tpu_utilization_percent`          | TPU duty cycle utilization           | percent| `npu`, `instance`, `uuid`, `index`                      |
| `all_smi_tpu_memory_used_bytes`            | TPU HBM memory used                  | bytes | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tpu_memory_total_bytes`           | TPU HBM memory total                 | bytes | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tpu_memory_utilization_percent`   | TPU HBM memory utilization percentage| percent| `npu`, `instance`, `uuid`, `index`                      |
| `all_smi_tpu_chip_version_info`            | TPU chip version information         | info  | `npu`, `instance`, `uuid`, `index`, `version`            |
| `all_smi_tpu_accelerator_type_info`        | TPU accelerator type information     | info  | `npu`, `instance`, `uuid`, `index`, `type`               |
| `all_smi_tpu_core_count`                   | Number of TPU cores                  | gauge | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tpu_tensorcore_count`             | Number of TensorCores per chip       | gauge | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tpu_memory_type_info`             | TPU memory type (HBM2/HBM2e/HBM3e)    | info  | `npu`, `instance`, `uuid`, `index`, `type`               |
| `all_smi_tpu_runtime_version_info`         | TPU runtime/library version          | info  | `npu`, `instance`, `uuid`, `index`, `version`            |
| `all_smi_tpu_power_max_watts`              | TPU maximum power limit              | watts | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tpu_hlo_queue_size`               | Number of pending HLO programs       | gauge | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tpu_hlo_exec_mean_microseconds`   | HLO execution timing (mean)          | µs    | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tpu_hlo_exec_p50_microseconds`    | HLO execution timing (P50)           | µs    | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tpu_hlo_exec_p90_microseconds`    | HLO execution timing (P90)           | µs    | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tpu_hlo_exec_p95_microseconds`    | HLO execution timing (P95)           | µs    | `npu`, `instance`, `uuid`, `index`                       |
| `all_smi_tpu_hlo_exec_p999_microseconds`   | HLO execution timing (P99.9)         | µs    | `npu`, `instance`, `uuid`, `index`                       |

Note: Google Cloud TPUs (v2-v7/Ironwood) are monitored via the `tpu-info` command-line tool running in streaming mode. Metrics include duty cycle utilization, HBM memory tracking, and chip configuration details.

### CPU Metrics (All Platforms)

| Metric                                | Description                | Unit    | Labels   |
|---------------------------------------|----------------------------|---------|----------|
| `all_smi_cpu_utilization`             | CPU utilization percentage | percent | -        |
| `all_smi_cpu_socket_count`            | Number of CPU sockets      | count   | -        |
| `all_smi_cpu_core_count`              | Total number of CPU cores  | count   | -        |
| `all_smi_cpu_thread_count`            | Total number of CPU threads| count   | -        |
| `all_smi_cpu_frequency_mhz`           | CPU frequency              | MHz     | -        |
| `all_smi_cpu_temperature_celsius`     | CPU temperature            | celsius | -        |
| `all_smi_cpu_power_consumption_watts` | CPU power consumption      | watts   | -        |
| `all_smi_cpu_socket_utilization`      | Per-socket CPU utilization | percent | `socket` |

### Apple Silicon CPU Specific Metrics

| Metric                                | Description                    | Unit    | Labels |
|---------------------------------------|--------------------------------|---------|--------|
| `all_smi_cpu_p_core_count`            | Number of performance cores    | count   | -      |
| `all_smi_cpu_e_core_count`            | Number of efficiency cores     | count   | -      |
| `all_smi_cpu_gpu_core_count`          | Number of integrated GPU cores | count   | -      |
| `all_smi_cpu_p_core_utilization`      | P-core utilization percentage  | percent | -      |
| `all_smi_cpu_e_core_utilization`      | E-core utilization percentage  | percent | -      |
| `all_smi_cpu_p_cluster_frequency_mhz` | P-cluster frequency            | MHz     | -      |
| `all_smi_cpu_e_cluster_frequency_mhz` | E-cluster frequency            | MHz     | -      |

### Memory Metrics (All Platforms)

| Metric                           | Description                   | Unit    | Labels |
|----------------------------------|-------------------------------|---------|--------|
| `all_smi_memory_total_bytes`     | Total system memory           | bytes   | -      |
| `all_smi_memory_used_bytes`      | Used system memory            | bytes   | -      |
| `all_smi_memory_available_bytes` | Available system memory       | bytes   | -      |
| `all_smi_memory_free_bytes`      | Free system memory            | bytes   | -      |
| `all_smi_memory_utilization`     | Memory utilization percentage | percent | -      |
| `all_smi_swap_total_bytes`       | Total swap space              | bytes   | -      |
| `all_smi_swap_used_bytes`        | Used swap space               | bytes   | -      |
| `all_smi_swap_free_bytes`        | Free swap space               | bytes   | -      |

### Linux-Specific Memory Metrics

| Metric                         | Description             | Unit  | Labels |
|--------------------------------|-------------------------|-------|--------|
| `all_smi_memory_buffers_bytes` | Memory used for buffers | bytes | -      |
| `all_smi_memory_cached_bytes`  | Memory used for cache   | bytes | -      |

### Storage Metrics

| Metric                         | Description          | Unit  | Labels        |
|--------------------------------|----------------------|-------|---------------|
| `all_smi_disk_total_bytes`     | Total disk space     | bytes | `mount_point` |
| `all_smi_disk_available_bytes` | Available disk space | bytes | `mount_point` |

Note: Storage metrics exclude Docker bind mounts and are filtered to show only relevant filesystems.

### Chassis/Node-Level Metrics

Chassis metrics provide visibility into system-wide power consumption, thermal conditions, and cooling status at the node level. These metrics aggregate information from CPU, GPU, ANE, and BMC sensors.

#### Common Chassis Metrics (All Platforms)

| Metric                              | Description                                    | Unit    | Labels                  |
|-------------------------------------|------------------------------------------------|---------|-------------------------|
| `all_smi_chassis_power_watts`       | Total chassis power consumption (CPU+GPU+ANE)  | watts   | `hostname`, `instance`  |

#### Apple Silicon Chassis Metrics

| Metric                                   | Description                               | Unit    | Labels                           |
|------------------------------------------|-------------------------------------------|---------|----------------------------------|
| `all_smi_chassis_thermal_pressure_info`  | Thermal pressure level                    | info    | `hostname`, `instance`, `level`  |
| `all_smi_chassis_cpu_power_watts`        | CPU power consumption                     | watts   | `hostname`, `instance`           |
| `all_smi_chassis_gpu_power_watts`        | GPU power consumption                     | watts   | `hostname`, `instance`           |
| `all_smi_chassis_ane_power_watts`        | ANE (Apple Neural Engine) power           | watts   | `hostname`, `instance`           |

#### Server Chassis Metrics (BMC-enabled Systems)

| Metric                                      | Description                      | Unit    | Labels                                     |
|---------------------------------------------|----------------------------------|---------|-------------------------------------------|
| `all_smi_chassis_inlet_temperature_celsius` | Chassis inlet temperature        | celsius | `hostname`, `instance`                    |
| `all_smi_chassis_outlet_temperature_celsius`| Chassis outlet temperature       | celsius | `hostname`, `instance`                    |
| `all_smi_chassis_fan_speed_rpm`             | Fan speed                        | RPM     | `hostname`, `instance`, `fan_id`, `fan_name` |

Note: Chassis metrics provide a unified view of node-level power consumption and thermal conditions, useful for cluster-wide capacity planning and power monitoring.

### Runtime Environment Metrics

| Metric                              | Description                                      | Unit  | Labels                                           |
|-------------------------------------|--------------------------------------------------|-------|--------------------------------------------------|
| `all_smi_runtime_environment`       | Current runtime environment (container or VM)    | gauge | `hostname`, `environment`                        |
| `all_smi_container_runtime_info`    | Container runtime environment information        | gauge | `hostname`, `runtime`, `container_id`            |
| `all_smi_kubernetes_pod_info`       | Kubernetes pod information (K8s only)            | gauge | `hostname`, `pod_name`, `namespace`              |
| `all_smi_virtualization_info`       | Virtualization environment information           | gauge | `hostname`, `vm_type`, `hypervisor`             |

Runtime environment metrics are detected at startup and provide information about the execution context:
- Container environments: Docker, Kubernetes, Podman, containerd, LXC, CRI-O, Backend.AI
- Virtualization platforms: VMware, VirtualBox, KVM, QEMU, Hyper-V, Xen, AWS EC2, Google Cloud, Azure, DigitalOcean, Parallels

### Process Metrics (When --processes Flag is Used)

| Metric                             | Description                     | Unit    | Labels                                                 |
|------------------------------------|---------------------------------|---------|--------------------------------------------------------|
| `all_smi_gpu_process_memory_bytes` | GPU memory used by process      | bytes   | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |
| `all_smi_gpu_process_sm_util`      | Process GPU SM utilization      | percent | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |
| `all_smi_gpu_process_mem_util`     | Process GPU memory utilization  | percent | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |
| `all_smi_gpu_process_enc_util`     | Process GPU encoder utilization | percent | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |
| `all_smi_gpu_process_dec_util`     | Process GPU decoder utilization | percent | `gpu_index`, `gpu_name`, `pid`, `process_name`, `user` |

## Platform Support Matrix

| Platform                     | GPU Metrics    | CPU Metrics    | Memory Metrics | Process Metrics |
|------------------------------|----------------|----------------|----------------|-----------------|
| Linux + NVIDIA               | ✓ Full         | ✓ Full         | ✓ Full         | ✓ Full          |
| Linux + Intel Gaudi          | ✓ Full         | ✓ Full         | ✓ Full         | ✗ N/A*******    |
| Linux + Tenstorrent          | ✓ Full***      | ✓ Full         | ✓ Full         | ✗ N/A****       |
| Linux + Rebellions           | ✓ Full         | ✓ Full         | ✓ Full         | ✗ N/A*****      |
| Linux + Furiosa              | ✓ Full         | ✓ Full         | ✓ Full         | ✗ N/A******     |
| Linux + Google TPU           | ✓ Full         | ✓ Full         | ✓ Full         | ✗ N/A********    |
| macOS + Apple Silicon        | ✓ Partial*     | ✓ Enhanced**   | ✓ Full         | ✓ Basic         |
| NVIDIA Jetson                | ✓ Full + DLA   | ✓ Full         | ✓ Full         | ✓ Full          |

*Apple Silicon (M1/M2/M3/M4) GPU metrics do not include temperature (thermal pressure provided instead)
**Apple Silicon (M1/M2/M3/M4) provides enhanced P-core/E-core metrics and cluster frequencies
***Tenstorrent provides extensive hardware monitoring including multiple temperature sensors, health counters, and status registers
****Tenstorrent NPUs do not expose per-process GPU usage information
*****Rebellions NPUs do not expose per-process GPU usage information
******Furiosa NPUs do not expose per-process GPU usage information
*******Intel Gaudi NPUs do not expose per-process GPU usage information via hl-smi
********Google Cloud TPUs do not expose per-process GPU usage information via tpu-info

## Example Prometheus Queries

### Basic Monitoring
```promql
# Average GPU utilization across all GPUs
avg(all_smi_gpu_utilization)

# Memory usage percentage per GPU
(all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) * 100

# GPUs running above 80°C
all_smi_gpu_temperature_celsius > 80
```

### Power Monitoring
```promql
# Total power consumption across all GPUs
sum(all_smi_gpu_power_consumption_watts)

# Power efficiency (utilization per watt)
all_smi_gpu_utilization / all_smi_gpu_power_consumption_watts
```

### AMD GPU Specific
```promql
# AMD GPUs with high fan speed (potential cooling issues)
all_smi_gpu_fan_speed_rpm > 3000

# VRAM utilization percentage
(all_smi_gpu_memory_vram_bytes / all_smi_gpu_memory_total_bytes) * 100

# AMD GPUs approaching power cap
all_smi_gpu_power_consumption_watts / all_smi_amd_power_cap_watts > 0.9

# Memory bandwidth usage (VRAM + GTT)
all_smi_gpu_memory_vram_bytes + all_smi_gpu_memory_gtt_bytes

# AMD GPU thermal efficiency (utilization per degree)
all_smi_gpu_utilization / all_smi_gpu_temperature_celsius
```

### Apple Silicon Specific
```promql
# P-core vs E-core utilization comparison
all_smi_cpu_p_core_utilization - all_smi_cpu_e_core_utilization

# ANE power consumption in watts
all_smi_ane_power_watts
```

### Tenstorrent NPU Specific
```promql
# NPUs with high temperature on any sensor
max by (instance) ({
  __name__=~"all_smi_tenstorrent_.*_temperature_celsius",
  instance=~"tt.*"
}) > 80

# Power efficiency by board type
all_smi_gpu_utilization / on(instance) group_left(board_type) 
  (all_smi_tenstorrent_board_info * 0 + all_smi_gpu_power_consumption_watts)

# Throttling detection
all_smi_tenstorrent_throttler > 0

# Health monitoring - ARC processors not incrementing
rate(all_smi_tenstorrent_arc0_health[5m]) == 0
```

### Rebellions NPU Specific
```promql
# NPUs in low performance state
all_smi_rebellions_performance_state > 0

# Devices with non-operational status
all_smi_rebellions_device_status != 1

# Power efficiency (TOPS per watt)
all_smi_rebellions_compute_tops / all_smi_gpu_power_consumption_watts

# Memory bandwidth saturation check
(all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) > 0.9
```

### Furiosa NPU Specific
```promql
# NPUs with unavailable cores
all_smi_furiosa_core_status == 0

# Average PE utilization across all cores
avg by (instance) (all_smi_furiosa_pe_utilization)

# NPUs with high error rates
rate(all_smi_furiosa_error_count[5m]) > 0.1

# Power governor not in OnDemand mode
all_smi_furiosa_power_governor_info{governor!="OnDemand"}

# Memory bandwidth bottleneck detection
all_smi_furiosa_memory_bandwidth_utilization > 80
```

### Intel Gaudi NPU Specific
```promql
# NPUs with high AIP utilization
all_smi_gaudi_aip_utilization_percent > 80

# HBM memory utilization across cluster
avg by (instance) (all_smi_gaudi_memory_utilization_percent)

# NPUs approaching power limit
all_smi_gaudi_power_draw_watts / all_smi_gaudi_power_max_watts > 0.9

# Power efficiency (AIP utilization per watt)
all_smi_gaudi_aip_utilization_percent / all_smi_gaudi_power_draw_watts

# NPUs running hot (temperature > 70°C)
all_smi_gaudi_temperature_celsius > 70

# Total HBM memory usage across all Gaudi NPUs
sum(all_smi_gaudi_memory_used_bytes)

# Gaudi NPUs by device variant
count by (internal_name) (all_smi_gaudi_internal_name_info)

# Driver version consistency check
count by (version) (all_smi_gaudi_driver_info) > 1
```

### Google TPU Specific
```promql
# TPU utilization across all chips
avg(all_smi_tpu_utilization_percent)

# HBM memory utilization percentage
all_smi_tpu_memory_utilization_percent

# Count TPUs by accelerator type
count by (type) (all_smi_tpu_accelerator_type_info)

# Monitor HLO queue size
all_smi_tpu_hlo_queue_size > 5

# Alert on high HLO execution latency
all_smi_tpu_hlo_exec_p90_microseconds > 1000000
```

### Process Monitoring
```promql
# Top 5 GPU memory consumers
topk(5, all_smi_gpu_process_memory_bytes)

# Processes using more than 1GB GPU memory
all_smi_gpu_process_memory_bytes > 1073741824
```

### Chassis/Node-Level Monitoring
```promql
# Total power consumption across all nodes
sum(all_smi_chassis_power_watts)

# Nodes with high power consumption (> 3000W)
all_smi_chassis_power_watts > 3000

# Power breakdown by component (Apple Silicon)
sum by (hostname) (all_smi_chassis_cpu_power_watts)
sum by (hostname) (all_smi_chassis_gpu_power_watts)
sum by (hostname) (all_smi_chassis_ane_power_watts)

# Nodes with non-nominal thermal pressure
all_smi_chassis_thermal_pressure_info{level!="Nominal"}

# Average chassis power per node
avg(all_smi_chassis_power_watts)

# Nodes with high inlet temperature
all_smi_chassis_inlet_temperature_celsius > 35

# Delta between inlet and outlet temperature (thermal dissipation)
all_smi_chassis_outlet_temperature_celsius - all_smi_chassis_inlet_temperature_celsius

# Fan speed monitoring
avg by (hostname) (all_smi_chassis_fan_speed_rpm)
```

### Runtime Environment Monitoring
```promql
# All containers running in Kubernetes
all_smi_container_runtime_info{runtime="Kubernetes"}

# All instances running in AWS EC2
all_smi_virtualization_info{vm_type="AWS EC2"}

# Containers running in Backend.AI
all_smi_runtime_environment{environment="Backend.AI"}

# Group metrics by runtime environment
sum by (environment) (all_smi_gpu_utilization) * on(hostname) group_left(environment) all_smi_runtime_environment
```

## Integration Examples

### Grafana Dashboard
Create a comprehensive monitoring dashboard with:
- GPU utilization heatmap
- Memory usage time series
- Power consumption stacked graph
- Temperature alerts
- Process resource usage table

### AlertManager Rules
```yaml
groups:
  - name: gpu_alerts
    rules:
      - alert: HighGPUTemperature
        expr: all_smi_gpu_temperature_celsius > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu_name }} is running hot"
          
      - alert: GPUMemoryExhausted
        expr: (all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) > 0.95
        for: 5m
        labels:
          severity: critical
          
      - alert: TenstorrentNPUFault
        expr: all_smi_tenstorrent_faults > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Tenstorrent NPU {{ $labels.instance }} has fault condition"
          
      - alert: TenstorrentNPUThrottling
        expr: all_smi_tenstorrent_throttler > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Tenstorrent NPU {{ $labels.instance }} is throttling"
          
      - alert: RebellionsNPULowPerformance
        expr: all_smi_rebellions_performance_state > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Rebellions NPU {{ $labels.instance }} stuck in low performance state P{{ $value }}"
          
      - alert: FuriosaNPUCoreFailure
        expr: all_smi_furiosa_core_status == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Furiosa NPU {{ $labels.instance }} has unavailable core {{ $labels.core }}"
          
      - alert: FuriosaNPUHighErrorRate
        expr: rate(all_smi_furiosa_error_count[5m]) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Furiosa NPU {{ $labels.instance }} experiencing high error rate"

      - alert: GaudiNPUHighTemperature
        expr: all_smi_gaudi_temperature_celsius > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Intel Gaudi NPU {{ $labels.instance }} is running hot at {{ $value }}°C"

      - alert: GaudiNPUPowerLimitApproaching
        expr: all_smi_gaudi_power_draw_watts / all_smi_gaudi_power_max_watts > 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Intel Gaudi NPU {{ $labels.instance }} approaching power limit"

      - alert: GaudiNPUHBMMemoryExhausted
        expr: all_smi_gaudi_memory_utilization_percent > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Intel Gaudi NPU {{ $labels.instance }} HBM memory nearly exhausted"

      - alert: ChassisHighPowerConsumption
        expr: all_smi_chassis_power_watts > 3500
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Chassis {{ $labels.hostname }} power consumption is high at {{ $value }}W"

      - alert: ChassisThermalPressureElevated
        expr: all_smi_chassis_thermal_pressure_info{level!="Nominal"} == 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Chassis {{ $labels.hostname }} thermal pressure elevated to {{ $labels.level }}"

      - alert: ChassisHighInletTemperature
        expr: all_smi_chassis_inlet_temperature_celsius > 40
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Chassis {{ $labels.hostname }} inlet temperature is high at {{ $value }}°C"
```

## Update Intervals

The metrics update interval can be configured:
- Default: 3 seconds
- Minimum recommended: 1 second
- Maximum recommended: 60 seconds

Higher update rates provide more real-time data but increase system load. For production monitoring, 5-10 seconds is typically sufficient.

## Notes

1. All metrics follow Prometheus naming conventions
2. Labels are used to differentiate between multiple devices
3. Info metrics (ending in `_info`) provide static metadata
4. Some metrics may not be available on all platforms
5. Process metrics require the `--processes` flag and may impact performance
6. Tenstorrent NPU metrics include comprehensive hardware monitoring data:
   - Multiple temperature sensors (ASIC, voltage regulator, inlet/outlet)
   - Detailed firmware versions and health counters
   - Power limits (TDP/TDC) and throttling information
   - PCIe and DDR status registers for diagnostics
7. Tenstorrent utilization is calculated based on power consumption as a proxy metric
8. Rebellions NPU metrics include:
   - Performance state monitoring (P0-P15) for power management
   - Device status and KMD version tracking
   - Support for ATOM, ATOM+, and ATOM Max variants
   - PCIe Gen4 x16 interface metrics
9. Furiosa NPU metrics include:
   - Per-core PE utilization monitoring
   - Core availability status tracking
   - Power governor mode information
   - Error counting and liveness monitoring
   - RNGD architecture with 8 cores per NPU
10. Intel Gaudi NPU metrics include:
    - AIP (AI Processor) utilization monitoring
    - HBM memory usage and utilization tracking (up to 128GB per device)
    - Power consumption with configurable power limits (up to 850W)
    - Temperature monitoring
    - Automatic device name mapping (HL-325L → Intel Gaudi 3 PCIe LP)
    - Support for Gaudi 1/2/3 across PCIe, OAM, UBB, and HLS form factors
    - Background process monitoring via hl-smi with circular buffer
11. Chassis/Node-level metrics include:
    - Total chassis power consumption aggregating CPU, GPU, and ANE power
    - Thermal pressure monitoring (Apple Silicon)
    - Individual power component breakdown (CPU, GPU, ANE)
    - Inlet/outlet temperature monitoring (BMC-enabled servers)
    - Fan speed monitoring with per-fan granularity