# Inferno Monitoring & Observability Guide
Complete monitoring setup for Inferno v0.8.0 using Prometheus, Grafana, and Alertmanager.
## Overview
This monitoring stack provides:
- **Prometheus**: Metrics collection and storage
- **Grafana**: Visualization and dashboarding
- **Alertmanager**: Alert routing and notifications
- **ServiceMonitor**: Kubernetes Prometheus Operator integration
- **PrometheusRule**: Alert definitions in Kubernetes format
## Architecture
```
Inferno Pods (metrics endpoint :9090/metrics)
↓
ServiceMonitor (discovers pods via Kubernetes SD)
↓
Prometheus (scrapes every 30s, evaluates rules)
↓
PrometheusRule (triggers alerts, computes recording rules)
↓
Alertmanager (routes alerts) → Slack/PagerDuty/Email
↓
Grafana (queries Prometheus, displays dashboards)
```
## Installation
### Prerequisites
- Prometheus Operator (or standalone Prometheus)
- Grafana 7.0+
- Kubernetes 1.20+
### Option 1: Helm Install (With Prometheus Operator)
```bash
# Install Prometheus Operator (if not already installed)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring --create-namespace
# Apply Inferno ServiceMonitor
kubectl apply -f monitoring/servicemonitor.yaml
# Apply Inferno PrometheusRule
kubectl apply -f monitoring/prometheusrule.yaml
# Apply Grafana datasource
kubectl apply -f monitoring/grafana-datasource.yaml
```
### Option 2: Manual Install (Standalone Prometheus)
```bash
# Create monitoring namespace
kubectl create namespace monitoring
# Create Prometheus ConfigMap with scrape config
kubectl create configmap prometheus-config \
--from-file=monitoring/prometheus-config.yaml \
-n monitoring
# Create Prometheus deployment
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: rules
mountPath: /etc/prometheus/rules
args:
- '--config.file=/etc/prometheus/prometheus.yaml'
- '--storage.tsdb.path=/prometheus'
volumes:
- name: config
configMap:
name: prometheus-config
- name: rules
configMap:
name: prometheus-rules
EOF
```
### Enable in Helm Chart
The Inferno Helm chart has built-in monitoring support:
```bash
# Enable ServiceMonitor in Helm values
helm install inferno ./helm/inferno \
-f helm/inferno/values-prod.yaml \
--set monitoring.serviceMonitor.enabled=true
# Or with environment variable
HELM_ENABLE_MONITORING=true helm install inferno ./helm/inferno \
-f helm/inferno/values-prod.yaml
```
## Metrics
### Request Metrics
| `inferno_http_requests_total` | Counter | Total HTTP requests by endpoint and status |
| `inferno_http_request_duration_seconds` | Histogram | HTTP request latency distribution |
| `inferno_http_requests_in_progress` | Gauge | Current in-flight requests |
### Inference Metrics
| `inferno_inference_requests_total` | Counter | Total inference requests by model |
| `inferno_inference_duration_seconds` | Histogram | Inference latency distribution by model |
| `inferno_inference_errors_total` | Counter | Inference errors by model |
| `inferno_tokens_generated_total` | Counter | Total tokens generated by model |
### Queue Metrics
| `inferno_queue_pending_requests` | Gauge | Current pending requests |
| `inferno_queue_max_capacity` | Gauge | Queue capacity limit |
| `inferno_queue_processed_total` | Counter | Total processed requests |
| `inferno_queue_dropped_total` | Counter | Dropped requests (queue full) |
### Cache Metrics
| `inferno_cache_hits_total` | Counter | Cache hit count |
| `inferno_cache_misses_total` | Counter | Cache miss count |
| `inferno_cache_evictions_total` | Counter | Cache evictions |
| `inferno_cache_size_bytes` | Gauge | Current cache size |
### Model Metrics
| `inferno_models_loaded` | Gauge | Number of loaded models |
| `inferno_model_size_bytes` | Gauge | Model size in bytes |
| `inferno_model_load_duration_seconds` | Histogram | Model load time |
| `inferno_model_load_errors_total` | Counter | Model load failures |
### Resource Metrics
| `container_cpu_usage_seconds_total` | Counter | Container CPU usage |
| `container_memory_usage_bytes` | Gauge | Container memory usage |
| `node_filesystem_avail_bytes` | Gauge | Available disk space |
## Alerts
### Critical Alerts
| `InfernoPodDown` | Pod not responding | 2 min | Page on-call engineer |
| `InfernoQueueCritical` | >500 pending requests | 2 min | Page on-call engineer |
| `InfernoPodMemoryCritical` | >3900Mi memory | 2 min | Page on-call engineer |
| `InfernoDiskSpaceCritical` | <5% available | 2 min | Page on-call engineer |
| `InfernoPersistenceWriteFailure` | Write errors detected | 5 min | Page on-call engineer |
### Warning Alerts
| `InfernoHighLatency` | P95 latency >1s | 5 min | Email ops team |
| `InfernoHighErrorRate` | >5% 5xx errors | 5 min | Email ops team |
| `InfernoQueueBacklog` | >100 pending requests | 5 min | Email ops team |
| `InfernoPodCPUHigh` | >1800m CPU | 5 min | Email ops team |
| `InfernoPodMemoryHigh` | >3500Mi memory | 5 min | Email ops team |
| `InfernoDiskSpaceLow` | <15% available | 5 min | Email ops team |
### Info Alerts
| `InfernoCacheHitRateLow` | Cache efficiency warning |
| `InfernoRateLimitExceeded` | Client rate limiting active |
## Grafana Dashboards
### Overview Dashboard
Shows:
- Pod status and health
- Request rate (5m average)
- API latency percentiles (P95, P99)
- 5xx error rate
- Request queue depth
- Inference latency (P95)
- Cache hit rate
- Pod memory usage
### Importing Dashboards
```bash
# Option 1: Via Grafana UI
# 1. Go to Dashboards → Import
# 2. Upload JSON file: monitoring/grafana-dashboard.json
# 3. Select Prometheus datasource
# Option 2: Via ConfigMap (Kubernetes)
kubectl create configmap grafana-dashboard-inferno \
--from-file=monitoring/grafana-dashboard.json \
-n monitoring
# Add label for auto-discovery (if using sidecar)
kubectl label configmap grafana-dashboard-inferno \
grafana_dashboard=1 -n monitoring
```
### Custom Dashboards
You can create additional dashboards for:
- **Model Performance**: Per-model metrics (latency, throughput, errors)
- **Queue Analysis**: Detailed queue depth, wait times, throughput
- **Cache Analysis**: Hit rate trends, eviction rates, memory usage
- **Resource Utilization**: CPU, memory, disk trends over time
- **Business Metrics**: Requests per customer, tokens per hour, cost per inference
## PromQL Queries
### Common Queries
```promql
# Request rate (requests/second)
rate(inferno_http_requests_total[5m])
# P95 API latency
histogram_quantile(0.95, rate(inferno_http_request_duration_seconds_bucket[5m]))
# Error rate percentage
100 * (rate(inferno_http_requests_total{status=~"5.."}[5m]) / rate(inferno_http_requests_total[5m]))
# Queue utilization
inferno_queue_pending_requests / inferno_queue_max_capacity
# Cache hit rate
rate(inferno_cache_hits_total[5m]) / (rate(inferno_cache_hits_total[5m]) + rate(inferno_cache_misses_total[5m]))
# Inference latency by model
histogram_quantile(0.95, rate(inferno_inference_duration_seconds_bucket[5m])) by (model)
# Per-model error rate
rate(inferno_inference_errors_total[5m]) by (model)
# Pod memory usage
container_memory_usage_bytes{pod=~"inferno.*"} / 1024 / 1024
# Pod CPU usage (millicores)
rate(container_cpu_usage_seconds_total{pod=~"inferno.*"}[5m]) * 1000
```
## Alert Configuration
### Slack Integration
```bash
# Configure Alertmanager for Slack
cat > /etc/alertmanager/slack-config.yaml <<EOF
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
receiver: 'slack-notifications'
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
EOF
```
### PagerDuty Integration
```bash
cat > /etc/alertmanager/pagerduty-config.yaml <<EOF
global:
resolve_timeout: 5m
route:
receiver: 'pagerduty'
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 30m
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}'
details:
firing: '{{ range .Alerts.Firing }}{{ .Annotations.description }}{{ end }}'
EOF
```
## Troubleshooting
### Prometheus Not Scraping Metrics
```bash
# Check ServiceMonitor is created
kubectl get servicemonitor -n inferno-prod
# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Verify metrics endpoint
kubectl port-forward -n inferno-prod svc/inferno 9090:9090
### Alerts Not Firing
```bash
# Check PrometheusRule
kubectl get prometheusrule -n inferno-prod
# Check Prometheus rule evaluation
# Test alert query manually
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Visit http://localhost:9090/alerts
```
### No Data in Grafana
```bash
# Verify datasource connection
# Grafana UI → Configuration → Data Sources → Prometheus
# Click "Save & Test"
# Check Prometheus has data
kubectl port-forward -n monitoring svc/prometheus 9090:9090
curl "http://localhost:9090/api/v1/query?query=up"
```
## Performance Tuning
### Prometheus Configuration
```yaml
# Adjust scrape interval for more/less data
scrape_interval: 30s # Default: 30s, increase for less load
evaluation_interval: 30s # Default: 30s, increase for less CPU
```
### Retention Policies
```bash
# Keep metrics for 15 days instead of default 30 days
prometheus --storage.tsdb.retention.time=15d
# Keep only 50GB of metrics instead of unlimited
prometheus --storage.tsdb.retention.size=50GB
```
### Grafana Performance
```bash
# Update dashboard refresh rate
# In Grafana dashboard: Refresh interval dropdown
# Use higher values (5m, 10m) for less load on Prometheus
```
## Metrics Collection Overhead
| ServiceMonitor scrape | <5% | <10Mi | 1-2MB/hour |
| Alert evaluation | <2% | <5Mi | None |
| Grafana queries | Varies | 50-100Mi | None |
## Best Practices
1. **Alert on symptoms, not causes**
- Alert on high latency, not CPU
- Alert on queue depth, not throughput
2. **Use recording rules**
- Pre-compute common aggregations
- Reduces query load on Prometheus
3. **Meaningful alert descriptions**
- Include runbook links
- Suggest remediation steps
4. **Regular testing**
- Test alert receivers regularly
- Verify dashboard accuracy
5. **Document custom metrics**
- Explain what each metric means
- List dependencies
## Next Steps
1. Deploy Prometheus and Grafana
2. Import dashboards into Grafana
3. Configure alert receivers (Slack, PagerDuty, email)
4. Create runbooks for each alert
5. Set up on-call rotation
6. Monitor the monitoring system
## Support
- **Documentation**: [Helm Chart README](../helm/inferno/README.md)
- **GitHub**: https://github.com/ringo380/inferno
- **Issues**: https://github.com/ringo380/inferno/issues
---
**Version**: Inferno v0.8.0
**Updated**: 2024-Q4
**Prometheus Operator**: v0.50+