inferno-ai 0.10.3

# Inferno Monitoring & Observability Guide

Complete monitoring setup for Inferno v0.8.0 using Prometheus, Grafana, and Alertmanager.

## Overview

This monitoring stack provides:
- **Prometheus**: Metrics collection and storage
- **Grafana**: Visualization and dashboarding
- **Alertmanager**: Alert routing and notifications
- **ServiceMonitor**: Kubernetes Prometheus Operator integration
- **PrometheusRule**: Alert definitions in Kubernetes format

## Architecture

```
Inferno Pods (metrics endpoint :9090/metrics)
        ↓
ServiceMonitor (discovers pods via Kubernetes SD)
        ↓
Prometheus (scrapes every 30s, evaluates rules)
        ↓
PrometheusRule (triggers alerts, computes recording rules)
        ↓
Alertmanager (routes alerts) → Slack/PagerDuty/Email
        ↓
Grafana (queries Prometheus, displays dashboards)
```

## Installation

### Prerequisites
- Prometheus Operator (or standalone Prometheus)
- Grafana 7.0+
- Kubernetes 1.20+

### Option 1: Helm Install (With Prometheus Operator)

```bash
# Install Prometheus Operator (if not already installed)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring --create-namespace

# Apply Inferno ServiceMonitor
kubectl apply -f monitoring/servicemonitor.yaml

# Apply Inferno PrometheusRule
kubectl apply -f monitoring/prometheusrule.yaml

# Apply Grafana datasource
kubectl apply -f monitoring/grafana-datasource.yaml
```

### Option 2: Manual Install (Standalone Prometheus)

```bash
# Create monitoring namespace
kubectl create namespace monitoring

# Create Prometheus ConfigMap with scrape config
kubectl create configmap prometheus-config \
  --from-file=monitoring/prometheus-config.yaml \
  -n monitoring

# Create Prometheus deployment
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
        - name: rules
          mountPath: /etc/prometheus/rules
        args:
          - '--config.file=/etc/prometheus/prometheus.yaml'
          - '--storage.tsdb.path=/prometheus'
      volumes:
      - name: config
        configMap:
          name: prometheus-config
      - name: rules
        configMap:
          name: prometheus-rules
EOF
```

### Enable in Helm Chart

The Inferno Helm chart has built-in monitoring support:

```bash
# Enable ServiceMonitor in Helm values
helm install inferno ./helm/inferno \
  -f helm/inferno/values-prod.yaml \
  --set monitoring.serviceMonitor.enabled=true

# Or with environment variable
HELM_ENABLE_MONITORING=true helm install inferno ./helm/inferno \
  -f helm/inferno/values-prod.yaml
```

## Metrics

### Request Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `inferno_http_requests_total` | Counter | Total HTTP requests by endpoint and status |
| `inferno_http_request_duration_seconds` | Histogram | HTTP request latency distribution |
| `inferno_http_requests_in_progress` | Gauge | Current in-flight requests |

### Inference Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `inferno_inference_requests_total` | Counter | Total inference requests by model |
| `inferno_inference_duration_seconds` | Histogram | Inference latency distribution by model |
| `inferno_inference_errors_total` | Counter | Inference errors by model |
| `inferno_tokens_generated_total` | Counter | Total tokens generated by model |

### Queue Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `inferno_queue_pending_requests` | Gauge | Current pending requests |
| `inferno_queue_max_capacity` | Gauge | Queue capacity limit |
| `inferno_queue_processed_total` | Counter | Total processed requests |
| `inferno_queue_dropped_total` | Counter | Dropped requests (queue full) |

### Cache Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `inferno_cache_hits_total` | Counter | Cache hit count |
| `inferno_cache_misses_total` | Counter | Cache miss count |
| `inferno_cache_evictions_total` | Counter | Cache evictions |
| `inferno_cache_size_bytes` | Gauge | Current cache size |

### Model Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `inferno_models_loaded` | Gauge | Number of loaded models |
| `inferno_model_size_bytes` | Gauge | Model size in bytes |
| `inferno_model_load_duration_seconds` | Histogram | Model load time |
| `inferno_model_load_errors_total` | Counter | Model load failures |

### Resource Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `container_cpu_usage_seconds_total` | Counter | Container CPU usage |
| `container_memory_usage_bytes` | Gauge | Container memory usage |
| `node_filesystem_avail_bytes` | Gauge | Available disk space |

## Alerts

### Critical Alerts

| Alert | Threshold | Duration | Action |
|-------|-----------|----------|--------|
| `InfernoPodDown` | Pod not responding | 2 min | Page on-call engineer |
| `InfernoQueueCritical` | >500 pending requests | 2 min | Page on-call engineer |
| `InfernoPodMemoryCritical` | >3900Mi memory | 2 min | Page on-call engineer |
| `InfernoDiskSpaceCritical` | <5% available | 2 min | Page on-call engineer |
| `InfernoPersistenceWriteFailure` | Write errors detected | 5 min | Page on-call engineer |

### Warning Alerts

| Alert | Threshold | Duration | Action |
|-------|-----------|----------|--------|
| `InfernoHighLatency` | P95 latency >1s | 5 min | Email ops team |
| `InfernoHighErrorRate` | >5% 5xx errors | 5 min | Email ops team |
| `InfernoQueueBacklog` | >100 pending requests | 5 min | Email ops team |
| `InfernoPodCPUHigh` | >1800m CPU | 5 min | Email ops team |
| `InfernoPodMemoryHigh` | >3500Mi memory | 5 min | Email ops team |
| `InfernoDiskSpaceLow` | <15% available | 5 min | Email ops team |

### Info Alerts

| Alert | Description |
|-------|-------------|
| `InfernoCacheHitRateLow` | Cache efficiency warning |
| `InfernoRateLimitExceeded` | Client rate limiting active |

## Grafana Dashboards

### Overview Dashboard
Shows:
- Pod status and health
- Request rate (5m average)
- API latency percentiles (P95, P99)
- 5xx error rate
- Request queue depth
- Inference latency (P95)
- Cache hit rate
- Pod memory usage

### Importing Dashboards

```bash
# Option 1: Via Grafana UI
# 1. Go to Dashboards → Import
# 2. Upload JSON file: monitoring/grafana-dashboard.json
# 3. Select Prometheus datasource

# Option 2: Via ConfigMap (Kubernetes)
kubectl create configmap grafana-dashboard-inferno \
  --from-file=monitoring/grafana-dashboard.json \
  -n monitoring

# Add label for auto-discovery (if using sidecar)
kubectl label configmap grafana-dashboard-inferno \
  grafana_dashboard=1 -n monitoring
```

### Custom Dashboards

You can create additional dashboards for:
- **Model Performance**: Per-model metrics (latency, throughput, errors)
- **Queue Analysis**: Detailed queue depth, wait times, throughput
- **Cache Analysis**: Hit rate trends, eviction rates, memory usage
- **Resource Utilization**: CPU, memory, disk trends over time
- **Business Metrics**: Requests per customer, tokens per hour, cost per inference

## PromQL Queries

### Common Queries

```promql
# Request rate (requests/second)
rate(inferno_http_requests_total[5m])

# P95 API latency
histogram_quantile(0.95, rate(inferno_http_request_duration_seconds_bucket[5m]))

# Error rate percentage
100 * (rate(inferno_http_requests_total{status=~"5.."}[5m]) / rate(inferno_http_requests_total[5m]))

# Queue utilization
inferno_queue_pending_requests / inferno_queue_max_capacity

# Cache hit rate
rate(inferno_cache_hits_total[5m]) / (rate(inferno_cache_hits_total[5m]) + rate(inferno_cache_misses_total[5m]))

# Inference latency by model
histogram_quantile(0.95, rate(inferno_inference_duration_seconds_bucket[5m])) by (model)

# Per-model error rate
rate(inferno_inference_errors_total[5m]) by (model)

# Pod memory usage
container_memory_usage_bytes{pod=~"inferno.*"} / 1024 / 1024

# Pod CPU usage (millicores)
rate(container_cpu_usage_seconds_total{pod=~"inferno.*"}[5m]) * 1000
```

## Alert Configuration

### Slack Integration

```bash
# Configure Alertmanager for Slack
cat > /etc/alertmanager/slack-config.yaml <<EOF
global:
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

route:
  receiver: 'slack-notifications'
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
EOF
```

### PagerDuty Integration

```bash
cat > /etc/alertmanager/pagerduty-config.yaml <<EOF
global:
  resolve_timeout: 5m

route:
  receiver: 'pagerduty'
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 30m

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        description: '{{ .GroupLabels.alertname }}'
        details:
          firing: '{{ range .Alerts.Firing }}{{ .Annotations.description }}{{ end }}'
EOF
```

## Troubleshooting

### Prometheus Not Scraping Metrics

```bash
# Check ServiceMonitor is created
kubectl get servicemonitor -n inferno-prod

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus 9090:9090
curl http://localhost:9090/api/v1/targets | jq

# Verify metrics endpoint
kubectl port-forward -n inferno-prod svc/inferno 9090:9090
curl http://localhost:9090/metrics | head -20
```

### Alerts Not Firing

```bash
# Check PrometheusRule
kubectl get prometheusrule -n inferno-prod

# Check Prometheus rule evaluation
kubectl logs -n monitoring deployment/prometheus | grep "error"

# Test alert query manually
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Visit http://localhost:9090/alerts
```

### No Data in Grafana

```bash
# Verify datasource connection
# Grafana UI → Configuration → Data Sources → Prometheus
# Click "Save & Test"

# Check Prometheus has data
kubectl port-forward -n monitoring svc/prometheus 9090:9090
curl "http://localhost:9090/api/v1/query?query=up"
```

## Performance Tuning

### Prometheus Configuration

```yaml
# Adjust scrape interval for more/less data
scrape_interval: 30s      # Default: 30s, increase for less load
evaluation_interval: 30s   # Default: 30s, increase for less CPU
```

### Retention Policies

```bash
# Keep metrics for 15 days instead of default 30 days
prometheus --storage.tsdb.retention.time=15d

# Keep only 50GB of metrics instead of unlimited
prometheus --storage.tsdb.retention.size=50GB
```

### Grafana Performance

```bash
# Update dashboard refresh rate
# In Grafana dashboard: Refresh interval dropdown
# Use higher values (5m, 10m) for less load on Prometheus
```

## Metrics Collection Overhead

| Component | CPU Impact | Memory Impact | Storage |
|-----------|-----------|---------------|---------|
| ServiceMonitor scrape | <5% | <10Mi | 1-2MB/hour |
| Alert evaluation | <2% | <5Mi | None |
| Grafana queries | Varies | 50-100Mi | None |

## Best Practices

1. **Alert on symptoms, not causes**
   - Alert on high latency, not CPU
   - Alert on queue depth, not throughput

2. **Use recording rules**
   - Pre-compute common aggregations
   - Reduces query load on Prometheus

3. **Meaningful alert descriptions**
   - Include runbook links
   - Suggest remediation steps

4. **Regular testing**
   - Test alert receivers regularly
   - Verify dashboard accuracy

5. **Document custom metrics**
   - Explain what each metric means
   - List dependencies

## Next Steps

1. Deploy Prometheus and Grafana
2. Import dashboards into Grafana
3. Configure alert receivers (Slack, PagerDuty, email)
4. Create runbooks for each alert
5. Set up on-call rotation
6. Monitor the monitoring system

## Support

- **Documentation**: [Helm Chart README](../helm/inferno/README.md)
- **GitHub**: https://github.com/ringo380/inferno
- **Issues**: https://github.com/ringo380/inferno/issues

---

**Version**: Inferno v0.8.0
**Updated**: 2024-Q4
**Prometheus Operator**: v0.50+