kotoba-monitoring 0.1.21

# KotobaDB Monitoring & Metrics

**Comprehensive monitoring and metrics collection system for KotobaDB** with health checks, performance monitoring, and Prometheus integration.

## Features

- **Metrics Collection**: Automatic collection of database, system, and application metrics
- **Health Checks**: Comprehensive health monitoring with configurable checks
- **Performance Monitoring**: Real-time performance tracking and analysis
- **Prometheus Integration**: Native Prometheus metrics export and scraping
- **Alerting System**: Configurable alerting rules and notifications
- **Custom Metrics**: Extensible metrics collection framework
- **Historical Data**: Time-series metrics storage and querying

## Quick Start

Add to your `Cargo.toml`:

```toml
[dependencies]
kotoba-monitoring = "0.1.0"
```

### Basic Monitoring Setup

```rust
use kotoba_monitoring::*;
use std::sync::Arc;

// Create monitoring configuration
let monitoring_config = MonitoringConfig {
    enable_metrics: true,
    enable_health_checks: true,
    collection_interval: Duration::from_secs(15),
    health_check_interval: Duration::from_secs(30),
    prometheus_config: PrometheusConfig {
        enabled: true,
        address: "127.0.0.1".to_string(),
        port: 9090,
        path: "/metrics".to_string(),
        global_labels: HashMap::new(),
    },
    ..Default::default()
};

// Create metrics collector (assuming you have a KotobaDB instance)
let collector = Arc::new(MetricsCollector::new(db_instance, monitoring_config.clone()));

// Create health checker
let health_checker = HealthChecker::new(monitoring_config.clone());
health_checker.add_default_checks().await?;

// Create Prometheus exporter
let exporter = PrometheusExporter::new(Arc::clone(&collector), monitoring_config.prometheus_config)?;

// Start monitoring
collector.start().await?;
health_checker.start().await?;
exporter.start().await?;

println!("Monitoring system started");

// Monitor health
let health = health_checker.check_health().await?;
println!("System health: {:?}", health.overall_status);

// Get performance metrics
let performance = collector.get_performance_metrics().await?;
println!("Queries per second: {}", performance.query_metrics.queries_per_second);
```

### Custom Metrics

```rust
// Record custom metrics
collector.record_metric(
    "custom_operation_duration",
    1.5,
    hashmap! {
        "operation".to_string() => "user_registration".to_string(),
        "region".to_string() => "us-east".to_string(),
    }
).await?;

// Record using Prometheus helpers
use kotoba_monitoring::prometheus_exporter::*;

record_counter("user_registrations_total", 1, &[("region", "us-east")]);
record_gauge("active_users", 150.0, &[("service", "auth")]);
record_histogram("request_duration", 0.25, &[("endpoint", "/api/users")]);
```

### Health Checks

```rust
// Create custom health check
struct CustomHealthCheck;

#[async_trait::async_trait]
impl HealthCheck for CustomHealthCheck {
    async fn check_health(&self) -> HealthCheckResult {
        // Your custom health check logic
        let status = HealthStatus::Healthy;
        let message = "Custom service is healthy".to_string();

        HealthCheckResult {
            name: "custom_service".to_string(),
            status,
            message,
            duration: Duration::from_millis(50),
            details: hashmap! {
                "version".to_string() => "1.2.3".to_string(),
                "uptime".to_string() => "2h 30m".to_string(),
            },
        }
    }
}

// Register custom health check
health_checker.register_check("custom".to_string(), Box::new(CustomHealthCheck)).await?;
```

## Metrics Categories

### Database Metrics

```rust
// Automatically collected database metrics
let db_metrics = collector.get_metrics(
    "database_connections_active",
    Utc::now() - Duration::hours(1),
    Utc::now()
).await?;
```

Available database metrics:
- `kotoba_db_connections_active`: Active database connections
- `kotoba_db_connections_total`: Total database connections
- `kotoba_db_queries_total`: Total number of queries
- `kotoba_db_query_latency_seconds`: Query latency histogram
- `kotoba_db_storage_size_bytes`: Total storage size
- `kotoba_db_storage_used_bytes`: Used storage size

### System Metrics (with `system` feature)

```toml
[dependencies]
kotoba-monitoring = { version = "0.1.0", features = ["system"] }
```

Available system metrics:
- `system_cpu_usage_percent`: CPU usage percentage
- `system_memory_usage_bytes`: Memory usage in bytes
- `system_memory_usage_percent`: Memory usage percentage
- `system_disk_usage_bytes`: Disk usage in bytes
- `system_disk_usage_percent`: Disk usage percentage

### Cluster Metrics (with `cluster` feature)

```toml
[dependencies]
kotoba-monitoring = { version = "0.1.0", features = ["cluster"] }
```

Available cluster metrics:
- `kotoba_cluster_nodes_total`: Total cluster nodes
- `kotoba_cluster_nodes_active`: Active cluster nodes
- `kotoba_cluster_leader_changes_total`: Leader change events

## Prometheus Integration

### Configuration

```rust
let prometheus_config = PrometheusConfig {
    enabled: true,
    address: "0.0.0.0".to_string(),  // Listen on all interfaces
    port: 9090,
    path: "/metrics".to_string(),
    global_labels: hashmap! {
        "service".to_string() => "kotoba-db".to_string(),
        "environment".to_string() => "production".to_string(),
    },
};
```

### Accessing Metrics

Once started, metrics are available at: `http://localhost:9090/metrics`

```bash
# View metrics
curl http://localhost:9090/metrics

# Example output
# HELP kotoba_db_connections_active Number of active database connections
# TYPE kotoba_db_connections_active gauge
# kotoba_db_connections_active{service="kotoba-db",environment="production"} 5

# HELP kotoba_db_query_latency_seconds Database query latency in seconds
# TYPE kotoba_db_query_latency_seconds histogram
# ...
```

### Prometheus Configuration

Add to your `prometheus.yml`:

```yaml
scrape_configs:
  - job_name: 'kotoba-db'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s
```

### Grafana Dashboards

Use the exported metrics to create dashboards in Grafana:

- **Database Performance**: Query latency, throughput, connection counts
- **Storage Usage**: Disk usage, I/O operations, cache hit rates
- **System Resources**: CPU, memory, network usage
- **Health Status**: Service health indicators and alerts

## Alerting System

### Alert Rules

```rust
let alerting_config = AlertingConfig {
    enabled: true,
    rules: vec![
        AlertRule {
            name: "High CPU Usage".to_string(),
            description: "CPU usage is above 80%".to_string(),
            query: "system_cpu_usage_percent > 80".to_string(),
            threshold: AlertThreshold::GreaterThan(80.0),
            evaluation_interval: Duration::from_secs(60),
            severity: AlertSeverity::Warning,
            labels: hashmap! {
                "team".to_string() => "infrastructure".to_string(),
            },
        },
        AlertRule {
            name: "Database Down".to_string(),
            description: "Database health check is failing".to_string(),
            query: "health_check_status{check_name=\"database\"} == 0".to_string(),
            threshold: AlertThreshold::Equal(0.0),
            evaluation_interval: Duration::from_secs(30),
            severity: AlertSeverity::Critical,
            labels: hashmap! {
                "service".to_string() => "database".to_string(),
            },
        },
    ],
    notifications: vec![
        NotificationConfig {
            notification_type: NotificationType::Slack,
            config: hashmap! {
                "webhook_url".to_string() => "https://hooks.slack.com/...".to_string(),
                "channel".to_string() => "#alerts".to_string(),
            },
        },
    ],
};
```

### Alert Severities

- **Info**: Informational alerts (e.g., version updates)
- **Warning**: Warning conditions (e.g., high resource usage)
- **Error**: Error conditions (e.g., service degradation)
- **Critical**: Critical conditions (e.g., service down)

### Notification Channels

- **Email**: SMTP-based email notifications
- **Slack**: Slack webhook notifications
- **Webhook**: HTTP webhook notifications
- **PagerDuty**: PagerDuty integration

## Health Checks

### Built-in Health Checks

The system includes several built-in health checks:

- **Database**: Database connectivity and responsiveness
- **Memory**: Memory usage monitoring
- **Disk**: Disk space availability
- **CPU**: CPU usage monitoring
- **Network**: Network connectivity (cluster mode)

### Health Status Levels

- **Healthy**: All systems operational
- **Degraded**: Some non-critical issues detected
- **Unhealthy**: Critical issues requiring attention
- **Unknown**: Health status cannot be determined

### Custom Health Checks

```rust
struct ExternalServiceHealthCheck {
    service_url: String,
}

#[async_trait::async_trait]
impl HealthCheck for ExternalServiceHealthCheck {
    async fn check_health(&self) -> HealthCheckResult {
        let start = Instant::now();

        // Check external service
        let client = reqwest::Client::new();
        let response = client
            .get(&self.service_url)
            .timeout(Duration::from_secs(5))
            .send()
            .await;

        let duration = start.elapsed();

        match response {
            Ok(resp) if resp.status().is_success() => HealthCheckResult {
                name: "external_service".to_string(),
                status: HealthStatus::Healthy,
                message: "External service is responding".to_string(),
                duration,
                details: hashmap! {
                    "response_time_ms".to_string() => duration.as_millis().to_string(),
                    "status_code".to_string() => resp.status().as_u16().to_string(),
                },
            },
            Ok(resp) => HealthCheckResult {
                name: "external_service".to_string(),
                status: HealthStatus::Degraded,
                message: format!("External service returned status {}", resp.status()),
                duration,
                details: HashMap::new(),
            },
            Err(e) => HealthCheckResult {
                name: "external_service".to_string(),
                status: HealthStatus::Unhealthy,
                message: format!("External service unreachable: {}", e),
                duration,
                details: HashMap::new(),
            },
        }
    }
}
```

## Performance Monitoring

### Real-time Metrics

```rust
// Get current performance snapshot
let performance = collector.get_performance_metrics().await?;

println!("Query Performance:");
println!("  Total queries: {}", performance.query_metrics.total_queries);
println!("  Queries/sec: {:.2}", performance.query_metrics.queries_per_second);
println!("  Avg latency: {:.2}ms", performance.query_metrics.avg_query_latency_ms);
println!("  P95 latency: {:.2}ms", performance.query_metrics.p95_query_latency_ms);

println!("Storage Performance:");
println!("  Total size: {} bytes", performance.storage_metrics.total_size_bytes);
println!("  Used size: {} bytes", performance.storage_metrics.used_size_bytes);
println!("  Cache hit rate: {:.2}%", performance.storage_metrics.cache_hit_rate * 100.0);
```

### Historical Analysis

```rust
// Get metrics for the last hour
let from = Utc::now() - Duration::hours(1);
let to = Utc::now();

let query_latencies = collector.get_metrics("query_latency", from, to).await?;
let avg_latency = query_latencies.iter()
    .map(|p| p.value)
    .sum::<f64>() / query_latencies.len() as f64;

println!("Average query latency over last hour: {:.2}ms", avg_latency);
```

## Configuration

### Advanced Configuration

```rust
let config = MonitoringConfig {
    enable_metrics: true,
    enable_health_checks: true,
    collection_interval: Duration::from_secs(10),  // More frequent collection
    health_check_interval: Duration::from_secs(20),
    retention_period: Duration::from_secs(7200),  // 2 hours retention
    max_metrics_points: 50000,                    // Higher limit
    prometheus_config: PrometheusConfig {
        enabled: true,
        address: "127.0.0.1".to_string(),
        port: 9090,
        path: "/metrics".to_string(),
        global_labels: hashmap! {
            "cluster".to_string() => "production".to_string(),
            "region".to_string() => "us-east-1".to_string(),
        },
    },
    alerting_config: AlertingConfig {
        enabled: true,
        rules: vec![/* alert rules */],
        notifications: vec![/* notification configs */],
    },
};
```

### Environment Variables

```bash
# Prometheus configuration
export KOTOBA_METRICS_PORT=9090
export KOTOBA_METRICS_PATH=/metrics

# Alerting configuration
export KOTOBA_ALERT_SLACK_WEBHOOK=https://hooks.slack.com/...
export KOTOBA_ALERT_EMAIL_SMTP=smtp.gmail.com:587
```

## Architecture

```
┌─────────────────────────────────────────┐
│            Application Layer            │
├─────────────────────────────────────────┤
│    ┌─────────────┬─────────────┬─────┐  │
│    │Metrics Coll│Health Check │Alert│  │ ← Monitoring Components
│    │ector       │er          │ing │  │
│    └─────────────┴─────────────┴─────┘  │
├─────────────────────────────────────────┤
│    ┌─────────────────────────────────┐  │
│    │    Prometheus HTTP Server       │  │ ← Metrics Export
│    └─────────────────────────────────┘  │
├─────────────────────────────────────────┤
│    ┌─────────────────────────────────┐  │
│    │      Metrics Storage            │  │ ← Time-series Storage
│    └─────────────────────────────────┘  │
├─────────────────────────────────────────┤
│         Database Integration          │ ← KotobaDB Integration
└─────────────────────────────────────────┘
```

## Integration Examples

### With KotobaDB

```rust
use kotoba_db::DB;
use kotoba_monitoring::*;

// Create database
let db = DB::open_lsm("./database").await?;

// Wrap database for monitoring
struct MonitoredKotobaDB {
    db: DB,
}

#[async_trait::async_trait]
impl MonitoredDatabase for MonitoredKotobaDB {
    async fn get_database_metrics(&self) -> Result<DatabaseMetrics, MonitoringError> {
        // Implement database metrics collection
        Ok(DatabaseMetrics {
            active_connections: 10,
            total_connections: 15,
            uptime_seconds: 3600,
            version: env!("CARGO_PKG_VERSION").to_string(),
        })
    }

    async fn get_query_metrics(&self) -> Result<QueryMetrics, MonitoringError> {
        // Implement query metrics collection
        Ok(QueryMetrics {
            total_queries: 1000,
            queries_per_second: 50.0,
            avg_query_latency_ms: 25.0,
            p95_query_latency_ms: 50.0,
            p99_query_latency_ms: 100.0,
            slow_queries: 5,
            failed_queries: 1,
        })
    }

    async fn get_storage_metrics(&self) -> Result<StorageMetrics, MonitoringError> {
        // Implement storage metrics collection
        Ok(StorageMetrics {
            total_size_bytes: 1_000_000_000,
            used_size_bytes: 500_000_000,
            read_operations: 10000,
            write_operations: 5000,
            read_bytes_per_sec: 100_000.0,
            write_bytes_per_sec: 50_000.0,
            cache_hit_rate: 0.95,
            io_latency_ms: 10.0,
        })
    }
}

let monitored_db = Arc::new(MonitoredKotobaDB { db });
let collector = Arc::new(MetricsCollector::new(monitored_db, config));
```

### With Custom Metrics

```rust
// Custom application metrics
async fn record_business_metrics(collector: &MetricsCollector) {
    // Business logic metrics
    collector.record_metric(
        "orders_total",
        150.0,
        hashmap! {
            "status".to_string() => "completed".to_string(),
            "region".to_string() => "us-east".to_string(),
        }
    ).await?;

    collector.record_metric(
        "revenue_total",
        25000.0,
        hashmap! {
            "currency".to_string() => "USD".to_string(),
            "period".to_string() => "daily".to_string(),
        }
    ).await?;
}
```

## Best Practices

### Monitoring Setup

1. **Start Simple**: Begin with basic health checks and essential metrics
2. **Define SLOs**: Set Service Level Objectives before configuring alerts
3. **Use Labels**: Properly label metrics for effective querying and aggregation
4. **Monitor Trends**: Focus on trends rather than absolute values
5. **Test Alerts**: Regularly test alerting rules to avoid alert fatigue

### Alert Configuration

1. **Start with Critical**: Configure alerts for truly critical conditions first
2. **Use Appropriate Thresholds**: Set thresholds based on historical data
3. **Avoid Noise**: Use aggregation and filtering to reduce false positives
4. **Escalation Paths**: Define clear escalation procedures for different alert severities
5. **Regular Review**: Regularly review and adjust alerting rules

### Performance Considerations

1. **Metrics Overhead**: Monitor the performance impact of metrics collection
2. **Storage Limits**: Configure appropriate retention periods and limits
3. **Network Usage**: Consider network overhead for distributed deployments
4. **Resource Usage**: Allocate sufficient resources for monitoring components

## Troubleshooting

### Common Issues

#### Metrics Not Appearing in Prometheus

```bash
# Check if metrics endpoint is accessible
curl http://localhost:9090/metrics

# Verify Prometheus configuration
# Check scrape target status in Prometheus UI
```

#### High Memory Usage

```rust
// Reduce metrics retention
let config = MonitoringConfig {
    retention_period: Duration::from_secs(1800), // 30 minutes
    max_metrics_points: 10000,
    ..Default::default()
};
```

#### Slow Health Checks

```rust
// Increase health check intervals
let config = MonitoringConfig {
    health_check_interval: Duration::from_secs(60), // Less frequent
    ..Default::default()
};
```

#### Alert Spam

```rust
// Add hysteresis to alert rules
// Use rate limiting for notifications
// Implement alert aggregation
```

## Future Enhancements

- **Distributed Tracing**: Request tracing across services
- **Anomaly Detection**: ML-based anomaly detection in metrics
- **Predictive Alerting**: Predictive maintenance alerts
- **Custom Dashboards**: Built-in dashboard generation
- **Metrics Federation**: Cross-cluster metrics aggregation

## Contributing

1. Fork the repository
2. Create a feature branch
3. Add tests for new metrics/checks
4. Update documentation
5. Submit a pull request

## License

Licensed under the MIT License.

---

**KotobaDB Monitoring & Metrics** - *Comprehensive observability for modern databases* 📊📈