rustchain-community 1.0.0

Open-source AI agent framework with core functionality and plugin system
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
# RustChain Monitoring and Observability Guide

*Complete guide to monitoring, logging, and observability for RustChain deployments*

## Table of Contents
- [Overview]#overview
- [Built-in Monitoring Features]#built-in-monitoring-features
- [Metrics Collection]#metrics-collection
- [Logging Configuration]#logging-configuration
- [Health Checks]#health-checks
- [Alerting Setup]#alerting-setup
- [Performance Monitoring]#performance-monitoring
- [Security Monitoring]#security-monitoring
- [Troubleshooting]#troubleshooting
- [Integration Guides]#integration-guides

## Overview

RustChain includes comprehensive monitoring and observability features designed for production environments. This guide covers setup, configuration, and best practices for monitoring your RustChain deployment.

### Key Monitoring Components

- **Metrics Collection**: Prometheus-compatible metrics
- **Structured Logging**: JSON logs with tracing integration
- **Health Endpoints**: System and component health checks  
- **Audit Trails**: Cryptographic integrity tracking
- **Performance Metrics**: Execution time and resource usage
- **Security Events**: Policy violations and safety incidents

## Built-in Monitoring Features

### Metrics Endpoint

RustChain exposes Prometheus-compatible metrics on `/metrics`:

```bash
# Enable metrics in configuration
export RUSTCHAIN_METRICS_ENABLED=true
export RUSTCHAIN_METRICS_PORT=9090

# Start with metrics enabled
cargo run --bin rustchain --features observability -- serve --metrics
```

### Health Check Endpoints

```bash
# Basic health check
curl http://localhost:8080/health

# Detailed health with component status
curl http://localhost:8080/health/detailed

# Readiness check (for Kubernetes)
curl http://localhost:8080/ready

# Liveness check (for Kubernetes)
curl http://localhost:8080/live
```

### Audit Trail Access

```bash
# View audit events
rustchain audit report --format json --since "24h ago"

# Export audit trail for analysis
rustchain audit export --output audit_trail.json --timerange "7d"

# Real-time audit monitoring
rustchain audit tail --format json | jq '.event_type'
```

## Metrics Collection

### Core Metrics

RustChain automatically collects these key metrics:

```yaml
# Mission Execution Metrics
rustchain_missions_total{status="success|failure|timeout"}
rustchain_mission_duration_seconds{mission_type="agent|chain|tool"}
rustchain_mission_steps_total{step_type="llm|command|create_file"}

# System Metrics  
rustchain_memory_usage_bytes{component="runtime|llm|memory_store"}
rustchain_cpu_usage_percent{component="executor|policy_engine"}
rustchain_disk_usage_bytes{path="/tmp|/data"}

# LLM Integration Metrics
rustchain_llm_requests_total{provider="openai|anthropic|ollama"}
rustchain_llm_response_time_seconds{provider="openai|anthropic|ollama"}
rustchain_llm_tokens_total{direction="prompt|completion"}
rustchain_llm_errors_total{provider="openai|anthropic|ollama",error_type="rate_limit|timeout|auth"}

# Security Metrics
rustchain_policy_violations_total{policy_type="safety|security|resource"}
rustchain_safety_checks_total{result="pass|fail|warn"}
rustchain_audit_events_total{event_type="mission_start|mission_complete|error"}

# Performance Metrics
rustchain_request_duration_seconds{endpoint="/api/v1/*"}
rustchain_concurrent_missions{status="running|queued"}
rustchain_tool_execution_total{tool="file_create|command|http"}
```

### Custom Metrics

Add custom metrics to your missions:

```yaml
# Mission with custom metrics
name: "Data Processing with Metrics"
steps:
  - id: "process_data"
    step_type: "tool"
    parameters:
      tool: "data_processor"
      metrics:
        - name: "records_processed"
          type: "counter"
          labels: {"dataset": "sales"}
        - name: "processing_time_seconds"  
          type: "histogram"
          buckets: [0.1, 0.5, 1.0, 5.0]
```

### Prometheus Configuration

```yaml
# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'rustchain'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 5s
    metrics_path: /metrics
    
  - job_name: 'rustchain-health'
    static_configs:
      - targets: ['localhost:8080']
    scrape_interval: 30s
    metrics_path: /health/metrics
```

## Logging Configuration

### Structured Logging Setup

```toml
# rustchain.toml - Logging configuration
[logging]
level = "info"
format = "json"
output = "file"
file_path = "/var/log/rustchain/app.log"
max_file_size = "100MB"
max_files = 10

# Component-specific log levels
[logging.components]
mission_executor = "debug"
policy_engine = "warn"
llm_integration = "info"
audit_system = "info"

# Sampling for high-volume logs
[logging.sampling]
enabled = true
rate = 0.1  # Sample 10% of debug logs
```

### Environment Variable Configuration

```bash
# Set logging via environment variables
export RUST_LOG="rustchain=info,rustchain::engine=debug"
export RUSTCHAIN_LOG_FORMAT="json"
export RUSTCHAIN_LOG_FILE="/var/log/rustchain/app.log"

# Enable specific subsystem logging
export RUST_LOG="rustchain::audit=debug,rustchain::policy=warn"
```

### Log Parsing Examples

#### Using jq for JSON Logs

```bash
# Filter by log level
tail -f /var/log/rustchain/app.log | jq 'select(.level == "ERROR")'

# Mission execution logs
tail -f /var/log/rustchain/app.log | jq 'select(.target | startswith("rustchain::engine"))'

# Performance analysis
tail -f /var/log/rustchain/app.log | jq 'select(.fields.duration_ms) | {mission: .fields.mission_id, duration: .fields.duration_ms}'

# Error aggregation
tail -f /var/log/rustchain/app.log | jq -r 'select(.level == "ERROR") | .fields.error' | sort | uniq -c
```

#### Using Fluentd for Log Forwarding

```xml
<source>
  @type tail
  path /var/log/rustchain/app.log
  pos_file /var/log/td-agent/rustchain.log.pos
  tag rustchain
  format json
</source>

<match rustchain>
  @type elasticsearch
  host elasticsearch.local
  port 9200
  index_name rustchain-logs
  type_name _doc
</match>
```

## Health Checks

### Basic Health Check Response

```json
{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "uptime_seconds": 3600,
  "version": "1.0.0",
  "components": {
    "runtime": "healthy",
    "memory_store": "healthy", 
    "llm_providers": "healthy",
    "policy_engine": "healthy",
    "audit_system": "healthy"
  }
}
```

### Detailed Health Check Response

```json
{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "uptime_seconds": 3600,
  "version": "1.0.0",
  "components": {
    "runtime": {
      "status": "healthy",
      "memory_usage_mb": 45.2,
      "cpu_usage_percent": 12.5,
      "active_missions": 3
    },
    "memory_store": {
      "status": "healthy",
      "total_entries": 150,
      "memory_usage_mb": 8.3,
      "hit_rate_percent": 85.2
    },
    "llm_providers": {
      "status": "healthy",
      "active_providers": ["openai", "anthropic"],
      "total_requests": 1247,
      "avg_response_time_ms": 850
    },
    "policy_engine": {
      "status": "healthy",
      "policies_loaded": 12,
      "violations_last_hour": 0
    },
    "audit_system": {
      "status": "healthy",
      "events_last_hour": 45,
      "storage_usage_mb": 23.1
    }
  }
}
```

### Custom Health Checks

```rust
// Add custom health checks
use rustchain::health::{HealthCheck, HealthStatus};

struct CustomDatabaseCheck {
    db_pool: DatabasePool,
}

impl HealthCheck for CustomDatabaseCheck {
    async fn check(&self) -> HealthStatus {
        match self.db_pool.get().await {
            Ok(_) => HealthStatus::Healthy,
            Err(e) => HealthStatus::Unhealthy {
                message: format!("Database connection failed: {}", e)
            }
        }
    }
    
    fn name(&self) -> &'static str {
        "database"
    }
}
```

## Alerting Setup

### Prometheus AlertManager Rules

```yaml
# rustchain-alerts.yml
groups:
  - name: rustchain
    rules:
      # High error rate
      - alert: HighMissionFailureRate
        expr: rate(rustchain_missions_total{status="failure"}[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High mission failure rate detected"
          description: "Mission failure rate is {{ $value }} per second"
          
      # Memory usage
      - alert: HighMemoryUsage
        expr: rustchain_memory_usage_bytes / (1024*1024*1024) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Memory usage is {{ $value }}GB"
          
      # LLM provider issues
      - alert: LLMProviderDown
        expr: up{job="rustchain"} == 0 or rate(rustchain_llm_errors_total[5m]) > 0.5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "LLM provider issues detected"
          description: "LLM provider {{ $labels.provider }} is experiencing issues"
          
      # Policy violations
      - alert: PolicyViolations
        expr: increase(rustchain_policy_violations_total[15m]) > 5
        for: 0s
        labels:
          severity: critical
        annotations:
          summary: "Multiple policy violations detected"
          description: "{{ $value }} policy violations in the last 15 minutes"
```

### Grafana Dashboard

```json
{
  "dashboard": {
    "title": "RustChain Monitoring",
    "panels": [
      {
        "title": "Mission Success Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "rate(rustchain_missions_total{status=\"success\"}[5m]) / rate(rustchain_missions_total[5m])",
            "legendFormat": "Success Rate"
          }
        ]
      },
      {
        "title": "Mission Duration",
        "type": "graph", 
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rustchain_mission_duration_seconds_bucket)",
            "legendFormat": "95th percentile"
          },
          {
            "expr": "histogram_quantile(0.5, rustchain_mission_duration_seconds_bucket)",
            "legendFormat": "Median"
          }
        ]
      },
      {
        "title": "System Resources",
        "type": "graph",
        "targets": [
          {
            "expr": "rustchain_memory_usage_bytes / (1024*1024)",
            "legendFormat": "Memory Usage (MB)"
          },
          {
            "expr": "rustchain_cpu_usage_percent",
            "legendFormat": "CPU Usage %"
          }
        ]
      }
    ]
  }
}
```

## Performance Monitoring

### Response Time Analysis

```bash
# Mission execution times by type
curl -s http://localhost:9090/metrics | grep rustchain_mission_duration | \
  awk -F'[{}]' '{print $2 " " $3}' | sort -k2 -n

# LLM provider performance comparison
curl -s http://localhost:9090/metrics | grep rustchain_llm_response_time | \
  awk -F'[{}]' '{print $2 " " $3}' | sort -k2 -n
```

### Resource Usage Tracking

```yaml
# Mission with resource monitoring
name: "Resource Intensive Task"
config:
  resource_limits:
    max_memory_mb: 500
    max_cpu_percent: 80
    max_duration_seconds: 300
  monitoring:
    track_memory: true
    track_cpu: true
    sample_interval_seconds: 5
    
steps:
  - id: "heavy_computation"
    step_type: "tool"
    parameters:
      tool: "data_processor"
      resource_profile: "memory_intensive"
```

### Performance Baselines

```bash
# Establish performance baselines
rustchain benchmark --output baseline.json --iterations 100

# Compare current performance to baseline
rustchain benchmark --baseline baseline.json --threshold 0.2

# Continuous performance testing
rustchain benchmark --continuous --alert-threshold 0.3
```

## Security Monitoring

### Audit Event Analysis

```bash
# Real-time security monitoring
rustchain audit tail --filter "event_type=policy_violation" | \
  jq '{time: .timestamp, policy: .policy_name, violation: .violation_type}'

# Daily security report
rustchain audit report --format json --since "1d ago" | \
  jq -r '.events[] | select(.event_type == "policy_violation") | 
         [.timestamp, .policy_name, .violation_type] | @csv'
```

### Security Metrics Dashboard

```yaml
# Security-focused metrics
rustchain_security_events_total{event_type="policy_violation|safety_check|auth_failure"}
rustchain_failed_authentications_total{source="api|cli|webhook"}
rustchain_suspicious_activities_total{activity_type="unusual_pattern|rate_limit_exceeded"}
```

## Troubleshooting

### Common Monitoring Issues

#### Metrics Not Appearing

```bash
# Check metrics endpoint
curl -v http://localhost:9090/metrics

# Verify feature compilation
cargo build --features observability

# Check configuration
rustchain config show | grep -i metrics
```

#### High Memory Usage

```bash
# Analyze memory usage by component
curl -s http://localhost:9090/metrics | grep rustchain_memory_usage_bytes

# Check for memory leaks in missions
rustchain audit report --filter "event_type=memory_warning" --since "1h ago"

# Optimize memory store configuration
export RUSTCHAIN_MEMORY_STORE_TTL=3600
export RUSTCHAIN_MEMORY_STORE_MAX_SIZE=1000
```

#### Slow Mission Execution

```bash
# Identify slow missions
rustchain audit report --format json --since "1h ago" | \
  jq -r '.events[] | select(.event_type == "mission_complete") | 
         [.mission_id, .duration_ms] | @csv' | sort -t, -k2 -n

# LLM provider performance analysis
curl -s http://localhost:9090/metrics | grep rustchain_llm_response_time
```

### Log Analysis Scripts

```bash
#!/bin/bash
# analyze_logs.sh - Comprehensive log analysis

LOG_FILE=${1:-/var/log/rustchain/app.log}

echo "=== RustChain Log Analysis ==="
echo "Log file: $LOG_FILE"
echo "Analysis time: $(date)"
echo

# Error summary
echo "Top 10 Errors:"
grep '"level":"ERROR"' "$LOG_FILE" | \
  jq -r '.fields.error' | sort | uniq -c | sort -nr | head -10

# Mission performance
echo -e "\nMission Performance (last 1000 entries):"
tail -n 1000 "$LOG_FILE" | grep '"target":"rustchain::engine"' | \
  jq -r 'select(.fields.duration_ms) | [.fields.mission_id, .fields.duration_ms] | @csv' | \
  awk -F, '{sum+=$2; count++} END {print "Average duration:", sum/count "ms"}'

# Policy violations
echo -e "\nPolicy Violations:"
grep '"event_type":"policy_violation"' "$LOG_FILE" | \
  jq -r '[.timestamp, .policy_name, .violation_type] | @csv' | tail -5
```

## Integration Guides

### ELK Stack Integration

#### Logstash Configuration

```ruby
input {
  file {
    path => "/var/log/rustchain/app.log"
    codec => json
    tags => ["rustchain"]
  }
}

filter {
  if "rustchain" in [tags] {
    mutate {
      add_field => { "service" => "rustchain" }
    }
    
    if [level] == "ERROR" {
      mutate {
        add_tag => ["error"]
      }
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "rustchain-logs-%{+YYYY.MM.dd}"
  }
}
```

### Datadog Integration

```yaml
# datadog-rustchain.yaml
logs:
  - type: file
    path: /var/log/rustchain/app.log
    service: rustchain
    source: rust
    tags:
      - env:production
      - team:ai-platform

# Custom metrics
init_config:

instances:
  - prometheus_url: http://localhost:9090/metrics
    namespace: rustchain
    metrics:
      - rustchain_missions_total
      - rustchain_mission_duration_seconds
      - rustchain_memory_usage_bytes
```

### New Relic Integration

```toml
# newrelic.toml
[newrelic]
license_key = "your-license-key"
app_name = "RustChain"

[newrelic.distributed_tracing]
enabled = true

[newrelic.attributes]
enabled = true
include = ["request.*", "mission.*"]
```

### Kubernetes Monitoring

```yaml
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: rustchain
spec:
  selector:
    matchLabels:
      app: rustchain
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics
    
---
# PodMonitor for detailed metrics
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: rustchain-pods
spec:
  selector:
    matchLabels:
      app: rustchain
  podMetricsEndpoints:
  - port: metrics
    interval: 15s
```

### Sample Grafana Queries

```promql
# Mission success rate over time
rate(rustchain_missions_total{status="success"}[5m]) / rate(rustchain_missions_total[5m])

# 95th percentile mission duration
histogram_quantile(0.95, rate(rustchain_mission_duration_seconds_bucket[5m]))

# Memory usage trend
rustchain_memory_usage_bytes / 1024 / 1024

# Error rate by component
rate(rustchain_errors_total[5m]) by (component)

# LLM provider comparison
avg_over_time(rustchain_llm_response_time_seconds[1h]) by (provider)
```

## Conclusion

This monitoring and observability setup provides comprehensive visibility into your RustChain deployment. Key benefits:

- **Proactive Issue Detection**: Catch problems before they impact users
- **Performance Optimization**: Identify bottlenecks and optimization opportunities  
- **Security Monitoring**: Track policy violations and security events
- **Operational Insights**: Understand usage patterns and system behavior
- **Compliance**: Maintain audit trails and meet regulatory requirements

For advanced monitoring scenarios or custom integrations, consult the [RustChain API Reference](API_REFERENCE.md) and consider the Enterprise Edition for additional monitoring features.