ruvector-tiny-dancer-core 2.0.6

Production-grade AI agent routing system with FastGRNN neural inference
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
# Tiny Dancer Observability Guide

This guide covers the comprehensive observability features in Tiny Dancer, including Prometheus metrics, OpenTelemetry distributed tracing, and structured logging.

## Table of Contents

1. [Overview]#overview
2. [Prometheus Metrics]#prometheus-metrics
3. [Distributed Tracing]#distributed-tracing
4. [Structured Logging]#structured-logging
5. [Integration Guide]#integration-guide
6. [Examples]#examples
7. [Best Practices]#best-practices

## Overview

Tiny Dancer provides three layers of observability:

- **Prometheus Metrics**: Real-time performance metrics and system health
- **OpenTelemetry Tracing**: Distributed tracing for request flow analysis
- **Structured Logging**: Context-rich logs with the `tracing` crate

All three work together to provide complete visibility into your routing system.

## Prometheus Metrics

### Available Metrics

#### Request Metrics

```
tiny_dancer_routing_requests_total{status="success|failure"}
```
Counter tracking total routing requests by status.

```
tiny_dancer_routing_latency_seconds{operation="total"}
```
Histogram of routing operation latency in seconds.

#### Feature Engineering Metrics

```
tiny_dancer_feature_engineering_duration_seconds{batch_size="1-10|11-50|51-100|100+"}
```
Histogram of feature engineering duration by batch size.

#### Model Inference Metrics

```
tiny_dancer_model_inference_duration_seconds{model_type="fastgrnn"}
```
Histogram of model inference duration.

#### Circuit Breaker Metrics

```
tiny_dancer_circuit_breaker_state
```
Gauge showing circuit breaker state:
- 0 = Closed (healthy)
- 1 = Half-Open (testing)
- 2 = Open (failing)

#### Routing Decision Metrics

```
tiny_dancer_routing_decisions_total{model_type="lightweight|powerful"}
```
Counter of routing decisions by target model type.

```
tiny_dancer_confidence_scores{decision_type="lightweight|powerful"}
```
Histogram of confidence scores by decision type.

```
tiny_dancer_uncertainty_estimates{decision_type="lightweight|powerful"}
```
Histogram of uncertainty estimates.

#### Candidate Metrics

```
tiny_dancer_candidates_processed_total{batch_size_range="1-10|11-50|51-100|100+"}
```
Counter of total candidates processed by batch size range.

#### Error Metrics

```
tiny_dancer_errors_total{error_type="inference_error|circuit_breaker_open|..."}
```
Counter of errors by type.

### Using Metrics

```rust
use ruvector_tiny_dancer_core::{Router, RouterConfig};

// Create router (metrics are automatically collected)
let router = Router::new(RouterConfig::default())?;

// Process requests...
let response = router.route(request)?;

// Export metrics in Prometheus format
let metrics = router.export_metrics()?;
println!("{}", metrics);
```

### Prometheus Configuration

```yaml
scrape_configs:
  - job_name: 'tiny-dancer'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:9090']
```

### Example Grafana Dashboard

```json
{
  "dashboard": {
    "title": "Tiny Dancer Routing",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [{
          "expr": "rate(tiny_dancer_routing_requests_total[5m])"
        }]
      },
      {
        "title": "P95 Latency",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(tiny_dancer_routing_latency_seconds_bucket[5m]))"
        }]
      },
      {
        "title": "Circuit Breaker State",
        "targets": [{
          "expr": "tiny_dancer_circuit_breaker_state"
        }]
      },
      {
        "title": "Lightweight vs Powerful Routing",
        "targets": [{
          "expr": "rate(tiny_dancer_routing_decisions_total[5m])"
        }]
      }
    ]
  }
}
```

## Distributed Tracing

### OpenTelemetry Integration

Tiny Dancer integrates with OpenTelemetry for distributed tracing, supporting exporters like Jaeger, Zipkin, and more.

### Trace Spans

The following spans are automatically created:

- `routing_request`: Complete routing operation
- `circuit_breaker_check`: Circuit breaker validation
- `feature_engineering`: Feature extraction and engineering
- `model_inference`: Neural model inference (per candidate)
- `uncertainty_estimation`: Uncertainty quantification

### Configuration

```rust
use ruvector_tiny_dancer_core::{TracingConfig, TracingSystem};

// Configure tracing
let config = TracingConfig {
    service_name: "tiny-dancer".to_string(),
    service_version: "1.0.0".to_string(),
    jaeger_agent_endpoint: Some("localhost:6831".to_string()),
    sampling_ratio: 1.0, // Sample 100% of traces
    enable_stdout: false,
};

// Initialize tracing
let tracing_system = TracingSystem::new(config);
tracing_system.init()?;

// Your application code...

// Shutdown and flush traces
tracing_system.shutdown();
```

### Jaeger Setup

```bash
# Run Jaeger all-in-one
docker run -d \
  -p 6831:6831/udp \
  -p 16686:16686 \
  jaegertracing/all-in-one:latest

# Access Jaeger UI at http://localhost:16686
```

### Trace Context Propagation

```rust
use ruvector_tiny_dancer_core::TraceContext;

// Get trace context from current span
if let Some(ctx) = TraceContext::from_current() {
    println!("Trace ID: {}", ctx.trace_id);
    println!("Span ID: {}", ctx.span_id);

    // W3C Trace Context format for HTTP headers
    let traceparent = ctx.to_w3c_traceparent();
    // Example: "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"
}
```

### Custom Spans

```rust
use ruvector_tiny_dancer_core::RoutingSpan;
use tracing::info_span;

// Create custom span
let span = info_span!("my_operation", param1 = "value");
let _guard = span.enter();

// Or use pre-defined span helpers
let span = RoutingSpan::routing_request(candidate_count);
let _guard = span.enter();
```

## Structured Logging

### Log Levels

Tiny Dancer uses the `tracing` crate for structured logging:

- **ERROR**: Critical failures (circuit breaker open, inference errors)
- **WARN**: Warnings (model path not found, degraded performance)
- **INFO**: Normal operations (router initialization, request completion)
- **DEBUG**: Detailed information (feature extraction, inference results)
- **TRACE**: Very detailed information (internal state changes)

### Example Logs

```
INFO tiny_dancer_router: Initializing Tiny Dancer router
INFO tiny_dancer_router: Circuit breaker enabled with threshold: 5
INFO tiny_dancer_router: Processing routing request candidate_count=3
DEBUG tiny_dancer_router: Extracting features batch_size=3
DEBUG tiny_dancer_router: Model inference completed candidate_id="candidate-1" confidence=0.92
DEBUG tiny_dancer_router: Routing decision made candidate_id="candidate-1" use_lightweight=true uncertainty=0.08
INFO tiny_dancer_router: Routing request completed successfully inference_time_us=245 lightweight_routes=2 powerful_routes=1
```

### Configuring Logging

```rust
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};

// Basic setup
tracing_subscriber::fmt()
    .with_max_level(tracing::Level::INFO)
    .init();

// Advanced setup with JSON formatting
tracing_subscriber::registry()
    .with(tracing_subscriber::fmt::layer().json())
    .with(tracing_subscriber::filter::LevelFilter::from_level(
        tracing::Level::DEBUG
    ))
    .init();
```

## Integration Guide

### Complete Setup

```rust
use ruvector_tiny_dancer_core::{
    Router, RouterConfig, TracingConfig, TracingSystem
};
use tracing_subscriber;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. Initialize structured logging
    tracing_subscriber::fmt()
        .with_max_level(tracing::Level::INFO)
        .init();

    // 2. Initialize distributed tracing
    let tracing_config = TracingConfig {
        service_name: "my-service".to_string(),
        service_version: "1.0.0".to_string(),
        jaeger_agent_endpoint: Some("localhost:6831".to_string()),
        sampling_ratio: 0.1, // Sample 10% in production
        enable_stdout: false,
    };
    let tracing_system = TracingSystem::new(tracing_config);
    tracing_system.init()?;

    // 3. Create router (metrics automatically enabled)
    let router = Router::new(RouterConfig::default())?;

    // 4. Process requests (all observability automatic)
    let response = router.route(request)?;

    // 5. Periodically export metrics (e.g., to HTTP endpoint)
    let metrics = router.export_metrics()?;

    // 6. Cleanup
    tracing_system.shutdown();

    Ok(())
}
```

### HTTP Metrics Endpoint

```rust
use axum::{Router, routing::get};

async fn metrics_handler(
    router: Arc<ruvector_tiny_dancer_core::Router>
) -> String {
    router.export_metrics().unwrap_or_default()
}

let app = Router::new()
    .route("/metrics", get(metrics_handler));
```

## Examples

### 1. Metrics Only

```bash
cargo run --example metrics_example
```

Demonstrates Prometheus metrics collection and export.

### 2. Tracing Only

```bash
# Start Jaeger first
docker run -d -p6831:6831/udp -p16686:16686 jaegertracing/all-in-one:latest

# Run example
cargo run --example tracing_example
```

Shows distributed tracing with OpenTelemetry.

### 3. Full Observability

```bash
cargo run --example full_observability
```

Combines metrics, tracing, and structured logging.

## Best Practices

### Production Configuration

1. **Sampling**: Don't trace every request in production
   ```rust
   sampling_ratio: 0.01, // 1% sampling
   ```

2. **Log Levels**: Use INFO or WARN in production
   ```rust
   .with_max_level(tracing::Level::INFO)
   ```

3. **Metrics Cardinality**: Be careful with high-cardinality labels
   - ✓ Good: `{model_type="lightweight"}`
   - ✗ Bad: `{candidate_id="12345"}` (too many unique values)

4. **Performance**: Metrics collection is very lightweight (<1μs overhead)

### Alerting Rules

Example Prometheus alerting rules:

```yaml
groups:
  - name: tiny_dancer
    rules:
      - alert: HighErrorRate
        expr: rate(tiny_dancer_errors_total[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"

      - alert: CircuitBreakerOpen
        expr: tiny_dancer_circuit_breaker_state == 2
        for: 1m
        annotations:
          summary: "Circuit breaker is open"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(tiny_dancer_routing_latency_seconds_bucket[5m])) > 0.01
        for: 5m
        annotations:
          summary: "P95 latency above 10ms"
```

### Debugging Performance Issues

1. **Check metrics** for high-level patterns
   ```promql
   rate(tiny_dancer_routing_requests_total[5m])
   ```

2. **Use traces** to identify bottlenecks
   - Look for long spans
   - Identify slow candidates

3. **Review logs** for error details
   ```bash
   grep "ERROR" logs.txt | jq .
   ```

## Troubleshooting

### Metrics Not Appearing

- Ensure router is processing requests
- Check metrics export: `router.export_metrics()?`
- Verify Prometheus scrape configuration

### Traces Not in Jaeger

- Confirm Jaeger is running: `docker ps`
- Check endpoint: `jaeger_agent_endpoint: Some("localhost:6831")`
- Verify sampling ratio > 0
- Call `tracing_system.shutdown()` to flush

### High Memory Usage

- Reduce sampling ratio
- Decrease histogram buckets
- Lower log level to INFO or WARN

## Reference

- [Prometheus Documentation]https://prometheus.io/docs/
- [OpenTelemetry Specification]https://opentelemetry.io/docs/
- [Tracing Crate]https://docs.rs/tracing/
- [Jaeger Documentation]https://www.jaegertracing.io/docs/