zentinel-proxy 0.6.11

# Inference Module

Token-based rate limiting and management for LLM/AI routing.

## Overview

The inference module provides specialized handling for LLM inference endpoints:

- **Token-based rate limiting** - Limits based on token consumption, not just requests
- **Token budgets** - Daily/monthly cumulative usage limits
- **Cost tracking** - Dollar cost attribution per model
- **Guardrails** - Prompt injection and PII detection
- **Model routing** - Route to different providers based on model
- **Fallback** - Automatic failover between providers
- **Streaming support** - Token counting for SSE responses

## Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                     Inference Pipeline                               │
└─────────────────────────────────────────────────────────────────────┘

     Request
        │
        ▼
┌───────────────────┐
│   Extract Model   │  Parse model from request body
│   & Estimate      │  Count input tokens (tiktoken)
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│   Token Rate      │  Check tokens-per-minute limit
│   Limit Check     │  (separate from request rate limit)
└─────────┬─────────┘
          │
     ┌────┴────┐
     │ Limited?│───Yes──▶ Return 429 + retry-after
     └────┬────┘
          │ No
          ▼
┌───────────────────┐
│   Budget Check    │  Check daily/monthly token budget
│                   │
└─────────┬─────────┘
          │
     ┌────┴────┐
     │Exhausted│───Yes──▶ Return 429 + budget error
     └────┬────┘
          │ No
          ▼
┌───────────────────┐
│   Guardrails      │  Prompt injection detection
│   (optional)      │  PII detection
└─────────┬─────────┘
          │
     ┌────┴────┐
     │ Blocked?│───Yes──▶ Return 400 + violation details
     └────┬────┘
          │ No
          ▼
┌───────────────────┐
│   Model Routing   │  Select upstream by model pattern
│                   │  gpt-4* → OpenAI, claude-* → Anthropic
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│   Upstream Pool   │  Least-tokens-queued balancing
│   Selection       │  Inference-aware health checks
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│   Forward         │  Stream response, count output tokens
│   Request         │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│   Cost Tracking   │  Calculate $ cost from token usage
│   & Metrics       │  Update budget, emit metrics
└───────────────────┘
```

## Token Rate Limiting

Unlike request-based rate limiting, token rate limiting accounts for the actual computational cost of LLM requests.

### Configuration

```kdl
route "/v1/chat/completions" {
    service-type "inference"
    inference {
        provider "openai"

        rate-limit {
            // Token-based limits
            tokens-per-minute 100000
            burst-tokens 10000

            // Can also combine with request limits
            requests-per-minute 100
        }
    }
}
```

### How It Works

1. **Request arrives** - Parse the `model` and `messages` from the request body
2. **Count input tokens** - Use tiktoken (or estimation) to count tokens
3. **Check rate limit** - Verify tokens-per-minute not exceeded
4. **Track response tokens** - For streaming responses, count tokens in SSE events
5. **Update usage** - Add input + output tokens to usage counter

### Token Counting

The module supports multiple token counting strategies:

```rust
pub enum TokenCounter {
    // Exact counting using tiktoken (requires feature)
    Tiktoken(TiktokenEncoder),

    // Estimation based on character/word count
    Estimation(EstimationConfig),

    // Provider-specific counters
    OpenAI(OpenAICounter),
    Anthropic(AnthropicCounter),
}
```

**Tiktoken Feature:**

```toml
[features]
tiktoken = ["tiktoken-rs"]
```

## Token Budgets

Cumulative token usage limits over time periods.

### Configuration

```kdl
route "/v1/chat/completions" {
    inference {
        budget {
            // Daily budget (resets at midnight UTC)
            daily-limit 1000000

            // Monthly budget (resets on 1st)
            monthly-limit 10000000

            // Enforce budget (reject requests when exhausted)
            enforce true

            // Alert when reaching threshold
            alert-threshold 0.8
        }
    }
}
```

### Budget Tracking

```rust
pub struct TokenBudgetTracker {
    daily_usage: AtomicU64,
    monthly_usage: AtomicU64,
    daily_limit: Option<u64>,
    monthly_limit: Option<u64>,
    last_daily_reset: AtomicU64,
    last_monthly_reset: AtomicU64,
}

impl TokenBudgetTracker {
    pub fn check(&self, tokens: u64) -> BudgetResult;
    pub fn record_usage(&self, tokens: u64);
    pub fn remaining_daily(&self) -> Option<u64>;
    pub fn remaining_monthly(&self) -> Option<u64>;
}
```

### Budget Response

When budget is exhausted:

```json
{
  "error": {
    "type": "budget_exceeded",
    "message": "Daily token budget exhausted",
    "budget_type": "daily",
    "limit": 1000000,
    "used": 1000000,
    "resets_at": "2024-01-16T00:00:00Z"
  }
}
```

## Cost Tracking

Calculate dollar costs based on token usage and model pricing.

### Configuration

```kdl
route "/v1/chat/completions" {
    inference {
        cost-attribution {
            enabled true

            pricing {
                model "gpt-4*" {
                    input-cost-per-million 30.0
                    output-cost-per-million 60.0
                }
                model "gpt-3.5-turbo*" {
                    input-cost-per-million 0.5
                    output-cost-per-million 1.5
                }
                model "claude-3-opus*" {
                    input-cost-per-million 15.0
                    output-cost-per-million 75.0
                }
                model "claude-3-sonnet*" {
                    input-cost-per-million 3.0
                    output-cost-per-million 15.0
                }
            }

            // Add cost header to response
            include-header true
            header-name "X-Inference-Cost"
        }
    }
}
```

### Cost Calculator

```rust
pub struct CostCalculator {
    pricing: HashMap<GlobPattern, ModelPricing>,
}

pub struct ModelPricing {
    input_cost_per_million: f64,
    output_cost_per_million: f64,
}

impl CostCalculator {
    pub fn calculate(&self, model: &str, input_tokens: u64, output_tokens: u64) -> f64;
}
```

### Metrics

Cost metrics are exported for monitoring:

```
# Total cost in dollars
zentinel_inference_cost_dollars_total{route="chat", model="gpt-4"} 1.23

# Cost per request
zentinel_inference_cost_per_request{route="chat", model="gpt-4", quantile="0.5"} 0.05
```

## Guardrails

Semantic inspection for prompt injection and PII detection.

### Configuration

```kdl
route "/v1/chat/completions" {
    inference {
        guardrails {
            prompt-injection {
                enabled true
                agent "guardrail-agent"
                action "block"
                // Or "warn" to log but allow
            }

            pii-detection {
                enabled true
                agent "pii-agent"
                action "redact"
                // Types: ssn, credit-card, email, phone, etc.
                types ["ssn", "credit-card"]
            }
        }
    }
}
```

### Guardrail Response

When blocked:

```json
{
  "error": {
    "type": "guardrail_violation",
    "message": "Request blocked by guardrail",
    "guardrail": "prompt_injection",
    "confidence": 0.95,
    "details": "Detected injection pattern in user message"
  }
}
```

## Model Routing

Route requests to different upstreams based on the model.

### Configuration

```kdl
route "/v1/chat/completions" {
    inference {
        model-routing {
            // Pattern-based routing
            model "gpt-4*" upstream="openai"
            model "gpt-3.5*" upstream="openai"
            model "claude-*" upstream="anthropic" provider="anthropic"
            model "llama-*" upstream="local-llama"

            // Default if no pattern matches
            default-upstream "openai"
        }
    }
}
```

### Model Mapping for Fallback

Map models between providers when falling back:

```kdl
route "/v1/chat/completions" {
    inference {
        fallback {
            upstreams {
                upstream "anthropic" provider="anthropic" {
                    model-mapping {
                        "gpt-4" "claude-3-opus"
                        "gpt-4-turbo" "claude-3-opus"
                        "gpt-3.5-turbo" "claude-3-sonnet"
                    }
                }
            }

            triggers {
                on-health-failure true
                on-budget-exhausted true
                on-error-codes [429, 503]
            }
        }
    }
}
```

## Streaming Support

Token counting for Server-Sent Event (SSE) streaming responses.

### How It Works

1. **Detect streaming** - Check for `stream: true` in request
2. **Proxy SSE** - Forward events from upstream to client
3. **Parse events** - Extract token chunks from SSE data
4. **Count tokens** - Accumulate output tokens from chunks
5. **Final count** - Get final token count from `usage` field or sum

### SSE Event Parsing

```rust
pub struct StreamingTokenCounter {
    accumulated_tokens: AtomicU64,
}

impl StreamingTokenCounter {
    pub fn process_event(&self, event: &SseEvent) {
        if let Some(choice) = event.data.get("choices").and_then(|c| c.get(0)) {
            if let Some(delta) = choice.get("delta").and_then(|d| d.get("content")) {
                let tokens = estimate_tokens(delta.as_str().unwrap_or(""));
                self.accumulated_tokens.fetch_add(tokens, Ordering::Relaxed);
            }
        }
    }

    pub fn finalize(&self, usage: Option<&Usage>) -> u64 {
        usage.map(|u| u.completion_tokens)
            .unwrap_or_else(|| self.accumulated_tokens.load(Ordering::Relaxed))
    }
}
```

## Load Balancing for Inference

Special load balancing algorithm for LLM endpoints.

### Least Tokens Queued

```rust
pub struct LeastTokensQueuedBalancer {
    targets: Vec<InferenceTarget>,
}

pub struct InferenceTarget {
    address: SocketAddr,
    queued_tokens: AtomicU64,
    processing_tokens: AtomicU64,
    max_tokens: u64,
}

impl LeastTokensQueuedBalancer {
    pub fn select(&self, estimated_tokens: u64) -> Option<&InferenceTarget> {
        // Find target with lowest (queued + processing) tokens
        // that has capacity for the estimated tokens
        self.targets
            .iter()
            .filter(|t| t.has_capacity(estimated_tokens))
            .min_by_key(|t| t.total_tokens())
    }
}
```

### Inference Health Checks

Specialized health checks for inference endpoints:

```kdl
upstream "openai" {
    health-check {
        type "inference"

        // Query /v1/models endpoint
        models-endpoint "/v1/models"

        // Or send a minimal completion probe
        probe {
            model "gpt-3.5-turbo"
            messages [{"role": "user", "content": "hi"}]
            max-tokens 1
        }

        interval-secs 30
        timeout-secs 10
    }
}
```

## Metrics

Inference-specific metrics:

```
# Token rate limiting
zentinel_inference_rate_limit_tokens_allowed_total{route="chat"}
zentinel_inference_rate_limit_tokens_limited_total{route="chat"}
zentinel_inference_rate_limit_current_tokens{route="chat"}

# Token budgets
zentinel_inference_budget_tokens_used{route="chat", period="daily"}
zentinel_inference_budget_tokens_remaining{route="chat", period="daily"}
zentinel_inference_budget_exhausted_total{route="chat", period="daily"}

# Token usage
zentinel_inference_input_tokens_total{route="chat", model="gpt-4"}
zentinel_inference_output_tokens_total{route="chat", model="gpt-4"}
zentinel_inference_tokens_per_request{route="chat", model="gpt-4", quantile="0.5"}

# Cost
zentinel_inference_cost_dollars_total{route="chat", model="gpt-4"}

# Guardrails
zentinel_inference_guardrail_blocked_total{route="chat", guardrail="prompt_injection"}
zentinel_inference_guardrail_latency_ms{route="chat", guardrail="prompt_injection"}

# Model routing
zentinel_inference_model_requests_total{route="chat", model="gpt-4", upstream="openai"}
zentinel_inference_fallback_total{route="chat", from="openai", to="anthropic"}
```

## Example Configuration

Complete inference route configuration:

```kdl
upstreams {
    upstream "openai" {
        target "api.openai.com:443"
        tls {
            sni "api.openai.com"
        }
        health-check {
            type "inference"
            models-endpoint "/v1/models"
            interval-secs 30
        }
    }

    upstream "anthropic" {
        target "api.anthropic.com:443"
        tls {
            sni "api.anthropic.com"
        }
    }
}

agents {
    agent "guardrail-agent" {
        type "custom"
        transport {
            unix-socket "/var/run/guardrail.sock"
        }
        events ["request-body"]
        timeout-ms 100
    }
}

routes {
    route "chat" {
        matches {
            path-prefix "/v1/chat/completions"
        }

        service-type "inference"
        upstream "openai"

        inference {
            provider "openai"

            rate-limit {
                tokens-per-minute 100000
                burst-tokens 10000
                requests-per-minute 100
            }

            budget {
                daily-limit 1000000
                monthly-limit 10000000
                enforce true
                alert-threshold 0.8
            }

            cost-attribution {
                enabled true
                pricing {
                    model "gpt-4*" {
                        input-cost-per-million 30.0
                        output-cost-per-million 60.0
                    }
                    model "gpt-3.5-turbo*" {
                        input-cost-per-million 0.5
                        output-cost-per-million 1.5
                    }
                }
            }

            guardrails {
                prompt-injection {
                    enabled true
                    agent "guardrail-agent"
                    action "block"
                }
            }

            model-routing {
                model "gpt-4*" upstream="openai"
                model "claude-*" upstream="anthropic" provider="anthropic"
                default-upstream "openai"
            }

            fallback {
                upstreams {
                    upstream "anthropic" provider="anthropic" {
                        model-mapping {
                            "gpt-4" "claude-3-opus"
                            "gpt-3.5-turbo" "claude-3-sonnet"
                        }
                    }
                }
                triggers {
                    on-health-failure true
                    on-budget-exhausted true
                    on-error-codes [429, 503]
                }
            }
        }
    }
}
```