Expand description
Inference routing module for LLM/AI traffic patterns
This module provides:
- Token-based rate limiting (tokens/minute instead of requests/second)
- Token budget tracking (cumulative usage per period)
- Cost attribution (per-model pricing)
- Multi-provider token counting (OpenAI, Anthropic, generic)
- Model-aware load balancing (LeastTokensQueued strategy)
§Example Usage
route "/v1/chat/completions" {
inference {
provider "openai"
rate-limit {
tokens-per-minute 100000
burst-tokens 10000
}
budget {
period "daily"
limit 1000000
enforce true
}
cost-attribution {
pricing {
model "gpt-4*" {
input-cost-per-million 30.0
output-cost-per-million 60.0
}
}
}
routing {
strategy "least-tokens-queued"
}
}
upstream "llm-pool" { ... }
}Structs§
- Cost
Calculator - Cost calculator for inference requests.
- Guardrail
Processor - Guardrail processor for semantic content analysis.
- Inference
Check Result - Result of an inference rate limit check
- Inference
Metrics - Inference-specific metrics collector.
- Inference
Rate Limit Manager - Manager for inference rate limiting, budgets, and cost tracking.
- Inference
Route Stats - Stats for a route’s inference configuration.
- Streaming
Token Counter - Streaming token counter for SSE responses.
- Streaming
Token Result - Result of streaming token counting.
- Tiktoken
Manager - Manages cached tiktoken BPE instances for different encodings.
- Token
Budget Tracker - Token budget tracker for per-tenant usage tracking.
- Token
Counter - Token counter for a specific provider
- Token
Estimate - Token count estimate with metadata
- Token
Rate Limiter - Token-based rate limiter for inference endpoints
Enums§
- PiiCheck
Result - Result of a PII detection check
- Prompt
Injection Result - Result of a prompt injection check
- Tiktoken
Encoding - Tiktoken encoding types
- Token
Count Source - Source of token count.
- Token
Rate Limit Result - Result of a rate limit check
- Token
Source - Source of token count information
Traits§
- Inference
Provider Adapter - Trait for provider-specific token extraction and estimation
Functions§
- create_
inference_ provider - Create a provider adapter based on the configured provider type
- create_
provider - Create a provider adapter based on provider type
- extract_
inference_ content - Extract message content from an inference request body.
- is_
sse_ response - Check if a response appears to be SSE based on content type.
- tiktoken_
manager - Get the global tiktoken manager