Module inference

Expand description

Inference routing module for LLM/AI traffic patterns

This module provides:

Token-based rate limiting (tokens/minute instead of requests/second)
Token budget tracking (cumulative usage per period)
Cost attribution (per-model pricing)
Multi-provider token counting (OpenAI, Anthropic, generic)
Model-aware load balancing (LeastTokensQueued strategy)

§Example Usage

route "/v1/chat/completions" {
    inference {
        provider "openai"
        rate-limit {
            tokens-per-minute 100000
            burst-tokens 10000
        }
        budget {
            period "daily"
            limit 1000000
            enforce true
        }
        cost-attribution {
            pricing {
                model "gpt-4*" {
                    input-cost-per-million 30.0
                    output-cost-per-million 60.0
                }
            }
        }
        routing {
            strategy "least-tokens-queued"
        }
    }
    upstream "llm-pool" { ... }
}

Structs§

CostCalculator: Cost calculator for inference requests.
GuardrailProcessor: Guardrail processor for semantic content analysis.
InferenceCheckResult: Result of an inference rate limit check
InferenceMetrics: Inference-specific metrics collector.
InferenceRateLimitManager: Manager for inference rate limiting, budgets, and cost tracking.
InferenceRouteStats: Stats for a route’s inference configuration.
StreamingTokenCounter: Streaming token counter for SSE responses.
StreamingTokenResult: Result of streaming token counting.
TiktokenManager: Manages cached tiktoken BPE instances for different encodings.
TokenBudgetTracker: Token budget tracker for per-tenant usage tracking.
TokenCounter: Token counter for a specific provider
TokenEstimate: Token count estimate with metadata
TokenRateLimiter: Token-based rate limiter for inference endpoints

Enums§

PiiCheckResult: Result of a PII detection check
PromptInjectionResult: Result of a prompt injection check
TiktokenEncoding: Tiktoken encoding types
TokenCountSource: Source of token count.
TokenRateLimitResult: Result of a rate limit check
TokenSource: Source of token count information

Traits§

InferenceProviderAdapter: Trait for provider-specific token extraction and estimation

Functions§

create_inference_provider: Create a provider adapter based on the configured provider type
create_provider: Create a provider adapter based on provider type
extract_inference_content: Extract message content from an inference request body.
is_sse_response: Check if a response appears to be SSE based on content type.
tiktoken_manager: Get the global tiktoken manager

Module inference

Module inference Copy item path

§Example Usage

Structs§

Enums§

Traits§

Functions§

Module inference