Module inference

Module inference 

Source
Expand description

Inference routing module for LLM/AI traffic patterns

This module provides:

  • Token-based rate limiting (tokens/minute instead of requests/second)
  • Token budget tracking (cumulative usage per period)
  • Cost attribution (per-model pricing)
  • Multi-provider token counting (OpenAI, Anthropic, generic)
  • Model-aware load balancing (LeastTokensQueued strategy)

§Example Usage

route "/v1/chat/completions" {
    inference {
        provider "openai"
        rate-limit {
            tokens-per-minute 100000
            burst-tokens 10000
        }
        budget {
            period "daily"
            limit 1000000
            enforce true
        }
        cost-attribution {
            pricing {
                model "gpt-4*" {
                    input-cost-per-million 30.0
                    output-cost-per-million 60.0
                }
            }
        }
        routing {
            strategy "least-tokens-queued"
        }
    }
    upstream "llm-pool" { ... }
}

Structs§

CostCalculator
Cost calculator for inference requests.
GuardrailProcessor
Guardrail processor for semantic content analysis.
InferenceCheckResult
Result of an inference rate limit check
InferenceMetrics
Inference-specific metrics collector.
InferenceRateLimitManager
Manager for inference rate limiting, budgets, and cost tracking.
InferenceRouteStats
Stats for a route’s inference configuration.
StreamingTokenCounter
Streaming token counter for SSE responses.
StreamingTokenResult
Result of streaming token counting.
TiktokenManager
Manages cached tiktoken BPE instances for different encodings.
TokenBudgetTracker
Token budget tracker for per-tenant usage tracking.
TokenCounter
Token counter for a specific provider
TokenEstimate
Token count estimate with metadata
TokenRateLimiter
Token-based rate limiter for inference endpoints

Enums§

PiiCheckResult
Result of a PII detection check
PromptInjectionResult
Result of a prompt injection check
TiktokenEncoding
Tiktoken encoding types
TokenCountSource
Source of token count.
TokenRateLimitResult
Result of a rate limit check
TokenSource
Source of token count information

Traits§

InferenceProviderAdapter
Trait for provider-specific token extraction and estimation

Functions§

create_inference_provider
Create a provider adapter based on the configured provider type
create_provider
Create a provider adapter based on provider type
extract_inference_content
Extract message content from an inference request body.
is_sse_response
Check if a response appears to be SSE based on content type.
tiktoken_manager
Get the global tiktoken manager