OmniLLM
An AI-native, production-grade Rust library for provider-neutral LLM access with multi-key load balancing, per-key rate limiting, protocol conversion, circuit breaking, and lock-free cost tracking.
Documentation
- Detailed Usage Guide
- Skill Guide
- Architecture Notes
- Implementation Notes
- API docs on docs.rs
- OmniLLM Skill Source
- OmniLLM Skill README
AI-Native Project
OmniLLM ships with a first-party OmniLLM Skill in skill/. The
skill teaches coding agents how to work with OmniLLM's actual runtime and conversion
surfaces instead of guessing from generic Rust or generic SDK patterns.
The bundled Skill is tuned for repository-native signals such as:
GatewayBuilder,Gateway,KeyConfig,PoolConfigProviderEndpoint,EndpointProtocol,ProviderProtocol,LlmRequest,LlmStreamEventApiRequest,WireFormat,emit_transport_request,transcode_*ReplayFixture,sanitize_transport_request,OMNILLM_RESPONSES_*- runtime errors like
NoAvailableKey,BudgetExceeded, andProtocol(...)
Bundled Skill
The repository includes the OmniLLM Skill in skill/. The
installation guide lives in skill/README.md, and the
website version lives in
website/docs/skill.md.
Install The Skill
See skill/README.md for GitHub-based Vercel Labs
skills installer commands for Claude Code, Codex, and OpenCode.
Use The Skill
After installing it, ask your agent to:
- integrate
omnillminto a Rust project - configure a multi-key runtime gateway
- transcode between provider protocols or typed endpoint formats
- explain replay sanitization and fixture-safe testing
- debug OmniLLM-specific errors and configuration issues
Repository Docs Site
The documentation site source lives in the GitHub repository:
Features
- Canonical
Responses + Capability Layerhybrid request/response model - Additive provider primitive protocol mode for raw provider-native payloads
- Runtime endpoint profiles through
EndpointProtocol, including official URL derivation and*_compatfull-URL modes - Additive multi-endpoint API layer with canonical request/response types for generation, embeddings, images, audio, and rerank
- Protocol-aware dispatch for OpenAI Responses, OpenAI Chat Completions, Claude Messages, and Gemini GenerateContent
- Raw JSON and typed transcoders between supported protocols and endpoint families
- Message-level
raw_messagepreservation for higher-fidelity round trips - Embedded provider support registry for OpenAI, Azure OpenAI, Anthropic, Gemini, Vertex AI, Bedrock, and OpenAI-compatible endpoints
- Replay fixture sanitization helpers for safe record/replay style testing
- Multi-key load balancing with per-key rate limiting and circuit breaking
- Lock-free budget tracking with pre-reserve + settle accounting
- Non-streaming
call, canonical streamingstream, primitiveprimitive_call, primitive SSE/binaryprimitive_stream, and primitive WebSocketprimitive_realtimeAPIs - Bundled OmniLLM Skill in
skill/for AI-native repo guidance
Dual Protocol Modes
OmniLLM now exposes two runtime protocol forms:
| Mode | Entry points | Payload model | Use when | Budget |
|---|---|---|---|---|
| OpenAI Responses canonical | Gateway::call, Gateway::stream |
LlmRequest, LlmResponse, LlmStreamEvent |
You want provider-neutral generation with existing OpenAI Responses-centered semantics and provider transcoding | Shared BudgetTracker |
| Provider primitive | Gateway::primitive_call, Gateway::primitive_stream, Gateway::primitive_realtime |
PrimitiveRequest, PrimitiveResponse, PrimitiveStreamEvent, PrimitiveRealtimeSession |
You need raw provider-native APIs such as OpenAI Images/Audio/Realtime, Anthropic Messages/Count Tokens, Gemini GenerateContent/CountTokens/Live, or OpenAI-compatible raw payloads | Shared BudgetTracker |
The canonical path remains the default and does not require primitive configuration.
Primitive mode is explicit: configure a PrimitiveProviderEndpoint, send a
PrimitiveRequest, and OmniLLM preserves the provider-native request and response
payloads. Usage extraction is side-channel telemetry used for budget settlement;
it does not rewrite the returned primitive body.
Primitive provider support is intentionally scoped. OmniLLM is not a full provider admin SDK; admin, billing, webhooks, fine-tuning, evals, tunings, managed-agent platforms, and hosted RAG/vector-store administration remain deferred unless a current Spec explicitly promotes them.
Current primitive support tiers:
WebSocket realtime support is implemented for OpenAI Realtime and Gemini Live. WebRTC transport is intentionally not claimed as implemented and remains planned until feature-gated tests cover it.
| Tier | Provider coverage | Budget class |
|---|---|---|
| P0 core | OpenAI Responses/Chat/Images/Audio/Embeddings, Anthropic Messages/Count Tokens/Batches/Files, Gemini Generate/Stream/Count/Embed/Files/Caches | token or media fallback |
| P1 HTTP gaps | OpenAI Files/Uploads/Models/Audio Translations/Image edits/variations, Anthropic Models/Files hardening, Gemini Models/Operations/Files/Caches hardening | zero-cost metadata, upload/storage, or billable-unit fallback |
| P2 async jobs | Batch lifecycle provider APIs | async job usage when observed |
| P3 transports | OpenAI Audio Speech binary chunks, OpenAI Realtime WebSocket, Gemini Live WebSocket; WebRTC remains planned | close/cancel/provider-error/no-usage fallback settlement |
| Deferred | admin, billing, fine-tuning, evals, tunings, managed agents, hosted RAG control plane, SDK helpers | out of scope |
use ;
use json;
use CancellationToken;
# async
Canonical Model
Generation stays centered on the existing Response API semantic model:
LlmRequest/LlmResponseare still the canonical generation types.ApiRequest/ApiResponseadd separate canonical types for embeddings, image generations, audio transcriptions, audio speech, and rerank.ConversionReport<T>makes bridge semantics explicit withbridged,lossy, andloss_reasons.
This keeps generation normalized around "generate one response" while avoiding capability lock-in to any single wire protocol.
Endpoint Families
Current typed endpoint coverage:
| Endpoint | Canonical type | Implemented wire formats |
|---|---|---|
| Generation | LlmRequest / LlmResponse |
open_ai_responses, open_ai_chat_completions, anthropic_messages, gemini_generate_content |
| Embeddings | EmbeddingRequest / EmbeddingResponse |
open_ai_embeddings |
| Image generation | ImageGenerationRequest / ImageGenerationResponse |
open_ai_image_generations |
| Audio transcription | AudioTranscriptionRequest / AudioTranscriptionResponse |
open_ai_audio_transcriptions |
| Audio speech | AudioSpeechRequest / AudioSpeechResponse |
open_ai_audio_speech |
| Rerank | RerankRequest / RerankResponse |
open_ai_rerank |
Provider support is exposed through embedded_provider_registry(). The registry distinguishes:
native: implemented with provider-native wire formatcompatible: OpenAI-compatible or wrapper-style supportplanned: listed in the matrix but not yet implemented as a codec/runtime adapter
Quick Start
use ;
use CancellationToken;
async
Runtime Endpoint Profiles
Runtime configuration uses EndpointProtocol, while ProviderProtocol remains
the low-level wire-protocol enum for parsing, emission, and transcoding.
Names such as ClaudeMessages and GeminiGenerateContent come directly from
the upstream API families OmniLLM models, so treat them as wire-shape
identifiers rather than preferred runtime configuration presets.
use ;
let endpoint = new
.with_auth;
Use official EndpointProtocol variants when OmniLLM should derive standard
upstream paths from a host or prefix. Use *_compat variants when the upstream
wrapper already exposes the full request URL.
For OpenAI Chat Completions wrappers that reject bare string content,
construct chat input with Message.parts: OmniLLM emits plain-text chat
messages as typed content arrays such as
[{ "type": "text", "text": "hi?" }].
Prompt Cache
OmniLLM exposes prompt caching as a typed generation capability instead of a raw provider-only JSON escape hatch:
use ;
let capabilities = CapabilitySet ;
Provider behavior is intentionally explicit:
- OpenAI Responses and Chat Completions emit
prompt_cache_keyandprompt_cache_retention; OpenAI breakpoint requests are partial support because caching is automatic prefix matching. - Claude Messages emits provider-native
cache_controlon supported tool, system, message, or content-block boundaries. - Gemini GenerateContent does not support typed prompt cache;
BestEffortbecomes a lossy bridge report andRequiredreturns an error before transport. TokenUsage.prompt_cachepreserves cached/read/write token telemetry so callers can verify cache hits from provider usage, not assumptions.- Budget estimates stay conservative and do not assume cache hits; actual settlement applies cache-aware pricing only when both provider telemetry and known cache rates are present.
For safer stable-prefix construction, use PromptLayoutBuilder to keep dynamic user/RAG content in the suffix and generate stable prefix keys that exclude dynamic content:
use ;
let request = new
.instructions
.stable_message
.user_input
.stable_prefix_cache_key
.build;
Protocol Transcoding
use ;
let raw_chat = r#"{
"model": "gpt-4.1-mini",
"messages": [{
"role": "user",
"content": [{ "type": "text", "text": "Hello!" }]
}],
"max_tokens": 32
}"#;
let raw_responses = transcode_request?;
Typed multi-endpoint transcoding keeps bridge metadata:
use ;
let raw_chat = r#"{
"model": "gpt-4.1-mini",
"messages": [{
"role": "user",
"content": [{ "type": "text", "text": "Hello!" }]
}],
"max_tokens": 32
}"#;
let report = transcode_api_request?;
assert!;
assert!;
println!;
If you bridge from the canonical Responses model to a narrower protocol, loss_reasons will tell you exactly what was dropped, such as unsupported builtin tools or provider-specific metadata.
Multi-Endpoint API
use ;
let request = Embeddings;
let transport = emit_transport_request?;
assert_eq!;
if let Json = transport.value.body
Local demo:
Replay Sanitization
ReplayFixture, sanitize_transport_request, sanitize_transport_response, and sanitize_json_value are intended for record/replay tests. They redact common secrets by default:
- auth headers
- query tokens such as
ak - JSON fields such as
api_key,token,secret - large binary/base64 payload fields
use ;
use json;
let request = TransportRequest ;
let sanitized = sanitize_transport_request;
assert_eq!;
Live Responses Demo
Optional live test:
The live demo and live tests read all endpoint configuration from environment variables or a local ignored .env file. See .env.example.
Gateway Builder
use Duration;
use ;
let gateway = new
.add_key
.budget_limit_usd
.pool_config
.request_timeout
.build
.expect;
Observability
for status in gateway.pool_status
println!;