# OmniLLM
An AI-native, production-grade Rust library for provider-neutral LLM access with multi-key load balancing, per-key rate limiting, protocol conversion, circuit breaking, and lock-free cost tracking.
## Documentation
- [Detailed Usage Guide](https://github.com/aiomni/omnillm/blob/main/website/docs/usage.md)
- [Skill Guide](https://github.com/aiomni/omnillm/blob/main/website/docs/skill.md)
- [Architecture Notes](https://github.com/aiomni/omnillm/blob/main/website/docs/architecture.md)
- [Implementation Notes](https://github.com/aiomni/omnillm/blob/main/website/docs/implementation.md)
- [API docs on docs.rs](https://docs.rs/omnillm)
- [OmniLLM Skill Source](./skill)
- [OmniLLM Skill README](./skill/README.md)
## AI-Native Project
OmniLLM ships with a first-party OmniLLM Skill in [`skill/`](./skill). The
skill teaches coding agents how to work with OmniLLM's actual runtime and conversion
surfaces instead of guessing from generic Rust or generic SDK patterns.
The bundled Skill is tuned for repository-native signals such as:
- `GatewayBuilder`, `Gateway`, `KeyConfig`, `PoolConfig`
- `ProviderEndpoint`, `EndpointProtocol`, `ProviderProtocol`, `LlmRequest`, `LlmStreamEvent`
- `ApiRequest`, `WireFormat`, `emit_transport_request`, `transcode_*`
- `ReplayFixture`, `sanitize_transport_request`, `OMNILLM_RESPONSES_*`
- runtime errors like `NoAvailableKey`, `BudgetExceeded`, and `Protocol(...)`
### Bundled Skill
The repository includes the OmniLLM Skill in [`skill/`](./skill). The
installation guide lives in [`skill/README.md`](./skill/README.md), and the
website version lives in
[`website/docs/skill.md`](./website/docs/skill.md).
### Install The Skill
See [`skill/README.md`](./skill/README.md) for GitHub-based Vercel Labs
`skills` installer commands for Claude Code, Codex, and OpenCode.
### Use The Skill
After installing it, ask your agent to:
- integrate `omnillm` into a Rust project
- configure a multi-key runtime gateway
- transcode between provider protocols or typed endpoint formats
- explain replay sanitization and fixture-safe testing
- debug OmniLLM-specific errors and configuration issues
## Repository Docs Site
The documentation site source lives in the GitHub repository:
- [website/docs](https://github.com/aiomni/omnillm/tree/main/website/docs)
- [website/theme](https://github.com/aiomni/omnillm/tree/main/website/theme)
- [skill](https://github.com/aiomni/omnillm/tree/main/skill)
- [GitHub Pages workflow](https://github.com/aiomni/omnillm/blob/main/.github/workflows/gh-pages.yml)
## Features
- Canonical `Responses + Capability Layer` hybrid request/response model
- Additive provider primitive protocol mode for raw provider-native payloads
- Runtime endpoint profiles through `EndpointProtocol`, including official URL derivation and `*_compat` full-URL modes
- Additive multi-endpoint API layer with canonical request/response types for generation, embeddings, images, audio, and rerank
- Protocol-aware dispatch for OpenAI Responses, OpenAI Chat Completions, Claude Messages, and Gemini GenerateContent
- Raw JSON and typed transcoders between supported protocols and endpoint families
- Message-level `raw_message` preservation for higher-fidelity round trips
- Embedded provider support registry for OpenAI, Azure OpenAI, Anthropic, Gemini, Vertex AI, Bedrock, and OpenAI-compatible endpoints
- Replay fixture sanitization helpers for safe record/replay style testing
- Multi-key load balancing with per-key rate limiting and circuit breaking
- Lock-free budget tracking with pre-reserve + settle accounting
- Non-streaming `call`, canonical streaming `stream`, primitive `primitive_call`, primitive SSE/binary `primitive_stream`, and primitive WebSocket `primitive_realtime` APIs
- Bundled OmniLLM Skill in `skill/` for AI-native repo guidance
## Dual Protocol Modes
OmniLLM now exposes two runtime protocol forms:
| OpenAI Responses canonical | `Gateway::call`, `Gateway::stream` | `LlmRequest`, `LlmResponse`, `LlmStreamEvent` | You want provider-neutral generation with existing OpenAI Responses-centered semantics and provider transcoding | Shared `BudgetTracker` |
| Provider primitive | `Gateway::primitive_call`, `Gateway::primitive_stream`, `Gateway::primitive_realtime` | `PrimitiveRequest`, `PrimitiveResponse`, `PrimitiveStreamEvent`, `PrimitiveRealtimeSession` | You need raw provider-native APIs such as OpenAI Images/Audio/Realtime, Anthropic Messages/Count Tokens, Gemini GenerateContent/CountTokens/Live, or OpenAI-compatible raw payloads | Shared `BudgetTracker` |
The canonical path remains the default and does not require primitive configuration.
Primitive mode is explicit: configure a `PrimitiveProviderEndpoint`, send a
`PrimitiveRequest`, and OmniLLM preserves the provider-native request and response
payloads. Usage extraction is side-channel telemetry used for budget settlement;
it does not rewrite the returned primitive body.
Primitive provider support is intentionally scoped. OmniLLM is not a full provider
admin SDK; admin, billing, webhooks, fine-tuning, evals, tunings, managed-agent
platforms, and hosted RAG/vector-store administration remain deferred unless a
current Spec explicitly promotes them.
Current primitive support tiers:
WebSocket realtime support is implemented for OpenAI Realtime and Gemini Live. WebRTC transport is intentionally not claimed as implemented and remains planned until feature-gated tests cover it.
| P0 core | OpenAI Responses/Chat/Images/Audio/Embeddings, Anthropic Messages/Count Tokens/Batches/Files, Gemini Generate/Stream/Count/Embed/Files/Caches | token or media fallback |
| P1 HTTP gaps | OpenAI Files/Uploads/Models/Audio Translations/Image edits/variations, Anthropic Models/Files hardening, Gemini Models/Operations/Files/Caches hardening | zero-cost metadata, upload/storage, or billable-unit fallback |
| P2 async jobs | Batch lifecycle provider APIs | async job usage when observed |
| P3 transports | OpenAI Audio Speech binary chunks, OpenAI Realtime WebSocket, Gemini Live WebSocket; WebRTC remains planned | close/cancel/provider-error/no-usage fallback settlement |
| Deferred | admin, billing, fine-tuning, evals, tunings, managed agents, hosted RAG control plane, SDK helpers | out of scope |
```rust
use omnillm::{
GatewayBuilder, KeyConfig, PrimitiveEndpointKind, PrimitiveProviderEndpoint,
PrimitiveProviderKind, PrimitiveRequest, ProviderEndpoint, ProviderPrimitiveWireFormat,
};
use serde_json::json;
use tokio_util::sync::CancellationToken;
# async fn demo() -> Result<(), Box<dyn std::error::Error>> {
let gateway = GatewayBuilder::new(ProviderEndpoint::openai_responses())
.primitive_endpoint(PrimitiveProviderEndpoint::openai())
.add_key(KeyConfig::new("sk-key", "openai"))
.budget_limit_usd(10.0)
.build()?;
let response = gateway
.primitive_call(
PrimitiveRequest::json(
PrimitiveProviderKind::OpenAi,
PrimitiveEndpointKind::Responses,
ProviderPrimitiveWireFormat::OpenAiResponses,
"gpt-4o",
json!({"model":"gpt-4o","input":"hello"}),
),
CancellationToken::new(),
)
.await?;
println!("status={} usage={:?}", response.status, response.usage);
# Ok(())
# }
```
## Canonical Model
Generation stays centered on the existing Response API semantic model:
- `LlmRequest` / `LlmResponse` are still the canonical generation types.
- `ApiRequest` / `ApiResponse` add separate canonical types for embeddings, image generations, audio transcriptions, audio speech, and rerank.
- `ConversionReport<T>` makes bridge semantics explicit with `bridged`, `lossy`, and `loss_reasons`.
This keeps generation normalized around "generate one response" while avoiding capability lock-in to any single wire protocol.
## Endpoint Families
Current typed endpoint coverage:
| Generation | `LlmRequest` / `LlmResponse` | `open_ai_responses`, `open_ai_chat_completions`, `anthropic_messages`, `gemini_generate_content` |
| Embeddings | `EmbeddingRequest` / `EmbeddingResponse` | `open_ai_embeddings` |
| Image generation | `ImageGenerationRequest` / `ImageGenerationResponse` | `open_ai_image_generations` |
| Audio transcription | `AudioTranscriptionRequest` / `AudioTranscriptionResponse` | `open_ai_audio_transcriptions` |
| Audio speech | `AudioSpeechRequest` / `AudioSpeechResponse` | `open_ai_audio_speech` |
| Rerank | `RerankRequest` / `RerankResponse` | `open_ai_rerank` |
Provider support is exposed through `embedded_provider_registry()`. The registry distinguishes:
- `native`: implemented with provider-native wire format
- `compatible`: OpenAI-compatible or wrapper-style support
- `planned`: listed in the matrix but not yet implemented as a codec/runtime adapter
## Quick Start
```rust
use omnillm::{
GenerationConfig, GatewayBuilder, KeyConfig, LlmRequest, Message, MessageRole,
ProviderEndpoint, RequestItem,
};
use tokio_util::sync::CancellationToken;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let gateway = GatewayBuilder::new(ProviderEndpoint::openai_responses())
.add_key(KeyConfig::new("sk-key-1", "prod-1").tpm_limit(90_000).rpm_limit(500))
.budget_limit_usd(50.0)
.build()?;
let req = LlmRequest {
model: "gpt-4.1-mini".into(),
instructions: Some("Answer concisely".into()),
input: vec![RequestItem::from(Message::text(MessageRole::User, "Hello!"))],
messages: Vec::new(),
capabilities: Default::default(),
generation: GenerationConfig {
max_output_tokens: Some(256),
..Default::default()
},
metadata: Default::default(),
vendor_extensions: Default::default(),
};
let resp = gateway.call(req, CancellationToken::new()).await?;
println!("{}", resp.content_text);
Ok(())
}
```
## Runtime Endpoint Profiles
Runtime configuration uses `EndpointProtocol`, while `ProviderProtocol` remains
the low-level wire-protocol enum for parsing, emission, and transcoding.
Names such as `ClaudeMessages` and `GeminiGenerateContent` come directly from
the upstream API families OmniLLM models, so treat them as wire-shape
identifiers rather than preferred runtime configuration presets.
```rust
use omnillm::{AuthScheme, EndpointProtocol, ProviderEndpoint};
let endpoint = ProviderEndpoint::new(
EndpointProtocol::OpenAiChatCompletionsCompat,
"https://your-openai-compatible-host/v1/chat/completions",
)
.with_auth(AuthScheme::Header {
name: "x-api-key".into(),
});
```
Use official `EndpointProtocol` variants when OmniLLM should derive standard
upstream paths from a host or prefix. Use `*_compat` variants when the upstream
wrapper already exposes the full request URL.
For OpenAI Chat Completions wrappers that reject bare string `content`,
construct chat input with `Message.parts`: OmniLLM emits plain-text chat
messages as typed `content` arrays such as
`[{ "type": "text", "text": "hi?" }]`.
## Prompt Cache
OmniLLM exposes prompt caching as a typed generation capability instead of a raw provider-only JSON escape hatch:
```rust
use omnillm::{
CacheBreakpoint, CapabilitySet, PromptCacheKey, PromptCachePolicy,
PromptCacheRetention,
};
let capabilities = CapabilitySet {
prompt_cache: Some(PromptCachePolicy::BestEffort {
key: Some(PromptCacheKey::Explicit { value: "tenant-a".into() }),
retention: PromptCacheRetention::Long,
breakpoint: CacheBreakpoint::Auto,
vendor_extensions: Default::default(),
}),
..Default::default()
};
```
Provider behavior is intentionally explicit:
- OpenAI Responses and Chat Completions emit `prompt_cache_key` and `prompt_cache_retention`; OpenAI breakpoint requests are partial support because caching is automatic prefix matching.
- Claude Messages emits provider-native `cache_control` on supported tool, system, message, or content-block boundaries.
- Gemini GenerateContent does not support typed prompt cache; `BestEffort` becomes a lossy bridge report and `Required` returns an error before transport.
- `TokenUsage.prompt_cache` preserves cached/read/write token telemetry so callers can verify cache hits from provider usage, not assumptions.
- Budget estimates stay conservative and do not assume cache hits; actual settlement applies cache-aware pricing only when both provider telemetry and known cache rates are present.
For safer stable-prefix construction, use `PromptLayoutBuilder` to keep dynamic user/RAG content in the suffix and generate stable prefix keys that exclude dynamic content:
```rust
use omnillm::{Message, MessageRole, PromptLayoutBuilder, PromptCacheRetention};
let request = PromptLayoutBuilder::new("gpt-5.4")
.instructions("Answer using the stable policy document.")
.stable_message(Message::text(MessageRole::User, "Stable policy context"))
.user_input("What changed for my account?")
.stable_prefix_cache_key("support-bot", Some("tenant-a"), PromptCacheRetention::Long, false)
.build();
```
## Protocol Transcoding
```rust
use omnillm::{transcode_request, ProviderProtocol};
let raw_chat = r#"{
"model": "gpt-4.1-mini",
"messages": [{
"role": "user",
"content": [{ "type": "text", "text": "Hello!" }]
}],
"max_tokens": 32
}"#;
let raw_responses = transcode_request(
ProviderProtocol::OpenAiChatCompletions,
ProviderProtocol::OpenAiResponses,
raw_chat,
)?;
```
Typed multi-endpoint transcoding keeps bridge metadata:
```rust
use omnillm::{transcode_api_request, WireFormat};
let raw_chat = r#"{
"model": "gpt-4.1-mini",
"messages": [{
"role": "user",
"content": [{ "type": "text", "text": "Hello!" }]
}],
"max_tokens": 32
}"#;
let report = transcode_api_request(
WireFormat::OpenAiChatCompletions,
WireFormat::OpenAiResponses,
raw_chat,
)?;
assert!(report.bridged);
assert!(!report.lossy);
println!("{}", report.value);
```
If you bridge from the canonical Responses model to a narrower protocol, `loss_reasons` will tell you exactly what was dropped, such as unsupported builtin tools or provider-specific metadata.
## Multi-Endpoint API
```rust
use omnillm::{
emit_transport_request, ApiRequest, EmbeddingInput, EmbeddingRequest, RequestBody, WireFormat,
};
let request = ApiRequest::Embeddings(EmbeddingRequest {
model: "text-embedding-3-small".into(),
input: vec![EmbeddingInput::Text { text: "hello".into() }],
dimensions: Some(256),
encoding_format: None,
user: None,
vendor_extensions: Default::default(),
});
let transport = emit_transport_request(WireFormat::OpenAiEmbeddings, &request)?;
assert_eq!(transport.value.path, "/embeddings");
if let RequestBody::Json { value } = transport.value.body {
println!("{}", value);
}
```
Local demo:
```sh
cargo run --example multi_endpoint_demo
```
## Replay Sanitization
`ReplayFixture`, `sanitize_transport_request`, `sanitize_transport_response`, and `sanitize_json_value` are intended for record/replay tests. They redact common secrets by default:
- auth headers
- query tokens such as `ak`
- JSON fields such as `api_key`, `token`, `secret`
- large binary/base64 payload fields
```rust
use omnillm::{sanitize_transport_request, HttpMethod, RequestBody, TransportRequest};
use serde_json::json;
let request = TransportRequest {
method: HttpMethod::Post,
path: "/responses?ak=secret".into(),
headers: [("Authorization".into(), "Bearer secret".into())]
.into_iter()
.collect(),
accept: None,
body: RequestBody::Json {
value: json!({ "api_key": "secret", "input": "hello" }),
},
};
let sanitized = sanitize_transport_request(&request);
assert_eq!(sanitized.path, "/responses?ak=<redacted:ak>");
```
## Live Responses Demo
```sh
cp .env.example .env
cargo run --example responses_live_demo
```
Optional live test:
```sh
cargo test responses_vision_demo -- --ignored --nocapture
cargo test responses_function_tool_demo -- --ignored --nocapture
```
The live demo and live tests read all endpoint configuration from environment variables or a local ignored `.env` file. See `.env.example`.
## Gateway Builder
```rust
use std::time::Duration;
use omnillm::{GatewayBuilder, KeyConfig, PoolConfig, ProviderEndpoint};
let gateway = GatewayBuilder::new(ProviderEndpoint::claude_messages())
.add_key(KeyConfig::new("sk-key-1", "claude-prod-1"))
.budget_limit_usd(100.0)
.pool_config(PoolConfig::default())
.request_timeout(Duration::from_secs(120))
.build()
.expect("at least one key required");
```
## Observability
```rust
for status in gateway.pool_status() {
println!(
"Key {:20} available={} inflight={}/{}",
status.label, status.available, status.tpm_inflight, status.tpm_limit,
);
}
println!("Budget remaining: ${:.4}", gateway.budget_remaining_usd());
```