omnillm 0.1.5

Production-grade LLM API gateway with multi-key load balancing, per-key rate limiting, circuit breaking, and cost tracking
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
# OmniLLM

An AI-native, production-grade Rust library for provider-neutral LLM access with multi-key load balancing, per-key rate limiting, protocol conversion, circuit breaking, and lock-free cost tracking.

## Documentation

- [Detailed Usage Guide]https://github.com/aiomni/omnillm/blob/main/website/docs/usage.md
- [Skill Guide]https://github.com/aiomni/omnillm/blob/main/website/docs/skill.md
- [Architecture Notes]https://github.com/aiomni/omnillm/blob/main/website/docs/architecture.md
- [Implementation Notes]https://github.com/aiomni/omnillm/blob/main/website/docs/implementation.md
- [API docs on docs.rs]https://docs.rs/omnillm
- [OmniLLM Skill Source]./skill
- [OmniLLM Skill README]./skill/README.md

## AI-Native Project

OmniLLM ships with a first-party OmniLLM Skill in [`skill/`](./skill). The
skill teaches coding agents how to work with OmniLLM's actual runtime and conversion
surfaces instead of guessing from generic Rust or generic SDK patterns.

The bundled Skill is tuned for repository-native signals such as:

- `GatewayBuilder`, `Gateway`, `KeyConfig`, `PoolConfig`
- `ProviderEndpoint`, `EndpointProtocol`, `ProviderProtocol`, `LlmRequest`, `LlmStreamEvent`
- `ApiRequest`, `WireFormat`, `emit_transport_request`, `transcode_*`
- `ReplayFixture`, `sanitize_transport_request`, `OMNILLM_RESPONSES_*`
- runtime errors like `NoAvailableKey`, `BudgetExceeded`, and `Protocol(...)`

### Bundled Skill

The repository includes the OmniLLM Skill in [`skill/`](./skill). The
installation guide lives in [`skill/README.md`](./skill/README.md), and the
website version lives in
[`website/docs/skill.md`](./website/docs/skill.md).

### Install The Skill

See [`skill/README.md`](./skill/README.md) for GitHub-based Vercel Labs
`skills` installer commands for Claude Code, Codex, and OpenCode.

### Use The Skill

After installing it, ask your agent to:

- integrate `omnillm` into a Rust project
- configure a multi-key runtime gateway
- transcode between provider protocols or typed endpoint formats
- explain replay sanitization and fixture-safe testing
- debug OmniLLM-specific errors and configuration issues

## Repository Docs Site

The documentation site source lives in the GitHub repository:

- [website/docs]https://github.com/aiomni/omnillm/tree/main/website/docs
- [website/theme]https://github.com/aiomni/omnillm/tree/main/website/theme
- [skill]https://github.com/aiomni/omnillm/tree/main/skill
- [GitHub Pages workflow]https://github.com/aiomni/omnillm/blob/main/.github/workflows/gh-pages.yml

## Features

- Canonical `Responses + Capability Layer` hybrid request/response model
- Additive provider primitive protocol mode for raw provider-native payloads
- Runtime endpoint profiles through `EndpointProtocol`, including official URL derivation and `*_compat` full-URL modes
- Additive multi-endpoint API layer with canonical request/response types for generation, embeddings, images, audio, and rerank
- Protocol-aware dispatch for OpenAI Responses, OpenAI Chat Completions, Claude Messages, and Gemini GenerateContent
- Raw JSON and typed transcoders between supported protocols and endpoint families
- Message-level `raw_message` preservation for higher-fidelity round trips
- Embedded provider support registry for OpenAI, Azure OpenAI, Anthropic, Gemini, Vertex AI, Bedrock, and OpenAI-compatible endpoints
- Replay fixture sanitization helpers for safe record/replay style testing
- Multi-key load balancing with per-key rate limiting and circuit breaking
- Lock-free budget tracking with pre-reserve + settle accounting
- Non-streaming `call`, canonical streaming `stream`, primitive `primitive_call`, primitive SSE/binary `primitive_stream`, and primitive WebSocket `primitive_realtime` APIs
- Bundled OmniLLM Skill in `skill/` for AI-native repo guidance

## Dual Protocol Modes

OmniLLM now exposes two runtime protocol forms:

| Mode | Entry points | Payload model | Use when | Budget |
| --- | --- | --- | --- | --- |
| OpenAI Responses canonical | `Gateway::call`, `Gateway::stream` | `LlmRequest`, `LlmResponse`, `LlmStreamEvent` | You want provider-neutral generation with existing OpenAI Responses-centered semantics and provider transcoding | Shared `BudgetTracker` |
| Provider primitive | `Gateway::primitive_call`, `Gateway::primitive_stream`, `Gateway::primitive_realtime` | `PrimitiveRequest`, `PrimitiveResponse`, `PrimitiveStreamEvent`, `PrimitiveRealtimeSession` | You need raw provider-native APIs such as OpenAI Images/Audio/Realtime, Anthropic Messages/Count Tokens, Gemini GenerateContent/CountTokens/Live, or OpenAI-compatible raw payloads | Shared `BudgetTracker` |

The canonical path remains the default and does not require primitive configuration.
Primitive mode is explicit: configure a `PrimitiveProviderEndpoint`, send a
`PrimitiveRequest`, and OmniLLM preserves the provider-native request and response
payloads. Usage extraction is side-channel telemetry used for budget settlement;
it does not rewrite the returned primitive body.

Primitive provider support is intentionally scoped. OmniLLM is not a full provider
admin SDK; admin, billing, webhooks, fine-tuning, evals, tunings, managed-agent
platforms, and hosted RAG/vector-store administration remain deferred unless a
current Spec explicitly promotes them.

Current primitive support tiers:

WebSocket realtime support is implemented for OpenAI Realtime and Gemini Live. WebRTC transport is intentionally not claimed as implemented and remains planned until feature-gated tests cover it.

| Tier | Provider coverage | Budget class |
| --- | --- | --- |
| P0 core | OpenAI Responses/Chat/Images/Audio/Embeddings, Anthropic Messages/Count Tokens/Batches/Files, Gemini Generate/Stream/Count/Embed/Files/Caches | token or media fallback |
| P1 HTTP gaps | OpenAI Files/Uploads/Models/Audio Translations/Image edits/variations, Anthropic Models/Files hardening, Gemini Models/Operations/Files/Caches hardening | zero-cost metadata, upload/storage, or billable-unit fallback |
| P2 async jobs | Batch lifecycle provider APIs | async job usage when observed |
| P3 transports | OpenAI Audio Speech binary chunks, OpenAI Realtime WebSocket, Gemini Live WebSocket; WebRTC remains planned | close/cancel/provider-error/no-usage fallback settlement |
| Deferred | admin, billing, fine-tuning, evals, tunings, managed agents, hosted RAG control plane, SDK helpers | out of scope |

```rust
use omnillm::{
    GatewayBuilder, KeyConfig, PrimitiveEndpointKind, PrimitiveProviderEndpoint,
    PrimitiveProviderKind, PrimitiveRequest, ProviderEndpoint, ProviderPrimitiveWireFormat,
};
use serde_json::json;
use tokio_util::sync::CancellationToken;

# async fn demo() -> Result<(), Box<dyn std::error::Error>> {
let gateway = GatewayBuilder::new(ProviderEndpoint::openai_responses())
    .primitive_endpoint(PrimitiveProviderEndpoint::openai())
    .add_key(KeyConfig::new("sk-key", "openai"))
    .budget_limit_usd(10.0)
    .build()?;

let response = gateway
    .primitive_call(
        PrimitiveRequest::json(
            PrimitiveProviderKind::OpenAi,
            PrimitiveEndpointKind::Responses,
            ProviderPrimitiveWireFormat::OpenAiResponses,
            "gpt-4o",
            json!({"model":"gpt-4o","input":"hello"}),
        ),
        CancellationToken::new(),
    )
    .await?;
println!("status={} usage={:?}", response.status, response.usage);
# Ok(())
# }
```

## Canonical Model

Generation stays centered on the existing Response API semantic model:

- `LlmRequest` / `LlmResponse` are still the canonical generation types.
- `ApiRequest` / `ApiResponse` add separate canonical types for embeddings, image generations, audio transcriptions, audio speech, and rerank.
- `ConversionReport<T>` makes bridge semantics explicit with `bridged`, `lossy`, and `loss_reasons`.

This keeps generation normalized around "generate one response" while avoiding capability lock-in to any single wire protocol.

## Endpoint Families

Current typed endpoint coverage:

| Endpoint | Canonical type | Implemented wire formats |
| --- | --- | --- |
| Generation | `LlmRequest` / `LlmResponse` | `open_ai_responses`, `open_ai_chat_completions`, `anthropic_messages`, `gemini_generate_content` |
| Embeddings | `EmbeddingRequest` / `EmbeddingResponse` | `open_ai_embeddings` |
| Image generation | `ImageGenerationRequest` / `ImageGenerationResponse` | `open_ai_image_generations` |
| Audio transcription | `AudioTranscriptionRequest` / `AudioTranscriptionResponse` | `open_ai_audio_transcriptions` |
| Audio speech | `AudioSpeechRequest` / `AudioSpeechResponse` | `open_ai_audio_speech` |
| Rerank | `RerankRequest` / `RerankResponse` | `open_ai_rerank` |

Provider support is exposed through `embedded_provider_registry()`. The registry distinguishes:

- `native`: implemented with provider-native wire format
- `compatible`: OpenAI-compatible or wrapper-style support
- `planned`: listed in the matrix but not yet implemented as a codec/runtime adapter

## Quick Start

```rust
use omnillm::{
    GenerationConfig, GatewayBuilder, KeyConfig, LlmRequest, Message, MessageRole,
    ProviderEndpoint, RequestItem,
};
use tokio_util::sync::CancellationToken;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let gateway = GatewayBuilder::new(ProviderEndpoint::openai_responses())
        .add_key(KeyConfig::new("sk-key-1", "prod-1").tpm_limit(90_000).rpm_limit(500))
        .budget_limit_usd(50.0)
        .build()?;

    let req = LlmRequest {
        model: "gpt-4.1-mini".into(),
        instructions: Some("Answer concisely".into()),
        input: vec![RequestItem::from(Message::text(MessageRole::User, "Hello!"))],
        messages: Vec::new(),
        capabilities: Default::default(),
        generation: GenerationConfig {
            max_output_tokens: Some(256),
            ..Default::default()
        },
        metadata: Default::default(),
        vendor_extensions: Default::default(),
    };

    let resp = gateway.call(req, CancellationToken::new()).await?;
    println!("{}", resp.content_text);
    Ok(())
}
```

## Runtime Endpoint Profiles

Runtime configuration uses `EndpointProtocol`, while `ProviderProtocol` remains
the low-level wire-protocol enum for parsing, emission, and transcoding.
Names such as `ClaudeMessages` and `GeminiGenerateContent` come directly from
the upstream API families OmniLLM models, so treat them as wire-shape
identifiers rather than preferred runtime configuration presets.

```rust
use omnillm::{AuthScheme, EndpointProtocol, ProviderEndpoint};

let endpoint = ProviderEndpoint::new(
    EndpointProtocol::OpenAiChatCompletionsCompat,
    "https://your-openai-compatible-host/v1/chat/completions",
)
.with_auth(AuthScheme::Header {
    name: "x-api-key".into(),
});
```

Use official `EndpointProtocol` variants when OmniLLM should derive standard
upstream paths from a host or prefix. Use `*_compat` variants when the upstream
wrapper already exposes the full request URL.
For OpenAI Chat Completions wrappers that reject bare string `content`,
construct chat input with `Message.parts`: OmniLLM emits plain-text chat
messages as typed `content` arrays such as
`[{ "type": "text", "text": "hi?" }]`.

## Prompt Cache

OmniLLM exposes prompt caching as a typed generation capability instead of a raw provider-only JSON escape hatch:

```rust
use omnillm::{
    CacheBreakpoint, CapabilitySet, PromptCacheKey, PromptCachePolicy,
    PromptCacheRetention,
};

let capabilities = CapabilitySet {
    prompt_cache: Some(PromptCachePolicy::BestEffort {
        key: Some(PromptCacheKey::Explicit { value: "tenant-a".into() }),
        retention: PromptCacheRetention::Long,
        breakpoint: CacheBreakpoint::Auto,
        vendor_extensions: Default::default(),
    }),
    ..Default::default()
};
```

Provider behavior is intentionally explicit:

- OpenAI Responses and Chat Completions emit `prompt_cache_key` and `prompt_cache_retention`; OpenAI breakpoint requests are partial support because caching is automatic prefix matching.
- Claude Messages emits provider-native `cache_control` on supported tool, system, message, or content-block boundaries.
- Gemini GenerateContent does not support typed prompt cache; `BestEffort` becomes a lossy bridge report and `Required` returns an error before transport.
- `TokenUsage.prompt_cache` preserves cached/read/write token telemetry so callers can verify cache hits from provider usage, not assumptions.
- Budget estimates stay conservative and do not assume cache hits; actual settlement applies cache-aware pricing only when both provider telemetry and known cache rates are present.

For safer stable-prefix construction, use `PromptLayoutBuilder` to keep dynamic user/RAG content in the suffix and generate stable prefix keys that exclude dynamic content:

```rust
use omnillm::{Message, MessageRole, PromptLayoutBuilder, PromptCacheRetention};

let request = PromptLayoutBuilder::new("gpt-5.4")
    .instructions("Answer using the stable policy document.")
    .stable_message(Message::text(MessageRole::User, "Stable policy context"))
    .user_input("What changed for my account?")
    .stable_prefix_cache_key("support-bot", Some("tenant-a"), PromptCacheRetention::Long, false)
    .build();
```

## Protocol Transcoding

```rust
use omnillm::{transcode_request, ProviderProtocol};

let raw_chat = r#"{
  "model": "gpt-4.1-mini",
  "messages": [{
    "role": "user",
    "content": [{ "type": "text", "text": "Hello!" }]
  }],
  "max_tokens": 32
}"#;

let raw_responses = transcode_request(
    ProviderProtocol::OpenAiChatCompletions,
    ProviderProtocol::OpenAiResponses,
    raw_chat,
)?;
```

Typed multi-endpoint transcoding keeps bridge metadata:

```rust
use omnillm::{transcode_api_request, WireFormat};

let raw_chat = r#"{
  "model": "gpt-4.1-mini",
  "messages": [{
    "role": "user",
    "content": [{ "type": "text", "text": "Hello!" }]
  }],
  "max_tokens": 32
}"#;

let report = transcode_api_request(
    WireFormat::OpenAiChatCompletions,
    WireFormat::OpenAiResponses,
    raw_chat,
)?;

assert!(report.bridged);
assert!(!report.lossy);
println!("{}", report.value);
```

If you bridge from the canonical Responses model to a narrower protocol, `loss_reasons` will tell you exactly what was dropped, such as unsupported builtin tools or provider-specific metadata.

## Multi-Endpoint API

```rust
use omnillm::{
    emit_transport_request, ApiRequest, EmbeddingInput, EmbeddingRequest, RequestBody, WireFormat,
};

let request = ApiRequest::Embeddings(EmbeddingRequest {
    model: "text-embedding-3-small".into(),
    input: vec![EmbeddingInput::Text { text: "hello".into() }],
    dimensions: Some(256),
    encoding_format: None,
    user: None,
    vendor_extensions: Default::default(),
});

let transport = emit_transport_request(WireFormat::OpenAiEmbeddings, &request)?;
assert_eq!(transport.value.path, "/embeddings");

if let RequestBody::Json { value } = transport.value.body {
    println!("{}", value);
}
```

Local demo:

```sh
cargo run --example multi_endpoint_demo
```

## Replay Sanitization

`ReplayFixture`, `sanitize_transport_request`, `sanitize_transport_response`, and `sanitize_json_value` are intended for record/replay tests. They redact common secrets by default:

- auth headers
- query tokens such as `ak`
- JSON fields such as `api_key`, `token`, `secret`
- large binary/base64 payload fields

```rust
use omnillm::{sanitize_transport_request, HttpMethod, RequestBody, TransportRequest};
use serde_json::json;

let request = TransportRequest {
    method: HttpMethod::Post,
    path: "/responses?ak=secret".into(),
    headers: [("Authorization".into(), "Bearer secret".into())]
        .into_iter()
        .collect(),
    accept: None,
    body: RequestBody::Json {
        value: json!({ "api_key": "secret", "input": "hello" }),
    },
};

let sanitized = sanitize_transport_request(&request);
assert_eq!(sanitized.path, "/responses?ak=<redacted:ak>");
```

## Live Responses Demo

```sh
cp .env.example .env
cargo run --example responses_live_demo
```

Optional live test:

```sh
cargo test responses_vision_demo -- --ignored --nocapture
cargo test responses_function_tool_demo -- --ignored --nocapture
```

The live demo and live tests read all endpoint configuration from environment variables or a local ignored `.env` file. See `.env.example`.

## Gateway Builder

```rust
use std::time::Duration;
use omnillm::{GatewayBuilder, KeyConfig, PoolConfig, ProviderEndpoint};

let gateway = GatewayBuilder::new(ProviderEndpoint::claude_messages())
    .add_key(KeyConfig::new("sk-key-1", "claude-prod-1"))
    .budget_limit_usd(100.0)
    .pool_config(PoolConfig::default())
    .request_timeout(Duration::from_secs(120))
    .build()
    .expect("at least one key required");
```

## Observability

```rust
for status in gateway.pool_status() {
    println!(
        "Key {:20} available={} inflight={}/{}",
        status.label, status.available, status.tpm_inflight, status.tpm_limit,
    );
}

println!("Budget remaining: ${:.4}", gateway.budget_remaining_usd());
```