omnillm 0.1.5

Production-grade LLM API gateway with multi-key load balancing, per-key rate limiting, circuit breaking, and cost tracking
Documentation

OmniLLM

An AI-native, production-grade Rust library for provider-neutral LLM access with multi-key load balancing, per-key rate limiting, protocol conversion, circuit breaking, and lock-free cost tracking.

Documentation

AI-Native Project

OmniLLM ships with a first-party OmniLLM Skill in skill/. The skill teaches coding agents how to work with OmniLLM's actual runtime and conversion surfaces instead of guessing from generic Rust or generic SDK patterns.

The bundled Skill is tuned for repository-native signals such as:

  • GatewayBuilder, Gateway, KeyConfig, PoolConfig
  • ProviderEndpoint, EndpointProtocol, ProviderProtocol, LlmRequest, LlmStreamEvent
  • ApiRequest, WireFormat, emit_transport_request, transcode_*
  • ReplayFixture, sanitize_transport_request, OMNILLM_RESPONSES_*
  • runtime errors like NoAvailableKey, BudgetExceeded, and Protocol(...)

Bundled Skill

The repository includes the OmniLLM Skill in skill/. The installation guide lives in skill/README.md, and the website version lives in website/docs/skill.md.

Install The Skill

See skill/README.md for GitHub-based Vercel Labs skills installer commands for Claude Code, Codex, and OpenCode.

Use The Skill

After installing it, ask your agent to:

  • integrate omnillm into a Rust project
  • configure a multi-key runtime gateway
  • transcode between provider protocols or typed endpoint formats
  • explain replay sanitization and fixture-safe testing
  • debug OmniLLM-specific errors and configuration issues

Repository Docs Site

The documentation site source lives in the GitHub repository:

Features

  • Canonical Responses + Capability Layer hybrid request/response model
  • Additive provider primitive protocol mode for raw provider-native payloads
  • Runtime endpoint profiles through EndpointProtocol, including official URL derivation and *_compat full-URL modes
  • Additive multi-endpoint API layer with canonical request/response types for generation, embeddings, images, audio, and rerank
  • Protocol-aware dispatch for OpenAI Responses, OpenAI Chat Completions, Claude Messages, and Gemini GenerateContent
  • Raw JSON and typed transcoders between supported protocols and endpoint families
  • Message-level raw_message preservation for higher-fidelity round trips
  • Embedded provider support registry for OpenAI, Azure OpenAI, Anthropic, Gemini, Vertex AI, Bedrock, and OpenAI-compatible endpoints
  • Replay fixture sanitization helpers for safe record/replay style testing
  • Multi-key load balancing with per-key rate limiting and circuit breaking
  • Lock-free budget tracking with pre-reserve + settle accounting
  • Non-streaming call, canonical streaming stream, primitive primitive_call, primitive SSE/binary primitive_stream, and primitive WebSocket primitive_realtime APIs
  • Bundled OmniLLM Skill in skill/ for AI-native repo guidance

Dual Protocol Modes

OmniLLM now exposes two runtime protocol forms:

Mode Entry points Payload model Use when Budget
OpenAI Responses canonical Gateway::call, Gateway::stream LlmRequest, LlmResponse, LlmStreamEvent You want provider-neutral generation with existing OpenAI Responses-centered semantics and provider transcoding Shared BudgetTracker
Provider primitive Gateway::primitive_call, Gateway::primitive_stream, Gateway::primitive_realtime PrimitiveRequest, PrimitiveResponse, PrimitiveStreamEvent, PrimitiveRealtimeSession You need raw provider-native APIs such as OpenAI Images/Audio/Realtime, Anthropic Messages/Count Tokens, Gemini GenerateContent/CountTokens/Live, or OpenAI-compatible raw payloads Shared BudgetTracker

The canonical path remains the default and does not require primitive configuration. Primitive mode is explicit: configure a PrimitiveProviderEndpoint, send a PrimitiveRequest, and OmniLLM preserves the provider-native request and response payloads. Usage extraction is side-channel telemetry used for budget settlement; it does not rewrite the returned primitive body.

Primitive provider support is intentionally scoped. OmniLLM is not a full provider admin SDK; admin, billing, webhooks, fine-tuning, evals, tunings, managed-agent platforms, and hosted RAG/vector-store administration remain deferred unless a current Spec explicitly promotes them.

Current primitive support tiers:

WebSocket realtime support is implemented for OpenAI Realtime and Gemini Live. WebRTC transport is intentionally not claimed as implemented and remains planned until feature-gated tests cover it.

Tier Provider coverage Budget class
P0 core OpenAI Responses/Chat/Images/Audio/Embeddings, Anthropic Messages/Count Tokens/Batches/Files, Gemini Generate/Stream/Count/Embed/Files/Caches token or media fallback
P1 HTTP gaps OpenAI Files/Uploads/Models/Audio Translations/Image edits/variations, Anthropic Models/Files hardening, Gemini Models/Operations/Files/Caches hardening zero-cost metadata, upload/storage, or billable-unit fallback
P2 async jobs Batch lifecycle provider APIs async job usage when observed
P3 transports OpenAI Audio Speech binary chunks, OpenAI Realtime WebSocket, Gemini Live WebSocket; WebRTC remains planned close/cancel/provider-error/no-usage fallback settlement
Deferred admin, billing, fine-tuning, evals, tunings, managed agents, hosted RAG control plane, SDK helpers out of scope
use omnillm::{
    GatewayBuilder, KeyConfig, PrimitiveEndpointKind, PrimitiveProviderEndpoint,
    PrimitiveProviderKind, PrimitiveRequest, ProviderEndpoint, ProviderPrimitiveWireFormat,
};
use serde_json::json;
use tokio_util::sync::CancellationToken;

# async fn demo() -> Result<(), Box<dyn std::error::Error>> {
let gateway = GatewayBuilder::new(ProviderEndpoint::openai_responses())
    .primitive_endpoint(PrimitiveProviderEndpoint::openai())
    .add_key(KeyConfig::new("sk-key", "openai"))
    .budget_limit_usd(10.0)
    .build()?;

let response = gateway
    .primitive_call(
        PrimitiveRequest::json(
            PrimitiveProviderKind::OpenAi,
            PrimitiveEndpointKind::Responses,
            ProviderPrimitiveWireFormat::OpenAiResponses,
            "gpt-4o",
            json!({"model":"gpt-4o","input":"hello"}),
        ),
        CancellationToken::new(),
    )
    .await?;
println!("status={} usage={:?}", response.status, response.usage);
# Ok(())
# }

Canonical Model

Generation stays centered on the existing Response API semantic model:

  • LlmRequest / LlmResponse are still the canonical generation types.
  • ApiRequest / ApiResponse add separate canonical types for embeddings, image generations, audio transcriptions, audio speech, and rerank.
  • ConversionReport<T> makes bridge semantics explicit with bridged, lossy, and loss_reasons.

This keeps generation normalized around "generate one response" while avoiding capability lock-in to any single wire protocol.

Endpoint Families

Current typed endpoint coverage:

Endpoint Canonical type Implemented wire formats
Generation LlmRequest / LlmResponse open_ai_responses, open_ai_chat_completions, anthropic_messages, gemini_generate_content
Embeddings EmbeddingRequest / EmbeddingResponse open_ai_embeddings
Image generation ImageGenerationRequest / ImageGenerationResponse open_ai_image_generations
Audio transcription AudioTranscriptionRequest / AudioTranscriptionResponse open_ai_audio_transcriptions
Audio speech AudioSpeechRequest / AudioSpeechResponse open_ai_audio_speech
Rerank RerankRequest / RerankResponse open_ai_rerank

Provider support is exposed through embedded_provider_registry(). The registry distinguishes:

  • native: implemented with provider-native wire format
  • compatible: OpenAI-compatible or wrapper-style support
  • planned: listed in the matrix but not yet implemented as a codec/runtime adapter

Quick Start

use omnillm::{
    GenerationConfig, GatewayBuilder, KeyConfig, LlmRequest, Message, MessageRole,
    ProviderEndpoint, RequestItem,
};
use tokio_util::sync::CancellationToken;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let gateway = GatewayBuilder::new(ProviderEndpoint::openai_responses())
        .add_key(KeyConfig::new("sk-key-1", "prod-1").tpm_limit(90_000).rpm_limit(500))
        .budget_limit_usd(50.0)
        .build()?;

    let req = LlmRequest {
        model: "gpt-4.1-mini".into(),
        instructions: Some("Answer concisely".into()),
        input: vec![RequestItem::from(Message::text(MessageRole::User, "Hello!"))],
        messages: Vec::new(),
        capabilities: Default::default(),
        generation: GenerationConfig {
            max_output_tokens: Some(256),
            ..Default::default()
        },
        metadata: Default::default(),
        vendor_extensions: Default::default(),
    };

    let resp = gateway.call(req, CancellationToken::new()).await?;
    println!("{}", resp.content_text);
    Ok(())
}

Runtime Endpoint Profiles

Runtime configuration uses EndpointProtocol, while ProviderProtocol remains the low-level wire-protocol enum for parsing, emission, and transcoding. Names such as ClaudeMessages and GeminiGenerateContent come directly from the upstream API families OmniLLM models, so treat them as wire-shape identifiers rather than preferred runtime configuration presets.

use omnillm::{AuthScheme, EndpointProtocol, ProviderEndpoint};

let endpoint = ProviderEndpoint::new(
    EndpointProtocol::OpenAiChatCompletionsCompat,
    "https://your-openai-compatible-host/v1/chat/completions",
)
.with_auth(AuthScheme::Header {
    name: "x-api-key".into(),
});

Use official EndpointProtocol variants when OmniLLM should derive standard upstream paths from a host or prefix. Use *_compat variants when the upstream wrapper already exposes the full request URL. For OpenAI Chat Completions wrappers that reject bare string content, construct chat input with Message.parts: OmniLLM emits plain-text chat messages as typed content arrays such as [{ "type": "text", "text": "hi?" }].

Prompt Cache

OmniLLM exposes prompt caching as a typed generation capability instead of a raw provider-only JSON escape hatch:

use omnillm::{
    CacheBreakpoint, CapabilitySet, PromptCacheKey, PromptCachePolicy,
    PromptCacheRetention,
};

let capabilities = CapabilitySet {
    prompt_cache: Some(PromptCachePolicy::BestEffort {
        key: Some(PromptCacheKey::Explicit { value: "tenant-a".into() }),
        retention: PromptCacheRetention::Long,
        breakpoint: CacheBreakpoint::Auto,
        vendor_extensions: Default::default(),
    }),
    ..Default::default()
};

Provider behavior is intentionally explicit:

  • OpenAI Responses and Chat Completions emit prompt_cache_key and prompt_cache_retention; OpenAI breakpoint requests are partial support because caching is automatic prefix matching.
  • Claude Messages emits provider-native cache_control on supported tool, system, message, or content-block boundaries.
  • Gemini GenerateContent does not support typed prompt cache; BestEffort becomes a lossy bridge report and Required returns an error before transport.
  • TokenUsage.prompt_cache preserves cached/read/write token telemetry so callers can verify cache hits from provider usage, not assumptions.
  • Budget estimates stay conservative and do not assume cache hits; actual settlement applies cache-aware pricing only when both provider telemetry and known cache rates are present.

For safer stable-prefix construction, use PromptLayoutBuilder to keep dynamic user/RAG content in the suffix and generate stable prefix keys that exclude dynamic content:

use omnillm::{Message, MessageRole, PromptLayoutBuilder, PromptCacheRetention};

let request = PromptLayoutBuilder::new("gpt-5.4")
    .instructions("Answer using the stable policy document.")
    .stable_message(Message::text(MessageRole::User, "Stable policy context"))
    .user_input("What changed for my account?")
    .stable_prefix_cache_key("support-bot", Some("tenant-a"), PromptCacheRetention::Long, false)
    .build();

Protocol Transcoding

use omnillm::{transcode_request, ProviderProtocol};

let raw_chat = r#"{
  "model": "gpt-4.1-mini",
  "messages": [{
    "role": "user",
    "content": [{ "type": "text", "text": "Hello!" }]
  }],
  "max_tokens": 32
}"#;

let raw_responses = transcode_request(
    ProviderProtocol::OpenAiChatCompletions,
    ProviderProtocol::OpenAiResponses,
    raw_chat,
)?;

Typed multi-endpoint transcoding keeps bridge metadata:

use omnillm::{transcode_api_request, WireFormat};

let raw_chat = r#"{
  "model": "gpt-4.1-mini",
  "messages": [{
    "role": "user",
    "content": [{ "type": "text", "text": "Hello!" }]
  }],
  "max_tokens": 32
}"#;

let report = transcode_api_request(
    WireFormat::OpenAiChatCompletions,
    WireFormat::OpenAiResponses,
    raw_chat,
)?;

assert!(report.bridged);
assert!(!report.lossy);
println!("{}", report.value);

If you bridge from the canonical Responses model to a narrower protocol, loss_reasons will tell you exactly what was dropped, such as unsupported builtin tools or provider-specific metadata.

Multi-Endpoint API

use omnillm::{
    emit_transport_request, ApiRequest, EmbeddingInput, EmbeddingRequest, RequestBody, WireFormat,
};

let request = ApiRequest::Embeddings(EmbeddingRequest {
    model: "text-embedding-3-small".into(),
    input: vec![EmbeddingInput::Text { text: "hello".into() }],
    dimensions: Some(256),
    encoding_format: None,
    user: None,
    vendor_extensions: Default::default(),
});

let transport = emit_transport_request(WireFormat::OpenAiEmbeddings, &request)?;
assert_eq!(transport.value.path, "/embeddings");

if let RequestBody::Json { value } = transport.value.body {
    println!("{}", value);
}

Local demo:

cargo run --example multi_endpoint_demo

Replay Sanitization

ReplayFixture, sanitize_transport_request, sanitize_transport_response, and sanitize_json_value are intended for record/replay tests. They redact common secrets by default:

  • auth headers
  • query tokens such as ak
  • JSON fields such as api_key, token, secret
  • large binary/base64 payload fields
use omnillm::{sanitize_transport_request, HttpMethod, RequestBody, TransportRequest};
use serde_json::json;

let request = TransportRequest {
    method: HttpMethod::Post,
    path: "/responses?ak=secret".into(),
    headers: [("Authorization".into(), "Bearer secret".into())]
        .into_iter()
        .collect(),
    accept: None,
    body: RequestBody::Json {
        value: json!({ "api_key": "secret", "input": "hello" }),
    },
};

let sanitized = sanitize_transport_request(&request);
assert_eq!(sanitized.path, "/responses?ak=<redacted:ak>");

Live Responses Demo

cp .env.example .env
cargo run --example responses_live_demo

Optional live test:

cargo test responses_vision_demo -- --ignored --nocapture
cargo test responses_function_tool_demo -- --ignored --nocapture

The live demo and live tests read all endpoint configuration from environment variables or a local ignored .env file. See .env.example.

Gateway Builder

use std::time::Duration;
use omnillm::{GatewayBuilder, KeyConfig, PoolConfig, ProviderEndpoint};

let gateway = GatewayBuilder::new(ProviderEndpoint::claude_messages())
    .add_key(KeyConfig::new("sk-key-1", "claude-prod-1"))
    .budget_limit_usd(100.0)
    .pool_config(PoolConfig::default())
    .request_timeout(Duration::from_secs(120))
    .build()
    .expect("at least one key required");

Observability

for status in gateway.pool_status() {
    println!(
        "Key {:20} available={} inflight={}/{}",
        status.label, status.available, status.tpm_inflight, status.tpm_limit,
    );
}

println!("Budget remaining: ${:.4}", gateway.budget_remaining_usd());