zeph-llm
LLM provider abstraction with Ollama, Claude, OpenAI, Gemini, and Candle backends.
Overview
Defines the LlmProvider trait and ships concrete backends for Ollama, Claude, OpenAI, Google Gemini, and OpenAI-compatible endpoints. Includes an orchestrator for multi-model coordination, a router for model selection, an optional Candle backend for local inference, and an SQLite-backed response cache with blake3 key hashing and TTL expiry.
Key modules
| Module | Description |
|---|---|
provider |
LlmProvider trait — unified inference interface; name() returns &str (no longer &'static str); Message carries MessageMetadata with agent_visible/user_visible flags for dual-visibility control |
ollama |
Ollama HTTP backend |
claude |
Anthropic Claude backend with with_client() builder for shared reqwest::Client |
openai |
OpenAI backend with with_client() builder for shared reqwest::Client |
gemini |
Google Gemini backend (generateContent + streamGenerateContent?alt=sse); system prompt mapped to systemInstruction, assistant role to "model", consecutive same-role message merging, thinking parts surfaced as StreamChunk::Thinking, functionCall parts in SSE stream emitted as StreamChunk::ToolUse; configured via [llm.gemini] and ZEPH_GEMINI_API_KEY |
compatible |
Generic OpenAI-compatible endpoint backend |
candle_provider |
Local inference via Candle (optional feature) |
orchestrator |
Multi-model coordination and fallback; send_with_retry() helper deduplicates retry logic |
router |
Model selection and routing logic with two strategies: EMA latency tracking and Thompson Sampling (Beta distributions). RouterProvider dispatches to the configured strategy and records outcomes per provider. Providers stored as Arc<[AnyProvider]> — clone() on every LLM request is O(1) regardless of chain length |
vision |
Image input support — base64-encoded images in LLM requests; optional dedicated vision_model per provider |
extractor |
chat_typed<T>() — typed LLM output via JSON Schema (schemars); per-TypeId schema caching |
sse |
Shared sse_to_chat_stream() helpers for Claude and OpenAI SSE parsing |
stt |
SpeechToText trait and WhisperProvider (OpenAI Whisper, feature-gated behind stt) |
candle_whisper |
Local offline STT via Candle (whisper-tiny/base/small, feature-gated behind candle) |
http |
default_client() — shared HTTP client with standard timeouts and user-agent |
error |
LlmError — unified error type; ContextLengthExceeded variant with is_context_length_error() heuristic matching across provider error formats (Claude, OpenAI, Ollama) |
Re-exports: LlmProvider, LlmError
Router strategies
The router supports two strategies for ordering providers in the fallback chain. Set the strategy in [llm.router]:
EMA (default)
Exponential moving average latency tracking. After each response, EmaTracker records provider latency and periodically reorders the chain so the fastest reliable provider is tried first.
[]
= true
= 0.1 # smoothing factor; lower = slower to adapt
= 60 # seconds between reordering
[]
= "ema"
Thompson Sampling
Adaptive model selection using Beta distributions. Each provider maintains a Beta(alpha, beta) distribution initialized with a uniform prior (1, 1). On each request the router samples all distributions and picks the provider with the highest sample; after the response it updates alpha (success) or beta (failure). This naturally balances exploration of less-tested providers with exploitation of known-good ones.
State persists across restarts to ~/.zeph/router_thompson_state.json (configurable). Stale entries for removed providers are pruned automatically on startup.
[]
= ["claude", "openai", "ollama"]
= "thompson"
# thompson_state_path = "~/.zeph/router_thompson_state.json" # optional
CLI commands for inspecting and managing Thompson state:
TUI: /router stats displays the same information in the dashboard.
[!NOTE] Thompson Sampling is most useful when you have multiple providers with varying reliability and want the router to automatically converge on the best one while still occasionally probing alternatives.
Cascade routing
The cascade strategy tries providers in order and escalates to the next when a quality threshold is not met. Configure via [llm.router.cascade]:
[]
= "cascade"
= ["ollama", "claude", "openai"]
[]
= 0.7
= 2
= ["ollama", "claude", "openai"] # optional: explicit cheapest-first ordering
cost_tiers reorders providers once at construction time (zero per-request cost). Providers absent from the list are appended after listed ones in original chain order. Unknown names are silently ignored.
Claude extended thinking
ClaudeProvider supports two thinking modes via ThinkingConfig:
| Mode | Description |
|---|---|
Extended { budget_tokens } |
Allocates a fixed token budget (1024–128000) for visible reasoning; emits interleaved-thinking-2025-05-14 beta header on Sonnet 4.6 with tools |
Adaptive { effort? } |
Lets the model allocate thinking budget automatically |
[]
= { = "extended", = 16000 }
CLI: --thinking extended:16000 or --thinking adaptive. When thinking is enabled and max_tokens is below 16000, it is raised automatically. Thinking deltas are parsed from the SSE stream and suppressed from the user-facing output; MessagePart::ThinkingBlock variants preserve thinking blocks verbatim across tool-use turns.
Gemini configuration
[]
= "gemini"
[]
= "gemini-2.0-flash" # or "gemini-2.5-pro" for extended thinking
= 8192
# base_url = "https://generativelanguage.googleapis.com/v1beta"
Store the API key in the vault: zeph vault set ZEPH_GEMINI_API_KEY AIza...
[!NOTE] Gemini does not expose an embeddings endpoint. For semantic memory and skill matching, pair Gemini with an Ollama embedding model via
[llm.orchestrator].
Features
| Feature | Default | Description |
|---|---|---|
schema |
on | schemars dependency, chat_typed, Extractor, and per-TypeId schema caching |
mock |
off | MockProvider for unit testing without a live LLM endpoint |
stt |
off | WhisperProvider using OpenAI Whisper API (requires reqwest/multipart) |
candle |
off | Local GGUF inference via Candle; pulls in candle-core, candle-nn, candle-transformers, hf-hub, tokenizers, symphonia, rubato |
cuda |
off | Enables CUDA backend for Candle (implies candle) |
metal |
off | Enables Metal backend for Candle on Apple Silicon (implies candle) |
To compile without schemars:
Installation
License
MIT