zeph-llm

LLM provider abstraction with Ollama, Claude, OpenAI, Gemini, and Candle backends.

Overview

Defines the LlmProvider trait and ships concrete backends for Ollama, Claude, OpenAI, Google Gemini, and OpenAI-compatible endpoints. Includes an orchestrator for multi-model coordination, a router for model selection, an optional Candle backend for local inference, and an SQLite-backed response cache with blake3 key hashing and TTL expiry.

Key modules

Module	Description
`provider`	`LlmProvider` trait — unified inference interface; `name()` returns `&str` (no longer `&'static str`); `Message` carries `MessageMetadata` with `agent_visible`/`user_visible` flags for dual-visibility control
`ollama`	Ollama HTTP backend
`claude`	Anthropic Claude backend with `with_client()` builder for shared `reqwest::Client`
`openai`	OpenAI backend with `with_client()` builder for shared `reqwest::Client`
`gemini`	Google Gemini backend (`generateContent` + `streamGenerateContent?alt=sse`); system prompt mapped to `systemInstruction`, `assistant` role to `"model"`, consecutive same-role message merging, thinking parts surfaced as `StreamChunk::Thinking`, `functionCall` parts in SSE stream emitted as `StreamChunk::ToolUse`; configured via `[llm.gemini]` and `ZEPH_GEMINI_API_KEY`
`compatible`	Generic OpenAI-compatible endpoint backend
`candle_provider`	Local inference via Candle (optional feature)
`orchestrator`	Multi-model coordination and fallback; `send_with_retry()` helper deduplicates retry logic
`router`	Model selection and routing logic with two strategies: EMA latency tracking and Thompson Sampling (Beta distributions). `RouterProvider` dispatches to the configured strategy and records outcomes per provider. Providers stored as `Arc<[AnyProvider]>` — `clone()` on every LLM request is O(1) regardless of chain length
`vision`	Image input support — base64-encoded images in LLM requests; optional dedicated `vision_model` per provider
`extractor`	`chat_typed<T>()` — typed LLM output via JSON Schema (`schemars`); per-`TypeId` schema caching
`sse`	Shared `sse_to_chat_stream()` helpers for Claude and OpenAI SSE parsing
`stt`	`SpeechToText` trait and `WhisperProvider` (OpenAI Whisper, feature-gated behind `stt`)
`candle_whisper`	Local offline STT via Candle (whisper-tiny/base/small, feature-gated behind `candle`)
`http`	`default_client()` — shared HTTP client with standard timeouts and user-agent
`error`	`LlmError` — unified error type; `ContextLengthExceeded` variant with `is_context_length_error()` heuristic matching across provider error formats (Claude, OpenAI, Ollama)

Re-exports: LlmProvider, LlmError

Router strategies

The router supports two strategies for ordering providers in the fallback chain. Set the strategy in [llm.router]:

EMA (default)

Exponential moving average latency tracking. After each response, EmaTracker records provider latency and periodically reorders the chain so the fastest reliable provider is tried first.

[llm]
router_ema_enabled      = true
router_ema_alpha        = 0.1   # smoothing factor; lower = slower to adapt
router_reorder_interval = 60    # seconds between reordering

[llm.router]
strategy = "ema"

Thompson Sampling

Adaptive model selection using Beta distributions. Each provider maintains a Beta(alpha, beta) distribution initialized with a uniform prior (1, 1). On each request the router samples all distributions and picks the provider with the highest sample; after the response it updates alpha (success) or beta (failure). This naturally balances exploration of less-tested providers with exploitation of known-good ones.

State persists across restarts to ~/.zeph/router_thompson_state.json (configurable). Stale entries for removed providers are pruned automatically on startup.

[llm.router]
chain    = ["claude", "openai", "ollama"]
strategy = "thompson"
# thompson_state_path = "~/.zeph/router_thompson_state.json"  # optional

CLI commands for inspecting and managing Thompson state:

zeph router stats   # show per-provider alpha/beta and success rate
zeph router reset   # reset all distributions to uniform prior

TUI: /router stats displays the same information in the dashboard.

[!NOTE] Thompson Sampling is most useful when you have multiple providers with varying reliability and want the router to automatically converge on the best one while still occasionally probing alternatives.

Cascade routing

The cascade strategy tries providers in order and escalates to the next when a quality threshold is not met. Configure via [llm.router.cascade]:

[llm.router]
strategy = "cascade"
chain = ["ollama", "claude", "openai"]

[llm.router.cascade]
quality_threshold = 0.7
max_escalations = 2
cost_tiers = ["ollama", "claude", "openai"]  # optional: explicit cheapest-first ordering

cost_tiers reorders providers once at construction time (zero per-request cost). Providers absent from the list are appended after listed ones in original chain order. Unknown names are silently ignored.

Complexity triage routing

The triage strategy classifies each request into a complexity tier before inference and routes it to the provider pool configured for that tier. This avoids sending simple queries to expensive models and reserves high-capability models for genuinely complex tasks.

[llm.router]
strategy = "triage"

[llm.complexity_routing]
simple_providers  = ["ollama"]
medium_providers  = ["ollama", "openai"]
complex_providers = ["claude", "openai"]
expert_providers  = ["claude"]

Tier assignment uses a lightweight classifier (TriageClassifier) that runs before the primary LLM call. The classifier dispatches to LlmRoutingStrategy::Triage on the RouterProvider.

[!TIP] Use ClassifierMode::Judge to route classification through a separate LLM call when heuristic scoring is insufficient for your workload.

PILOT LinUCB bandit routing

The bandit strategy applies a contextual LinUCB bandit to provider selection. On each request, context features (query complexity score, recent per-provider latency, time-of-day bucket) are assembled into a feature vector; the bandit computes an upper confidence bound per provider and selects the highest. After each response, the reward signal (success × inverse latency) updates the ridge regression weights.

State is persisted to ~/.zeph/router_bandit_state.json (configurable) and restored on restart.

[llm.router]
strategy = "bandit"
chain    = ["ollama", "claude", "openai"]

[llm.router.bandit]
alpha            = 1.0     # exploration parameter; higher = more exploration
state_path       = "~/.zeph/router_bandit_state.json"
feature_dim      = 8       # dimensionality of the context feature vector

[!NOTE] PILOT (Provider Intelligent Linucb Online Tracking) is most effective when providers have meaningfully different latency/quality profiles and the workload has varied query complexity. For uniform workloads, Thompson Sampling may converge faster.

[!TIP] Inspect learned weights and UCB scores with zeph router stats (same command as Thompson Sampling) or /router stats in the TUI.

SLM provider recommendations

For cost-sensitive or resource-constrained deployments, the following Small Language Models are verified to work well with Zeph:

Task	Recommended SLM	Notes
Embeddings	`nomic-embed-text` (Ollama)	Default embedding model
Simple queries / routing	`qwen3:8b` (Ollama)	Fast, low memory footprint
Summarization / compaction	`qwen3:8b` or `phi-4-mini`	Good quality at 8B scale
Graph extraction	`qwen3:8b`	Structured output via JSON Schema
STT	`whisper-tiny` / `whisper-base` (Candle)	Local offline, no API key

Pair SLMs with a cloud provider for complex/expert tasks using triage routing:

[llm.router]
strategy = "triage"

[llm.complexity_routing]
simple_providers  = ["ollama"]   # qwen3:8b handles simple queries
medium_providers  = ["ollama"]
complex_providers = ["claude"]
expert_providers  = ["claude"]

Claude extended thinking

ClaudeProvider supports two thinking modes via ThinkingConfig:

Mode	Description
`Extended { budget_tokens }`	Allocates a fixed token budget (1024–128000) for visible reasoning; emits `interleaved-thinking-2025-05-14` beta header on Sonnet 4.6 with tools
`Adaptive { effort? }`	Lets the model allocate thinking budget automatically

[llm.claude]
thinking = { mode = "extended", budget_tokens = 16000 }

CLI: --thinking extended:16000 or --thinking adaptive. When thinking is enabled and max_tokens is below 16000, it is raised automatically. Thinking deltas are parsed from the SSE stream and suppressed from the user-facing output; MessagePart::ThinkingBlock variants preserve thinking blocks verbatim across tool-use turns.

Gemini configuration

[llm]
provider = "gemini"

[llm.gemini]
model = "gemini-2.0-flash"   # or "gemini-2.5-pro" for extended thinking
max_tokens = 8192
# base_url = "https://generativelanguage.googleapis.com/v1beta"

Store the API key in the vault: zeph vault set ZEPH_GEMINI_API_KEY AIza...

[!NOTE] Gemini does not expose an embeddings endpoint. For semantic memory and skill matching, pair Gemini with an Ollama embedding model via [llm.orchestrator].

Features

Feature	Default	Description
`schema`	on	`schemars` dependency, `chat_typed`, `Extractor`, and per-`TypeId` schema caching
`mock`	off	`MockProvider` for unit testing without a live LLM endpoint
`stt`	off	`WhisperProvider` using OpenAI Whisper API (requires `reqwest/multipart`)
`candle`	off	Local GGUF inference via Candle; pulls in `candle-core`, `candle-nn`, `candle-transformers`, `hf-hub`, `tokenizers`, `symphonia`, `rubato`
`cuda`	off	Enables CUDA backend for Candle (implies `candle`)
`metal`	off	Enables Metal backend for Candle on Apple Silicon (implies `candle`)

To compile without schemars:

cargo build -p zeph-llm --no-default-features

Installation

cargo add zeph-llm

# Without schemars (chat_typed and Extractor not available)
cargo add zeph-llm --no-default-features

# With local inference via Candle
cargo add zeph-llm --features candle

# With OpenAI Whisper STT
cargo add zeph-llm --features stt

Documentation

Full documentation: https://bug-ops.github.io/zeph/

License

MIT

zeph-llm 0.18.1