zeph 0.20.0

Lightweight AI agent with hybrid inference, skills-first architecture, and multi-channel I/O
# LLM Providers

Zeph supports multiple LLM backends. Choose based on your needs:

| Provider | Type | Embeddings | Vision | Streaming | Best For |
|----------|------|-----------|--------|-----------|----------|
| Ollama | Local | Yes | Yes | Yes | Privacy, free, offline |
| Claude | Cloud | No | Yes | Yes | Quality, reasoning, prompt caching |
| OpenAI | Cloud | Yes | Yes | Yes | Ecosystem, GPT-4o, GPT-5 |
| Gemini | Cloud | Yes | Yes | Yes | Google ecosystem, long context, extended thinking |
| Compatible | Cloud | Varies | Varies | Varies | Together AI, Groq, Fireworks |
| Candle | Local | No | No | No | Minimal footprint |

Claude does not support embeddings natively. Use a multi-provider setup with `embed = true` on an Ollama or OpenAI provider entry to combine Claude chat with local embeddings. Gemini supports embeddings via the `text-embedding-004` model — set `embedding_model` in the Gemini `[[llm.providers]]` entry to enable.

## Quick Setup

**Ollama** (default — no API key needed):

```bash
ollama pull mistral:7b
ollama pull qwen3-embedding
zeph
```

**Claude**:

```bash
ZEPH_CLAUDE_API_KEY=sk-ant-... zeph
```

**OpenAI**:

```bash
ZEPH_LLM_PROVIDER=openai ZEPH_OPENAI_API_KEY=sk-... zeph
```

**Gemini**:

```bash
ZEPH_LLM_PROVIDER=gemini ZEPH_GEMINI_API_KEY=AIza... zeph
```

## Gemini

Zeph supports Google Gemini as a first-class LLM backend. Gemini is a strong choice when you want access to Google's latest models (Gemini 2.5 Pro, Gemini 2.0 Flash), very long context windows, extended thinking, or native multimodal reasoning.

### Why Gemini

Google's Gemini 2.5 family brings extended thinking (visible as streaming `Thinking` chunks in Zeph's TUI), native tool use, vision, and embeddings. For tasks that require deep reasoning over large codebases or long documents, Gemini's context capacity complements Zeph's existing RAG pipeline.

### Integration Overview

The `GeminiProvider` translates Zeph's internal message format to Gemini's `generateContent` API:

- The system prompt becomes a top-level `systemInstruction` field (Gemini's required format).
- The `assistant` role is mapped to `"model"` (Gemini's terminology for the model turn).
- Consecutive messages with the same role are automatically merged — Gemini requires strict user/model alternation.
- If the conversation starts with a model turn, a synthetic empty user message is prepended to satisfy the API contract.
- Tool definitions are converted to Gemini `functionDeclarations` with JSON schema normalization (`$ref` inlining, `anyOf`/`oneOf``nullable`, type name uppercasing).
- Vision inputs are sent as `inlineData` parts with base64-encoded image data.

Streaming uses `streamGenerateContent?alt=sse`. Thinking parts (returned with `thought: true` by Gemini 2.5 models) are surfaced as `StreamChunk::Thinking` and shown in the TUI sidebar.

### Configuration

```toml
[llm]
[[llm.providers]]
type = "gemini"
model = "gemini-2.0-flash"           # default; use "gemini-2.5-pro" for extended thinking
max_tokens = 8192
# embedding_model = "text-embedding-004"  # enable Gemini embeddings (optional)
# thinking_level = "medium"              # minimal, low, medium, high (Gemini 2.5+)
# thinking_budget = 8192                 # token budget for thinking; -1 = dynamic, 0 = off
# include_thoughts = true                # surface thinking chunks in TUI
# base_url = "https://generativelanguage.googleapis.com/v1beta"  # default
```

Store the API key in the vault (recommended):

```bash
zeph vault set ZEPH_GEMINI_API_KEY AIza...
```

Or export it as an environment variable:

```bash
export ZEPH_GEMINI_API_KEY=AIza...
```

Run `zeph init` and choose Gemini as the provider to have the wizard generate a complete config with all Gemini parameters, including the thinking level prompt.

### Capabilities

| Feature | Gemini 2.0 Flash | Gemini 2.5 Pro |
|---------|-----------------|----------------|
| Chat | Yes | Yes |
| Streaming (SSE) | Yes | Yes |
| Tool use | Yes | Yes |
| Streaming tool use | Yes | Yes |
| Vision | Yes | Yes |
| Embeddings | Yes (`text-embedding-004`) | Yes (`text-embedding-004`) |
| Extended thinking | No | Yes (`thinking_level` / `thinking_budget`) |
| Remote model discovery | Yes | Yes |

### Embeddings

Set `embedding_model` in the Gemini `[[llm.providers]]` entry to enable Gemini embeddings. When set, `supports_embeddings()` returns `true` and Zeph uses `POST /v1beta/models/{model}:embedContent` for semantic memory and skill matching — no Ollama dependency required.

```toml
[[llm.providers]]
type = "gemini"
model = "gemini-2.0-flash"
embedding_model = "text-embedding-004"
```

### Streaming and Thinking

When streaming is active, Zeph emits chunks as they arrive from the SSE stream (`streamGenerateContent?alt=sse`). For Gemini 2.5 models that return thinking parts, the TUI shows a "Thinking…" indicator while the model reasons and then switches to the response stream. Both paths use the same retry infrastructure (`send_with_retry`) — HTTP 429 (rate limit) and 503 (service unavailable) responses trigger automatic backoff and retry.

Configure thinking via `thinking_level` (categorical) or `thinking_budget` (token count). Both fields are optional and apply only to Gemini 2.5+ models.

### Streaming Tool Use

Gemini delivers `functionCall` parts as complete objects within a single SSE event (not incrementally chunked). The SSE parser collects all `functionCall` parts from the event's `parts` array and emits a single `StreamChunk::ToolUse` with all tool calls. When an event contains both text and function call parts, tool calls take priority and any text in that event is dropped (matching the non-streaming behavior).

Streaming tool use is available on all Gemini models that support function calling, including Gemini 2.0 Flash.

## Switching Providers

Change the `type` field in the `[[llm.providers]]` entry. All skills, memory, and tools work the same regardless of which provider is active.

```toml
[llm]
[[llm.providers]]
type = "claude"   # ollama, claude, openai, gemini, candle, compatible
model = "claude-sonnet-4-6"
```

At runtime, use the `/provider <name>` command to switch providers:

```
> /provider claude
Switched to Claude (claude-sonnet-4-6)
```

The chosen provider is now the active provider for this channel. On the next session start, Zeph automatically restores the last-used provider for that channel.

### Provider Persistence

Zeph remembers the last provider you used per channel (CLI, TUI, Telegram). When you restart or switch channels, your preferred provider is restored automatically:

- **CLI/TUI**: last provider is saved globally (both share the same `channel_id = ""`)
- **Telegram**: last provider is saved per chat (when configured with per-chat wiring)

Enable persistence with:

```toml
[session]
provider_persistence = true     # default: enabled
```

Disable it to always start with the default provider:

```toml
[session]
provider_persistence = false
```

Provider preferences are stored in SQLite alongside session metadata. If you switch providers and the session crashes before a successful turn, the previous provider preference is restored on the next session start.

## Response Caching

Enable SQLite-backed response caching to avoid redundant LLM calls for identical requests. The cache key is a blake3 hash of the full message history and model name. Streaming responses bypass the cache.

```toml
[llm]
response_cache_enabled = true
response_cache_ttl_secs = 3600  # 1 hour (default)
```

See [Memory and Context — LLM Response Cache](memory.md#llm-response-cache) for details.

## Per-Subsystem Embedding Providers

Every subsystem that generates vector embeddings has its own `embed_provider` or `embedding_provider` config field. Pointing these at a dedicated embedding provider (e.g., a local Ollama model) prevents embedding requests from saturating the chat provider's connection pool or triggering guardrails.

| Config field | Subsystem |
|---|---|
| `[memory.semantic] embed_provider` | Semantic memory — stores and retrieves conversation embeddings |
| `[skills] embedding_provider` | Skill matcher — finds relevant skills by embedding similarity |
| `[skills.mining] embedding_provider` | Skill mining — deduplicates candidate skills during self-learning |
| `[index] embed_provider` | Code indexer — embeds AST chunks for RAG retrieval |
| `[mcp.tool_discovery] embedding_provider` | MCP tool registry — indexes discovered tools by description |

When a field is empty or omitted, the subsystem falls back to the agent's primary LLM provider. For deployments using Claude (which does not support embeddings) or any cloud provider where embedding volume is significant, set all five fields to a dedicated embedding provider:

```toml
[[llm.providers]]
name = "embed"
type = "ollama"
model = "nomic-embed-text"
embed = true

[memory.semantic]
embed_provider = "embed"

[skills]
embedding_provider = "embed"

[skills.mining]
embedding_provider = "embed"

[index]
embed_provider = "embed"

[mcp.tool_discovery]
embedding_provider = "embed"
```

This ensures that a burst of embedding requests (e.g., during code indexing or skill hot-reload) does not compete with ongoing chat inference.

## Next Steps

- [Use a Cloud Provider]../guides/cloud-provider.md — Claude, OpenAI, and compatible API setup
- [Model Orchestrator]../advanced/orchestrator.md — multi-provider routing with fallback chains
- [Adaptive Inference]../advanced/adaptive-inference.md — Thompson Sampling and EMA-based provider routing
- [SkillOrchestra]../advanced/skill-orchestra.md — RL-based adaptive routing that learns from execution outcomes
- [Local Inference (Candle)]../advanced/candle.md — HuggingFace GGUF models