# Semantic Memory
Enable semantic search to retrieve contextually relevant messages from conversation history using vector similarity.
Requires an embedding model. Ollama with `qwen3-embedding` is the default. Claude API does not support embeddings natively — use the [orchestrator](../advanced/orchestrator.md) to route embeddings through Ollama while using Claude for chat.
## Vector Backend
Zeph supports two vector backends for storing embeddings:
| `qdrant` (default) | Production, multi-user, large datasets | Qdrant server |
| `sqlite` | Development, single-user, offline, quick setup | None |
The `sqlite` backend stores vectors in the same SQLite database as conversation history and performs cosine similarity search in-process. It requires no external services, making it ideal for local development and single-user deployments.
## Setup with SQLite Backend (Quickstart)
No external services needed:
```toml
[memory]
vector_backend = "sqlite"
[memory.semantic]
enabled = true
recall_limit = 5
```
The vector tables are created automatically via migration `011_vector_store.sql`.
## Setup with Qdrant Backend
1. **Start Qdrant:**
```bash
docker compose up -d qdrant
```
2. **Enable semantic memory in config:**
```toml
[memory]
vector_backend = "qdrant" # default, can be omitted
[memory.semantic]
enabled = true
recall_limit = 5
```
3. **Automatic setup:** Qdrant collection (`zeph_conversations`) is created automatically on first use with correct vector dimensions (1024 for `qwen3-embedding`) and Cosine distance metric. No manual initialization required.
## How It Works
- **Hybrid search:** Recall uses both Qdrant vector similarity and SQLite FTS5 keyword search, merging results with configurable weights. This improves recall quality especially for exact term matches.
- **Automatic embedding:** Messages are embedded asynchronously using the configured `embedding_model` and stored in Qdrant alongside SQLite.
- **FTS5 index:** All messages are automatically indexed in an SQLite FTS5 virtual table via triggers, enabling BM25-ranked keyword search with zero configuration.
- **Graceful degradation:** If Qdrant is unavailable, Zeph falls back to FTS5-only keyword search instead of returning empty results.
- **Startup backfill:** On startup, if Qdrant is available, Zeph calls `embed_missing()` to backfill embeddings for any messages stored while Qdrant was offline.
## Hybrid Search Weights
Configure the balance between vector (semantic) and keyword (BM25) search:
```toml
[memory.semantic]
enabled = true
recall_limit = 5
vector_weight = 0.7 # Weight for Qdrant vector similarity
keyword_weight = 0.3 # Weight for FTS5 keyword relevance
```
When Qdrant is unavailable, only keyword search runs (effectively `keyword_weight = 1.0`).
## Temporal Decay
Enable time-based score attenuation to prefer recent context over stale information:
```toml
[memory.semantic]
temporal_decay_enabled = true
temporal_decay_half_life_days = 30 # Score halves every 30 days
```
Scores decay exponentially: at 1 half-life a message retains 50% of its original score, at 2 half-lives 25%, and so on. Adjust `temporal_decay_half_life_days` based on how quickly your project context changes.
## MMR Re-ranking
Enable Maximal Marginal Relevance to diversify recall results and reduce redundancy:
```toml
[memory.semantic]
mmr_enabled = true
mmr_lambda = 0.7 # 0.0 = max diversity, 1.0 = pure relevance
```
MMR iteratively selects results that are both relevant to the query and dissimilar to already-selected items. The default `mmr_lambda = 0.7` works well for most use cases. Lower it if you see too many semantically similar results in recall.
## Autosave Assistant Responses
By default, only user messages are embedded. Enable `autosave_assistant` to also embed assistant responses for richer semantic recall:
```toml
[memory]
autosave_assistant = true
autosave_min_length = 20 # Skip embedding for very short replies
```
Short responses (below `autosave_min_length` bytes) are still saved to SQLite but skip the embedding step. User messages always generate embeddings regardless of this setting.
## Memory Export and Import
Back up or migrate conversation data with portable JSON snapshots:
```bash
zeph memory export conversations.json
zeph memory import conversations.json
```
See [CLI Reference — `zeph memory`](../reference/cli.md#zeph-memory) for details.
## Semantic Response Caching
Complement exact-match response caching with embedding-based similarity matching:
```toml
[llm]
response_cache_enabled = true
semantic_cache_enabled = true # Enable semantic cache (default: false)
semantic_cache_threshold = 0.95 # Cosine similarity for cache hit (default: 0.95)
semantic_cache_max_candidates = 10 # Max entries examined per lookup (default: 10)
```
Lower the threshold (e.g., 0.92) for more cache hits with slightly less precise matching. Increase `semantic_cache_max_candidates` for better recall at the cost of lookup latency.
## Write-Time Importance Scoring
Score messages by decision-relevance at write time to improve recall quality:
```toml
[memory.semantic]
importance_enabled = true # Enable importance scoring (default: false)
importance_weight = 0.15 # Blend weight in recall ranking (default: 0.15)
```
Messages with high importance scores (architectural decisions, key constraints, user preferences) receive a recall boost proportional to `importance_weight`. The score is computed by an LLM classifier at message persist time and stored in the `importance_score` column (migration 039).
## SleepGate: Automatic Forgetting
Over time, the vector index accumulates stale embeddings. Enable SleepGate to periodically remove low-value entries:
```toml
[memory.forgetting]
enabled = true
interval_secs = 86400 # Run every 24 hours (default)
retention_threshold = 0.30 # Score below which entries are forgotten (default: 0.30)
```
SleepGate scores entries on recency, access frequency, and semantic density. A built-in compression predictor preserves load-bearing entries even if their retention score is low.
Forgotten entries are soft-deleted — removed from the vector index but retained in SQLite for potential restoration.
See [SleepGate](../advanced/sleep-gate.md) for tuning guidelines and interaction with other memory features.
## Storage Architecture
| SQLite | Source of truth for message text, conversations, summaries, skill usage |
| Qdrant or SQLite vectors | Vector index for semantic similarity search (embeddings only) |
Both stores work together: SQLite holds the data, the vector backend enables similarity search over it. With the Qdrant backend, the `embeddings_metadata` table in SQLite maps message IDs to Qdrant point IDs. With the SQLite backend, vectors are stored directly in `vector_points` and `vector_point_payloads` tables.
The `messages` table includes `agent_visible`, `user_visible`, and `compacted_at` columns (migration `013_message_metadata.sql`) plus an index on `conversation_id`. Semantic recall and FTS5 keyword search filter by `agent_visible=1`, ensuring compacted messages are excluded from retrieval results.