zeph 0.18.3

Lightweight AI agent with hybrid inference, skills-first architecture, and multi-channel I/O
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
# Memory and Context

Zeph uses a dual-store memory system: SQLite for structured conversation history and a configurable vector backend (Qdrant or embedded SQLite) for semantic search across past sessions.

## Conversation History

All messages are stored in SQLite. The CLI channel provides persistent input history with arrow-key navigation, prefix search, and Emacs keybindings. History persists across restarts.

When conversations grow long, Zeph compacts history automatically using a two-tier strategy. The soft tier fires at `soft_compaction_threshold` (default 0.70): it prunes tool outputs and applies pre-computed deferred summaries without an LLM call. The hard tier fires at `hard_compaction_threshold` (default 0.90): it runs full LLM-based chunked compaction. Compaction uses dual-visibility flags on each message: original messages are marked `agent_visible=false` (hidden from the LLM) while remaining `user_visible=true` (preserved in UI). A summary is inserted as `agent_visible=true, user_visible=false` — visible to the LLM but hidden from the user. This is performed atomically via `replace_conversation()` in SQLite. The result: the user retains full scroll-back history while the LLM operates on a compact context.

## Semantic Memory

With semantic memory enabled, messages are embedded as vectors for similarity search. Ask "what did we discuss about the API yesterday?" and Zeph retrieves relevant context from past sessions automatically. Both vector similarity and keyword (FTS5) search respect visibility boundaries — only `agent_visible=true` messages are indexed and returned, so compacted originals never appear in recall results.

Two vector backends are available:

| Backend | Use case | Dependency |
|---------|----------|------------|
| `qdrant` (default) | Production, large datasets | External Qdrant server |
| `sqlite` | Development, single-user, offline | None (embedded) |

Semantic memory uses hybrid search — vector similarity combined with SQLite FTS5 keyword search — to improve recall quality. When the vector backend is unavailable, Zeph falls back to keyword-only search.

### Result Quality: MMR and Temporal Decay

Two post-processing stages improve recall quality beyond raw similarity:

- **Temporal decay** attenuates scores based on message age. A configurable half-life (default: 30 days) ensures recent context is preferred over stale information. Scores decay exponentially: a message at 1 half-life gets 50% weight, at 2 half-lives 25%, etc.
- **MMR re-ranking** (Maximal Marginal Relevance) reduces redundancy in results by penalizing candidates too similar to already-selected items. The `mmr_lambda` parameter (default: 0.7) controls the relevance-diversity trade-off: higher values favor relevance, lower values favor diversity.

Both are disabled by default. Enable them in `[memory.semantic]`:

```toml
[memory.semantic]
enabled = true
recall_limit = 5
temporal_decay_enabled = true
temporal_decay_half_life_days = 30
mmr_enabled = true
mmr_lambda = 0.7
```

### Quick Setup

Embedded SQLite vectors (no external dependencies):

```toml
[memory]
vector_backend = "sqlite"

[memory.semantic]
enabled = true
recall_limit = 5
```

Qdrant (production):

```toml
[memory]
vector_backend = "qdrant"  # default

[memory.semantic]
enabled = true
recall_limit = 5
```

See [Set Up Semantic Memory](../guides/semantic-memory.md) for the full setup guide.

## Cross-Session History Restore

When a session is resumed, Zeph restores previous message history from SQLite. The restore pipeline applies `sanitize_tool_pairs()` to ensure every `ToolUse` message has a matching `ToolResult`. Orphaned `ToolUse` or `ToolResult` parts at session boundaries — caused by session interruptions or compaction boundary splits — are detected and stripped before the history reaches the LLM. This prevents Claude API 400 errors that occur when the API receives unmatched tool call pairs.

## Context Engineering

Token counts throughout the context pipeline are computed by `TokenCounter` — a shared BPE tokenizer (`cl100k_base`) with a DashMap cache. This replaced the previous `chars / 4` heuristic and provides accurate budget allocation, especially for non-ASCII content and tool schemas. See [Token Efficiency — Token Counting](../architecture/token-efficiency.md#token-counting) for implementation details.

When `context_budget_tokens` is set (default: 0 = unlimited), Zeph allocates the context window proportionally:

| Allocation | Share | Purpose |
|-----------|-------|---------|
| Summaries | 15% | Compressed conversation history |
| Semantic recall | 25% | Relevant messages from past sessions |
| Recent history | 60% | Most recent messages in current conversation |

A two-tier pruning system manages overflow:

1. **Tool output pruning** (cheap) — replaces old tool outputs with short placeholders
2. **Chunked LLM compaction** (fallback) — splits middle messages into ~4096-token chunks, summarizes them in parallel (up to 4 concurrent LLM calls), then merges partial summaries. Falls back to single-pass if any chunk fails.

Both tiers run automatically. See [Context Engineering](../advanced/context.md) for tuning options.

## Project Context

Drop a `ZEPH.md` file in your project root and Zeph discovers it automatically. Project-specific instructions are included in every prompt as a `<project_context>` block. Zeph walks up the directory tree looking for `ZEPH.md`, `ZEPH.local.md`, or `.zeph/config.md`.

## Embeddable Trait and EmbeddingRegistry

The `Embeddable` trait provides a generic interface for any type that can be embedded in Qdrant. It requires `id()`, `content_for_embedding()`, `content_hash()`, and `to_payload()` methods. `EmbeddingRegistry<T: Embeddable>` is a generic sync/search engine that delta-syncs items by BLAKE3 content hash and performs cosine similarity search. This pattern is used internally by skill matching, MCP tool registry, and code indexing.

## Credential Scrubbing

When `memory.redact_credentials` is enabled (default: `true`), Zeph scrubs credential patterns from message content before sending it to the LLM context pipeline. This prevents accidental leakage of API keys, tokens, and passwords stored in conversation history. The scrubbing runs via `scrub_content()` in the context builder and covers the same patterns as the output redaction system (see [Security — Secret Redaction](../reference/security.md#secret-redaction)).

## Autosave Assistant Responses

By default, only user messages generate vector embeddings. Enable `autosave_assistant` to persist assistant responses to SQLite and optionally embed them for semantic recall:

```toml
[memory]
autosave_assistant = true    # Save assistant messages (default: false)
autosave_min_length = 20     # Minimum content length for embedding (default: 20)
```

When enabled, assistant responses shorter than `autosave_min_length` are saved to SQLite without generating an embedding (via `save_only()`). Responses meeting the threshold go through the full embedding pipeline. User messages always generate embeddings regardless of this setting.

## Memory Snapshots

Export and import conversation history as portable JSON files for backup, migration, or sharing between instances.

```bash
# Export all conversations, messages, and summaries
zeph memory export backup.json

# Import into another instance (duplicates are skipped)
zeph memory import backup.json
```

The snapshot format (version 1) includes conversations, messages with multipart content, and summaries. Import uses `INSERT OR IGNORE` semantics — existing messages with matching IDs are skipped, so importing the same file twice is safe.

## LLM Response Cache

Cache identical LLM requests to avoid redundant API calls. The cache is SQLite-backed, keyed by a blake3 hash of the message history and model name.

```toml
[llm]
response_cache_enabled = true   # Enable response caching (default: false)
response_cache_ttl_secs = 3600  # Cache entry lifetime in seconds (default: 3600)

[memory]
response_cache_cleanup_interval_secs = 3600  # Interval for purging expired cache entries (default: 3600)
```

A periodic background task purges expired entries. The cleanup interval is configurable via `[memory] response_cache_cleanup_interval_secs` (default: 3600 seconds). Streaming responses bypass the cache entirely — only non-streaming completions are cached.

### Semantic Response Caching

In addition to exact-match caching, Zeph supports embedding-based similarity matching for cache lookups. When `semantic_cache_enabled = true`, the system embeds incoming message context and searches for cached responses with cosine similarity above `semantic_cache_threshold` (default: 0.95). This allows cache hits even when messages are paraphrased or slightly different.

```toml
[llm]
response_cache_enabled = true
semantic_cache_enabled = true          # Enable semantic similarity matching (default: false)
semantic_cache_threshold = 0.95        # Cosine similarity threshold for cache hit (default: 0.95)
semantic_cache_max_candidates = 10     # Max entries to examine per lookup (default: 10)
```

The threshold controls the tradeoff between hit rate and relevance: lower values (0.92) produce more hits but risk returning less relevant cached responses; higher values (0.98) are more conservative. `semantic_cache_max_candidates` controls how many entries are examined per query — increase to 50+ for better recall at the cost of latency.

## Write-Time Importance Scoring

When `importance_enabled = true`, each message receives an importance score (0.0-1.0) at write time. The score is computed by an LLM classifier that evaluates how decision-relevant the message content is. During semantic recall, the importance score is blended with the similarity score using `importance_weight` (default: 0.15), boosting recall of architecturally significant decisions and key facts.

```toml
[memory.semantic]
importance_enabled = true         # Enable write-time importance scoring (default: false)
importance_weight = 0.15          # Blend weight for importance in recall ranking (default: 0.15)
```

The weight controls how much importance influences the final recall ranking: `0.0` disables importance entirely (pure similarity), `1.0` makes importance the dominant signal. The default `0.15` provides a subtle boost to high-importance messages without disrupting similarity-based ranking.

## Native Memory Tools

When a memory backend is configured, Zeph registers two native tools that the model can invoke explicitly during a conversation, in addition to automatic recall that runs at context-build time.

### `memory_search`

Searches long-term memory across three sources and returns a combined markdown result:

- **Semantic recall** — vector similarity search against past messages (same as automatic recall)
- **Key facts** — structured facts extracted and stored via `memory_save`
- **Session summaries** — summaries from other conversations, excluding the current session

The model invokes this tool when it needs to actively retrieve information rather than rely on what was injected automatically. Example: the user asks "what was the API key format we agreed on last week?" and the model has no relevant context in the current window.

**Parameters:**

| Parameter | Type | Description |
|-----------|------|-------------|
| `query` | string (required) | Natural language search query |
| `limit` | integer (optional, default 5) | Maximum number of results per source |

### `memory_save`

Persists content to long-term memory as a key fact, making it retrievable in future sessions.

The model uses this when it identifies information worth preserving explicitly — decisions, preferences, or facts the user stated that should survive context compaction. Content is validated (non-empty, max 4096 characters) before being stored via `remember()`.

**Parameters:**

| Parameter | Type | Description |
|-----------|------|-------------|
| `content` | string (required) | The information to persist (max 4096 characters) |

### Registration

`MemoryToolExecutor` is registered in the tool chain only when a memory backend is configured. If `[memory]` is absent or `[memory.semantic]` is disabled, neither tool appears in the model's tool list.

## Query-Aware Memory Routing

By default, semantic recall queries both SQLite FTS5 (keyword) and Qdrant (vector) backends and merges results via reciprocal rank fusion. Query-aware routing selects the optimal backend(s) per query, avoiding unnecessary work.

```toml
[memory.routing]
strategy = "heuristic"   # Currently the only strategy (default)
```

The heuristic router classifies queries into three routes:

| Route | Backend | When |
|-------|---------|------|
| Keyword | SQLite FTS5 | Code patterns (`::`, `/`), snake_case identifiers, short queries (<=3 words) |
| Semantic | Qdrant vectors | Question words (`what`, `how`, `why`, ...), long natural language (>=6 words) |
| Hybrid | Both + RRF merge | Medium-length queries without clear signals (4-5 words, no question word) |
| Graph | Graph store + Hybrid fallback | Relationship patterns (`related to`, `opinion on`, `connection between`, `know about`). Requires `graph-memory` feature; falls back to Hybrid when disabled |

Question words override code pattern heuristics: `"how does error_handling work"` routes Semantic, not Keyword. Relationship patterns take priority over all other heuristics: `"how is Rust related to this project"` routes Graph, not Semantic.

The agent calls `recall_routed()` on `SemanticMemory`, which delegates to the configured router before querying. When Qdrant is unavailable, Semantic-route queries return empty results; Hybrid-route queries fall back to FTS5 only.

## Adaptive Memory Admission Control (A-MAC)

By default, every message that crosses the minimum length threshold is embedded and stored in the vector backend. A-MAC adds a learned gate that evaluates each candidate message against the current memory state before committing the write. Only messages that are sufficiently novel — dissimilar to recently stored content — are admitted, preventing the vector index from filling with near-duplicate information.

A-MAC is disabled by default. Enable it in `[memory.admission]`:

```toml
[memory.admission]
enabled = true
threshold = 0.40            # Composite score threshold; messages below this are rejected (default: 0.40)
fast_path_margin = 0.15     # Skip full check and admit immediately when score >= threshold + margin (default: 0.15)
admission_provider = "fast" # Provider name for LLM-assisted admission decisions (optional)

[memory.admission.weights]
future_utility = 0.30       # LLM-estimated future reuse probability (heuristic mode only)
factual_confidence = 0.15   # Inverse of hedging markers (e.g. "I think", "maybe")
semantic_novelty = 0.30     # 1 - max similarity to existing memories
temporal_recency = 0.10     # Always 1.0 at write time
content_type_prior = 0.15   # Role-based prior (user messages score higher)
```

The `fast_path_margin` short-circuits the admission check for clearly novel messages, reducing embedding lookups on low-similarity content. When `admission_provider` is set, borderline cases (similarity near `threshold`) are escalated to an LLM for a binary admit/reject decision; without it, the threshold comparison is the sole gate.

### RL-Based Admission Strategy

The default `heuristic` strategy uses static weights and an optional LLM call for the `future_utility` factor. The `rl` strategy replaces the `future_utility` LLM call with a trained logistic regression model that learns from actual recall outcomes.

The RL model collects `(query, content, was_recalled)` triples from every admitted and rejected message over time. When the training corpus reaches `rl_min_samples`, the model is trained and deployed. Below that threshold the system automatically falls back to `heuristic`.

```toml
[memory.admission]
enabled = true
admission_strategy = "rl"          # "heuristic" (default) or "rl"
rl_min_samples = 500               # Training samples required before RL activates (default: 500)
rl_retrain_interval_secs = 3600    # Background retraining interval in seconds (default: 3600)
```

> [!WARNING]
> `admission_strategy = "rl"` is currently a preview feature. The model infrastructure is wired and sample collection is active, but the trained model is not yet connected to the admission path — the system will emit a startup warning and fall back to `heuristic`. Full RL-gated admission is tracked in [#2416](https://github.com/bug-ops/zeph/issues/2416).

> [!NOTE]
> Migration 055 adds the tables required for RL sample storage. Run `zeph --migrate-config` when upgrading an existing installation.

## MemScene Consolidation

MemScene groups semantically related messages into *scenes* — short-lived narrative units covering a coherent sub-topic within a session. Scenes are detected automatically in the background and consolidated into a single embedding before the individual messages are demoted in the recall index. This compresses the vector space without discarding information: a scene embedding captures the collective meaning of its member messages, and scene summaries are searchable in future sessions.

MemScene is configured under `[memory.tiers]`:

```toml
[memory.tiers]
scene_enabled = true
scene_similarity_threshold = 0.80  # Minimum cosine similarity for messages to be grouped into the same scene (default: 0.80)
scene_batch_size = 10              # Number of messages to evaluate per consolidation cycle (default: 10)
scene_provider = "fast"            # Provider name for scene summary generation
```

`scene_provider` must reference a `[[llm.providers]]` entry. If unset, the default provider is used. Scenes are stored in SQLite alongside their member message IDs and can be inspected with `zeph memory stats`.

## Active Context Compression

Zeph supports two compression strategies for managing context growth:

```toml
[memory.compression]
strategy = "reactive"    # Default — compress only when reactive compaction fires
```

**Reactive** (default) relies on the existing two-tier compaction pipeline (Tier 1 tool output pruning, Tier 2 chunked LLM compaction). No additional configuration needed.

**Proactive** fires compression before reactive compaction when the current token count exceeds `threshold_tokens`:

```toml
[memory.compression]
strategy = "proactive"
threshold_tokens = 80000       # Fire when context exceeds this token count (>= 1000)
max_summary_tokens = 4000      # Cap for the compressed summary (>= 128)
# model = ""                   # Reserved for future per-compression model selection (currently unused)
```

Proactive and reactive compression are mutually exclusive per turn: if proactive compression fires, reactive compaction is skipped for that turn (and vice versa). The `compacted_this_turn` flag resets at the start of each turn.

Proactive compression emits two metrics: `compression_events` (count) and `compression_tokens_saved` (cumulative tokens freed).

> [!NOTE]
> Validation rejects `threshold_tokens < 1000` and `max_summary_tokens < 128` at startup.

### Tool Output Archive (Memex)

When `archive_tool_outputs = true`, Zeph saves the full body of every tool output in the compaction range to SQLite before summarization begins. The archived entries are stored in the `tool_overflow` table with `archive_type = 'archive'` and are excluded from the normal overflow cleanup pass.

During compaction the LLM sees placeholder messages instead of the full outputs, keeping the summarization prompt small. After the LLM produces its summary, Zeph appends UUID reference lines (one per archived output) to the summary text. This gives you a complete audit trail of tool outputs that survived context compaction.

This feature is disabled by default because it increases SQLite storage usage. Enable it when you need durable tool output history across long sessions:

```toml
[memory.compression]
archive_tool_outputs = true
```

> [!TIP]
> Tool output archives are written by database migration 054. Run `zeph --migrate-config` if you are upgrading an existing installation.

## Failure-Driven Compression Guidelines

When `[memory.compression_guidelines]` is enabled, the agent learns from its own compaction mistakes. After each hard compaction, it watches the next several LLM responses for a two-signal context-loss indicator: an uncertainty phrase (e.g. "I don't recall", "I'm not sure if") combined with a prior-context reference (e.g. "earlier you mentioned", "we discussed before"). When both signals appear together in the same response, the pair is recorded as a compression failure in SQLite.

A background updater wakes on a configurable interval, and when the number of unprocessed failure pairs exceeds `update_threshold`, it calls the LLM to synthesize updated compression guidelines. The resulting guidelines are sanitized to strip prompt-injection attempts and stored in SQLite. Every subsequent compaction prompt includes the active guidelines inside a `<compression-guidelines>` block, steering the summarizer to preserve categories of information that were lost before.

The feature is disabled by default:

```toml
[memory.compression_guidelines]
enabled = true
update_threshold = 5             # Minimum failure pairs before triggering an update (default: 5)
max_guidelines_tokens = 500      # Token budget for the guidelines document (default: 500)
max_pairs_per_update = 10        # Failure pairs consumed per update cycle (default: 10)
detection_window_turns = 10      # Turns after hard compaction to watch for context loss (default: 10)
update_interval_secs = 300       # Seconds between background updater checks (default: 300)
max_stored_pairs = 100           # Maximum unused failure pairs retained (default: 100)
```

> [!NOTE]
> Guidelines are injected only when `enabled = true` and at least one guidelines version exists in SQLite. The guidelines document grows incrementally as the agent accumulates failure experience.

### Per-Category Compression Guidelines

By default a single global guidelines document is maintained for the entire conversation. When `categorized_guidelines = true`, the updater maintains **four independent documents** — one per content category — and injects only the relevant document during compaction:

| Category | Content covered |
|----------|----------------|
| `tool_output` | Tool call results, shell output, file reads |
| `assistant_reasoning` | Agent reasoning steps and explanations |
| `user_context` | User instructions, preferences, and goals |
| `unknown` | Messages that do not match a category |

Each category runs its own update cycle: a category is updated only when its unprocessed failure pair count reaches `update_threshold`, avoiding unnecessary LLM calls for categories that have few failures.

Enable per-category guidelines alongside the base feature:

```toml
[memory.compression_guidelines]
enabled = true
categorized_guidelines = true    # Maintain separate guidelines per content category (default: false)
update_threshold = 5
```

> [!TIP]
> Per-category guidelines reduce the chance that tool-output compression rules interfere with how assistant reasoning is compressed, and vice versa. Enable this when you have long sessions mixing heavy tool use with extended reasoning chains.

## Graph Memory

With the `graph-memory` feature enabled, Zeph extracts entities and relationships from conversations and stores them as a knowledge graph in SQLite. This enables multi-hop reasoning ("how is X related to Y?"), temporal fact tracking ("user switched from vim to neovim"), and cross-session entity linking.

Graph memory is opt-in and complementary to vector + keyword search. After each user message, a background task extracts entities and edges via LLM. On subsequent turns, matched graph facts are injected into the context as a system message alongside recalled messages. The context budget allocates 4% of available tokens to graph facts (taken proportionally from summaries, semantic recall, cross-session, and code context allocations). Messages flagged with injection patterns skip extraction for security.

```toml
[memory.graph]
enabled = true
max_hops = 2
recall_limit = 10
```

See [Graph Memory](graph-memory.md) for the full concept guide.

## Session Summary on Shutdown

When a session ends (graceful shutdown), Zeph checks whether a session summary already exists
for the conversation. If none does — which is typical for short or interrupted sessions that
never triggered hard compaction — it generates a lightweight LLM summary of the recent messages
and stores it in the `zeph_session_summaries` vector collection. This makes the session
retrievable by `search_session_summaries` in future conversations, enabling cross-session recall
even for brief interactions.

The guard is SQLite-authoritative: if a summary record exists in SQLite (written by either the
shutdown path or a previous hard compaction), the shutdown path is skipped. This handles the edge
case where a Qdrant write failed but the SQLite record succeeded.

```toml
[memory]
shutdown_summary = true              # default: true
shutdown_summary_min_messages = 4   # skip sessions with fewer user turns
shutdown_summary_max_messages = 20  # cap LLM input to the last N messages
```

The LLM call is bounded by a 5-second timeout (10 seconds worst-case if the structured output
call times out and falls back to plain text). Errors are logged as warnings and never propagate
to the caller — shutdown completes regardless.

## Structured Anchored Summarization

When hard compaction fires, the summarizer can produce structured summaries anchored to specific information categories. The `AnchoredSummary` format replaces free-form prose with five mandatory sections:

1. **Session Intent** — what the user is trying to accomplish
2. **Files Modified** — file paths, function names, structs referenced
3. **Decisions Made** — architectural or implementation decisions with rationale
4. **Open Questions** — unresolved items or ambiguities
5. **Next Steps** — concrete actions to take immediately

Anchored summaries are validated for completeness (`session_intent` and `next_steps` must be non-empty) and rendered as Markdown with `[anchored summary]` headers for context injection. This structured format reduces information loss during compaction compared to unstructured prose summaries.

## SleepGate Forgetting Pass

Over time, the vector index accumulates stale or low-value embeddings that dilute recall quality. SleepGate implements a periodic forgetting pass inspired by memory consolidation during sleep: it scans stored embeddings, scores them on recency, access frequency, and semantic density, then soft-deletes entries below the retention threshold.

```toml
[memory.forgetting]
enabled = true
interval_secs = 86400          # Run forgetting pass every N seconds (default: 86400 = 24h)
retention_threshold = 0.30     # Composite score below which entries are forgotten (default: 0.30)
```

SleepGate also integrates a performance-floor compression predictor that estimates whether removing a candidate embedding would degrade recall quality for recent queries. Entries that the predictor flags as load-bearing are preserved regardless of their retention score.

Forgotten entries are soft-deleted (marked in SQLite, removed from the vector index) and can be restored manually if needed.

## Multi-Vector Chunking

Long messages (tool outputs, code blocks, large paste operations) that exceed the embedding model's token limit are automatically split into overlapping chunks, each embedded independently. During recall, chunk scores are aggregated back to the parent message using max-pooling, so a message is retrieved if any of its chunks is relevant.

This runs in the real-time embedding path — no configuration is needed. The chunk size and overlap are derived from the embedding model's context window.

## BATS Budget Hint

The Budget-Aware Token Steering (BATS) system injects a budget hint into the system prompt that tells the LLM how much context space remains. This helps the model produce appropriately-sized responses and make better decisions about when to use tools versus answering from context.

BATS also implements a utility-based 5-way action policy that evaluates each agent turn against five action categories (respond, search, tool-use, delegate, wait) and selects the action with the highest expected utility given the current context budget and conversation state.

## Cost-Sensitive Store Routing

When multiple storage backends are available (SQLite vectors, Qdrant, graph store), the memory system routes write operations to the backend with the lowest cost for the given content type. Short factual statements are routed to the graph store, long narratives to vector storage, and structured data to SQLite key-value pairs.

```toml
[memory.routing]
cost_sensitive = true          # Enable cost-aware write routing (default: false)
```

## Goal-Conditioned Write Gate

When enabled, the write gate evaluates whether a candidate memory entry is relevant to the user's current goal before admitting it. This prevents the memory system from storing tangential information during long exploratory sessions.

The goal text is extracted from the most recent `/plan` goal or from the first user message in the session if no plan is active.

## Kumiho Belief Revision

Kumiho implements belief revision for the graph memory store. When new information contradicts an existing entity-relationship fact, Kumiho evaluates the conflict using temporal recency and source reliability, then either updates the existing edge, creates a versioned override, or flags the conflict for user resolution.

This is paired with D-MEM RPE (Reward Prediction Error) routing for graph memory, which uses prediction errors from graph queries to adaptively weight the graph store's contribution to hybrid recall.

## Next Steps

- [Set Up Semantic Memory](../guides/semantic-memory.md) — Qdrant setup guide
- [Context Budgets](context-budgets.md) — BATS budget hints and allocation strategy
- [SleepGate](../advanced/sleep-gate.md) — automatic memory forgetting and index hygiene
- [Graph Memory](graph-memory.md) — entity-relationship tracking and multi-hop reasoning
- [Context Engineering](../advanced/context.md) — budget allocation, compaction, recall tuning