rsclaw 2026.5.20

AI Agent Engine Compatible with OpenClaw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
# `src/kb/` — Knowledge Base

User-managed RAG knowledge base. See `docs/specs/2026-05-19-knowledge-base.md`
for the full design and `docs/adr/0001-knowledge-base.md` for the decision
record. Week 1 plan: `docs/plans/2026-05-19-kb-mvp-week1-foundation.md`.
Week 2 plan: `docs/plans/2026-05-19-kb-mvp-week2-pipeline.md`.

## What's implemented (Weeks 1–4)

**Week 1 (Foundation):**

- Types, content store, canonicalizers, chunker, redb schema, file IO primitives.

**Week 2 (Persistence + Pipeline):**

- **redb accessors** (`store/docs`, `store/chunks`, `store/seen`,
  `store/ledger`, `store/jobs`) — composable inside a single
  `WriteTransaction` so the pipeline can write doc + ledger + job +
  seen atomically. Each table has `*_in_wtx` reader variants for the
  race-safe NOOP re-check inside the ingest pipeline.
- **`KbStore` facade** — owns the `redb::Database`, exposes
  `begin_write` / `begin_read`.
- **`KbEmbedder` trait + `StubEmbedder`** — deterministic 1024-dim
  vectors for tests; real BGE-M3 embedder lands as a self-contained
  follow-up behind the same trait.
- **`ingest_canonicalized()`** — single-tx atomic pipeline. Fast-path
  NOOP read, file staging, then one `WriteTransaction` does the race-safe
  NOOP re-check + version compute + 5-table write + commit. Returns
  `doc_id` synchronously.
- **`WorkerPool`** — single tokio task that claims `Ready` jobs from
  `kb_jobs_by_status_priority`, dispatches to `JobHandler`, marks
  `Done` / `Failed` / requeues. `reclaim_stale` interleaved every
  `reclaim_interval` for expired claims. `mark_done` / `mark_failed`
  verify the claim's fencing token so zombie workers can't clobber
  the new claimant's state. Requires multi-threaded tokio runtime
  (uses `tokio::task::block_in_place`).
- **`ChunkAndEmbed` handler** — reads staged markdown, runs the
  Week 1 chunker, embeds via `KbEmbedder`, writes chunks + advances
  ledger to `IndexingComplete`. Idempotent on rerun (deterministic
  `chunk_id`); drops stale chunks from prior `doc_version`s before
  inserting the new set.
- **Crash recovery** — stalled-claim reclaim path tested; process
  restart resumes the queue.

**Week 3 (Retrieval):**

- **HnswCache** (`index/hnsw.rs`) — `RwLock<Hnsw<f32, DistCosine>>`
  with rebuild from redb on startup. Append-only `insert` (re-inserting
  an id orphans the old vertex; compactor reaps via rebuild).
- **TantivyIndex** (`index/tantivy.rs`) — BM25 with `chunk_id`-keyed
  delete-then-add upsert + `delete_all_documents` for full rebuild.
  Uses the `JiebaTokenizer` (`index/cjk.rs`) so Chinese queries
  match against jieba-segmented tokens; ASCII queries round-trip
  identically.
- **KbIndex composite** (`index/mod.rs`) — single handle wraps both
  layers; `upsert_chunk` + `commit` are the worker write path.
- **Worker integration**`ChunkAndEmbed` handler writes to both
  indexes after the redb commit lands; failures propagate so the
  worker retries (chunks are already durable in redb).
- **Filter** (`search/filter.rs`) — visibility + status + version +
  tags + source_kind + doc_ids. Single source of truth for "can this
  caller see this hit".
- **RRF + MMR** (`search/rrf.rs`, `search/mmr.rs`) — pure-function
  fusion + diversity selector.
- **Pipeline** (`search/pipeline.rs`) — `SearchCtx::search` composes
  dense + sparse → filter → fuse → MMR → lazy text fetch.
- **Tools** (`tools/`) — `kb_search`, `kb_fetch`, `kb_list_docs`,
  `kb_similar`, `kb_search_entities`. JSON-shaped IO; `CallerScope`
  is a separate function arg (runtime-injected, agent cannot supply).
- **Entity store** (`store/entities.rs`) — put_entity + get_entity +
  find_by_surface scan + chunks_for_entity edges. Inverted index is a
  Week 4 optimisation once entity extraction emits non-trivial counts.

**Week 4 (Syncers + Compactor + CLI):**

- **`KbSourceSyncer` trait** (`sync/mod.rs`) — generic interface for
  source-specific ingest. Async via `async-trait`.
- **`ManualUploadSyncer`** (`sync/manual.rs`) — file path → bytes →
  `canonicalize_by_mime``ingest_canonicalized`. Used by
  `rsclaw kb add <path>`.
- **`UrlSyncer`** (`sync/url.rs`) — `reqwest::get` with
  ETag/Last-Modified conditional headers; falls back to content-hash
  dedupe via `seen_items`. Cursor persisted as
  `etag:` / `lastmod:` / `contenthash:` in `SyncState`.
- **`SyncRegistry`** (`sync/state.rs`) — load/save SyncState per
  source_id wrapper around `store::seen::{get,put}_sync_state`.
- **Compactor** (`compactor/mod.rs`) — orphan file scan with
  grace-period guard + ledger advancement
  `IndexingComplete → CleanupPending → Done`. Single
  `run_compactor_tick` function, idempotent.
- **CLI** (`src/cli/kb.rs` + `src/cmd/kb.rs`) — `rsclaw kb add | ls |
  rm | search | show | visibility | compact | stats`. `kb add`
  synchronously drains the worker pool so follow-up `kb search`
  sees fresh chunks immediately.

**Week 5 (Polish):**

- **CJK tokenizer** (`index/cjk.rs`) — `JiebaTokenizer` registered
  as tantivy's `cjk` analyzer. The schema applies it to
  `indexed_text` so Chinese BM25 queries actually match.
- **Regex entity extractor** (`entities/extract.rs`) — pulls URLs /
  emails / hashtags / @-mentions out of each chunk's
  `indexed_text`. The worker handler upserts `KbEntity` +
  `KbEntityIndex` edges per chunk, activating
  `kb_search_entities`. CJK hashtags supported.
- **`require_entities` + `boost_entities` in search pipeline**  intersect/multiply on the fused result set against entity edges.
  Covered by `tests/kb_entities_e2e.rs::require_entities_filters_to_chunks_with_mention`.
- **CLI completeness for spec §5 v1**`kb show <doc_id>` lists
  the doc's chunks; `kb rm --tag <name>` bulk-tombstones every
  Active doc with that tag; `kb export <doc_id> --to <path>` writes
  the canonical markdown body to disk; `kb stats` now reports
  per-status doc counts + `kb_entities` / `kb_entity_index` /
  `disk_bytes`; `kb add --recursive <dir>` ingests a directory
  tree with an `--ext` filter (default `md,txt,html,pdf`).
- **HNSW snapshot persistence** (`index/hnsw.rs::{snapshot,restore}`)
  `kb compact` dumps `<paths.root>/hnsw/snapshot.*` via
  `hnsw_rs::file_dump` plus a JSON sidecar with the `id_to_chunk`
  map. `KbIndex::open_and_rebuild` tries `restore()` first, falling
  back to `rebuild()` from redb. Eliminates startup cost on
  re-open of large stores.
- **`kb sync-all`** — refresh every Active URL doc whose
  `SyncState.last_sync_at` is older than `--interval-min`
  (default 20). Supports `--max` cap and `--dry-run`. Acts as a
  manual scheduler tick until gateway-resident syncer ticks ship.
- **`kb search --json`** + `entity_alignment` + `warnings` in
  every kb_search response — the same regex extractor that runs
  on chunk text runs on the query so the agent can spot
  cross-entity hallucinations (`query mentions [伊利] but none
  of the chunks containing it appear in results`).

## What's NOT in Weeks 1–4

- BGE-M3 embedder (real model) — Week 2.5 (self-contained behind `KbEmbedder` trait)
- BGE-M3 real embedder — Week 6 (StubEmbedder today)
- Gateway-resident scheduler for syncer ticks — Week 6 (today:
  manual `kb add <url>` and `kb sync-all` both work; user/cron
  drives the cadence)
- LocalFolderSyncer, MailSyncer, ChatSyncer — V2 (post-MVP)
- `kb_explain` retrieval trace — V2 (post-MVP)
- Tauri admin UI — V2 (post-MVP)
- ML-based NER (replaces regex extractor) — V2 (post-MVP)

## Architecture invariants (verify after every code change)

1. **`chunk_id` depends on `logical_source_id`, never on `doc_id` or
   `doc_version`**: re-ingesting the same file produces identical
   `chunk_id`s. Covered by
   `kb::model::chunk::tests::reingest_same_file_same_chunk_ids`,
   `kb::chunker::tests::idempotent_chunk_ids`, and
   `tests/kb_week1_e2e.rs::reingest_same_file_same_chunk_ids`.
2. **`KbDoc.visible_to(scope)` is the only visibility entry point**:
   never call `KbVisibility::visible_to(scope, owner)` directly from
   retrieval code — pairing the wrong owner is the most likely
   scope-leak. Covered by
   `kb::model::doc::tests::visibility_private_requires_matching_owner`
   and `kbdoc_visible_to_pairs_owner_with_visibility`.
3. **`write_if_new` is truly atomic no-clobber**: never replace it
   with `path.exists()` + `rename()` — that's a TOCTOU race AND Unix
   `rename(2)` overwrites. Covered by
   `kb::content_store::atomic::tests::write_if_new_concurrent_no_clobber`
   (20-iteration thread race).
4. **Markdown paths are content-addressed**: layout is
   `md/<kind>/<slug>--<lsid8>--<md8>.md` where `lsid8` =
   `sha256(logical_source_id)[:8]` and `md8` =
   `sha256(body)[:8]`. Same lsid + same content → same path
   (idempotent re-ingest). Same lsid + new content (v2 ingest under a
   stable seed) → different path; both versions coexist until the
   Week 4 compactor reaps the old file. `stage_doc` still errors on
   any body mismatch at a same-path hit (full 64-bit suffix
   collision, ~2^-32). Covered by
   `kb::content_store::paths::tests::markdown_rel_same_lsid_different_body_different_path`
   and
   `kb::content_store::tests::stage_same_lsid_different_body_lands_at_different_paths`.
5. **Files are stage-only**: nothing in `canonicalize/` or
   `content_store/` deletes files. Deletion happens via the compactor
   + ledger reconciliation in Week 4.
6. **No SQL pretense**: redb queries are KV / range-scan only; never
   use SQL terminology (no "partial unique index", no "UPDATE …
   RETURNING").
7. **PII in logs goes through `util::redact`**: source ids and
   content previews emit only `redact(s)` (first 8 hex of sha256).

### Added in Week 2

8. **All ingest writes happen in one redb tx**`ingest_canonicalized`
   commits `KbDoc` + `VersionPointer` + `IngestLedgerEntry` + `Job` +
   `SeenItems` together. Splitting any of these into separate txs
   reintroduces the Outbox bug: a doc visible to readers but no job
   queued for chunking. Covered by
   `kb::pipeline::ingest::tests::fresh_ingest_writes_all_tables`.
9. **NOOP re-check + version compute happen INSIDE the wtx** — these
   reads use `*_in_wtx` accessor variants so a concurrent ingest with
   the same `(lsid, raw_sha)` cannot pass NOOP-miss in both threads and
   produce duplicate docs. redb's single-writer guarantee plus the
   in-wtx re-check is the correctness hinge. Covered by
   `kb::pipeline::ingest::tests::concurrent_ingest_same_bytes_produces_one_doc`.
10. **`ChunkAndEmbed` handler is idempotent** — re-running on the same
    `doc_id` produces identical chunks (deterministic `chunk_id`) and
    identical vectors. Re-runs after the ledger already advanced are
    safe no-ops, not errors. Covered by
    `kb::worker::handlers::chunk_embed::tests::idempotent_rerun_produces_same_chunks`
    and `rerun_after_ledger_advanced_does_not_error`.
11. **Job dedupe is keyed on `JobKind::dedupe_key()`, not job_id**    enqueueing the same logical work twice while a job is `Ready` or
    `Running` returns the existing `job_id` without writing a duplicate.
    Covered by `kb::store::jobs::tests::enqueue_dedupes_active_jobs`.
12. **`mark_done` / `mark_failed` verify the claim's fencing token**    a zombie worker whose claim was reclaimed cannot transition the
    job and clobber the new claimant. Covered by
    `kb::store::jobs::tests::mark_done_with_wrong_token_errors` and
    `mark_done_after_reclaim_errors`.
13. **Stalled claims auto-reclaim** — workers that crash mid-job leave
    a claim with `expires_at` in the past; the next `reclaim_stale`
    sweep resets the job to `Ready` (or fails it once `max_attempts` is
    hit) and another worker re-runs it. Both the `WorkerPool` (tokio,
    CLI/tests) and the gateway's `KnowledgeService::spawn_worker`
    (std::thread, sweeps every 30s) drive this. Covered by
    `tests/kb_week2_recovery.rs::stalled_claim_is_reclaimed_and_rerun`
    and `kb::store::jobs::tests::reclaim_stale_fails_job_past_max_attempts`.
14. **`WorkerPool::shutdown()` exits in bounded time** — the AtomicBool
    is checked at the top of each loop iteration and on every wake
    from the idle sleep. Long-running handlers delay shutdown only
    until they return. Covered by
    `kb::worker::pool::tests::shutdown_exits_within_poll_idle_plus_margin`.

### Added in Week 3

15. **Visibility filter runs on every retrieval call** — every
    `tools/kb_*` entry point goes through `search::filter::keep_doc` +
    `is_latest_version`. There is no caller-supplied bypass. Covered
    by
    `kb::search::pipeline::tests::search_filter_by_visibility_hides_private`.
16. **HNSW + tantivy are caches over redb** — losing either is a
    rebuild, not data loss. `KbIndex::open_and_rebuild` reconstructs
    both from `kb_chunks` on startup. Covered by
    `kb::index::hnsw::tests::rebuild_then_search_returns_hits` and
    `kb::index::tests::open_and_rebuild_recovers_both_layers`.
17. **Tantivy upsert deletes-by-term before add** — re-running
    `chunk_embed` on the same chunk_id replaces the indexed text
    rather than producing a duplicate match. Covered by
    `kb::index::tantivy::tests::upsert_replaces_previous`.
18. **CallerScope is injected by the runtime, not by tool input**    `kb_search::KbSearchInput` deliberately has no `caller_scope`
    field; the runtime constructs scope from auth context and passes
    it as a separate function argument to `tools::*::run`.

### Added in Week 4

19. **All syncers go through `ingest_canonicalized`**`ManualUpload`
    and `Url` syncers both terminate in `ingest_canonicalized(...)`,
    so spec §J's atomicity contract holds for every ingest path. No
    syncer ever writes to redb directly.
20. **UrlSyncer conditional-get uses SyncState.cursor** — every
    304 NOT_MODIFIED response counts as `docs_skipped`, never
    `docs_added`. Covered by the `manual_syncer_dedupes_identical_bytes`
    pattern (UrlSyncer integration deferred to Week 6 with a
    `wiremock` dep).
21. **Compactor never deletes files referenced by any KbDoc**    `referenced_paths` unions over every doc's
    `markdown_path` + `raw_path` plus every Pending/IndexingComplete
    ledger entry's `new_paths`. The grace period (default 1h) guards
    against in-flight ingest. Covered by
    `kb::compactor::tests::referenced_file_preserved`.
22. **CLI is a thin wrapper over the library surface** — every
    `rsclaw kb` subcommand calls into Week 2–3's tool surface
    (`ingest_canonicalized`, `kb_search`, `kb_list_docs`, `kb_fetch`)
    or the new Week 4 syncer/compactor functions. `kb add` drains
    the worker pool synchronously so an immediate `kb search` sees
    fresh chunks.

### Added in Week 5 (Polish)

23. **CJK BM25 search works**`JiebaTokenizer` is registered as
    tantivy's `cjk` analyzer and applied to the `indexed_text`
    field's `TextOptions`. The default whitespace+lowercase
    analyzer reduced Chinese sentences to a single un-searchable
    token; jieba splits them into searchable terms. Covered by
    `kb::index::tantivy::tests::chinese_query_matches_chinese_doc`.
24. **Entity edges land on every chunk write** — the regex
    extractor (`entities/extract.rs`) runs inside the same
    `wtx` as the chunk insert, so `KbEntityIndex` rows are
    consistent with chunks. `kb_search_entities` returns these
    edges; `require_entities` / `boost_entities` filters in
    `search::pipeline` are wired against them. Covered by
    `tests/kb_entities_e2e.rs::entities_extracted_and_queryable`
    and `require_entities_filters_to_chunks_with_mention`.
25. **CLI fully covers spec §5 v1**`add | ls | rm | search |
    show | visibility | compact | stats | export`. `rm` accepts
    either a `doc_id` or `--tag <name>` for bulk tombstone;
    `show` resolves doc_ids to a chunk list and chunk_ids to a
    single-chunk fetch with neighbors. `stats` reports per-status
    doc counts and on-disk bytes. `add --recursive <dir>` ingests
    a directory tree.
26. **HNSW snapshot survives process restart**`kb compact`
    dumps the dense layer to `hnsw/snapshot.*`. Subsequent
    `KbIndex::open_and_rebuild` calls restore in-place rather than
    re-inserting every chunk. Empty caches still write a meta
    sidecar so restore is symmetric. Covered by
    `kb::index::hnsw::tests::snapshot_roundtrip_preserves_search`
    and `snapshot_empty_cache_writes_meta_only`.
27. **Tombstoned docs resurrect on same-content re-ingest** — spec
    §6 keeps Tombstoned docs for 30 days. Re-adding the same file
    within that window flips status back to Active rather than
    silently NOOP-returning the hidden doc. Both the read-only
    fast path and the wtx-scoped re-check honour this. Covered by
    `kb::pipeline::ingest::tests::tombstoned_doc_resurrects_on_reingest`.
28. **CLI smoke tests**`tests/kb_cli_smoke.rs` invokes the
    compiled `rsclaw` binary via `CARGO_BIN_EXE_rsclaw`. Ten
    tests covering the full `kb` subcommand surface guard against
    arg-parsing and output-format regressions.
29. **Retrieval output is byte-deterministic**`search::pipeline`
    sorts the post-MMR result by `(score desc, chunk_id asc)` so
    the wire bytes are stable across calls with the same inputs.
    Spec §3 "KV cache 友好": identical search inputs must produce
    identical agent context across turns or the cache fragments.
30. **HNSW snapshot has a schema_version**`HnswMeta.schema_version`
    bumps on format changes. Restore errors instead of panicking
    on mismatch; the operator can delete the `hnsw/` directory
    to force a rebuild from redb (cache, not source of truth).
31. **`reclaim_stale` leaves an audit trail** — every job reset
    from Running→Ready gets `last_error =
    "claim_token_expired"` inside the same wtx. Operators reading
    `kb_jobs_by_id` see exactly why each job came back.
32. **`UrlSyncer` classifies HTTP failures** — 401/403 →
    `AuthFailed`, 429 (with Retry-After parsed) → `RateLimited`,
    other 4xx → `Permanent` (no point retrying), 5xx →
    `Network` (transient). `SyncError` variants are usable
    end-to-end now.

## Quick start

### CLI (everyday flow)

```bash
# Add a file (synchronously chunks + indexes in CLI-only mode)
rsclaw kb add ~/Documents/manual.md --tags personal

# Add a directory recursively
rsclaw kb add ~/Documents/notes --recursive --ext md,txt --tags wiki

# Add a URL (conditional GET via ETag/Last-Modified on re-run)
rsclaw kb add https://example.com/changelog.html --tags changelog

# Search (hybrid: HNSW + tantivy BM25 + RRF + MMR)
rsclaw kb search "brown fox" -k 5
rsclaw kb search "brown fox" --json | jq

# List + filter
rsclaw kb ls --tag wiki --limit 20
rsclaw kb show <doc_id>           # metadata + chunk list
rsclaw kb show <chunk_id>         # single chunk + neighbors
rsclaw kb visibility <doc_id> private

# Maintenance
rsclaw kb compact                  # orphan-file scan + HNSW snapshot
rsclaw kb sync-all --dry-run       # refresh stale URL docs
rsclaw kb stats                    # per-status counts + disk_bytes
rsclaw kb export <doc_id> --to ./out.md

# Delete (tombstone — kept 30 days for recovery)
rsclaw kb rm <doc_id> --yes
rsclaw kb rm --tag stale --yes     # bulk by tag
# Re-add the same file within 30 days resurrects the doc.
```

### Rust API (embedders + tests)

```rust
use rsclaw::kb::{
    canonicalize_by_mime, detect_mime, ingest_canonicalized,
    CanonicalizeInput, HandlerCtx, IngestInput, KbEmbedder, KbIndex,
    KbPaths, KbStore, StubEmbedder, WorkerConfig, WorkerPool,
};
use std::sync::Arc;

# async fn demo() -> anyhow::Result<()> {
let tmp = tempfile::TempDir::new()?;
let store = Arc::new(KbStore::open(&tmp.path().join("kb.redb"))?);
let paths = Arc::new(KbPaths::new(tmp.path().join("kb")));
paths.ensure_layout()?;
let embedder: Arc<dyn KbEmbedder> = Arc::new(StubEmbedder::default());
let index = Arc::new(KbIndex::open(&paths)?);

// Start the worker pool (requires multi-threaded tokio runtime).
let ctx = HandlerCtx {
    store: store.clone(),
    paths: paths.clone(),
    embedder: embedder.clone(),
    index: index.clone(),
};
let pool = WorkerPool::start(ctx, WorkerConfig::default());

// Ingest a doc.
let bytes = std::fs::read("manual.md")?;
let mime = detect_mime(&bytes, Some("manual.md"));
let canon = canonicalize_by_mime(CanonicalizeInput {
    bytes: &bytes,
    mime: &mime,
    hint_title: Some("manual.md"),
    logical_source_id_seed: None,
})?
.unwrap();

let out = ingest_canonicalized(
    &store,
    IngestInput {
        canon: &canon,
        raw_bytes: &bytes,
        raw_ext: "md",
        visibility: None,
        owner_user_id: None,
        seen_key: None,
        source: None,
        paths: &paths,
    },
)?;
println!("doc_id: {}", out.doc_id);

// Worker pool picks up the ChunkAndEmbed job asynchronously and
// writes chunks + vectors into kb_chunks. See
// `tests/kb_week2_pipeline.rs` for the full async wait pattern.

pool.shutdown().await;
# Ok(()) }
```

## Testing

```bash
cargo test -p rsclaw --lib kb::          # unit tests (~200)
cargo test --test kb_week1_e2e           # Week 1 integration (6)
cargo test --test kb_week2_pipeline      # Week 2 async e2e (1)
cargo test --test kb_week2_recovery      # Week 2 crash recovery (2)
cargo test --test kb_week3_search        # Week 3 retrieval e2e (1)
cargo test --test kb_week4_syncers       # Week 4 syncer e2e (2)
cargo test --test kb_week4_compactor     # Week 4 compactor integration (2)
cargo test --test kb_entities_e2e        # Week 5 entity extraction (2)
cargo test --test kb_cli_smoke           # CLI smoke (11)
cargo test --test kb_tools_e2e           # kb_fetch/similar/list_docs (7)
```

End-to-end CLI smoke:

```bash
echo "# Hello\n\nThe quick brown fox." > /tmp/doc.md
rsclaw --base-dir /tmp/kbdemo kb add /tmp/doc.md --tags demo
rsclaw --base-dir /tmp/kbdemo kb search "brown fox"
rsclaw --base-dir /tmp/kbdemo kb stats
```