# `src/kb/` — Knowledge Base
User-managed RAG knowledge base. See `docs/specs/2026-05-19-knowledge-base.md`
for the full design and `docs/adr/0001-knowledge-base.md` for the decision
record. Week 1 plan: `docs/plans/2026-05-19-kb-mvp-week1-foundation.md`.
Week 2 plan: `docs/plans/2026-05-19-kb-mvp-week2-pipeline.md`.
## What's implemented (Weeks 1–4)
**Week 1 (Foundation):**
- Types, content store, canonicalizers, chunker, redb schema, file IO primitives.
**Week 2 (Persistence + Pipeline):**
- **redb accessors** (`store/docs`, `store/chunks`, `store/seen`,
`store/ledger`, `store/jobs`) — composable inside a single
`WriteTransaction` so the pipeline can write doc + ledger + job +
seen atomically. Each table has `*_in_wtx` reader variants for the
race-safe NOOP re-check inside the ingest pipeline.
- **`KbStore` facade** — owns the `redb::Database`, exposes
`begin_write` / `begin_read`.
- **`KbEmbedder` trait + `StubEmbedder`** — deterministic 1024-dim
vectors for tests; real BGE-M3 embedder lands as a self-contained
follow-up behind the same trait.
- **`ingest_canonicalized()`** — single-tx atomic pipeline. Fast-path
NOOP read, file staging, then one `WriteTransaction` does the race-safe
NOOP re-check + version compute + 5-table write + commit. Returns
`doc_id` synchronously.
- **`WorkerPool`** — single tokio task that claims `Ready` jobs from
`kb_jobs_by_status_priority`, dispatches to `JobHandler`, marks
`Done` / `Failed` / requeues. `reclaim_stale` interleaved every
`reclaim_interval` for expired claims. `mark_done` / `mark_failed`
verify the claim's fencing token so zombie workers can't clobber
the new claimant's state. Requires multi-threaded tokio runtime
(uses `tokio::task::block_in_place`).
- **`ChunkAndEmbed` handler** — reads staged markdown, runs the
Week 1 chunker, embeds via `KbEmbedder`, writes chunks + advances
ledger to `IndexingComplete`. Idempotent on rerun (deterministic
`chunk_id`); drops stale chunks from prior `doc_version`s before
inserting the new set.
- **Crash recovery** — stalled-claim reclaim path tested; process
restart resumes the queue.
**Week 3 (Retrieval):**
- **HnswCache** (`index/hnsw.rs`) — `RwLock<Hnsw<f32, DistCosine>>`
with rebuild from redb on startup. Append-only `insert` (re-inserting
an id orphans the old vertex; compactor reaps via rebuild).
- **TantivyIndex** (`index/tantivy.rs`) — BM25 with `chunk_id`-keyed
delete-then-add upsert + `delete_all_documents` for full rebuild.
Uses the `JiebaTokenizer` (`index/cjk.rs`) so Chinese queries
match against jieba-segmented tokens; ASCII queries round-trip
identically.
- **KbIndex composite** (`index/mod.rs`) — single handle wraps both
layers; `upsert_chunk` + `commit` are the worker write path.
- **Worker integration** — `ChunkAndEmbed` handler writes to both
indexes after the redb commit lands; failures propagate so the
worker retries (chunks are already durable in redb).
- **Filter** (`search/filter.rs`) — visibility + status + version +
tags + source_kind + doc_ids. Single source of truth for "can this
caller see this hit".
- **RRF + MMR** (`search/rrf.rs`, `search/mmr.rs`) — pure-function
fusion + diversity selector.
- **Pipeline** (`search/pipeline.rs`) — `SearchCtx::search` composes
dense + sparse → filter → fuse → MMR → lazy text fetch.
- **Tools** (`tools/`) — `kb_search`, `kb_fetch`, `kb_list_docs`,
`kb_similar`, `kb_search_entities`. JSON-shaped IO; `CallerScope`
is a separate function arg (runtime-injected, agent cannot supply).
- **Entity store** (`store/entities.rs`) — put_entity + get_entity +
find_by_surface scan + chunks_for_entity edges. Inverted index is a
Week 4 optimisation once entity extraction emits non-trivial counts.
**Week 4 (Syncers + Compactor + CLI):**
- **`KbSourceSyncer` trait** (`sync/mod.rs`) — generic interface for
source-specific ingest. Async via `async-trait`.
- **`ManualUploadSyncer`** (`sync/manual.rs`) — file path → bytes →
`canonicalize_by_mime` → `ingest_canonicalized`. Used by
`rsclaw kb add <path>`.
- **`UrlSyncer`** (`sync/url.rs`) — `reqwest::get` with
ETag/Last-Modified conditional headers; falls back to content-hash
dedupe via `seen_items`. Cursor persisted as
`etag:` / `lastmod:` / `contenthash:` in `SyncState`.
- **`SyncRegistry`** (`sync/state.rs`) — load/save SyncState per
source_id wrapper around `store::seen::{get,put}_sync_state`.
- **Compactor** (`compactor/mod.rs`) — orphan file scan with
grace-period guard + ledger advancement
`IndexingComplete → CleanupPending → Done`. Single
`run_compactor_tick` function, idempotent.
- **CLI** (`src/cli/kb.rs` + `src/cmd/kb.rs`) — `rsclaw kb add | ls |
rm | search | show | visibility | compact | stats`. `kb add`
synchronously drains the worker pool so follow-up `kb search`
sees fresh chunks immediately.
**Week 5 (Polish):**
- **CJK tokenizer** (`index/cjk.rs`) — `JiebaTokenizer` registered
as tantivy's `cjk` analyzer. The schema applies it to
`indexed_text` so Chinese BM25 queries actually match.
- **Regex entity extractor** (`entities/extract.rs`) — pulls URLs /
emails / hashtags / @-mentions out of each chunk's
`indexed_text`. The worker handler upserts `KbEntity` +
`KbEntityIndex` edges per chunk, activating
`kb_search_entities`. CJK hashtags supported.
- **`require_entities` + `boost_entities` in search pipeline** —
intersect/multiply on the fused result set against entity edges.
Covered by `tests/kb_entities_e2e.rs::require_entities_filters_to_chunks_with_mention`.
- **CLI completeness for spec §5 v1** — `kb show <doc_id>` lists
the doc's chunks; `kb rm --tag <name>` bulk-tombstones every
Active doc with that tag; `kb export <doc_id> --to <path>` writes
the canonical markdown body to disk; `kb stats` now reports
per-status doc counts + `kb_entities` / `kb_entity_index` /
`disk_bytes`; `kb add --recursive <dir>` ingests a directory
tree with an `--ext` filter (default `md,txt,html,pdf`).
- **HNSW snapshot persistence** (`index/hnsw.rs::{snapshot,restore}`)
— `kb compact` dumps `<paths.root>/hnsw/snapshot.*` via
`hnsw_rs::file_dump` plus a JSON sidecar with the `id_to_chunk`
map. `KbIndex::open_and_rebuild` tries `restore()` first, falling
back to `rebuild()` from redb. Eliminates startup cost on
re-open of large stores.
- **`kb sync-all`** — refresh every Active URL doc whose
`SyncState.last_sync_at` is older than `--interval-min`
(default 20). Supports `--max` cap and `--dry-run`. Acts as a
manual scheduler tick until gateway-resident syncer ticks ship.
- **`kb search --json`** + `entity_alignment` + `warnings` in
every kb_search response — the same regex extractor that runs
on chunk text runs on the query so the agent can spot
cross-entity hallucinations (`query mentions [伊利] but none
of the chunks containing it appear in results`).
## What's NOT in Weeks 1–4
- BGE-M3 embedder (real model) — Week 2.5 (self-contained behind `KbEmbedder` trait)
- BGE-M3 real embedder — Week 6 (StubEmbedder today)
- Gateway-resident scheduler for syncer ticks — Week 6 (today:
manual `kb add <url>` and `kb sync-all` both work; user/cron
drives the cadence)
- LocalFolderSyncer, MailSyncer, ChatSyncer — V2 (post-MVP)
- `kb_explain` retrieval trace — V2 (post-MVP)
- Tauri admin UI — V2 (post-MVP)
- ML-based NER (replaces regex extractor) — V2 (post-MVP)
## Architecture invariants (verify after every code change)
1. **`chunk_id` depends on `logical_source_id`, never on `doc_id` or
`doc_version`**: re-ingesting the same file produces identical
`chunk_id`s. Covered by
`kb::model::chunk::tests::reingest_same_file_same_chunk_ids`,
`kb::chunker::tests::idempotent_chunk_ids`, and
`tests/kb_week1_e2e.rs::reingest_same_file_same_chunk_ids`.
2. **`KbDoc.visible_to(scope)` is the only visibility entry point**:
never call `KbVisibility::visible_to(scope, owner)` directly from
retrieval code — pairing the wrong owner is the most likely
scope-leak. Covered by
`kb::model::doc::tests::visibility_private_requires_matching_owner`
and `kbdoc_visible_to_pairs_owner_with_visibility`.
3. **`write_if_new` is truly atomic no-clobber**: never replace it
with `path.exists()` + `rename()` — that's a TOCTOU race AND Unix
`rename(2)` overwrites. Covered by
`kb::content_store::atomic::tests::write_if_new_concurrent_no_clobber`
(20-iteration thread race).
4. **Markdown paths are content-addressed**: layout is
`md/<kind>/<slug>--<lsid8>--<md8>.md` where `lsid8` =
`sha256(logical_source_id)[:8]` and `md8` =
`sha256(body)[:8]`. Same lsid + same content → same path
(idempotent re-ingest). Same lsid + new content (v2 ingest under a
stable seed) → different path; both versions coexist until the
Week 4 compactor reaps the old file. `stage_doc` still errors on
any body mismatch at a same-path hit (full 64-bit suffix
collision, ~2^-32). Covered by
`kb::content_store::paths::tests::markdown_rel_same_lsid_different_body_different_path`
and
`kb::content_store::tests::stage_same_lsid_different_body_lands_at_different_paths`.
5. **Files are stage-only**: nothing in `canonicalize/` or
`content_store/` deletes files. Deletion happens via the compactor
+ ledger reconciliation in Week 4.
6. **No SQL pretense**: redb queries are KV / range-scan only; never
use SQL terminology (no "partial unique index", no "UPDATE …
RETURNING").
7. **PII in logs goes through `util::redact`**: source ids and
content previews emit only `redact(s)` (first 8 hex of sha256).
### Added in Week 2
8. **All ingest writes happen in one redb tx** — `ingest_canonicalized`
commits `KbDoc` + `VersionPointer` + `IngestLedgerEntry` + `Job` +
`SeenItems` together. Splitting any of these into separate txs
reintroduces the Outbox bug: a doc visible to readers but no job
queued for chunking. Covered by
`kb::pipeline::ingest::tests::fresh_ingest_writes_all_tables`.
9. **NOOP re-check + version compute happen INSIDE the wtx** — these
reads use `*_in_wtx` accessor variants so a concurrent ingest with
the same `(lsid, raw_sha)` cannot pass NOOP-miss in both threads and
produce duplicate docs. redb's single-writer guarantee plus the
in-wtx re-check is the correctness hinge. Covered by
`kb::pipeline::ingest::tests::concurrent_ingest_same_bytes_produces_one_doc`.
10. **`ChunkAndEmbed` handler is idempotent** — re-running on the same
`doc_id` produces identical chunks (deterministic `chunk_id`) and
identical vectors. Re-runs after the ledger already advanced are
safe no-ops, not errors. Covered by
`kb::worker::handlers::chunk_embed::tests::idempotent_rerun_produces_same_chunks`
and `rerun_after_ledger_advanced_does_not_error`.
11. **Job dedupe is keyed on `JobKind::dedupe_key()`, not job_id** —
enqueueing the same logical work twice while a job is `Ready` or
`Running` returns the existing `job_id` without writing a duplicate.
Covered by `kb::store::jobs::tests::enqueue_dedupes_active_jobs`.
12. **`mark_done` / `mark_failed` verify the claim's fencing token** —
a zombie worker whose claim was reclaimed cannot transition the
job and clobber the new claimant. Covered by
`kb::store::jobs::tests::mark_done_with_wrong_token_errors` and
`mark_done_after_reclaim_errors`.
13. **Stalled claims auto-reclaim** — workers that crash mid-job leave
a claim with `expires_at` in the past; the next `reclaim_stale`
sweep resets the job to `Ready` (or fails it once `max_attempts` is
hit) and another worker re-runs it. Both the `WorkerPool` (tokio,
CLI/tests) and the gateway's `KnowledgeService::spawn_worker`
(std::thread, sweeps every 30s) drive this. Covered by
`tests/kb_week2_recovery.rs::stalled_claim_is_reclaimed_and_rerun`
and `kb::store::jobs::tests::reclaim_stale_fails_job_past_max_attempts`.
14. **`WorkerPool::shutdown()` exits in bounded time** — the AtomicBool
is checked at the top of each loop iteration and on every wake
from the idle sleep. Long-running handlers delay shutdown only
until they return. Covered by
`kb::worker::pool::tests::shutdown_exits_within_poll_idle_plus_margin`.
### Added in Week 3
15. **Visibility filter runs on every retrieval call** — every
`tools/kb_*` entry point goes through `search::filter::keep_doc` +
`is_latest_version`. There is no caller-supplied bypass. Covered
by
`kb::search::pipeline::tests::search_filter_by_visibility_hides_private`.
16. **HNSW + tantivy are caches over redb** — losing either is a
rebuild, not data loss. `KbIndex::open_and_rebuild` reconstructs
both from `kb_chunks` on startup. Covered by
`kb::index::hnsw::tests::rebuild_then_search_returns_hits` and
`kb::index::tests::open_and_rebuild_recovers_both_layers`.
17. **Tantivy upsert deletes-by-term before add** — re-running
`chunk_embed` on the same chunk_id replaces the indexed text
rather than producing a duplicate match. Covered by
`kb::index::tantivy::tests::upsert_replaces_previous`.
18. **CallerScope is injected by the runtime, not by tool input** —
`kb_search::KbSearchInput` deliberately has no `caller_scope`
field; the runtime constructs scope from auth context and passes
it as a separate function argument to `tools::*::run`.
### Added in Week 4
19. **All syncers go through `ingest_canonicalized`** — `ManualUpload`
and `Url` syncers both terminate in `ingest_canonicalized(...)`,
so spec §J's atomicity contract holds for every ingest path. No
syncer ever writes to redb directly.
20. **UrlSyncer conditional-get uses SyncState.cursor** — every
304 NOT_MODIFIED response counts as `docs_skipped`, never
`docs_added`. Covered by the `manual_syncer_dedupes_identical_bytes`
pattern (UrlSyncer integration deferred to Week 6 with a
`wiremock` dep).
21. **Compactor never deletes files referenced by any KbDoc** —
`referenced_paths` unions over every doc's
`markdown_path` + `raw_path` plus every Pending/IndexingComplete
ledger entry's `new_paths`. The grace period (default 1h) guards
against in-flight ingest. Covered by
`kb::compactor::tests::referenced_file_preserved`.
22. **CLI is a thin wrapper over the library surface** — every
`rsclaw kb` subcommand calls into Week 2–3's tool surface
(`ingest_canonicalized`, `kb_search`, `kb_list_docs`, `kb_fetch`)
or the new Week 4 syncer/compactor functions. `kb add` drains
the worker pool synchronously so an immediate `kb search` sees
fresh chunks.
### Added in Week 5 (Polish)
23. **CJK BM25 search works** — `JiebaTokenizer` is registered as
tantivy's `cjk` analyzer and applied to the `indexed_text`
field's `TextOptions`. The default whitespace+lowercase
analyzer reduced Chinese sentences to a single un-searchable
token; jieba splits them into searchable terms. Covered by
`kb::index::tantivy::tests::chinese_query_matches_chinese_doc`.
24. **Entity edges land on every chunk write** — the regex
extractor (`entities/extract.rs`) runs inside the same
`wtx` as the chunk insert, so `KbEntityIndex` rows are
consistent with chunks. `kb_search_entities` returns these
edges; `require_entities` / `boost_entities` filters in
`search::pipeline` are wired against them. Covered by
`tests/kb_entities_e2e.rs::entities_extracted_and_queryable`
and `require_entities_filters_to_chunks_with_mention`.
25. **CLI fully covers spec §5 v1** — `add | ls | rm | search |
show | visibility | compact | stats | export`. `rm` accepts
either a `doc_id` or `--tag <name>` for bulk tombstone;
`show` resolves doc_ids to a chunk list and chunk_ids to a
single-chunk fetch with neighbors. `stats` reports per-status
doc counts and on-disk bytes. `add --recursive <dir>` ingests
a directory tree.
26. **HNSW snapshot survives process restart** — `kb compact`
dumps the dense layer to `hnsw/snapshot.*`. Subsequent
`KbIndex::open_and_rebuild` calls restore in-place rather than
re-inserting every chunk. Empty caches still write a meta
sidecar so restore is symmetric. Covered by
`kb::index::hnsw::tests::snapshot_roundtrip_preserves_search`
and `snapshot_empty_cache_writes_meta_only`.
27. **Tombstoned docs resurrect on same-content re-ingest** — spec
§6 keeps Tombstoned docs for 30 days. Re-adding the same file
within that window flips status back to Active rather than
silently NOOP-returning the hidden doc. Both the read-only
fast path and the wtx-scoped re-check honour this. Covered by
`kb::pipeline::ingest::tests::tombstoned_doc_resurrects_on_reingest`.
28. **CLI smoke tests** — `tests/kb_cli_smoke.rs` invokes the
compiled `rsclaw` binary via `CARGO_BIN_EXE_rsclaw`. Ten
tests covering the full `kb` subcommand surface guard against
arg-parsing and output-format regressions.
29. **Retrieval output is byte-deterministic** — `search::pipeline`
sorts the post-MMR result by `(score desc, chunk_id asc)` so
the wire bytes are stable across calls with the same inputs.
Spec §3 "KV cache 友好": identical search inputs must produce
identical agent context across turns or the cache fragments.
30. **HNSW snapshot has a schema_version** — `HnswMeta.schema_version`
bumps on format changes. Restore errors instead of panicking
on mismatch; the operator can delete the `hnsw/` directory
to force a rebuild from redb (cache, not source of truth).
31. **`reclaim_stale` leaves an audit trail** — every job reset
from Running→Ready gets `last_error =
"claim_token_expired"` inside the same wtx. Operators reading
`kb_jobs_by_id` see exactly why each job came back.
32. **`UrlSyncer` classifies HTTP failures** — 401/403 →
`AuthFailed`, 429 (with Retry-After parsed) → `RateLimited`,
other 4xx → `Permanent` (no point retrying), 5xx →
`Network` (transient). `SyncError` variants are usable
end-to-end now.
## Quick start
### CLI (everyday flow)
```bash
# Add a file (synchronously chunks + indexes in CLI-only mode)
rsclaw kb add ~/Documents/manual.md --tags personal
# Add a directory recursively
rsclaw kb add ~/Documents/notes --recursive --ext md,txt --tags wiki
# Add a URL (conditional GET via ETag/Last-Modified on re-run)
rsclaw kb add https://example.com/changelog.html --tags changelog
# Search (hybrid: HNSW + tantivy BM25 + RRF + MMR)
rsclaw kb search "brown fox" -k 5
# List + filter
rsclaw kb ls --tag wiki --limit 20
rsclaw kb show <doc_id> # metadata + chunk list
rsclaw kb show <chunk_id> # single chunk + neighbors
rsclaw kb visibility <doc_id> private
# Maintenance
rsclaw kb compact # orphan-file scan + HNSW snapshot
rsclaw kb sync-all --dry-run # refresh stale URL docs
rsclaw kb stats # per-status counts + disk_bytes
rsclaw kb export <doc_id> --to ./out.md
# Delete (tombstone — kept 30 days for recovery)
rsclaw kb rm <doc_id> --yes
rsclaw kb rm --tag stale --yes # bulk by tag
# Re-add the same file within 30 days resurrects the doc.
```
### Rust API (embedders + tests)
```rust
use rsclaw::kb::{
canonicalize_by_mime, detect_mime, ingest_canonicalized,
CanonicalizeInput, HandlerCtx, IngestInput, KbEmbedder, KbIndex,
KbPaths, KbStore, StubEmbedder, WorkerConfig, WorkerPool,
};
use std::sync::Arc;
# async fn demo() -> anyhow::Result<()> {
let tmp = tempfile::TempDir::new()?;
let store = Arc::new(KbStore::open(&tmp.path().join("kb.redb"))?);
let paths = Arc::new(KbPaths::new(tmp.path().join("kb")));
paths.ensure_layout()?;
let embedder: Arc<dyn KbEmbedder> = Arc::new(StubEmbedder::default());
let index = Arc::new(KbIndex::open(&paths)?);
// Start the worker pool (requires multi-threaded tokio runtime).
let ctx = HandlerCtx {
store: store.clone(),
paths: paths.clone(),
embedder: embedder.clone(),
index: index.clone(),
};
let pool = WorkerPool::start(ctx, WorkerConfig::default());
// Ingest a doc.
let bytes = std::fs::read("manual.md")?;
let mime = detect_mime(&bytes, Some("manual.md"));
let canon = canonicalize_by_mime(CanonicalizeInput {
bytes: &bytes,
mime: &mime,
hint_title: Some("manual.md"),
logical_source_id_seed: None,
})?
.unwrap();
let out = ingest_canonicalized(
&store,
IngestInput {
canon: &canon,
raw_bytes: &bytes,
raw_ext: "md",
visibility: None,
owner_user_id: None,
seen_key: None,
source: None,
paths: &paths,
},
)?;
println!("doc_id: {}", out.doc_id);
// Worker pool picks up the ChunkAndEmbed job asynchronously and
// writes chunks + vectors into kb_chunks. See
// `tests/kb_week2_pipeline.rs` for the full async wait pattern.
pool.shutdown().await;
# Ok(()) }
```
## Testing
```bash
cargo test -p rsclaw --lib kb:: # unit tests (~200)
cargo test --test kb_week1_e2e # Week 1 integration (6)
cargo test --test kb_week2_pipeline # Week 2 async e2e (1)
cargo test --test kb_week2_recovery # Week 2 crash recovery (2)
cargo test --test kb_week3_search # Week 3 retrieval e2e (1)
cargo test --test kb_week4_syncers # Week 4 syncer e2e (2)
cargo test --test kb_week4_compactor # Week 4 compactor integration (2)
cargo test --test kb_entities_e2e # Week 5 entity extraction (2)
cargo test --test kb_cli_smoke # CLI smoke (11)
cargo test --test kb_tools_e2e # kb_fetch/similar/list_docs (7)
```
End-to-end CLI smoke:
```bash
echo "# Hello\n\nThe quick brown fox." > /tmp/doc.md
rsclaw --base-dir /tmp/kbdemo kb add /tmp/doc.md --tags demo
rsclaw --base-dir /tmp/kbdemo kb search "brown fox"
rsclaw --base-dir /tmp/kbdemo kb stats
```