basemind 0.2.2

Full AI context layer over MCP — tree-sitter code-map, document RAG (PDF/Office/HTML/email + OCR + reranker), shared agent memory, on-demand web crawl, git history + blame + per-symbol diff. 300+ languages, 8 coding-agent harnesses, content-addressed Fjall + LanceDB.
# basemind

Full AI context layer for coding agents — code-map, document RAG, shared memory, web crawl,
git history. 300+ languages, one MCP server.

[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
[![crates.io](https://img.shields.io/crates/v/basemind.svg)](https://crates.io/crates/basemind)
[![npm](https://img.shields.io/npm/v/basemind.svg)](https://www.npmjs.com/package/basemind)
[![PyPI](https://img.shields.io/pypi/v/basemind.svg)](https://pypi.org/project/basemind/)
<!-- markdownlint-disable-next-line MD013 -->
[![CI](https://github.com/Goldziher/basemind/actions/workflows/ci.yaml/badge.svg)](https://github.com/Goldziher/basemind/actions/workflows/ci.yaml)

<!-- markdownlint-disable-next-line MD057 -->
![statusline](docs/assets/statusline.png)

<!-- TODO: replace with screenshot after Commit 2 ships and user provides screenshot -->

---

## The four pillars

**Code** — Tree-sitter outlines, symbol search, reference + caller + implementation graphs,
call chains, git history per symbol, blame at symbol-level resolution.

**Documents** — Ingest + semantic search over PDFs, Office (Word/Excel/iWork), HTML, email,
archives. Built-in OCR, layout detection, keyword + NER extraction, cross-encoder reranking.
All ONNX bundled — no system install needed.

**Memory** — Per-repo scoped key-value + semantic vector storage. Clones of the same git
origin automatically share memory; unrelated repos isolated.

**Web** — On-demand HTTP scrape + follow-link crawl. Pages chunk, embed, and land in the
documents store under scope `web:<host>` for unified search.

---

## Feature table

<!-- markdownlint-disable MD013 -->

| Pillar | What it does | MCP tools | Backend |
|---|---|---|---|
| **Code intelligence** | Outlines, symbol search, refs/callers/callees, call graphs, impl lookup, dependents, in-tree regex | `outline`, `search_symbols`, `workspace_grep`, `find_references`, `find_callers`, `call_graph`, `find_implementations`, `dependents`, `list_files`, `status`, `repo_info` | tree-sitter × 300+ langs · Fjall LSM index · content-addressed blob store |
| **Git intelligence** | Symbol-level history, blame, churn, recent changes, structural diffs across revs | `symbol_history`, `blame_file`, `blame_symbol`, `hot_files`, `recent_changes`, `commits_touching`, `find_commits_by_path`, `diff_outline`, `diff_file`, `working_tree_status` | gix + sha-keyed disk cache |
| **Document RAG** | Ingest + semantic search over PDFs, Office (Excel/Word/HWP/iWork), HTML, XML, email, archives. Adds OCR (Tesseract + PaddleOCR), cross-encoder reranker, keyword extraction (YAKE/RAKE), NER (gline-rs ONNX + LLM), extractive + abstractive summarization, layout detection, page auto-rotate, redaction, language detection. All ONNX models bundled — no system install needed. | `search_documents` | kreuzberg + LanceDB |
| **Shared memory** | Per-repo scoped key-value + semantic memory. Clones of the same git origin URL automatically share memory; unrelated repos isolated. | `memory_put`, `memory_get`, `memory_list`, `memory_search`, `memory_delete` | LanceDB + Fjall, scope-keyed |
| **Web crawl** | On-demand HTTP scrape + link-following crawl. Crawled pages route through the documents pipeline (chunk → embed → LanceDB) under scope `web:<host>`. | `web_scrape`, `web_crawl`, `web_map` | kreuzcrawl (native HTTP, no chromium) |
| **Admin** | Live rescan + telemetry dashboard | `rescan`, `telemetry_summary` ||

<!-- markdownlint-enable MD013 -->

---

## Quickstart

### Claude Code

```text
/plugin marketplace add Goldziher/basemind
/plugin install basemind@basemind
```

Restart the session. The basemind binary installs automatically on first use (via npx, uvx, or
direct download with checksum verification). To enable a live statusline, run `/bm-statusline`
once — this wires the statusline into `~/.claude/settings.json` (plugins cannot set it
automatically).

Output: `◆ basemind  ●  1,247 files · 23m ago  │  47 calls · 14k saved`. Counts render bright; the
state dot is green (serve active / scan < 1 h), amber (idle or scan 1–24 h), or red (no serve and
stale index). When a document/memory/web index is present, a third segment appears: `│  312 docs ·
18 mem · 4 sites`. Narrow terminals collapse to `◆ basemind ● 1.2k · 23m │ 47c · 14k saved`.

### Any MCP client

```bash
cargo install basemind --features full --locked
```

Then add to your MCP config:

```json
{
  "mcpServers": {
    "basemind": {
      "command": "basemind",
      "args": ["serve"]
    }
  }
}
```

Supported harnesses: Claude Code · Cursor · Codex (CLI + App) · Gemini · OpenCode · Factory Droid ·
GitHub Copilot CLI · Continue · Cline. Each harness has install instructions in the
[Harness-specific setup](#harness-specific-setup) section below.

### CLI only

```bash
basemind scan                     # index the working tree
basemind query outline path/file.rs  # inspect structure
basemind query symbol "parseQuery"   # find by name
basemind watch                    # live re-index on file change
```

---

## Why basemind, specifically

### vs grep / ripgrep

**What ripgrep does well:** blazing-fast line matching. **What it misses:**

- Grep returns 50+ hits in docs, tests, comments, variable names — agent wastes context filtering noise.
- No scope awareness: `parseQuery()` and `parseQuery` string both match; semantic signals lost.
- Every query re-scans the disk; no pre-computed structures to leverage.

basemind: semantic-quality answers at grep speed via tree-sitter + indexed call sites.

### vs vector-only RAG (LangChain / LlamaIndex DIY stacks)

**What vector RAG does well:** fuzzy document semantic search. **What it misses:**

- Pure embeddings lose exact structure — which function calls which, which class implements which interface.
- No line/column resolution — agent can't map vector hits back to code symbols.
- No git history integration — "what changed recently?" and "who wrote this?" require separate systems.

basemind: code structure + git history + vector memory + document search all in one, unified scope.

### vs context7 / openai-codex / Aider's repo-map

**What these do well:** generate code-map summaries. **What they miss:**

- Static snapshots — stale after the first edit.
- No semantic indexing — every lookup re-parses or re-scans.
- Human-focused output (markdown) instead of agent-facing structure (JSON tools).

basemind: live-updated index with sub-millisecond MCP tools, built for agents not humans.

### vs GitHub native search

**What GitHub does well:** repository-wide fuzzy text search. **What it misses:**

- Cloud-only — your code leaves the machine, latency is network-bound.
- No local-editor integration — agent can't query in-progress edits before commit.
- No cross-language polyglot support — each language's search tuned separately.

basemind: local-only, always-fresh index of your working tree, 300+ languages in one sweep.

---

## Performance

Measured on Apple Silicon, release build, `--features full`, default `eager_l2 = true`. Cold
filesystem cache adds ~50% to first scan; numbers below are warm steady-state.

### Scan throughput

| Repo | Files | Language mix | Time |
|---|---|---|---|
| tokio | 859 | Rust | 0.2 s |
| react | 7 061 | TS / JSX | 2.2 s |
| django | 7 061 | Python | 2.5 s |
| requests | 2 195 | Python | 0.7 s |
| gin | 1 217 | Go | 1.0 s |
| ripgrep | 12 851 | Rust | 4.0 s |
| ripgrep-shallow | 12 851 | Rust | 0.16 s |
| TypeScript compiler | 81 324 | TS / JS / JSON | ~22 s |

The TypeScript compiler is the worst case — 81k files scanned in 22 seconds. Most real repos sit
between tokio and ripgrep. Re-scans skip unchanged content hashes, so warm rescans on edited
working trees are typically dominated by the changed-set size, not repo size.

### Per-tool MCP latency

Against the 81k-file TypeScript index:

<!-- markdownlint-disable MD013 -->

| Latency | Tools |
|---|---|
| < 1 ms | `outline`, `list_files`, `find_references`, `find_callers`, `find_implementations`, `hot_files`, `repo_info` |
| 3–6 ms | `search_symbols`, `call_graph` |
| 4–10 ms | `recent_changes`, `commits_touching`, `find_commits_by_path`, `symbol_history`, `diff_outline`, `diff_file` |
| 20–25 ms | `status` |
| 30–40 ms | `blame_file`, `blame_symbol` |
| 40–200 ms | `workspace_grep` |
| ~200 ms | `search_documents` |
| 350–600 ms | `working_tree_status` |

<!-- markdownlint-enable MD013 -->

basemind preloads L1 outlines into RAM on `serve` start, so code-map queries hit no disk. The Fjall
LSM inverted index handles ref/caller/impl lookups without scanning blobs. Git tools track `gix`
walk cost; Fjall-backed tools dominate only on enormous histories.

---

## Configuration

Full config lives at `schema/basemind-config-v1.schema.json`. Minimal example:

```toml
# .basemind/basemind.toml
file_watch_glob = "**/*.{rs,ts,tsx,py,go}"
eager_l2 = true

[documents]
enabled = true
```

Per-query MCP overrides:

```json
{
  "query": "what does kreuzberg do?",
  "reranker_enabled": true,
  "reranker_preset": "bge-reranker-base"
}
```

Environment variables map mechanically: `--llm-api-key` ↔ `BASEMIND_LLM_API_KEY`. Every MCP tool
accepts per-query overrides that win over file/env/CLI layers.

---

## Architecture

```text
source files
  → tree-sitter parsers (300+ langs, pack name dispatch)
  → L1 outlines + L2 calls + L3 structural hash blobs (content-addressed)
  → Fjall LSM inverted index (symbols / calls / imports / impls)
  → MCP server (rmcp) + documents pipeline (kreuzberg) → LanceDB
  → 32 MCP tools across 8 coding-agent harnesses
```

- **Scanner** (`src/scanner.rs`) — rayon-parallel walker over the gitignore-aware file set.
  Extracts L1 (symbols + imports + implementations), L2 (calls + docs), L3 (structural hashes)
  per file.
- **Content-addressed blobs** (`src/store.rs`) — msgpack at
  `.basemind/blobs/<blake3>.{l1,l2,l3}.msgpack`. Two files with identical content share the
  same blob.
- **Inverted index** (`src/index/`) — Fjall LSM keyspace at
  `.basemind/views/<view>/index.fjall/`. Nine partitions drive symbol search, references,
  implementations, and dependents.
- **MCP surface** (`src/mcp/`) — stdio JSON-RPC via rmcp. Tool descriptions are routing surface
  for agents; semantics stated honestly (substring vs prefix, scope-aware vs name-only, capped).
- **Git layer** (`src/git.rs`, `src/git_cache.rs`) — gix-backed blame, log, diff, status.
  Sha-keyed disk cache makes warm queries free.

---

## Installation

<!-- markdownlint-disable MD013 -->

| Channel | Command | Platforms | Features |
|---|---|---|---|
| Homebrew | `brew install Goldziher/tap/basemind` | macOS, Linux | base |
| npm | `npm install -g basemind` | any Node 14+ platform | base |
| pip | `pip install basemind` | any Python 3.8+ platform | base |
| cargo | `cargo install basemind --locked` | any Rust platform | base |
| cargo (full) | `cargo install basemind --features full --locked` | any Rust platform | documents + memory + crawl |
| GH releases | Download binary from [releases]https://github.com/Goldziher/basemind/releases | macOS · Linux · Windows | base |

<!-- markdownlint-enable MD013 -->

### Harness-specific setup

| Harness | Install command |
|---|---|
| Claude Code | `/plugin marketplace add Goldziher/basemind` then `/plugin install basemind@basemind` |
| Cursor | See Cursor docs for plugin install flow; `basemind` manifest at `.cursor-plugin/plugin.json` |
| Codex CLI | `/plugins` then search for `basemind` |
| Codex App | Plugins panel → Coding category → basemind → `+` |
| Gemini CLI | `gemini extensions install https://github.com/Goldziher/basemind` |
| OpenCode | Add `{ "plugin": ["basemind-opencode@latest"] }` to `opencode.json` |
| Factory Droid | `droid plugin --help` (manifest at `.claude-plugin/marketplace.json`) |
| GitHub Copilot CLI | `copilot plugin --help` (same manifest) |
| Generic MCP | See "Any MCP client" section above |

---

## Differentiators

- **Content-addressed dedup** — Blake3-hashed L1/L2/L3 blobs deduplicated across files and
  views. Edit a file, rescan, skip unchanged hashes.
- **Secret-masking `SecretString`** — api_key fields redacted in Debug/Display/Serialize.
  Tracing spans and panic messages never leak the value.
- **Provenance ledger** — every config value's origin tracked via `ConfigSource` (MCP > CLI >
  env > TOML > defaults). Audit trail for debugging.
- **Schema-driven config** — Rust types in `src/config/` drive
  `schema/basemind-config-v1.schema.json` via `schemars`; snapshot is asserted byte-equal.
  Config is code.
- **Zero-system-dep ONNX**`ort-bundled` ships the runtime in the binary. No
  `apt install onnxruntime`, no system complexity.

---

## Project state

- **Real-OSS hardening:** `tests/harden.rs` runs the full tool sweep against 8 upstream repos
  (ripgrep, tokio, TypeScript, React, Django, requests, gin, ripgrep-shallow) on every release.
  Canary assertions catch regressions.
- **[CHANGELOG.md]CHANGELOG.md** — release history and migration notes.
- **[Contributing guide]CONTRIBUTING.md** — development workflow: `task setup`, `task check`,
  `task build`. Pre-commit hooks via [prek]https://github.com/j178/prek.
- **[License: MIT]LICENSE**