# CLAUDE.md
Datasphere — background daemon that distills knowledge from Claude Code sessions.
Named after the AI knowledge network in Dan Simmons' Hyperion Cantos.
## Build & Test
```bash
cargo build # Development build
cargo build --release # Release build
cargo test # Run tests
```
## Architecture
```
[All Claude projects] ──▶ AllProjectsWatcher ──▶ Queue ──▶ Processor
(~/.claude/ │
projects/) ▼
┌─────────────────┐
│ For each job: │
│ 1. Read JSONL │
│ 2. Distill │
│ 3. Embed │
│ 4. Store │
└─────────────────┘
```
### Components
| AllProjectsWatcher | Watch `~/.claude/projects/` recursively for `.jsonl` changes |
| Queue | Persistent JSONL queue at `~/.datasphere/queue.jsonl` |
| Distiller | Extract insights via LLM (chunked, no synthesis) |
| Embedder | Text embeddings (1536 dims) |
| Store | LanceDB (nodes, processed tables) |
## Module Structure
```
src/
├── main.rs # CLI: scan, start, query, queue, stats, show, add, reset
├── lib.rs # Public exports
├── core/
│ └── node.rs # Node type (EMBEDDING_DIM = 1536)
├── store/
│ ├── lance.rs # LanceDB wrapper + vector search
│ └── schema.rs # Arrow schemas
├── distill.rs # LLM knowledge extraction (chunked)
├── embed.rs # Embeddings + text chunking
├── llm.rs # LLM CLI wrapper
├── queue.rs # Persistent job queue
├── session.rs # Session discovery + project listing
├── transcript/
│ ├── reader.rs # JSONL parsing, context formatting
│ └── types.rs # TranscriptEntry types
└── watch/
├── session.rs # Single-project watcher
└── all_projects.rs # All-projects watcher for daemon
```
## Storage
```
~/.datasphere/
├── db/ # LanceDB database
│ ├── nodes.lance/
│ └── processed.lance/
└── queue.jsonl # Job queue (append-only)
```
## Key Constants
| `EMBEDDING_DIM` | 1536 | text-embedding-3-small dimensions |
| `SIMHASH_CHANGE_THRESHOLD` | 10 | Bits changed to trigger re-process |
| `JOB_DELAY_MS` | 500 | Rate limiting between jobs |
## CLI Commands
```bash
ds scan # One-shot scan current project
ds start # Daemon: watch all projects
ds query "text" # Search knowledge graph
ds query -f json "x" # JSON output for MCP
ds queue # Show queue counts
ds queue pending # List pending jobs
ds queue clear # Remove completed jobs
ds queue nuke # Delete all jobs
ds stats # Database statistics
ds show # Display nodes
ds add <file> # Add file (no LLM, direct embed)
ds related <id> # Find nodes similar to a node
ds reset # Delete db + nuke queue (fresh start)
```
## MCP Server
`mcp/` contains Node.js MCP server that shells out to `ds` CLI:
```bash
cd mcp && npm install
claude mcp add datasphere -s user -- node /path/to/mcp/index.js
```
No build step required — pure JavaScript (ES modules).
**Tools exposed:**
| `datasphere_query(query, limit?)` | Search knowledge graph by text |
| `datasphere_related(node_id, limit?)` | Find nodes similar to a given node |
## Design Principles
1. **Background first**: Daemon watches all projects, processes incrementally
2. **Explicit queue**: All jobs are queued for introspection and rate limiting
3. **No synthesis**: Each chunk becomes its own node (avoids recency bias)
4. **Vector search**: Nodes retrieved by embedding similarity at query time