# cartog — Technology Stack
## Language
**Rust** — Performance, tree-sitter native bindings, single binary distribution.
## Core Dependencies
| `tree-sitter` 0.24 | Incremental parsing, CST traversal | Pinned — grammar crates lag by one minor |
| `tree-sitter-{lang}` 0.23 | Per-language grammars (Python, TS/JS, Rust, Go, Ruby, Java) | Each ~1-2 MB of generated C |
| `rusqlite` (bundled) | SQLite storage, zero external deps | `bundled` compiles SQLite from C source — no system `libsqlite3-dev` required. Critical for cross-compilation to 5 targets |
| `clap` (derive) | CLI argument parsing | `ValueEnum` derive for type-safe `--kind` filters with shell completion |
| `serde` + `serde_json` | JSON serialization for `--json` output | `to_string_pretty` for readability in both terminal and agent contexts |
| `walkdir` | Recursive directory traversal | `follow_links(true)` — enables symlinked monorepo layouts |
| `sha2` | SHA-256 content hashing | Change detection fallback when git is unavailable |
| `notify` + `notify-debouncer-mini` | Filesystem event watching | "mini" debouncer chosen over full — fewer deps, simpler API, sufficient for single-project use |
| `ctrlc` | Graceful Ctrl+C handling | Flushes pending RAG embeddings before exit |
| `anyhow` | Error propagation | `Result<T>` everywhere, no `unwrap()` in library code |
| `rmcp` (server + transport-io) | MCP server over stdio | Server-only — cartog is never an MCP client. stdio transport matches how agents launch subprocesses |
| `tokio` (rt-multi-thread) | Async runtime for MCP server only | Multi-thread for `spawn_blocking` throughput. Runtime created on-demand — sync commands skip it entirely |
| `tracing` + `tracing-subscriber` | Structured logging to stderr | Logs to stderr so stdout stays clean for output and MCP protocol |
| `fastembed` | ONNX Runtime inference for embeddings + re-ranking | `default-features = false` drops image models (CLIP etc.) we don't use. `rustls-tls` avoids OpenSSL system dependency |
| `sqlite-vec` | Vector similarity search (KNN) in SQLite | `vec0` virtual table, requires integer rowids (bridged via `symbol_embedding_map`) |
| `criterion` (dev) | Micro-benchmarks | Query latency benchmarks (µs-level) |
## Build Profiles
| `dev.opt-level` | `1` | Tree-sitter C grammars are machine-generated huge files that compile very slowly at opt-level 0. Level 1 also makes indexing usably fast during development |
| `release.lto` | `"thin"` | Most binary size / performance benefits of full LTO at a fraction of the link time |
| `release.strip` | `"debuginfo"` | Removes DWARF sections (~50% binary size reduction) but keeps function names in panic backtraces for diagnosable crash reports |
## Architecture Decisions
| Parser | tree-sitter (CST + query API) | Incremental, multi-language, structural. JS/TS and Python use declarative tree-sitter queries for call/throw/type-ref extraction; other languages use cursor walks. Handles 90% of what LSP provides without running a language server |
| Storage | SQLite (single `.cartog.db`) | Zero infra, ~1 MB, persists across sessions. WAL mode enables concurrent readers (watcher + MCP server) |
| Packaging | Skill (primary) | Changes agent workflow, not just adds a tool. Works with any LLM that has bash access |
| MCP server | `cartog serve` (stdio) | Skill remains primary; MCP as secondary for zero-context-cost tool access. 1:1 mapping with CLI commands — same `db.*()` code paths |
| Change detection | Git-based + SHA-256 + Merkle diff + `--force` | Git diff covers committed + staged + unstaged + untracked files. SHA-256 double-checks to skip touched-but-unmodified files. Merkle-tree hashing (content_hash + subtree_hash per symbol) enables surgical symbol-level updates within changed files — only added/modified/removed symbols touch the DB |
| Edge resolution | Name-based, scope-aware, multi-pass, scoped | 6-tier priority: same file > import-path > same dir > parent scope > unique global > kind disambiguation (type def > function > method). Two passes so import edges resolved in pass 1 feed pass 2. Incremental mode: invalidates dangling edges after symbol changes, then re-resolves only affected edges |
| Symbol ID | `file_path:kind:qualified_name` | Stable across line movements. `kind` = function/class/method/etc. Qualified name encodes parent chain: `auth.py:method:TokenService.validate`. Deterministic, human-readable, no UUIDs |
| Ignore strategy | Hardcoded 18 dirs + `starts_with('.')` | No `.gitignore` parsing — simpler, faster, predictable. Covers node_modules, \_\_pycache\_\_, target, venv, dist, build, .next, vendor, etc. |
| Content truncation | 2048 bytes per symbol | ~512 tokens at code's ~2-3 chars/token ratio. Captures signature + leading body. Below 50 bytes → excluded (noise) |
| Name normalization | camelCase/snake_case splitting for FTS5 | `validateToken` → `"validate token"`, `get_http_response` → `"get http response"`. Stored in FTS5 alongside original name |
| Serialization (embeddings) | Raw little-endian f32 bytes | Zero-overhead, directly compatible with sqlite-vec `MATCH` operator. 384 dims × 4 bytes = 1536 bytes per symbol |
| Language detection | File extension | Simple, predictable, covers 99%. TSX is a separate key from TS (different grammar for JSX nodes) |
| Async boundary | Manual `tokio::Runtime` for `serve` only | 95% of commands are sync. Avoids async overhead for index/search/refs. `spawn_blocking` offloads sync SQLite calls from the async MCP handler |
| DB concurrency | `Arc<Mutex<Database>>` | Single connection, not a pool. MCP serves one agent session — contention is negligible. `std::sync::Mutex` (not tokio) because lock is never held across `.await` |
| Path security | Canonical CWD validation | MCP tool parameters come from LLM agents. Rejects paths outside CWD subtree via `canonicalize` + `starts_with`. Defense-in-depth against prompt injection |
| Watch mode | Debounced re-index + deferred RAG | 2s debounce, 30s RAG delay. Embedding only fires after editing stops — avoids embedding code that changes seconds later |
| Vector search | sqlite-vec (opt-in) | Embedded in SQLite, no external infra. Models downloaded via `cartog rag setup` |
| Model cache | `~/.cache/cartog/models` | XDG-compliant shared cache avoids downloading ~1.2 GB of models per project. Precedence: `FASTEMBED_CACHE_DIR` > `XDG_CACHE_HOME/cartog/models` > `~/.cache/cartog/models` |
| Output format | Human default + `--json` flag (global) | Readable for humans, parseable for scripts. Both `cartog --json stats` and `cartog stats --json` work |
| Distribution | `cargo install` + pre-built binaries | GitHub Releases for 5 targets (Linux x86/ARM, macOS x86/ARM, Windows), crates.io publish |
| LSP | Auto-detected (`lsp` feature) | Index-time refinement for edges unresolved by heuristics. Auto-detects language servers on PATH (rust-analyzer, pyright, typescript-language-server, gopls), sends `textDocument/definition`, shuts down after. Silently skips when no server found. Disable with `--no-lsp`. Install: `cargo install cartog --features lsp` |
| Monorepo | Deferred | Index from CWD, user can `cd` into subproject |
## RAG Pipeline Design
### Why small models
BERT attention is **O(n²) in sequence length**. Keeping input short is the single biggest performance lever — more impactful than model size, quantization, or batching.
This drives two key decisions:
1. **Small embedding model** — BGE-small-en-v1.5 quantized (384 dimensions). 2-3x faster than full precision with negligible quality loss for code symbol matching. Outputs are L2-normalized, enabling L2 distance in sqlite-vec (equivalent to cosine ranking). Trade-off: English-only model — non-English identifiers/comments get degraded embeddings.
2. **AST-aware embedding text** — Header + signature + significant body lines (skipping blanks, comments, closing braces) up to ~200 tokens (~800 bytes):
```
// File: auth/tokens.py | function validate_token
def validate_token(token: str) -> bool:
if token.is_expired():
raise TokenError('expired')
return lookup_session(token.session_id)
```
This captures the "what does this function do" signal (~100-200 tokens) while staying within the model's 512-token window. Full source content is still stored separately for FTS5 keyword search and cross-encoder re-ranking. Decorators/annotations are kept (they carry semantic meaning like `@login_required`).
### What gets embedded (and what doesn't)
- **Functions, classes, methods**: embedded with AST-aware text (header + significant body lines)
- **Variables**: excluded — too numerous, low signal for semantic search
- **Imports**: excluded at content extraction time — they exist as graph edges, not search targets
### Search pipeline
```
Query
│
├─→ FTS5 keyword search (BM25 ranking)
│ 3-tier fallback: phrase → AND → OR
│ Searches: symbol_name + normalized_name + full content
│
├─→ Vector KNN search (sqlite-vec, L2 distance)
│ L2-normalized embeddings → L2 distance ≡ cosine ranking
│ Query embedded with same BGE-small-en-v1.5 model
│
├─→ Reciprocal Rank Fusion (RRF, k=60)
│ Merges both ranked lists: score = Σ 1/(k + rank + 1)
│ k=60 from Cormack et al. 2009
│ Over-retrieval: max(limit × 3, 20) per source
│
└─→ Cross-encoder re-ranking (optional)
BGE-reranker-base, scores (query, full_content) pairs jointly
Capped at 50 candidates to bound latency
Graceful degradation: tri-state cache (not attempted / failed / ready)
If model unavailable → search works with RRF-only ordering
```
### FTS5 three-tier fallback (precision-first)
| 1 | Phrase | `"validate token"` — exact adjacent match |
| 2 | AND | `"validate" AND "token"` — all terms, any order |
| 3 | OR | `"validate" OR "token"` — any term present |
Returns the first non-empty result. Only FTS5 syntax errors trigger fallback — real DB errors propagate.
### Key constants
| `EMBEDDING_DIM` | 384 | BGE-small-en-v1.5 output dimension |
| `EMBED_BATCH_SIZE` | 64 | Limits ONNX padding waste when text lengths vary |
| `CHUNK_SIZE` | 512 | Symbols per embedding engine call |
| `DB_BATCH_LIMIT` | 256 | Pending DB writes before flush |
| `MAX_CONTENT_BYTES` | 2048 | ~512 tokens at code's char/token ratio (stored for FTS5 + reranker) |
| `MIN_CONTENT_BYTES` | 50 | Below this → noise, not embedded |
| `MAX_EMBED_TEXT_BYTES` | 800 | ~200 tokens for bi-encoder input (AST-aware significant lines) |
| `EMBEDDING_FORMAT_VERSION` | 2 | Auto-triggers re-embed when embedding strategy changes |
| `RERANK_MAX` | 50 | Cross-encoder candidate cap |
| RRF `k` | 60.0 | Standard constant from Cormack et al. 2009 |
| Over-retrieval | `limit × 3` (min 20) | Enough candidates for effective RRF merge |
| `MAX_SEARCH_LIMIT` | 100 | Hard cap on returned results |
## SQLite Tuning
The database is a regenerable index — crash-recovery safety is traded for throughput.
| `journal_mode` | WAL | Write-Ahead Logging enables concurrent readers. Watch thread and MCP server read while indexer writes |
| `foreign_keys` | ON | Enforce referential integrity |
| `synchronous` | NORMAL | Reduced fsync frequency. Safe with WAL for a regenerable index — power failure loses at most the last transaction, recoverable via `cartog index --force` |
| `cache_size` | -65536 (64 MB) | Large page cache for repeated queries in MCP sessions |
| `temp_store` | MEMORY | Temp tables in RAM, not disk |
| `mmap_size` | 268435456 (256 MB) | Memory-mapped I/O for large index scans |
### Schema overview
```
┌──────────────────────────────────────────────────────────┐
│ Core tables │
│ │
│ symbols ──────────── edges ──────── files metadata │
│ (id, name, kind, (source_id, (path, (key, │
│ file_path, lines, target_name, hash, value) │
│ signature, target_id, lang) │
│ content_hash, kind, line) │
│ subtree_hash, ...) │
├──────────────────────────────────────────────────────────┤
│ RAG tables │
│ │
│ symbol_content ──→ symbol_fts (FTS5 virtual table) │
│ (symbol_id, (symbol_name, normalized_name, │
│ content, content) │
│ header, Synced via INSERT/DELETE triggers │
│ normalized_name) │
│ │
│ symbol_embedding_map ──→ symbol_vec (vec0 virtual table)│
│ (id INTEGER ↔ (embedding float[384]) │
│ symbol_id TEXT) KNN via MATCH operator │
│ Bridge: vec0 requires integer rowids, │
│ symbol IDs are text strings │
└──────────────────────────────────────────────────────────┘
Indexes (9): symbols(name, kind, file, parent),
edges(source, target, target_id, kind),
embedding_map(symbol_id)
```
## Minimum Supported Rust Version
1.70+ (edition 2021). Declared in `Cargo.toml` as `rust-version = "1.70"`.
Note: `fastembed` and its transitive dependencies (ONNX Runtime) may require a higher MSRV in practice. The declared 1.70 is the project's intent — verify with `cargo build` on older toolchains if targeting minimum compatibility.
## Further Reading
- [Product Overview](product.md) — purpose, target users, differentiation
- [Project Structure](structure.md) — module layout, conventions
- [Usage](usage.md) — CLI commands, agent skill setup, MCP server per client