git-semantic
Semantic search and spatial navigation for Git repositories — so AI coding agents orient in one turn and retrieve exactly what they need.
git-semantic parses every tracked file with tree-sitter, generates vector embeddings per chunk, and stores them on a dedicated orphan Git branch. At index time it also builds a spatial map of the codebase — grouping files into semantically coherent subsystems using Leiden community detection, labeling them by their key functions, and tracking cross-file call edges.
Search is hybrid: BM25 (SQLite FTS5) + semantic embeddings + graph proximity, merged via Reciprocal Rank Fusion. Exact identifier lookups score higher when they appear in both ranked lists; files connected via call edges to top results are boosted automatically.
Motivation
Good agents don't need to explore — they need to know where to look and how much to read.
git-semantic gives agents a spatial model of the codebase. Instead of searching and accumulating, an agent can orient with map, read a file's structure with get --mode outline (~96% token reduction), pull the declaration with --mode signatures (~86% reduction), and fetch the exact body with get file:start-end only when it needs to. A well-structured session stays flat because the agent fetches surgically from the start rather than accumulating everything that matched.
The index lives on a Git branch. One person indexes, the whole team benefits — no re-embedding, no API keys per developer. The map is shared state: every agent session starts with the same orientation, not a cold rediscovery of the codebase.
git-semantic benchmark measures this concretely on your own repo: token savings per language, read mode comparison, and a navigation replay that shows grep precision vs map+outline+get precision across sampled subsystems.
See BENCHMARKS.md for results on real codebases.
How it works
main branch semantic branch (orphan)
────────────────── ──────────────────────────
src/main.rs → src/main.rs ← [{start_line, end_line, text, embedding}, ...]
src/db.rs → src/db.rs ← [{...}, ...]
src/chunking/mod.rs → src/chunking/mod.rs
.semantic-map.json ← subsystems + edges
git-semantic indexparses all tracked files, embeds each chunk, clusters files into subsystems using Leiden community detection, builds the spatial map, and commits everything to thesemanticorphan branch.git push origin semanticshares the embeddings and map with the team.- Everyone else runs
git fetch origin semantic+git-semantic hydrateto populate their local SQLite search index (vector + FTS5) — no re-embedding needed. - Agents use
mapto orient,get --mode outlineto read cheaply,get file:start-endto retrieve exactly, andgreponly when the map is insufficient.
Getting started
→ Quickstart — install, index, and share in under 5 minutes → Navigation guide — map / get / grep workflow with examples → Repo health — reading the heatmap, drilling into communities → CI setup — keep the index fresh automatically → MCP setup — connect to Claude Code, Cursor, Windsurf
Installation
Prerequisites: Rust 1.65+, Git 2.0+
Commands
git-semantic index
Parses and embeds all tracked files, builds the spatial map, and commits to the semantic branch.
- First run: full index
- Subsequent runs: incremental — only changed files are re-embedded
- Respects
.gitignore - Skips binary files
git-semantic hydrate
Reads the semantic branch and populates the local .git/semantic.db index. Fetches origin/semantic first, falls back to local.
git-semantic map [query]
Show the spatial map of the codebase, or find the subsystem relevant to a task. Subsystems are built by Leiden community detection — files are grouped by embedding similarity, not filesystem location, so semantically related files cluster together even in flat repos.
# → lists all subsystems with key functions and entry points
# → returns the most relevant subsystem with file locations and call edges
Output:
## embeddings — gemma: GemmaProvider, EmbeddingConfig, cache_dir, TextEmbedding
entry points:
src/embed.rs (via create_provider, EmbeddingConfig)
src/main.rs (via EmbeddingConfig, load_or_default)
src/embeddings/gemma.rs:1-45
src/embeddings/config.rs:0-47
...
git-semantic get <file> [--mode outline|signatures|full]
Retrieve a file by path or a specific chunk by line range.
File-level retrieval (three modes powered by tree-sitter):
Output includes callers — files outside this one that reference it via edges:
// src/db.rs
// callers:
// src/main.rs (via hydrate_from_branch, grep_semantic)
L1-126 init_with_dimension
L128-140 clear
L142-161 insert_subsystem
L463-497 search_hybrid
Chunk-level retrieval (exact or overlapping range):
| mode | mechanism | typical savings vs raw |
|---|---|---|
outline |
tree-sitter extracts identifier name only | ~96% |
signatures |
tree-sitter cuts at body node, keeps full declaration | ~86% |
full (default) |
all chunks concatenated | ~4% |
git-semantic grep <query>
Search code using three-signal hybrid search: BM25 (FTS5) + semantic embeddings + graph proximity, merged via Reciprocal Rank Fusion. Files connected via call edges to top semantic results are boosted automatically. Higher score = more relevant; a result scoring 2x the next is an unambiguous match.
git-semantic health [--community <name>]
Show a cohesion/coupling heatmap of all semantic communities. Use --community to drill into a specific one — shows files, top dependents, and top dependencies.
git-semantic benchmark [--json]
Measure token savings across read modes for every indexed file, and replay actual navigation queries to compare grep vs map+get strategies.
Output includes:
- Token savings by language (outline / signatures vs raw)
- Read mode comparison table
- Session cost simulation
- Navigation comparison: grep precision vs map+outline+get precision across sampled subsystems
git-semantic mcp
Starts the MCP server (JSON-RPC over stdio). Exposes map, get, grep, and health as tools to any MCP-compatible client — Claude Code, Cursor, Codex, Windsurf, and others.
Register it in your client's config:
Claude Code (.claude/settings.json):
Cursor (.cursor/mcp.json):
git-semantic config
Configure the embedding provider. Stored in .git/config, per-repository.
Navigation workflow
Once registered as an MCP server, any client can call the tools directly. The intended workflow:
Step 1 — orient
Read the output. If it names the function or file needed — go to step 2 immediately.
Step 2 — read cheaply
Start with outline. If the declaration alone is enough, stop. If you need the body, go to step 3.
Step 3 — retrieve exactly
Use the line ranges from the outline output directly. Maximum 3 calls per task.
Step 4 — search (last resort)
Use when the map was genuinely insufficient. Search is hybrid (BM25 + semantic + graph proximity). For exact identifier lookups prefer grep over map — BM25 will find it precisely.
Orient once, read cheaply, retrieve exactly, never re-search what the map already answered.
Sharing embeddings
Indexing only needs to happen once. Push the semantic branch and the whole team benefits — no API keys, no re-embedding.
# Once, by whoever has an API key
# Everyone else
Automated via GitHub Actions
name: Semantic Index
on:
push:
branches:
jobs:
index:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
token: ${{ secrets.GITHUB_TOKEN }}
- name: Install git-semantic
run: cargo install gitsem
- name: Index codebase
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: git-semantic index
- name: Push semantic branch
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git push origin semantic
Configuration
Gemma embeddings (default, no API key required)
Model files are cached at ~/.cache/fastembed by default. Override with FASTEMBED_CACHE_DIR.
OpenAI embeddings
Available keys
| Key | Default | Description |
|---|---|---|
provider |
gemma |
Embedding provider: gemma or openai |
openai.model |
text-embedding-3-small |
OpenAI model |
gemma.embeddingDim |
768 |
Gemma embedding dimension |
Supported languages
Rust, Python, JavaScript, TypeScript, Java, C, C++, Go
Project structure
git-semantic/
├── src/
│ ├── main.rs # CLI and command handlers
│ ├── map.rs # Subsystem and edge data types
│ ├── clustering.rs # Leiden community detection and edge extraction
│ ├── models.rs # CodeChunk data structure
│ ├── db.rs # SQLite + sqlite-vec + FTS5 hybrid search index
│ ├── embed.rs # Embedding dispatch
│ ├── semantic_branch.rs # Orphan branch read/write via git worktree
│ ├── embeddings/ # OpenAI, ONNX, and Gemma provider implementations
│ └── chunking/ # tree-sitter parsing and language detection
└── Cargo.toml
License
MIT OR Apache-2.0