gitsem 0.1.0

A semantic search layer for Git repositories using vector embeddings
gitsem-0.1.0 is not a library.

gitsem

A semantic search layer for Git repositories that augments commits with vector embeddings, enabling AI agents and developers to search code by meaning rather than text patterns.

Features

  • Semantic Commit Notes: Automatically attach embeddings and context to commits
  • Vector Search: Search code using natural language queries
  • Git-Native: Uses Git notes (refs/notes/semantic) for storage
  • Team Collaboration: Share semantic indexes via git push/pull
  • Retroactive Indexing: Add semantic notes to existing commit history
  • Idempotent Operations: Safe to run after regular git commands

Installation

Prerequisites

  • Rust 1.65 or higher
  • Git 2.0 or higher

From crates.io

cargo install gitsem

Build from Source

git clone https://github.com/ccherrad/gitsem.git
cd gitsem
cargo install --path .

The binary will be installed to ~/.cargo/bin/gitsem.

Verify Installation

gitsem help
# OR use as git subcommand
git semantic help

Both gitsem and git-semantic binaries are installed, so you can use either:

  • gitsem <command> - Standalone command
  • git semantic <command> - Git subcommand style

If the command isn't found, ensure ~/.cargo/bin is in your PATH:

export PATH="$HOME/.cargo/bin:$PATH"

Add this to your ~/.bashrc or ~/.zshrc to make it permanent.

How It Works

Architecture

┌─────────────────┐
│   Git Commits   │
└────────┬────────┘
         │ gitsem commit/reindex
         ▼
┌─────────────────────────────────┐
│   Git Notes (refs/notes/semantic)│  ← Source of Truth
│   - Commit metadata              │
│   - Diffs                        │
│   - Vector embeddings (768-dim)  │
└────────┬────────────────────────┘
         │ gitsem pull
         ▼
┌─────────────────┐
│  SQLite (.git/  │  ← Search Index
│  semantic.db)   │
│  - vec0 virtual │
│    table        │
└─────────────────┘
         │
         ▼ gitsem grep
┌─────────────────┐
│  Vector Search  │
│  Results        │
└─────────────────┘

Data Flow

  1. Create Semantic Notes: gitsem commit or reindex generates embeddings and stores them as Git notes
  2. Sync Across Team: git push origin refs/notes/semantic shares notes with teammates
  3. Build Search Index: gitsem pull fetches notes and populates local SQLite database
  4. Search: gitsem grep performs KNN vector similarity search

Commands

gitsem commit

Create a commit with semantic notes attached.

# Commit with all changes
gitsem commit -a -m "Add user authentication"

# Commit staged changes
git add .
gitsem commit -m "Fix login bug"

# Interactive (prompts for message)
gitsem commit

What it does:

  • Creates a regular Git commit
  • Generates embeddings from the diff
  • Attaches semantic note to the commit in refs/notes/semantic

gitsem reindex <range>

Add semantic notes to existing commits retroactively.

# Index last 3 commits
gitsem reindex HEAD~3..HEAD

# Index all commits since main
gitsem reindex main..HEAD

# Index specific range
gitsem reindex abc123..def456

What it does:

  • Fetches all commits in the range
  • Generates embeddings for each commit's diff
  • Attaches semantic notes to existing commits

gitsem pull [remote]

Pull code changes and sync semantic notes.

# Pull from origin (default)
gitsem pull

# Pull from upstream
gitsem pull upstream

What it does:

  • Executes git pull
  • Fetches refs/notes/semantic from remote
  • Rebuilds local SQLite database from notes

gitsem grep <query>

Search code semantically using natural language.

# Basic search
gitsem grep "authentication logic"

# Limit results
gitsem grep "error handling" -n 5

What it does:

  • Generates embedding for the query
  • Performs KNN vector similarity search
  • Returns semantically similar code chunks

gitsem show [commit]

View semantic note attached to a commit.

# Show note for HEAD
gitsem show

# Show note for specific commit
gitsem show abc123

# Show note for HEAD~2
gitsem show HEAD~2

What it does:

  • Displays formatted semantic note
  • Shows embedding dimensions
  • Previews commit content and diff

Examples

Example 1: New Repository Setup

# Clone a repository
git clone https://github.com/example/myproject.git
cd myproject

# Index the last 10 commits (use either style)
gitsem reindex HEAD~10..HEAD
# OR: git semantic reindex HEAD~10..HEAD

# Share semantic notes with team
git push origin refs/notes/semantic

Example 2: Daily Development Workflow

# Make changes
vim src/auth.rs

# Create commit with semantic notes
gitsem commit -a -m "feat: add JWT token validation"
# OR: git semantic commit -a -m "feat: add JWT token validation"

# Pull teammate's changes and sync semantics
gitsem pull
# OR: git semantic pull

# Search for related code
gitsem grep "token validation logic"
# OR: git semantic grep "token validation logic"

Example 3: Code Review

# View semantic context of a commit
gitsem show HEAD~2

# Search for similar patterns
gitsem grep "similar authentication pattern"

Example 4: Team Collaboration

# Developer A: Create semantic commits
gitsem commit -m "refactor: simplify error handling"
git push origin main refs/notes/semantic

# Developer B: Pull and sync
gitsem pull
gitsem grep "error handling patterns"

Configuration

Environment Variables

OPENAI_API_KEY (Required for real embeddings)

Currently, the embedding generator is a placeholder. To use real embeddings:

  1. Set your OpenAI API key:

    export OPENAI_API_KEY="sk-..."
    
  2. Add to your shell config (~/.bashrc or ~/.zshrc):

    export OPENAI_API_KEY="sk-..."
    

Git Configuration

Semantic notes are stored in refs/notes/semantic. To automatically fetch notes:

git config --add remote.origin.fetch "+refs/notes/semantic:refs/notes/semantic"

Current Limitations

  1. Placeholder Embeddings: The current implementation uses dummy embeddings (768-dimensional vectors with sequential values). Real LLM API integration (OpenAI, Cohere, etc.) needs to be implemented in src/embed.rs.

  2. SQLite-vec Integration: The vec0 virtual table is defined but requires the sqlite-vec extension to be loaded at runtime for production vector search.

  3. No Automatic Sync: Semantic notes must be manually pushed/pulled via git push origin refs/notes/semantic.

Development

Project Structure

gitsem/
├── src/
│   ├── main.rs       # CLI and command handlers
│   ├── models.rs     # CodeChunk data structure
│   ├── db.rs         # SQLite database with vec0 table
│   ├── git.rs        # Git notes read/write operations
│   └── embed.rs      # Embedding generation (placeholder)
├── Cargo.toml
└── README.md

Building

cargo build --release

Testing

cargo test

Installing Locally

cargo install --path .

Roadmap

  • Real embedding API integration (OpenAI, Cohere, local models)
  • Load sqlite-vec extension for production vector search
  • Automatic note syncing on push/pull
  • Support for multiple embedding models
  • Web UI for browsing semantic history
  • VS Code extension
  • GitHub Action for CI/CD integration

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

License

MIT OR Apache-2.0

Acknowledgments

  • Built with gix - Pure Rust Git implementation
  • Inspired by the need for semantic code search in AI-assisted development