<div align="center">
<h1>ColGREP</h1>
<p>Semantic code search for your terminal and your coding agents, built on NextPlaid.<br/>
Combines regex filtering with semantic ranking with LateOn-Code-edge multi-vector embeddings.<br/>
A single Rust binary. No server. No API. 100% local, your code never leaves your machine.</p>
<img width="680" src="../docs/colgrep-demo.gif" alt="ColGREP demo"/>
<p>
<a href="#quick-start"><b>Quick Start</b></a>
·
<a href="#search-modes"><b>Search Modes</b></a>
·
<a href="#agent-integrations"><b>Agent Integrations</b></a>
·
<a href="#how-it-works"><b>How It Works</b></a>
·
<a href="#installation"><b>Installation</b></a>
</p>
</div>
## Quick Start
**Install:**
Homebrew (macOS / Linux)
```bash
brew install lightonai/tap/colgrep
```
Shell installer (macOS / Linux)
```bash
Windows (PowerShell)
```bash
> **macOS** binaries ship with **Apple Accelerate + CoreML** enabled — full hardware acceleration out of the box.
>
> **Linux & Windows** binaries work immediately but run on CPU only. For hardware acceleration, install via Cargo — see [Installation](#installation).
**Build the index:**
```bash
colgrep init # current directory
colgrep init /path/to/project # or a specific project
colgrep init -y # auto-confirm for large codebases (>10K code units)
```
**Search:**
```bash
colgrep "database connection pooling"
```
No setup, no config, no dependencies. `colgrep init` builds the index for the first time. After that, every search detects file changes and updates the index automatically before returning results. Supports `--model` to override the ColBERT model and `--pool-factor` to control embedding compression.
---
## Search Modes
ColGREP supports three search modes: **semantic**, **regex**, and **hybrid** (both combined).
### Semantic Search
Find code by meaning, even when keywords don't match exactly:
```bash
colgrep "function that retries HTTP requests"
colgrep "error handling in API layer"
colgrep "authentication middleware" ./src
```
### Regex Search
Use `-e` for traditional pattern matching (ERE syntax by default):
```bash
colgrep -e "async fn\s+\w+"
```
### Hybrid Search
Combine regex filtering with semantic ranking. Regex narrows the candidates, semantics ranks them:
```bash
# Find async functions, rank by "error handling"
colgrep -e "async fn" "error handling"
# Find Result types, rank by "database operations"
colgrep -e "Result<" "database operations" --include="*.rs"
# Find TODOs, rank by relevance to "security"
colgrep -e "TODO" "security concerns"
```
---
## CLI Reference
### Search Options
| `-e` | `--pattern` | Regex pre-filter (ERE syntax) |
| `-E` | `--extended-regexp` | ERE mode (default, kept for grep compat) |
| `-F` | `--fixed-strings` | Treat `-e` as literal string |
| `-w` | `--word-regexp` | Whole-word match for `-e` |
| `-k` | `--results` | Number of results (default: 15) |
| `-n` | `--lines` | Context lines to show (default: 6) |
| `-l` | `--files-only` | List matching files only |
| `-c` | `--content` | Show full function/class content |
| `-r` | `--recursive` | Recursive (default, for grep compat) |
| `-y` | `--yes` | Auto-confirm indexing |
| | `--json` | JSON output |
| | `--code-only` | Skip docs/config files |
| | `--include` | Filter by glob (e.g., `"*.rs"`) |
| | `--exclude` | Exclude files by glob |
| | `--exclude-dir` | Exclude directories |
| | `--model` | Override ColBERT model |
| | `--no-pool` | Disable embedding pooling |
| | `--pool-factor` | Set pool factor (default: 2) |
### Filtering
```bash
# By file extension
colgrep --include="*.py" "database query"
colgrep --include="*.{ts,tsx}" "React component"
# By path pattern
colgrep --include="src/**/*.rs" "config parsing"
colgrep --include="**/tests/**" "test helper"
# Exclude files or directories
colgrep --exclude="*.test.ts" "component"
colgrep --exclude-dir="vendor" --exclude-dir="node_modules" "import"
# Search specific paths
colgrep "error handling" ./src/api ./src/auth
# Code-only (skip markdown, yaml, json, etc.)
colgrep --code-only "authentication logic"
```
**Glob pattern syntax:**
| `*.py` | All Python files |
| `*.{ts,tsx}` | TypeScript and TSX files |
| `src/**/*.rs` | Rust files under `src/` |
| `**/tests/**` | Files in any `tests/` directory |
| `*_test.go` | Go test files |
### Output Modes
```bash
# Default: filepath:lines with context
colgrep "authentication"
# Files only (like grep -l)
colgrep -l "database queries"
# Full content with syntax highlighting
colgrep -c "authentication handler" -k 5
# JSON for scripting
### Subcommands
| `colgrep init` | Build or update the index |
| `colgrep status` | Show index status for current project |
| `colgrep clear` | Clear index for current project |
| `colgrep clear --all` | Clear all indexes |
| `colgrep set-model <ID>` | Change the default ColBERT model |
| `colgrep settings` | View or modify configuration |
| `colgrep settings --ignore` | Add extra ignore patterns (persistent) |
| `colgrep settings --force-include` | Force-include normally ignored paths |
| `colgrep --stats` | Show search statistics for all indexes |
---
## Configuration
```bash
# Show current config
colgrep settings
# Set default results count
colgrep settings --k 20
# Set default context lines
colgrep settings --n 10
# Use INT8 quantized model (faster inference)
colgrep settings --int8
# Use FP32 full precision (more accurate)
colgrep settings --fp32
# Set embedding pool factor (2 = 50% smaller index, 1 = full precision)
colgrep settings --pool-factor 2
# Set parallel encoding sessions (default: CPU count, max 16)
colgrep settings --parallel 8
# Set batch size per session (default: 1 for CPU, 64 for CUDA)
colgrep settings --batch-size 2
# Set parser recursion depth guard (default: 1024)
colgrep settings --max-recursion-depth 1024
# Enable verbose output by default
colgrep settings --verbose
# Reset a value to default (pass 0)
colgrep settings --k 0 --n 0
```
### Relative Paths
By default, colgrep displays absolute file paths in search output. You can switch to relative paths (relative to the current working directory) to improve readability and save ~35% tokens when used with LLMs:
```bash
# Enable relative paths
colgrep settings --relative-paths
# Revert to absolute paths (default)
colgrep settings --no-relative-paths
```
> **Note:** JSON output (`--json`) always uses absolute paths regardless of this setting.
### Custom Ignore & Force-Include
By default, colgrep ignores common non-source directories (`node_modules`, `.git`, `target`, `vendor`, `build`, etc.) and hidden files. You can customize this behavior persistently across sessions:
```bash
# Add extra patterns to ignore (on top of built-in defaults)
colgrep settings --ignore generated --ignore "*.pb.go"
# Remove an extra ignore pattern
colgrep settings --no-ignore generated
# Clear all extra ignore patterns (revert to defaults only)
colgrep settings --clear-ignore
# Force-include files/dirs that are normally ignored
colgrep settings --force-include .vscode
colgrep settings --force-include vendor/internal
# Remove a force-include pattern
colgrep settings --no-force-include .vscode
# Clear all force-include patterns
colgrep settings --clear-force-include
```
**Pattern syntax:**
| `generated` | Any directory or file named `generated` |
| `*.pb.go` | Files ending in `.pb.go` |
| `.vscode` | The `.vscode` directory (normally hidden = ignored) |
| `vendor/internal` | The `vendor/internal` path prefix specifically |
Force-include takes priority over both built-in and extra ignore rules, so you can selectively un-ignore specific paths. These settings are stored in `~/.config/colgrep/config.json` and apply to all projects.
### Change Model
```bash
# Temporary (single query)
colgrep "query" --model lightonai/LateOn-Code
# Permanent (clears existing indexes)
colgrep set-model lightonai/LateOn-Code
# Private HuggingFace model
HF_TOKEN=hf_xxx colgrep set-model myorg/private-model
```
Config stored at `~/.config/colgrep/config.json`.
---
## Agent Integrations
| Claude Code | `colgrep --install-claude-code` | `colgrep --uninstall-claude-code` |
| OpenCode | `colgrep --install-opencode` | `colgrep --uninstall-opencode` |
| Codex | `colgrep --install-codex` | `colgrep --uninstall-codex` |
> Restart your agent after installing.
### Claude Code Integration
The Claude Code integration installs session and task hooks that:
- Inject colgrep usage instructions into the agent's system prompt
- Check index health before activating (skips if >3000 chunks need indexing or index is desynced)
- Propagate colgrep instructions to spawned sub-agents via task hooks
This means Claude Code automatically uses `colgrep` as its primary search tool when the index is ready.
### Complete Uninstall
Remove colgrep from all AI tools, clear all indexes, and delete all data:
```bash
colgrep --uninstall
```
---
## How It Works
```mermaid
flowchart TD
A["Your codebase"] --> B["Tree-sitter"]
B --> C["Structured representation"]
C --> D["LateOn-Code-edge · 17M"]
D --> E["NextPlaid"]
E --> F["Search"]
B -.- B1["Parse functions, methods, classes"]
C -.- C1["Signature, params, calls, docstring, code"]
D -.- D1["Multi-vector embedding per code unit · runs on CPU"]
E -.- E1["Rust index binary · quantized · memory-mapped · incremental"]
F -.- F1["grep-compatible flags · SQLite filtering · semantic ranking
100% local, your code never leaves your machine"]
style A fill:#4a90d9,stroke:#357abd,color:#fff
style B fill:#50b86c,stroke:#3d9956,color:#fff
style C fill:#50b86c,stroke:#3d9956,color:#fff
style D fill:#e8913a,stroke:#d07a2e,color:#fff
style E fill:#e8913a,stroke:#d07a2e,color:#fff
style F fill:#9b59b6,stroke:#8445a0,color:#fff
style B1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
style C1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
style D1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
style E1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
style F1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
```
### 1. Parse
[Tree-sitter](https://tree-sitter.github.io/) parses source files into ASTs and extracts code units: **functions**, **methods**, **classes**, **constants**, and **raw code blocks** (module-level statements not covered by other units). This gives 100% file coverage.
### 2. Analyze (5 Layers)
Each code unit is enriched with five layers of analysis:
| **AST** | Signature, parameters, return type, docstring, parent class | `def fetch(url: str) -> Response` |
| **Call Graph** | Outgoing calls + reverse `called_by` | `Calls: range, client.get` |
| **Control Flow** | Loops, branches, error handling, cyclomatic complexity | `has_error_handling: true` |
| **Data Flow** | Variable declarations and assignments | `Variables: i, e` |
| **Dependencies** | Imports used within the function | `Uses: client, RequestError` |
### 3. Build Structured Text
Each unit is converted to a structured text representation before embedding. This gives the model richer signal than raw code alone:
```
Function: fetch_with_retry
Signature: def fetch_with_retry(url: str, max_retries: int = 3) -> Response
Description: Fetches data from a URL with retry logic.
Parameters: url, max_retries
Returns: Response
Calls: range, client.get
Variables: i, e
Uses: client, RequestError
Code:
def fetch_with_retry(url: str, max_retries: int = 3) -> Response:
"""Fetches data from a URL with retry logic."""
for i in range(max_retries):
try:
return client.get(url)
except RequestError as e:
if i == max_retries - 1:
raise e
File: src / utils / http client http_client.py
```
File paths are normalized for better semantic matching: separators become spaces, `snake_case` and `CamelCase` are split (e.g., `HttpClient` → `http client`).
### 4. Encode with ColBERT
The [ColBERT](https://github.com/stanford-futuredata/ColBERT) model produces **multi-vector embeddings**: ~300 token-level vectors of dimension 128 per code unit (instead of a single vector). At query time, each query token finds its best match across all document tokens (**MaxSim** scoring). This preserves fine-grained information that single-vector models lose.
The default model is [LateOn-Code-edge](https://huggingface.co/lightonai/LateOn-Code-edge) (17M parameters), optimized for code search and fast enough to run on CPU.
### 5. Index with PLAID
The [PLAID](https://arxiv.org/abs/2205.09707) algorithm compresses multi-vector embeddings with **product quantization** (2-bit or 4-bit) and stores them in a **memory-mapped** index. Embedding pooling (default factor: 2) further reduces index size by ~50%. Indexes support **incremental updates** so only changed files are re-encoded.
### 6. Search
The search pipeline:
1. **Encode** the query with ColBERT (single ONNX session, fast)
2. **Pre-filter** by metadata if `--include`, `--exclude`, `--exclude-dir` or `--code-only` are set (SQLite)
3. If `-e` pattern is provided: **regex filter** candidates, then score semantically
4. **MaxSim** scoring against the PLAID index
5. **Demote** test functions by -1 unless the query mentions "test"
6. **Find representative lines** using weighted token matching with a sliding window
---
## Index Management
### `colgrep init`
Build or incrementally update the index for a project without running a search. If the index already exists, only changed files are re-encoded.
```bash
colgrep init # current directory
colgrep init ~/projects/myapp # specific project
colgrep init -y # auto-confirm for large codebases (>10K code units)
colgrep init --model lightonai/LateOn-Code # use a specific model
colgrep init --pool-factor 1 # disable embedding pooling (more precise)
```
This is useful for:
- **Pre-warming** the index so the first search is instant
- **CI/dev setup** scripts where you want indexing to happen ahead of time
- **Updating** the index after pulling new code
```bash
# Check index status
colgrep status
# Clear current project index
colgrep clear
# Clear all indexes
colgrep clear --all
# Show statistics
colgrep --stats
```
Indexes are stored outside the project directory:
| Linux | `~/.local/share/colgrep/indices/` |
| macOS | `~/Library/Application Support/colgrep/indices/` |
| Windows | `%APPDATA%\colgrep\indices\` |
Each project gets a directory named `{project}-{hash8}`. Inside:
- `index/` — PLAID vector index + SQLite metadata
- `state.json` — File hashes for incremental updates
- `project.json` — Canonical project path
ColGREP automatically detects and repairs index/metadata desync from interrupted operations.
---
## Supported Languages
### Code (25 languages, tree-sitter AST parsing)
| Python | `.py` |
| TypeScript | `.ts`, `.tsx` |
| JavaScript | `.js`, `.jsx`, `.mjs` |
| Go | `.go` |
| Rust | `.rs` |
| Java | `.java` |
| C | `.c`, `.h` |
| C++ | `.cpp`, `.cc`, `.cxx`, `.hpp`, `.hxx` |
| C# | `.cs` |
| Ruby | `.rb` |
| Kotlin | `.kt`, `.kts` |
| Swift | `.swift` |
| Scala | `.scala`, `.sc` |
| PHP | `.php` |
| Lua | `.lua` |
| Elixir | `.ex`, `.exs` |
| Haskell | `.hs` |
| OCaml | `.ml`, `.mli` |
| R | `.r`, `.rmd` |
| Zig | `.zig` |
| Julia | `.jl` |
| SQL | `.sql` |
| Vue | `.vue` |
| Svelte | `.svelte` |
| HTML | `.html`, `.htm` |
### Text & Config (11 formats, document-level extraction)
| Markdown | `.md` |
| Plain text | `.txt`, `.rst` |
| AsciiDoc | `.adoc` |
| Org | `.org` |
| YAML | `.yaml`, `.yml` |
| TOML | `.toml` |
| JSON | `.json` |
| Dockerfile | `Dockerfile` |
| Makefile | `Makefile` |
| Shell | `.sh`, `.bash`, `.zsh` |
| PowerShell | `.ps1` |
---
## Installation
The pre-built binaries from [Quick Start](#quick-start) are the fastest way to get started. For hardware acceleration on Linux/Windows, or to build from source, use Cargo.
### Cargo
Install from [crates.io](https://crates.io/crates/colgrep):
```bash
# CPU-only (all platforms)
cargo install colgrep
# macOS with full acceleration (same as pre-built binary)
cargo install colgrep --features "accelerate,coreml"
# Linux with OpenBLAS
cargo install colgrep --features openblas
# Linux with CUDA
cargo install colgrep --features cuda
# Linux with CUDA + TensorRT
cargo install colgrep --features "cuda,tensorrt"
# Windows with DirectML
cargo install colgrep --features directml
```
### Build from Source
```bash
git clone https://github.com/lightonai/next-plaid.git
cd next-plaid
cargo install --path colgrep --features "accelerate,coreml" # or your preferred features
```
### Hardware Acceleration Features
| `accelerate` | macOS | Apple Accelerate for vector operations |
| `coreml` | macOS | CoreML for model inference |
| `openblas` | Linux | OpenBLAS for vector operations |
| `cuda` | Linux/Windows | NVIDIA CUDA for model inference |
| `tensorrt` | Linux | NVIDIA TensorRT for model inference |
| `directml` | Windows | DirectML for model inference |
<details>
<summary><b>OpenBLAS setup (Linux)</b></summary>
```bash
# Debian/Ubuntu
sudo apt install libopenblas-dev
# Fedora/RHEL
sudo dnf install openblas-devel
# Arch
sudo pacman -S openblas
```
Then: `cargo install colgrep --features openblas`
</details>
### ONNX Runtime
ONNX Runtime is downloaded automatically on first use. No manual installation required.
Lookup order:
1. `ORT_DYLIB_PATH` environment variable
2. Python environments (pip/conda/venv)
3. System paths
4. Auto-download to `~/.cache/onnxruntime/`
---
## Environment Variables
| `ORT_DYLIB_PATH` | Path to ONNX Runtime library |
| `XDG_DATA_HOME` | Override data directory |
| `XDG_CONFIG_HOME` | Override config directory |
| `HF_TOKEN` | HuggingFace token for private models |
| `HUGGING_FACE_HUB_TOKEN` | Alternative HF token variable |
---
## Citation
```bibtex
@software{next-plaid,
title = {NextPlaid, ColGREP: Multi-vector search, from database to coding agents.},
url = {https://github.com/lightonai/next-plaid},
author = {Sourty, Rapha\"{e}l},
contributors = {Dinaburg, Artem and Carron, Igor and Hsu, Chao-Chun (Joe) and Weitekamp, Raymond and R\k{a}czka, Szymon and Motliuk, Mark},
year = {2026},
}
@misc{LateOn-Code,
title = {LateOn-Code: a Family of State-Of-The-Art Late Interaction Code Retrieval Models},
author = {Chaffin, Antoine},
url = {https://huggingface.co/collections/lightonai/lateon-code},
year = {2026}
}
```
## License
Apache-2.0
## See Also
- [llm-tldr](https://github.com/parcadei/llm-tldr)
- [mgrep](https://github.com/mixedbread-ai/mgrep)
- [cgrep](https://github.com/awgn/cgrep)