colgrep 1.1.1

Semantic code search powered by ColBERT
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
<div align="center">
  <h1>ColGREP</h1>
  <p>Semantic code search for your terminal and your coding agents, built on NextPlaid.<br/>
  Combines regex filtering with semantic ranking with LateOn-Code-edge multi-vector embeddings.<br/>
  A single Rust binary. No server. No API. 100% local, your code never leaves your machine.</p>

  <img width="680" src="../docs/colgrep-demo.gif" alt="ColGREP demo"/>

  <p>
    <a href="#quick-start"><b>Quick Start</b></a>
    &middot;
    <a href="#search-modes"><b>Search Modes</b></a>
    &middot;
    <a href="#agent-integrations"><b>Agent Integrations</b></a>
    &middot;
    <a href="#how-it-works"><b>How It Works</b></a>
    &middot;
    <a href="#installation"><b>Installation</b></a>
  </p>
</div>

## Quick Start

**Install:**

Homebrew (macOS / Linux)
```bash
brew install lightonai/tap/colgrep
```

Shell installer (macOS / Linux)
```bash
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/lightonai/next-plaid/releases/latest/download/colgrep-installer.sh | sh
```

Windows (PowerShell)
```bash
powershell -c "irm https://github.com/lightonai/next-plaid/releases/latest/download/colgrep-installer.ps1 | iex"
```

> **macOS** binaries ship with **Apple Accelerate + CoreML** enabled — full hardware acceleration out of the box.
>
> **Linux & Windows** binaries work immediately but run on CPU only. For hardware acceleration, install via Cargo — see [Installation]#installation.

**Build the index:**

```bash
colgrep init              # current directory
colgrep init /path/to/project  # or a specific project
colgrep init -y  # auto-confirm for large codebases (>10K code units)
```

**Search:**

```bash
colgrep "database connection pooling"
```

No setup, no config, no dependencies. `colgrep init` builds the index for the first time. After that, every search detects file changes and updates the index automatically before returning results. Supports `--model` to override the ColBERT model and `--pool-factor` to control embedding compression.

---

## Search Modes

ColGREP supports three search modes: **semantic**, **regex**, and **hybrid** (both combined).

### Semantic Search

Find code by meaning, even when keywords don't match exactly:

```bash
colgrep "function that retries HTTP requests"
colgrep "error handling in API layer"
colgrep "authentication middleware" ./src
```

### Regex Search

Use `-e` for traditional pattern matching (ERE syntax by default):

```bash
colgrep -e "async fn\s+\w+"
colgrep -e "TODO|FIXME|HACK"
colgrep -e "impl\s+Display" --include="*.rs"
```

### Hybrid Search

Combine regex filtering with semantic ranking. Regex narrows the candidates, semantics ranks them:

```bash
# Find async functions, rank by "error handling"
colgrep -e "async fn" "error handling"

# Find Result types, rank by "database operations"
colgrep -e "Result<" "database operations" --include="*.rs"

# Find TODOs, rank by relevance to "security"
colgrep -e "TODO" "security concerns"
```

---

## CLI Reference

### Search Options

| Flag | Long                | Description                              |
| ---- | ------------------- | ---------------------------------------- |
| `-e` | `--pattern`         | Regex pre-filter (ERE syntax)            |
| `-E` | `--extended-regexp` | ERE mode (default, kept for grep compat) |
| `-F` | `--fixed-strings`   | Treat `-e` as literal string             |
| `-w` | `--word-regexp`     | Whole-word match for `-e`                |
| `-k` | `--results`         | Number of results (default: 15)          |
| `-n` | `--lines`           | Context lines to show (default: 6)       |
| `-l` | `--files-only`      | List matching files only                 |
| `-c` | `--content`         | Show full function/class content         |
| `-r` | `--recursive`       | Recursive (default, for grep compat)     |
| `-y` | `--yes`             | Auto-confirm indexing                    |
|      | `--json`            | JSON output                              |
|      | `--code-only`       | Skip docs/config files                   |
|      | `--include`         | Filter by glob (e.g., `"*.rs"`)          |
|      | `--exclude`         | Exclude files by glob                    |
|      | `--exclude-dir`     | Exclude directories                      |
|      | `--model`           | Override ColBERT model                   |
|      | `--no-pool`         | Disable embedding pooling                |
|      | `--pool-factor`     | Set pool factor (default: 2)             |

### Filtering

```bash
# By file extension
colgrep --include="*.py" "database query"
colgrep --include="*.{ts,tsx}" "React component"

# By path pattern
colgrep --include="src/**/*.rs" "config parsing"
colgrep --include="**/tests/**" "test helper"

# Exclude files or directories
colgrep --exclude="*.test.ts" "component"
colgrep --exclude-dir="vendor" --exclude-dir="node_modules" "import"

# Search specific paths
colgrep "error handling" ./src/api ./src/auth

# Code-only (skip markdown, yaml, json, etc.)
colgrep --code-only "authentication logic"
```

**Glob pattern syntax:**

| Pattern       | Matches                         |
| ------------- | ------------------------------- |
| `*.py`        | All Python files                |
| `*.{ts,tsx}`  | TypeScript and TSX files        |
| `src/**/*.rs` | Rust files under `src/`         |
| `**/tests/**` | Files in any `tests/` directory |
| `*_test.go`   | Go test files                   |

### Output Modes

```bash
# Default: filepath:lines with context
colgrep "authentication"

# Files only (like grep -l)
colgrep -l "database queries"

# Full content with syntax highlighting
colgrep -c "authentication handler" -k 5

# JSON for scripting
colgrep --json "auth" | jq '.[] | .unit.file'
```

### Subcommands

| Command                  | Description                            |
| ------------------------ | -------------------------------------- |
| `colgrep init`           | Build or update the index              |
| `colgrep status`         | Show index status for current project  |
| `colgrep clear`          | Clear index for current project        |
| `colgrep clear --all`    | Clear all indexes                      |
| `colgrep set-model <ID>` | Change the default ColBERT model       |
| `colgrep settings`       | View or modify configuration           |
| `colgrep settings --ignore` | Add extra ignore patterns (persistent) |
| `colgrep settings --force-include` | Force-include normally ignored paths |
| `colgrep --stats`        | Show search statistics for all indexes |

---

## Configuration

```bash
# Show current config
colgrep settings

# Set default results count
colgrep settings --k 20

# Set default context lines
colgrep settings --n 10

# Use INT8 quantized model (faster inference)
colgrep settings --int8

# Use FP32 full precision (more accurate)
colgrep settings --fp32

# Set embedding pool factor (2 = 50% smaller index, 1 = full precision)
colgrep settings --pool-factor 2

# Set parallel encoding sessions (default: CPU count, max 16)
colgrep settings --parallel 8

# Set batch size per session (default: 1 for CPU, 64 for CUDA)
colgrep settings --batch-size 2

# Set parser recursion depth guard (default: 1024)
colgrep settings --max-recursion-depth 1024

# Enable verbose output by default
colgrep settings --verbose

# Reset a value to default (pass 0)
colgrep settings --k 0 --n 0
```

### Relative Paths

By default, colgrep displays absolute file paths in search output. You can switch to relative paths (relative to the current working directory) to improve readability and save ~35% tokens when used with LLMs:

```bash
# Enable relative paths
colgrep settings --relative-paths

# Revert to absolute paths (default)
colgrep settings --no-relative-paths
```

> **Note:** JSON output (`--json`) always uses absolute paths regardless of this setting.

### Custom Ignore & Force-Include

By default, colgrep ignores common non-source directories (`node_modules`, `.git`, `target`, `vendor`, `build`, etc.) and hidden files. You can customize this behavior persistently across sessions:

```bash
# Add extra patterns to ignore (on top of built-in defaults)
colgrep settings --ignore generated --ignore "*.pb.go"

# Remove an extra ignore pattern
colgrep settings --no-ignore generated

# Clear all extra ignore patterns (revert to defaults only)
colgrep settings --clear-ignore

# Force-include files/dirs that are normally ignored
colgrep settings --force-include .vscode
colgrep settings --force-include vendor/internal

# Remove a force-include pattern
colgrep settings --no-force-include .vscode

# Clear all force-include patterns
colgrep settings --clear-force-include
```

**Pattern syntax:**

| Pattern            | Matches                                                |
| ------------------ | ------------------------------------------------------ |
| `generated`        | Any directory or file named `generated`                |
| `*.pb.go`          | Files ending in `.pb.go`                               |
| `.vscode`          | The `.vscode` directory (normally hidden = ignored)    |
| `vendor/internal`  | The `vendor/internal` path prefix specifically         |

Force-include takes priority over both built-in and extra ignore rules, so you can selectively un-ignore specific paths. These settings are stored in `~/.config/colgrep/config.json` and apply to all projects.

### Change Model

```bash
# Temporary (single query)
colgrep "query" --model lightonai/LateOn-Code

# Permanent (clears existing indexes)
colgrep set-model lightonai/LateOn-Code

# Private HuggingFace model
HF_TOKEN=hf_xxx colgrep set-model myorg/private-model
```

Config stored at `~/.config/colgrep/config.json`.

---

## Agent Integrations

| Agent       | Install                         | Uninstall                         |
| ----------- | ------------------------------- | --------------------------------- |
| Claude Code | `colgrep --install-claude-code` | `colgrep --uninstall-claude-code` |
| OpenCode    | `colgrep --install-opencode`    | `colgrep --uninstall-opencode`    |
| Codex       | `colgrep --install-codex`       | `colgrep --uninstall-codex`       |

> Restart your agent after installing.

### Claude Code Integration

The Claude Code integration installs session and task hooks that:

- Inject colgrep usage instructions into the agent's system prompt
- Check index health before activating (skips if >3000 chunks need indexing or index is desynced)
- Propagate colgrep instructions to spawned sub-agents via task hooks

This means Claude Code automatically uses `colgrep` as its primary search tool when the index is ready.

### Complete Uninstall

Remove colgrep from all AI tools, clear all indexes, and delete all data:

```bash
colgrep --uninstall
```

---

## How It Works

```mermaid
flowchart TD
    A["Your codebase"] --> B["Tree-sitter"]
    B --> C["Structured representation"]
    C --> D["LateOn-Code-edge · 17M"]
    D --> E["NextPlaid"]
    E --> F["Search"]

    B -.- B1["Parse functions, methods, classes"]
    C -.- C1["Signature, params, calls, docstring, code"]
    D -.- D1["Multi-vector embedding per code unit · runs on CPU"]
    E -.- E1["Rust index binary · quantized · memory-mapped · incremental"]
    F -.- F1["grep-compatible flags · SQLite filtering · semantic ranking
100% local, your code never leaves your machine"]

    style A fill:#4a90d9,stroke:#357abd,color:#fff
    style B fill:#50b86c,stroke:#3d9956,color:#fff
    style C fill:#50b86c,stroke:#3d9956,color:#fff
    style D fill:#e8913a,stroke:#d07a2e,color:#fff
    style E fill:#e8913a,stroke:#d07a2e,color:#fff
    style F fill:#9b59b6,stroke:#8445a0,color:#fff
    style B1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
    style C1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
    style D1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
    style E1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
    style F1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
```

### 1. Parse

[Tree-sitter](https://tree-sitter.github.io/) parses source files into ASTs and extracts code units: **functions**, **methods**, **classes**, **constants**, and **raw code blocks** (module-level statements not covered by other units). This gives 100% file coverage.

### 2. Analyze (5 Layers)

Each code unit is enriched with five layers of analysis:

| Layer            | Extracts                                                    | Example                           |
| ---------------- | ----------------------------------------------------------- | --------------------------------- |
| **AST**          | Signature, parameters, return type, docstring, parent class | `def fetch(url: str) -> Response` |
| **Call Graph**   | Outgoing calls + reverse `called_by`                        | `Calls: range, client.get`        |
| **Control Flow** | Loops, branches, error handling, cyclomatic complexity      | `has_error_handling: true`        |
| **Data Flow**    | Variable declarations and assignments                       | `Variables: i, e`                 |
| **Dependencies** | Imports used within the function                            | `Uses: client, RequestError`      |

### 3. Build Structured Text

Each unit is converted to a structured text representation before embedding. This gives the model richer signal than raw code alone:

```
Function: fetch_with_retry
Signature: def fetch_with_retry(url: str, max_retries: int = 3) -> Response
Description: Fetches data from a URL with retry logic.
Parameters: url, max_retries
Returns: Response
Calls: range, client.get
Variables: i, e
Uses: client, RequestError
Code:
def fetch_with_retry(url: str, max_retries: int = 3) -> Response:
    """Fetches data from a URL with retry logic."""
    for i in range(max_retries):
        try:
            return client.get(url)
        except RequestError as e:
            if i == max_retries - 1:
                raise e
File: src / utils / http client http_client.py
```

File paths are normalized for better semantic matching: separators become spaces, `snake_case` and `CamelCase` are split (e.g., `HttpClient` &rarr; `http client`).

### 4. Encode with ColBERT

The [ColBERT](https://github.com/stanford-futuredata/ColBERT) model produces **multi-vector embeddings**: ~300 token-level vectors of dimension 128 per code unit (instead of a single vector). At query time, each query token finds its best match across all document tokens (**MaxSim** scoring). This preserves fine-grained information that single-vector models lose.

The default model is [LateOn-Code-edge](https://huggingface.co/lightonai/LateOn-Code-edge) (17M parameters), optimized for code search and fast enough to run on CPU.

### 5. Index with PLAID

The [PLAID](https://arxiv.org/abs/2205.09707) algorithm compresses multi-vector embeddings with **product quantization** (2-bit or 4-bit) and stores them in a **memory-mapped** index. Embedding pooling (default factor: 2) further reduces index size by ~50%. Indexes support **incremental updates** so only changed files are re-encoded.

### 6. Search

The search pipeline:

1. **Encode** the query with ColBERT (single ONNX session, fast)
2. **Pre-filter** by metadata if `--include`, `--exclude`, `--exclude-dir` or `--code-only` are set (SQLite)
3. If `-e` pattern is provided: **regex filter** candidates, then score semantically
4. **MaxSim** scoring against the PLAID index
5. **Demote** test functions by -1 unless the query mentions "test"
6. **Find representative lines** using weighted token matching with a sliding window

---

## Index Management

### `colgrep init`

Build or incrementally update the index for a project without running a search. If the index already exists, only changed files are re-encoded.

```bash
colgrep init                                        # current directory
colgrep init ~/projects/myapp                       # specific project
colgrep init -y                                     # auto-confirm for large codebases (>10K code units)
colgrep init --model lightonai/LateOn-Code          # use a specific model
colgrep init --pool-factor 1                        # disable embedding pooling (more precise)
```

This is useful for:

- **Pre-warming** the index so the first search is instant
- **CI/dev setup** scripts where you want indexing to happen ahead of time
- **Updating** the index after pulling new code

```bash
# Check index status
colgrep status

# Clear current project index
colgrep clear

# Clear all indexes
colgrep clear --all

# Show statistics
colgrep --stats
```

Indexes are stored outside the project directory:

| Platform | Location                                         |
| -------- | ------------------------------------------------ |
| Linux    | `~/.local/share/colgrep/indices/`                |
| macOS    | `~/Library/Application Support/colgrep/indices/` |
| Windows  | `%APPDATA%\colgrep\indices\`                     |

Each project gets a directory named `{project}-{hash8}`. Inside:

- `index/` &mdash; PLAID vector index + SQLite metadata
- `state.json` &mdash; File hashes for incremental updates
- `project.json` &mdash; Canonical project path

ColGREP automatically detects and repairs index/metadata desync from interrupted operations.

---

## Supported Languages

### Code (25 languages, tree-sitter AST parsing)

| Language   | Extensions                            |
| ---------- | ------------------------------------- |
| Python     | `.py`                                 |
| TypeScript | `.ts`, `.tsx`                         |
| JavaScript | `.js`, `.jsx`, `.mjs`                 |
| Go         | `.go`                                 |
| Rust       | `.rs`                                 |
| Java       | `.java`                               |
| C          | `.c`, `.h`                            |
| C++        | `.cpp`, `.cc`, `.cxx`, `.hpp`, `.hxx` |
| C#         | `.cs`                                 |
| Ruby       | `.rb`                                 |
| Kotlin     | `.kt`, `.kts`                         |
| Swift      | `.swift`                              |
| Scala      | `.scala`, `.sc`                       |
| PHP        | `.php`                                |
| Lua        | `.lua`                                |
| Elixir     | `.ex`, `.exs`                         |
| Haskell    | `.hs`                                 |
| OCaml      | `.ml`, `.mli`                         |
| R          | `.r`, `.rmd`                          |
| Zig        | `.zig`                                |
| Julia      | `.jl`                                 |
| SQL        | `.sql`                                |
| Vue        | `.vue`                                |
| Svelte     | `.svelte`                             |
| HTML       | `.html`, `.htm`                       |

### Text & Config (11 formats, document-level extraction)

| Format     | Extensions             |
| ---------- | ---------------------- |
| Markdown   | `.md`                  |
| Plain text | `.txt`, `.rst`         |
| AsciiDoc   | `.adoc`                |
| Org        | `.org`                 |
| YAML       | `.yaml`, `.yml`        |
| TOML       | `.toml`                |
| JSON       | `.json`                |
| Dockerfile | `Dockerfile`           |
| Makefile   | `Makefile`             |
| Shell      | `.sh`, `.bash`, `.zsh` |
| PowerShell | `.ps1`                 |

---

## Installation

The pre-built binaries from [Quick Start](#quick-start) are the fastest way to get started. For hardware acceleration on Linux/Windows, or to build from source, use Cargo.

### Cargo

Install from [crates.io](https://crates.io/crates/colgrep):

```bash
# CPU-only (all platforms)
cargo install colgrep

# macOS with full acceleration (same as pre-built binary)
cargo install colgrep --features "accelerate,coreml"

# Linux with OpenBLAS
cargo install colgrep --features openblas

# Linux with CUDA
cargo install colgrep --features cuda

# Linux with CUDA + TensorRT
cargo install colgrep --features "cuda,tensorrt"

# Windows with DirectML
cargo install colgrep --features directml
```

### Build from Source

```bash
git clone https://github.com/lightonai/next-plaid.git
cd next-plaid
cargo install --path colgrep --features "accelerate,coreml"  # or your preferred features
```

### Hardware Acceleration Features

| Feature      | Platform      | Description                            |
| ------------ | ------------- | -------------------------------------- |
| `accelerate` | macOS         | Apple Accelerate for vector operations |
| `coreml`     | macOS         | CoreML for model inference             |
| `openblas`   | Linux         | OpenBLAS for vector operations         |
| `cuda`       | Linux/Windows | NVIDIA CUDA for model inference        |
| `tensorrt`   | Linux         | NVIDIA TensorRT for model inference    |
| `directml`   | Windows       | DirectML for model inference           |

<details>
<summary><b>OpenBLAS setup (Linux)</b></summary>

```bash
# Debian/Ubuntu
sudo apt install libopenblas-dev

# Fedora/RHEL
sudo dnf install openblas-devel

# Arch
sudo pacman -S openblas
```

Then: `cargo install colgrep --features openblas`

</details>

### ONNX Runtime

ONNX Runtime is downloaded automatically on first use. No manual installation required.

Lookup order:

1. `ORT_DYLIB_PATH` environment variable
2. Python environments (pip/conda/venv)
3. System paths
4. Auto-download to `~/.cache/onnxruntime/`

---

## Environment Variables

| Variable                 | Description                          |
| ------------------------ | ------------------------------------ |
| `ORT_DYLIB_PATH`         | Path to ONNX Runtime library         |
| `XDG_DATA_HOME`          | Override data directory              |
| `XDG_CONFIG_HOME`        | Override config directory            |
| `HF_TOKEN`               | HuggingFace token for private models |
| `HUGGING_FACE_HUB_TOKEN` | Alternative HF token variable        |

---

## Citation

```bibtex
@software{next-plaid,
  title        = {NextPlaid, ColGREP: Multi-vector search, from database to coding agents.},
  url          = {https://github.com/lightonai/next-plaid},
  author       = {Sourty, Rapha\"{e}l},
  contributors = {Dinaburg, Artem and Carron, Igor and Hsu, Chao-Chun (Joe) and Weitekamp, Raymond and R\k{a}czka, Szymon and Motliuk, Mark},
  year         = {2026},
}

@misc{LateOn-Code,
  title  = {LateOn-Code: a Family of State-Of-The-Art Late Interaction Code Retrieval Models},
  author = {Chaffin, Antoine},
  url    = {https://huggingface.co/collections/lightonai/lateon-code},
  year   = {2026}
}
```

## License

Apache-2.0

## See Also

- [llm-tldr]https://github.com/parcadei/llm-tldr
- [mgrep]https://github.com/mixedbread-ai/mgrep
- [cgrep]https://github.com/awgn/cgrep