ripvec-core 1.0.4

Semantic code + document search engine. Cacheless static-embedding + cross-encoder rerank by default; optional ModernBERT/BGE transformer engines with GPU backends. Tree-sitter chunking, hybrid BM25 + PageRank, composable ranking layers.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
# ripvec

[![CI](https://github.com/fnordpig/ripvec/actions/workflows/ci.yml/badge.svg)](https://github.com/fnordpig/ripvec/actions/workflows/ci.yml)
[![crates.io](https://img.shields.io/crates/v/ripvec.svg)](https://crates.io/crates/ripvec)
[![docs.rs](https://docs.rs/ripvec-core/badge.svg)](https://docs.rs/ripvec-core)
[![downloads](https://img.shields.io/crates/d/ripvec.svg)](https://crates.io/crates/ripvec)
[![plugin](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fraw.githubusercontent.com%2Ffnordpig%2Fripvec%2Fmain%2Fplugins%2F.claude-plugin%2Fmarketplace.json&query=%24.metadata.version&label=plugin&color=blue)](plugins/)
[![License: MIT/Apache-2.0](https://img.shields.io/badge/license-MIT%2FApache--2.0-blue.svg)](LICENSE-MIT)

**Cacheless semantic code + document search that ties or beats transformer baselines. One binary, 19 grammars, three retrieval engines, zero setup.**

ripvec finds code and documents by meaning, provides structural code intelligence across every language it knows, and ranks results by how important each file is in your project. The default engine runs CPU-only, holds no on-disk index, and matches or exceeds ModernBERT-class transformers on our benchmark matrix across code and prose. Transformer engines remain available, opt-in, for users who want a persistent index and the best coherent top-K on long-form narrative.

```sh
$ ripvec "retry logic with exponential backoff" ~/src/my-project

 1. retry_handler.rs:42-78                                        [0.91]
    pub async fn with_retry<F, T>(f: F, max_attempts: u32) -> Result<T>
    where F: Fn() -> Future<Output = Result<T>> {
        let mut delay = Duration::from_millis(100);
        for attempt in 0..max_attempts {
            match f().await {
                Ok(v) => return Ok(v),
                Err(e) if attempt < max_attempts - 1 => {
                    sleep(delay).await;
                    delay *= 2;  // exponential backoff
    ...

 2. http_client.rs:156-189                                        [0.84]
    impl HttpClient {
        async fn request_with_backoff(&self, req: Request) -> Response {
    ...
```

The function is called `with_retry`, the variable is `delay`. "exponential backoff" appears nowhere in the source. grep can't find this. ripvec can, because it embeds both your query and the code into the same vector space, fuses semantic scores with path-enriched BM25, layers a structural-importance signal from a PageRank percentile boost, and reranks the top candidates through a cross-encoder.

## When to use what

ripvec has three interfaces. Here's when each one matters:

| Interface | When to use it | Who uses it |
|-----------|---------------|-------------|
| **CLI** (`ripvec "query" .`) | Terminal search, interactive TUI, one-shot queries | You, directly |
| **MCP server** (`ripvec-mcp`) | AI agent needs to search or understand your codebase | Claude Code, Cursor, any MCP client |
| **LSP server** (`ripvec-mcp --lsp`) | Editor/agent needs symbols, definitions, diagnostics | Claude Code's LSP tool, editors |

The MCP server gives AI agents 8 semantic + structural tools plus 9 LSP tools. The LSP server gives editors structural intelligence (outlines, go-to-definition, syntax diagnostics) for all 19 languages from one binary. The CLI is for humans. Same binary for all three.

If you're using **Claude Code**, install the plugin. It sets up both MCP and LSP automatically; Claude will use `search_code` when you ask conceptual questions and the LSP for symbol navigation.

## Engines

Three retrieval engines share the same CLI/MCP/LSP surface. Pick at runtime with `--model`:

```mermaid
graph TB
    Q["Query"] --> S["Shared surface<br/>CLI / MCP / LSP"]
    S --> R["--model ripvec<br/>(default)"]
    S --> M["--model modernbert"]
    S --> B["--model bert"]
    R --> RP["Model2Vec 32M bi-encoder<br/>+ path-enriched BM25<br/>+ PageRank percentile boost<br/>+ MiniLM-L-12 cross-encoder rerank<br/>= in-memory only"]
    M --> MP["ModernBERT 768-dim transformer<br/>+ BM25 + PageRank<br/>= persistent on-disk index"]
    B --> BP["BGE-small 384-dim transformer<br/>+ BM25 + PageRank<br/>= persistent on-disk index"]
```

| Engine | Pipeline | Cache | When to pick |
|---|---|---|---|
| **`ripvec`** (default) | Model2Vec 32M + L-12 cross-encoder rerank + path-enriched BM25 + PageRank | none (in-memory per session) | Default. Sub-MCPs, fresh worktrees, fan-out agents, document archives, anywhere first-query latency matters. |
| **`modernbert`** | ModernBERT 768d transformer + BM25 + PageRank | `~/.cache/ripvec/` or `.ripvec/cache/` | Workstation with a persistent index. Best coherent top-10 on long-form narrative prose. GPU-capable (Metal/MLX/CUDA). |
| **`bert`** | BGE-small 384d transformer + BM25 + PageRank | `~/.cache/ripvec/` or `.ripvec/cache/` | Lighter transformer alternative. Untested in our 1.0 benchmark matrix; likely worse than ModernBERT, lower memory footprint. |

The MCP daemon picks the engine at startup via `RIPVEC_MCP_ENGINE` (defaults to `ripvec`); the transformer engines require building `ripvec-mcp` with `--features legacy-transformer-mcp`. The CLI accepts `--model` per-invocation with no build-time gating.

## Quality

Measured target-hit count in top-3 results across four corpora, 10 queries each. A "target hit" is a result from the file or book a human evaluator considered correct for the query. Per corpus the maximum is 30 (10 queries Γ— top-3 each); the **TOTAL** column has a maximum of 120.

| Engine | Gutenberg (prose) | Tokio (Rust) | Flask (Python) | ripvec (Rust+MD) | **TOTAL** |
|---|:-:|:-:|:-:|:-:|:-:|
| ModernBERT (`--model modernbert --index`) | 17 | 6 | 14 | 15 | **52 / 120** |
| Ripvec (`--model ripvec`, default) | **22** | **12** | 11 | 11 | **56 / 120** |

The ripvec engine ties or beats ModernBERT on top-3 across this matrix. ModernBERT retains an edge on top-10 coherence for long-form narrative prose: on 3 of 10 Gutenberg queries, every BERT result is from the right book where ripvec spreads positions 2-10 across noise. That gap is real and is the structural cost of static-embedding retrieval. Most code and document-search workloads care about top-3 far more than top-10 coherence.

The Tokio result is worth flagging. Under ModernBERT, 7/10 dev-flavored queries put test files in top-5 ahead of canonical implementations. The ripvec engine's PageRank percentile boost flips that: implementations rank above the tests that reference them, because tests have near-zero structural PageRank and implementations don't.

## Workflow: orient, search, navigate

```mermaid
graph LR
    A["πŸ—ΊοΈ Orient<br/>get_repo_map"] --> B["πŸ” Search<br/>search_code"]
    B --> C["🧭 Navigate<br/>LSP operations"]
    C -->|"need more context"| B
    C -->|"found it"| D["✏️ Edit"]
```

**Orient.** `get_repo_map` returns a structural overview ranked by function-level importance. One tool call replaces 10+ sequential file reads. Start here when working on unfamiliar code.

**Search.** `search_code "authentication middleware"` finds implementations by meaning across all 19 languages simultaneously. Results are ranked by relevance and structural importance.

**Navigate.** LSP `documentSymbol` shows the file outline. `goToDefinition` jumps to the likely definition. `findReferences` shows usage sites. `incomingCalls`/`outgoingCalls` traces the call graph.

## Semantic search

You describe behavior, ripvec finds the implementation:

| What you want | grep / ripgrep | ripvec |
|---------------|----------------|--------|
| "retry with backoff" | Nothing (code says `delay *= 2`) | Finds the retry handler |
| "database connection pool" | Comments mentioning "pool" | The pool implementation |
| "authentication middleware" | `// TODO: add auth` | The auth guard |
| "WebSocket lifecycle" | String "WebSocket" | Connect/disconnect handlers |

Search modes: `--mode hybrid` (default, semantic + BM25 fusion), `--mode semantic` (pure vector similarity), `--mode keyword` (pure BM25). Hybrid is usually best.

## Multi-language LSP

ripvec serves LSP from a single binary for all 19 grammars. No per-language server installs. It provides:

- **`documentSymbol`**: file outline (functions, fields, enum variants, constants, types, headings)
- **`workspaceSymbol`**: cross-language symbol search with PageRank boost
- **`goToDefinition`**: name-based resolution ranked by structural importance
- **`findReferences`**: usage sites via hybrid search + content filtering
- **`hover`**: scope chain, signature, enriched context
- **`publishDiagnostics`**: tree-sitter syntax error detection after every edit
- **`incomingCalls` / `outgoingCalls`**: function-level call graph

For languages with dedicated LSPs (Rust, Python, Go, TypeScript), ripvec runs alongside them. The dedicated server handles types, ripvec handles semantic search and cross-language features. For languages without dedicated LSPs (bash, HCL, Ruby, Kotlin, Swift, Scala), ripvec is the primary code intelligence.

JSON, YAML, TOML, and Markdown get structural outlines (keys, mappings, headings) and syntax diagnostics. Useful for navigating large config files, not comparable to language-aware intelligence.

## Architecture: the ripvec engine

The default engine is a four-stage composite pipeline. Each stage uses a fast cheap-to-rebuild signal; together they outperform a single transformer on retrieval quality.

```mermaid
graph TB
    Q["Query"] --> EMB["Bi-encoder embed<br/>(Model2Vec 32M)"]
    Q --> BM["BM25 score<br/>(path-enriched tokenization)"]
    EMB --> SEM["Cosine similarity<br/>top-N candidates"]
    BM --> LEX["Lexical ranking<br/>top-N candidates"]
    SEM --> RRF["Reciprocal Rank Fusion<br/>(k=60)"]
    LEX --> RRF
    RRF --> PR["Γ— PageRank percentile boost<br/>(sigmoid curve, Ξ±=0.5)"]
    PR --> RR["Cross-encoder rerank<br/>(ms-marco-MiniLM-L-12-v2)<br/>top-100 candidates"]
    RR --> OUT["Top-k results"]
```

**Static bi-encoder retrieval (Model2Vec).** The bi-encoder is a lookup-and-mean-pool over a pretrained 256-dim embedding table (`minishlab/potion-base-32M`). No transformer forward pass; encoding cost is dominated by memory bandwidth, not FLOPs. About 5ms per query on a single CPU thread; ~250K chunks per second when indexing in parallel.

**Path-enriched BM25.** Lexical scoring with a code-aware tokenizer that splits `parseJsonConfig` into `[parse, json, config]` and `my_func_name` into `[my, func, name]`. Chunk text is enriched with the file stem (doubled) and the last three directory components before tokenization, so a query like "session encoding" hits both content and `sessions.py` paths.

**Reciprocal Rank Fusion.** Combines the semantic and lexical rankings via Cormack et al.'s rank-based fusion (k=60). Handles the scale mismatch between cosine similarity and BM25 without tuning.

**PageRank percentile boost.** A structural-importance signal on top of relevance. See the next section.

**Cross-encoder rerank.** The top 100 candidates are re-scored by a cross-encoder (`ms-marco-MiniLM-L-12-v2`) that runs full attention across the concatenated query+document pair. Per-pair cost ~10ms on CPU; total ~1s for 100 candidates. Final scores are blended (0.7 Γ— cross-encoder + 0.3 Γ— bi-encoder) so when the cross-encoder is uncertain (sigmoid outputs clustered near 0.5), the bi-encoder's confident ordering is preserved.

### Function-level PageRank

```mermaid
graph LR
    subgraph "Call Graph"
        A["main()"] --> B["handle_request()"]
        A --> C["init_db()"]
        B --> D["authenticate()"]
        B --> E["dispatch()"]
        D --> F["verify_token()"]
        E --> D
    end
    subgraph "PageRank"
        D2["authenticate() β˜…β˜…β˜…"]
        B2["handle_request() β˜…β˜…"]
        E2["dispatch() β˜…"]
    end
```

ripvec extracts call expressions from every function body using tree-sitter, resolves callee names to definitions, and computes PageRank on the resulting call graph. Functions called by many others rank higher. `authenticate()` in the example above is more structurally important than `dispatch()` because more code depends on it.

The bi-encoder is structurally weaker than a transformer. Model2Vec doesn't model cross-token interactions and can't reliably distinguish a 1500-char canonical implementation from a 3-line example stub by dense similarity alone. Without a corrective signal, the engine ranks `tests/hello_world.py` competitively with `src/auth/handler.py` on a query like "register a route." PageRank carries the missing signal: implementations are imported by tests and callers; stubs are imported by nothing.

ripvec applies the structural prior as a **sigmoid-on-percentile boost**: `boost(p) = 1 + Ξ± Γ— sigmoid((p βˆ’ 0.5) / s)` where `p` is the file's PR percentile within the corpus, `Ξ±=0.5` is the ceiling lift, and `s=0.15` controls steepness.

| PR percentile | Example file | Boost (Ξ±=0.5) |
|---|---|---:|
| 0 (not in graph) | isolated leaf file | 1.00Γ— (no boost) |
| 0.10 (bottom decile) | rarely-imported impl | 1.04Γ— |
| 0.25 (lower quartile) | hub of one small module | 1.08Γ— |
| **0.50 (median)** | typical impl file | **1.25Γ—** |
| 0.75 (upper quartile) | heavily-imported module | 1.42Γ— |
| 0.95 (near top) | central trait / API surface | 1.48Γ— |
| 1.00 (graph root) | e.g. `tokio/src/lib.rs` | ~1.49Γ— (asymptote 1.5Γ—) |

Two design constraints fall out of this curve:

1. **At-or-above-median PR gets a meaningfully different boost from low-PR.** A median-importance impl with cosine 0.84 ends at 0.84 Γ— 1.25 = 1.05; a near-zero-PR test with cosine 0.85 ends at 0.85 Γ— 1.02 = 0.867. The impl flips above the test by ~21%, enough to reorder reliably when the bi-encoder is uncertain.
2. **The ceiling caps centers-of-universe.** A graph-root file at p=1.0 gets at most 1.5Γ—. It can't dominate when the query genuinely matches a less-central file.

The boost is applied via a composable [`RankingLayer`](crates/ripvec-core/src/ranking.rs) chain shared across CLI, MCP, and LSP code paths. Adding a new ranking signal (recency, file-saturation diversification) is a single new `impl RankingLayer`.

## Architecture: transformer engines

ripvec retains two transformer engines for users who want a persistent on-disk index and the absolute coherent top-K on long-form prose. Both share the cache architecture, the BM25/RRF/PageRank ranking layers, and the GPU backends; they differ only in the embedding model.

### Cache layout

```mermaid
graph TD
    subgraph "~/.cache/ripvec/&lt;project_hash&gt;/v3-modernbert/"
        M["manifest.json<br/>file entries + Merkle hashes"]
        L["manifest.lock<br/>advisory fd-lock"]
        subgraph "objects/ (content-addressed)"
            O1["ab/cdef12...<br/>(zstd-compressed FileCache)"]
            O2["3f/a891bc...<br/>(zstd-compressed FileCache)"]
        end
    end
```

Each file's chunks and embeddings are serialized into a `FileCache` object, compressed with zstd (~8x), and stored by blake3 content hash in a git-style `xx/hash` sharded object store. The manifest tracks metadata: mtime, size, content hash, chunk count per file, plus Merkle directory hashes.

The ripvec engine never builds any of this. It holds the in-memory index across an MCP session lifetime, drops it on `reindex` or process exit, and rebuilds on next query.

### Change detection: two-level diff

```mermaid
graph TD
    F["File on disk"] --> M{"mtime + size<br/>match manifest?"}
    M -->|"yes"| SKIP["Unchanged<br/>(fast path, no I/O)"]
    M -->|"no"| HASH{"blake3 content hash<br/>matches manifest?"}
    HASH -->|"yes"| TOUCH["Touched but identical<br/>(heal mtime in manifest)"]
    HASH -->|"no"| DIRTY["Dirty β†’ re-embed"]
```

Level 1 (mtime+size) is a stat call (microseconds). Level 2 (blake3 hash) reads the file but avoids re-embedding if content hasn't changed. After `git clone` (where all mtimes are wrong), the first run hashes everything but re-embeds nothing, then heals the manifest mtimes for fast-path on subsequent runs.

### Serialization: two formats

| Format | Used for | Portable? |
|--------|----------|-----------|
| **rkyv** (zero-copy) | User-level cache (~/.cache) | No (architecture-dependent) |
| **bitcode** | Repo-level cache (.ripvec/) | Yes (cross-architecture) |

Auto-detected on read via magic bytes: `0x42 0x43` = bitcode, otherwise rkyv. Both are zstd-compressed. Repo-level indices use bitcode so they can be committed to git and shared between x86 CI and ARM developer machines.

### Concurrency and locking

```mermaid
sequenceDiagram
    participant MCP as MCP Server
    participant Watcher as File Watcher
    participant Lock as manifest.lock
    participant Cache as Object Store

    Note over MCP: Query arrives
    MCP->>Lock: acquire read lock
    MCP->>Cache: load objects
    Lock-->>MCP: release

    Note over Watcher: File change detected (2s debounce)
    Watcher->>Lock: acquire write lock
    Watcher->>Cache: re-embed dirty files
    Watcher->>Cache: write new objects
    Watcher->>Cache: save manifest + GC
    Lock-->>Watcher: release
```

The file watcher debounces for 2 seconds of quiet before triggering re-indexing. Advisory `fd-lock` on `manifest.lock` prevents readers from seeing a half-written manifest. Multiple readers can proceed concurrently; writers block all readers.

Garbage collection runs after each incremental update; unreferenced objects (from deleted or re-embedded files) are removed from the store.

### Repo-level indexing

```sh
ripvec --model modernbert --index --repo-level "query"
git add .ripvec/ && git commit -m "add search index"
```

Creates `.ripvec/config.toml` (pins model + version) and `.ripvec/cache/` (manifest + objects). Teammates who clone get instant search. The config is validated on load. If the model doesn't match the runtime model, ripvec falls back to the user-level cache with a warning.

Repo config can also exclude files from the index using `.gitignore` syntax:

```toml
[ignore]
patterns = [
  "*.jsonl",
  "*.md",
  "docs/generated/**",
  "!docs/README.md",
]
```

These patterns apply to CLI indexing, incremental cache diffing, MCP reindexing, and repo-map file discovery. The command-line `--exclude-extensions=jsonl,md` flag is useful for one-off extension filters.

### Cache resolution

```mermaid
graph TD
    A["--cache-dir override"] -->|"highest priority"| R["Resolved cache dir"]
    B[".ripvec/config.toml<br/>(repo-local)"] -->|"if model matches"| R
    C["RIPVEC_CACHE env var"] --> R
    D["~/.cache/ripvec/<br/>(XDG default)"] -->|"lowest priority"| R
```

### Embedding pipeline

```mermaid
graph LR
    subgraph "Stage 1: Chunk (rayon)"
        F["Files"] --> TS["Tree-sitter<br/>parse"]
        TS --> C["Semantic<br/>chunks"]
    end
    subgraph "Stage 2: Tokenize"
        C --> T["Tokenizer<br/>(BPE / WordPiece)"]
        T --> B["Padded<br/>batches"]
    end
    subgraph "Stage 3: Embed (GPU)"
        B --> FW["Forward pass<br/>(22 layers ModernBERT,<br/>12 layers BGE-small)"]
        FW --> P["Mean / CLS pool<br/>+ L2 norm"]
        P --> V["Embedding<br/>vectors"]
    end
    C -.->|"bounded channel<br/>backpressure"| T
    T -.->|"bounded channel<br/>backpressure"| FW
```

For large corpora (1000+ files), stages run concurrently as a streaming pipeline with bounded channels for backpressure. The GPU starts embedding after the first batch (~50ms), not after all files are chunked.

### Driver / Architecture split

The core design insight for the transformer engines: the forward pass is written ONCE as a generic `ModernBertArch<D: Driver>`, and each backend implements the `Driver` trait with platform-specific operations. Same model, same math, different hardware.

```mermaid
graph TB
    subgraph "Architecture (written once)"
        FP["ModernBertArch&lt;D: Driver&gt;<br/>forward()"]
        FP --> L1["Layer 1: Attention + FFN"]
        L1 --> L2["Layer 2: Attention + FFN"]
        L2 --> LN["...22 layers..."]
        LN --> Pool["Mean pool + L2 norm"]
    end
    subgraph "Driver trait implementations"
        FP -.->|"D = Metal"| M["MetalDriver<br/>MPS GEMMs + custom MSL kernels"]
        FP -.->|"D = CUDA"| CU["CudaDriver<br/>cuBLAS tensor cores + NVRTC kernels"]
        FP -.->|"D = CPU"| CP["CpuDriver<br/>ndarray + Accelerate/OpenBLAS"]
        FP -.->|"D = MLX"| ML["MlxDriver<br/>lazy eval β†’ auto-fused Metal"]
    end
```

### What each backend actually does per layer

Each of the 22 ModernBERT layers runs attention + FFN. Here's how the same operations map to different hardware:

```mermaid
graph LR
    subgraph "Attention"
        LN1["LayerNorm"] --> QKV["QKV projection<br/>(GEMM)"]
        QKV --> PAD["Pad + Split"]
        PAD --> ROPE["RoPE rotation"]
        ROPE --> ATTN["Q @ K^T<br/>(batched GEMM)"]
        ATTN --> SM["Scale + Mask<br/>+ Softmax"]
        SM --> AV["Scores @ V<br/>(batched GEMM)"]
        AV --> UNPAD["Reshape + Unpad"]
        UNPAD --> OPROJ["Output proj<br/>(GEMM)"]
        OPROJ --> RES1["Residual add"]
    end
    subgraph "FFN"
        RES1 --> LN2["LayerNorm"]
        LN2 --> WI["Wi projection<br/>(GEMM)"]
        WI --> GEGLU["Split + GeGLU"]
        GEGLU --> WO["Wo projection<br/>(GEMM)"]
        WO --> RES2["Residual add"]
    end
```

| Operation | Metal | CUDA | CPU | MLX |
|-----------|-------|------|-----|-----|
| **GEMM** | MPS (AMX) | cuBLAS FP16 tensor cores | Accelerate / OpenBLAS | Auto-fused |
| **Softmax+Scale+Mask** | Fused MSL kernel | Fused NVRTC kernel | Scalar loop | Auto-fused |
| **RoPE** | Custom MSL kernel | Custom NVRTC kernel | Scalar loop | Lazy ops |
| **GeGLU (split+gelu+gate)** | Fused MSL kernel | Fused NVRTC kernel | Scalar loop | Auto-fused |
| **Pad/Unpad/Reshape** | Custom MSL kernels | Custom NVRTC kernels | Rust loops | Free (metadata) |
| **FP16 support** | Yes (all kernels) | Yes (all kernels) | No | No |

Metal and CUDA have hand-written fused kernels for softmax, GeGLU, and attention reshape. These eliminate intermediate buffers and reduce memory bandwidth. MLX gets fusion automatically via lazy evaluation (the entire forward pass typically compiles to 2-3 Metal kernel dispatches). CPU uses explicit scalar loops for everything except GEMM.

### Dual search index (transformer engines)

```mermaid
graph LR
    subgraph "HybridIndex"
        subgraph "SearchIndex (dense vectors)"
            EMB["embeddings<br/>(TurboQuant 4-bit compressed)"]
            EMB --> CS["Cosine similarity scan"]
        end
        subgraph "Bm25Index (tantivy)"
            TAN["Inverted index<br/>(code-aware tokenizer)"]
            TAN --> BM["BM25 scoring<br/>(name 3Γ— / path 1.5Γ— / body 1Γ—)"]
        end
        CS --> RRF["RRF fusion (k=60)"]
        BM --> RRF
    end
```

The transformer BM25 index uses a code-aware tokenizer that splits `parseJsonConfig` into `[parse, json, config]` and `my_func_name` into `[my, func, name]`. Keyword search finds `json config parser` even if the function is named in camelCase. Function names are boosted 3x over body text.

TurboQuant compresses 768-dim vectors from 3KB to ~380 bytes (4-bit) with a rotation matrix for better quantization. This enables ~5x faster scanning for large indices while maintaining ranking quality through exact re-ranking of the top candidates.

## Performance

**Cacheless (ripvec engine, the default).** Wall time for a single query, end-to-end including model load on cold start:

| Corpus | First query (cold) | Warm | Notes |
|---|---|---|---|
| Small repo (~500 files) | ~7s | 0.3s | Model download + index build dominate cold path |
| Medium repo (~5K files, e.g. Tokio) | ~12s | 0.8s | |
| Large repo (~50K files) | ~50s | 8s | Linear in file count for indexing |
| Linux kernel (~92K files, 1.7 GB) | ~75s | n/a (in-memory drops between processes) | |

The MCP daemon holds the in-memory index for the session lifetime, so warm latency dominates after the first query. For sub-MCPs and agent fan-out where each spawn starts fresh, the cold-path numbers are what to budget against. Roughly 100Γ— faster cold-path than ModernBERT cacheless (33s/79s on Gutenberg/Tokio respectively).

**Indexed (transformer engines).** Time to build the persistent index on first run; subsequent queries against the cached index are milliseconds.

| Hardware | Throughput | Time (Flask corpus, 2383 chunks) |
|----------|-----------|----------------------------------|
| RTX 4090 (CUDA) | 435 chunks/s | ~5s |
| M2 Max (Metal) | 73.8 chunks/s | ~32s |
| M2 Max (CPU/Accelerate) | 73.5 chunks/s | ~32s |

Metal and CPU show similar throughput on M2 Max because macOS Accelerate routes BLAS operations through the AMX coprocessor regardless of backend. The Metal backend has headroom on larger batches and non-BLAS operations.

**Memory.** Ripvec engine: ~200 MB for a typical project (embedding table + chunks + BM25). Transformer engines: ~500 MB during embedding (model weights + batch buffers), ~100 MB for query-time.

**Where CPU goes on the ripvec engine (linux/92K corpus, sampled).**

| Component | % of CPU-time |
|---|---:|
| rayon worker synchronization (intrinsic par_iter joins) | ~38% |
| tokenizer Unicode normalization (upstream `tokenizers` crate) | ~10% |
| file I/O (read + open syscalls) | ~5% |
| pool_ids (SIMD f32x8, our kernel) | ~2% |
| tree-sitter parse | ~3% |
| BM25 build + interner | ~3% |
| useful work | ~36% |

The 38% sync floor is structural: rayon's `par_iter` join semantics require parking workers between stages. We've shipped what's worth shipping past that floor (mimalloc, hand-vectorized pool_ids, bounded-queue streaming pipeline, lasso term interning). Further compression would require restructuring around an async stage scheduler.

## How it compares

| Tool | Type | Key difference from ripvec |
|------|------|--------------------------|
| ripgrep | Text search | No semantic understanding |
| Sourcegraph | Cloud AI platform | $49-59/user/month, code leaves your machine |
| grepai | Local semantic search | Requires Ollama for embeddings |
| mgrep | Semantic search | Uses cloud embeddings (Mixedbread AI) |
| Serena | MCP symbol navigation | Requires per-language LSP servers installed |
| Bloop | Was semantic + navigation | Archived Jan 2025 |
| VS Code anycode | Tree-sitter outlines | Editor-only, no cross-file search |
| Cursor @Codebase | IDE semantic search | Cursor-only, sends embeddings to cloud |

ripvec is self-contained (no Ollama, no cloud, no per-language setup), runs locally, and combines search + LSP + structural ranking in one binary. The cacheless default fits sub-MCP / fan-out / fresh-worktree workflows where a persistent index isn't viable.

## Install

### Pre-built binaries (fastest)

```sh
cargo binstall ripvec ripvec-mcp
```

Requires [cargo-binstall](https://github.com/cargo-bins/cargo-binstall). Downloads a pre-built binary for your platform; no compilation.

### From source

```sh
cargo install ripvec ripvec-mcp
```

For CUDA (Linux with NVIDIA GPU, transformer engines only):

```sh
cargo install ripvec ripvec-mcp --features cuda
```

To enable transformer engines on the MCP daemon:

```sh
cargo install ripvec-mcp --features legacy-transformer-mcp
```

(The default `ripvec-mcp` build ships only the ripvec engine. The CLI binary `ripvec` accepts all engines without feature gating.)

### Claude Code plugin

```sh
claude plugin install ripvec@fnordpig-my-claude-plugins
```

The plugin auto-downloads the binary for your platform on first use and configures both MCP and LSP servers. It includes 3 skills (codebase orientation, semantic discovery, change impact analysis), 3 commands (`/map`, `/find`, `/repo-index`), and a code exploration agent. CUDA is auto-detected via `nvidia-smi`.

### Platforms

| Platform | Backends | GPU |
|----------|----------|-----|
| macOS Apple Silicon | Metal + MLX + CPU (Accelerate) | Metal auto-enabled |
| Linux x86_64 | CPU (OpenBLAS) | CUDA with `--features cuda` |
| Linux ARM64 (Graviton) | CPU (OpenBLAS) | CUDA with `--features cuda` |

Model weights download automatically on first run: ~33MB (potion-base-32M, default ripvec engine) or ~100MB (ModernBERT). The cross-encoder reranker (`ms-marco-MiniLM-L-12-v2`, 33MB) downloads on first use of the ripvec engine.

## Usage

### CLI

```sh
ripvec "error handling" .                              # Default ripvec engine
ripvec "form validation hooks" -n 5                    # Top 5 results
ripvec "database migration" --mode keyword             # BM25 only
ripvec "session encoding" --model modernbert --index   # ModernBERT with persistent index
ripvec --model modernbert --index --exclude-extensions=jsonl,md  # Skip noisy extensions
ripvec -i --model modernbert --index .                 # Interactive TUI (transformer engines only)
```

### MCP server

```json
{ "mcpServers": { "ripvec": { "command": "ripvec-mcp" } } }
```

Tools (8 retrieval + 9 LSP):

| Category | Tools |
|---|---|
| Retrieval | `search_code`, `search_text`, `find_similar`, `find_duplicates`, `get_repo_map`, `reindex`, `index_status`, `up_to_date` |
| LSP | `lsp_document_symbols`, `lsp_workspace_symbols`, `lsp_hover`, `lsp_goto_definition`, `lsp_goto_implementation`, `lsp_references`, `lsp_prepare_call_hierarchy`, `lsp_incoming_calls`, `lsp_outgoing_calls` |
| Diagnostics | `debug_log`, `log_level` |

Engine selection is per-daemon via `RIPVEC_MCP_ENGINE={ripvec,modernbert,bert}`; default is `ripvec`. Tool schemas are stable across engines: `index_status` reports `engine: "ripvec"` and `cache_location: "in-memory"` under the ripvec engine, `engine: "modernbert"` and an on-disk path under transformer engines.

### LSP server

```sh
ripvec-mcp --lsp   # serves LSP over stdio
```

Same binary, `--lsp` flag selects protocol.

## Supported languages

19 tree-sitter grammars, 30 file extensions:

| Language | Extensions | Extracted elements |
|----------|-----------|-------------------|
| Rust | `.rs` | functions, structs, enums, variants, fields, impls, traits, consts, mods |
| Python | `.py` | functions, classes, assignments |
| JavaScript | `.js` `.jsx` | functions, classes, methods, variables |
| TypeScript | `.ts` `.tsx` | functions, classes, interfaces, type aliases, enums |
| Go | `.go` | functions, methods, types, constants |
| Java | `.java` | methods, classes, interfaces, enums, fields, constructors |
| C | `.c` `.h` | functions, structs, enums, typedefs |
| C++ | `.cpp` `.cc` `.cxx` `.hpp` | functions, classes, namespaces, enums, fields |
| Bash | `.sh` `.bash` `.bats` | functions, variables |
| Ruby | `.rb` | methods, classes, modules, constants |
| HCL / Terraform | `.tf` `.tfvars` `.hcl` | blocks (resources, data, variables) |
| Kotlin | `.kt` `.kts` | functions, classes, objects, properties |
| Swift | `.swift` | functions, classes, protocols, properties |
| Scala | `.scala` | functions, classes, traits, objects, vals, types |
| TOML | `.toml` | tables, key-value pairs |
| JSON | `.json` | object keys |
| YAML | `.yaml` `.yml` | mapping keys |
| Markdown | `.md` | headings |

Unsupported file types get sliding-window plain-text chunking. The embedding model handles any language; tree-sitter just provides better chunk boundaries.

## Acknowledgments

ripvec's static bi-encoder uses [Model2Vec](https://github.com/MinishLab/model2vec) embeddings (`potion-base-32M`, `potion-code-16M`) from MinishLab, whose [semble](https://github.com/MinishLab/semble) pipeline inspired the path-enriched BM25 and query-shape boosting design we ported to Rust and extended. Cross-encoder rerank uses [`ms-marco-MiniLM-L-12-v2`](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2). See [CREDITS.md](CREDITS.md) for the full ledger of what we used, what we ported, and what we built on top.

## Limitations

- **goToDefinition is best-effort**: resolves by name matching and structural importance, not by type system analysis. Use dedicated LSPs (rust-analyzer, pyright, gopls) when you need exact resolution for overloaded symbols.
- **Call graph is approximate**: common names like `new`, `run`, `render` may resolve to the wrong definition. Cross-crate resolution limited to workspace members.
- **Ripvec engine top-10 coherence on long-form prose**: ModernBERT retains an edge on narrative corpora where a single source document is the right answer for every position in the top-10. Top-3 quality is competitive; coherent top-10 is not. If you're searching a legal archive or a book collection and need 10 contiguous hits from the same source, `--model modernbert --index` is the better tool.
- **Cacheless cold start scales linearly**: first-query indexing on the ripvec engine is O(files). At 92K files (Linux kernel) it's ~75s. Persistent transformer engines amortize this across runs but pay model-download and disk-cache costs.
- **English-centric**: both engines were trained primarily on English text. Queries and code comments in other languages will have lower recall.

## Development

```sh
cargo fmt --check && cargo clippy --all-targets -- -D warnings && cargo test --workspace
```

See [CLAUDE.md](CLAUDE.md) for detailed development conventions, architecture notes, and MCP tool namespace resolution.

### Architecture

Cargo workspace with three crates:

| Crate | Role |
|-------|------|
| [`ripvec-core`](crates/ripvec-core) | Engines (ripvec + transformer), backends, chunking, embedding, search, repo map, cache, call graph, ranking layers |
| [`ripvec`](crates/ripvec) | CLI binary (clap + ratatui TUI) |
| [`ripvec-mcp`](crates/ripvec-mcp) | MCP + LSP server binary (rmcp + tower-lsp-server) |

### Docs

- [CREDITS.md](CREDITS.md): full attribution for models, libraries, and design inspiration
- [Metal/MPS Architecture](docs/METAL_MPS_ARCHITECTURE.md)
- [CUDA Architecture](docs/CUDA_ARCHITECTURE.md)
- [Development Learnings](docs/LEARNINGS.md)

## License

Licensed under either of [Apache-2.0](LICENSE-APACHE) or [MIT](LICENSE-MIT) at your option.