omengrep 0.0.2

Semantic code search — multi-vector embeddings + BM25
# omengrep (og)

**Semantic code search — multi-vector embeddings + BM25**

## Quick Reference

```bash
cargo build --release                 # Build
cargo install --path .                # Install binary
og build ./src                        # Build index (required first)
og "query" ./src                      # Semantic search (text query)
og file.rs#func                       # Find similar code (by name)
og file.rs:42                         # Find similar code (by line)
cargo test                            # Run tests
```

## Architecture

```
Build:  Scan (ignore crate) -> Extract (tree-sitter, 25 langs) -> Embed (ort, LateOn-Code-edge INT8) -> Store (omendb multi-vector) + index_text (BM25)
Search: Embed query -> search_multi_with_text (BM25 candidates + MuVERA MaxSim rerank) -> Code-aware boost -> Results

Index hierarchy:
- Building: checks parent (refuses if exists), merges subdirs (fast vector copy)
- Searching: walks up to find index, filters results to search scope
```

| Component  | Implementation                                        |
| ---------- | ----------------------------------------------------- |
| Scanner    | `ignore` crate (gitignore-aware, binary detection)    |
| Extraction | Tree-sitter AST (25 languages)                        |
| Embeddings | `ort` (LateOn-Code-edge INT8 ONNX, 48d/token)        |
| Vector DB  | omendb (MuVERA multi-vector + BM25 hybrid)            |
| Boosting   | Code-aware heuristics (name match, type, path)        |

## Project Structure

```
src/
├── main.rs                 # Entry point
├── lib.rs                  # Re-exports
├── types.rs                # Block, SearchResult, FileRef
├── boost.rs                # Code-aware ranking boosts
├── tokenize.rs             # BM25 identifier splitting
├── cli/
│   ├── mod.rs              # Command dispatch (clap)
│   ├── search.rs           # Search command + file ref parsing
│   ├── build.rs            # Build/update index
│   ├── status.rs           # Index status
│   ├── clean.rs            # Delete index
│   ├── list.rs             # List indexes
│   ├── model.rs            # Model management
│   └── output.rs           # Result formatting (default, json, compact, files-only)
├── embedder/
│   ├── mod.rs              # Embedder trait + factory
│   ├── onnx.rs             # ORT ONNX inference (LateOn-Code-edge)
│   └── tokenizer.rs        # HuggingFace tokenizer wrapper
├── extractor/
│   ├── mod.rs              # Tree-sitter extraction coordinator
│   ├── languages.rs        # Language registry (extension -> parser + query)
│   ├── queries.rs          # Tree-sitter query definitions per language
│   └── text.rs             # Markdown/prose chunking
└── index/
    ├── mod.rs              # SemanticIndex (omendb multi-vector)
    ├── manifest.rs         # Manifest v8 (JSON, tracks files/hashes/blocks)
    └── walker.rs           # File walker (ignore crate, gitignore-aware)
Cargo.toml
```

## Technology Stack

| Component    | Version                  | Notes                                 |
| ------------ | ------------------------ | ------------------------------------- |
| Rust         | nightly-2025-12-04       | Required for omendb (portable_simd)   |
| ort          | 2.0.0-rc.11              | ONNX Runtime inference                |
| omendb       | 0.0.27 (path dep)        | Multi-vector + BM25 hybrid search     |
| tree-sitter  | 0.25                     | AST parsing (25 languages)            |
| Embeddings   | LateOn-Code-edge INT8    | 17M params, 48d/token, ~17MB          |

## Code Standards

| Aspect     | Standard                                           |
| ---------- | -------------------------------------------------- |
| Edition    | 2021 (moving to 2024)                              |
| Errors     | `anyhow` (app), `thiserror` (lib boundaries)       |
| Imports    | `crate::` over `super::`, stdlib -> external -> local |
| Parallelism| `rayon` for CPU-bound extraction                   |
| Strings    | `&str` > `String`, `&[T]` > `Vec<T>` where possible |

## Verification

| Check  | Command                          | Pass Criteria   |
| ------ | -------------------------------- | --------------- |
| Build  | `cargo build --release`          | Zero errors     |
| Test   | `cargo test`                     | All pass        |
| Smoke  | `og "test" ./src`                | Returns results |
| Lint   | `cargo clippy`                   | No warnings     |

## Key Behaviors

- `OG_AUTO_BUILD=1` — auto-build index on search if missing
- Auto-update: search detects stale files and re-indexes before searching
- Exit codes: 0 = match found, 1 = no match, 2 = error
- File refs: `file#name` (by block name), `file:line` (by line number)
- Output formats: default (colored), `--json`, `--json --compact`, `-l` (files only)

## AI Context

**Read order:** `ai/STATUS.md` -> `ai/DECISIONS.md`

| File              | Purpose                          |
| ----------------- | -------------------------------- |
| `ai/STATUS.md`    | Current state, blockers, roadmap |
| `ai/DECISIONS.md` | Architectural decisions          |