search-semantically

Embeddable semantic code search with multi-signal POEM ranking.
A Rust library crate that provides local, incremental code search combining BM25 full-text search, vector similarity via ONNX embeddings, path matching, symbol matching, import graph propagation, and git recency — ranked using Pareto-optimal Election Method (POEM).
[!NOTE] I (@Fizzizist) cannot take credit for the data science that went into this crate. All credit goes to the hard work done by @aebrer and colleagues.
Features
- Semantic search — ONNX-powered
all-MiniLM-L6-v2embeddings with cosine similarity - Full-text search — BM25 scoring over chunk content
- Tree-sitter chunking — language-aware splitting into functions, structs, impls, etc.
- Multi-signal ranking — six independent signals fused via POEM
- Incremental indexing — only re-indexes files changed since last run (mtime diffing)
- Git-aware — recency signal based on commit history
- Import graph propagation — follows import/usage relationships to boost relevant code
- Zero external services — runs entirely locally, model downloaded and cached on first use
- Embeddable — designed as a library crate, not a CLI tool
Supported Languages
| Language | Feature Flag |
|---|---|
| Rust | tree-sitter-rust (default) |
| TypeScript | ts-typescript |
| Python | ts-python |
| Go | ts-go |
| Java | ts-java |
| C | ts-c |
| C++ | ts-cpp |
| Markdown | built-in (text chunker) |
Plain text and other formats fall back to line/paragraph chunking automatically.
Quick Start
Add to your Cargo.toml:
[]
= "0.1"
Then use it:
use SearchEngine;
use PathBuf;
let engine = new?;
engine.index?; // incremental — only processes new/changed files
let results = engine.search?;
for result in results
On first run, the ONNX model (all-MiniLM-L6-v2, 384-dim) is downloaded and cached at $XDG_CACHE_DIR/search-semantically/models/Xenova/all-MiniLM-L6-v2/.
Architecture
graph TD
subgraph SearchEngine ["SearchEngine (top-level API)"]
scanner["scanner (walk)"]
chunker["chunker (ts/txt)"]
embedder["embedder (ONNX)"]
metrics["metrics (6 signals)"]
ranker["ranker (POEM)"]
end
db["db (SQLite)<br/>files · chunks · symbols · imports"]
scanner --> chunker --> embedder --> metrics --> ranker
scanner --- db
chunker --- db
embedder --- db
metrics --- db
ranker --- db
Data Flow
SearchEngine::search()opens/creates.search-index/search.dbin the project root- Scanner walks the project, diffing against indexed files by mtime
- New/changed files are chunked, embedded, and stored in SQLite
- Query is classified (
Identifier/NaturalLanguage/PathLike) - Six metric signals are computed per candidate (up to 1000 candidates)
- Results are ranked via POEM and returned as formatted output
The Six Signals
| Signal | Description |
|---|---|
| BM25 | Full-text relevance over chunk content |
| Cosine | Vector similarity between query and chunk embeddings |
| Path | Match strength between query and file path |
| Symbol | Match against defined symbol names (functions, structs, etc.) |
| Import Graph | Propagation through import/usage relationships |
| Git Recency | How recently the file was modified in git history |
Key Types
| Type | Purpose |
|---|---|
SearchEngine |
Main entry point, constructed with a project root PathBuf |
StoredChunk |
A chunk row from the DB (id, file_id, path, lines, kind, content) |
TextChunk |
In-memory chunk produced by chunkers (content, line range, kind, optional name) |
MetricScores |
Six f64 scores per candidate |
QueryType |
Identifier / NaturalLanguage / PathLike |
FileType |
Enum of supported languages and formats |
Building & Testing
All tests use tempfile::TempDir for full isolation — no setup required.
Index Storage
- Index database:
<project_root>/.search-index/search.db - ONNX model cache:
$XDG_CACHE_DIR/search-semantically/models/Xenova/all-MiniLM-L6-v2/