search-semantically 0.1.5

Embeddable semantic code search with multi-signal POEM ranking
Documentation

search-semantically

logo

Embeddable semantic code search with multi-signal POEM ranking.

A Rust library crate that provides local, incremental code search combining BM25 full-text search, vector similarity via ONNX embeddings, path matching, symbol matching, import graph propagation, and git recency — ranked using Pareto-optimal Election Method (POEM).

Features

  • Semantic search — ONNX-powered all-MiniLM-L6-v2 embeddings with cosine similarity
  • Full-text search — BM25 scoring over chunk content
  • Tree-sitter chunking — language-aware splitting into functions, structs, impls, etc.
  • Multi-signal ranking — six independent signals fused via POEM
  • Incremental indexing — only re-indexes files changed since last run (mtime diffing)
  • Git-aware — recency signal based on commit history
  • Import graph propagation — follows import/usage relationships to boost relevant code
  • Zero external services — runs entirely locally, model downloaded and cached on first use
  • Embeddable — designed as a library crate, not a CLI tool

Supported Languages

Language Feature Flag
Rust tree-sitter-rust (default)
TypeScript ts-typescript
Python ts-python
Go ts-go
Java ts-java
C ts-c
C++ ts-cpp
Markdown built-in (text chunker)

Plain text and other formats fall back to line/paragraph chunking automatically.

Quick Start

Add to your Cargo.toml:

[dependencies]
search-semantically = "0.1"

Then use it:

use search_semantically::SearchEngine;
use std::path::PathBuf;

let engine = SearchEngine::new(PathBuf::from("/path/to/project"))?;
engine.index()?; // incremental — only processes new/changed files
let results = engine.search("function that parses HTTP headers")?;
for result in results {
    println!("{}", result);
}

On first run, the ONNX model (all-MiniLM-L6-v2, 384-dim) is downloaded and cached at $XDG_CACHE_DIR/search-semantically/models/Xenova/all-MiniLM-L6-v2/.

Architecture

graph TD
    subgraph SearchEngine ["SearchEngine (top-level API)"]
        scanner["scanner (walk)"]
        chunker["chunker (ts/txt)"]
        embedder["embedder (ONNX)"]
        metrics["metrics (6 signals)"]
        ranker["ranker (POEM)"]
    end
    db["db (SQLite)<br/>files · chunks · symbols · imports"]
    scanner --> chunker --> embedder --> metrics --> ranker
    scanner --- db
    chunker --- db
    embedder --- db
    metrics --- db
    ranker --- db

Data Flow

  1. SearchEngine::search() opens/creates .search-index/search.db in the project root
  2. Scanner walks the project, diffing against indexed files by mtime
  3. New/changed files are chunked, embedded, and stored in SQLite
  4. Query is classified (Identifier / NaturalLanguage / PathLike)
  5. Six metric signals are computed per candidate (up to 1000 candidates)
  6. Results are ranked via POEM and returned as formatted output

The Six Signals

Signal Description
BM25 Full-text relevance over chunk content
Cosine Vector similarity between query and chunk embeddings
Path Match strength between query and file path
Symbol Match against defined symbol names (functions, structs, etc.)
Import Graph Propagation through import/usage relationships
Git Recency How recently the file was modified in git history

Key Types

Type Purpose
SearchEngine Main entry point, constructed with a project root PathBuf
StoredChunk A chunk row from the DB (id, file_id, path, lines, kind, content)
TextChunk In-memory chunk produced by chunkers (content, line range, kind, optional name)
MetricScores Six f64 scores per candidate
QueryType Identifier / NaturalLanguage / PathLike
FileType Enum of supported languages and formats

Building & Testing

cargo build                        # debug build (downloads ONNX model on first embed)
cargo test                         # run all tests (uses tempfile, no external deps needed)
cargo test -- --nocapture          # run tests with stdout visible

All tests use tempfile::TempDir for full isolation — no setup required.

Index Storage

  • Index database: <project_root>/.search-index/search.db
  • ONNX model cache: $XDG_CACHE_DIR/search-semantically/models/Xenova/all-MiniLM-L6-v2/