# Semantic Search Module Specification
**Version:** 1.0
**Created:** 2026-02-03
**Author:** architect-agent
**Status:** Approved for Implementation
## Overview
This specification defines the semantic code search module for tldr-rs, providing AI-powered code search using dense embeddings. The module enables natural language queries to find semantically related code, similarity detection between code fragments, and embedding generation for downstream tools.
## Table of Contents
1. [Module Structure](#1-module-structure)
2. [Public API Types](#2-public-api-types)
3. [Core Components](#3-core-components)
4. [CLI Commands](#4-cli-commands)
5. [Behavioral Contracts](#5-behavioral-contracts)
6. [Error Handling](#6-error-handling)
7. [Integration Points](#7-integration-points)
8. [Performance Considerations](#8-performance-considerations)
9. [Testing Strategy](#9-testing-strategy)
---
## 1. Module Structure
```
tldr-core/src/semantic/
├── mod.rs # Module exports and re-exports
├── spec.md # This specification
├── embedder.rs # Embedding generation (fastembed-rs wrapper)
├── chunker.rs # Code chunking via tree-sitter
├── index.rs # In-memory embedding index
├── cache.rs # JSON-based embedding cache
├── similarity.rs # Cosine similarity and search
└── types.rs # Shared types (CodeChunk, EmbeddingResult, etc.)
tldr-cli/src/commands/
├── semantic.rs # `tldr semantic <query> [path]`
├── embed.rs # `tldr embed <file|path>`
├── similar.rs # `tldr similar <file>`
└── explain.rs # `tldr explain <file> <function>` (DEFERRED)
```
### Module Dependencies
```
┌─────────────────────────────────────────────────────────────────┐
│ tldr-cli │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ semantic │ │ embed │ │ similar │ │ explain │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ (DEFERRED) │
└───────┼────────────┼────────────┼────────────┼──────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ tldr-core/semantic │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ index.rs │ │
│ │ SemanticIndex::search() / SemanticIndex::find_similar() │ │
│ └────────────────────────┬─────────────────────────────────┘ │
│ │ │
│ ┌────────────┐ ┌────────┴───────┐ ┌────────────┐ │
│ │ embedder.rs│◄─┤ similarity.rs │──► cache.rs │ │
│ └─────┬──────┘ └────────────────┘ └─────┬──────┘ │
│ │ │ │
│ ┌─────▼──────┐ ┌─────▼──────┐ │
│ │ chunker.rs │ │ types.rs │ │
│ └─────┬──────┘ └────────────┘ │
│ │ │
└────────┼────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────┐ ┌─────────────────────┐
│ tldr-core/ast │ │ fastembed-rs │
│ (tree-sitter) │ │ (embedding models) │
└─────────────────────┘ └─────────────────────┘
```
---
## 2. Public API Types
### 2.1 Core Types (`types.rs`)
```rust
use std::path::PathBuf;
use serde::{Deserialize, Serialize};
/// A chunk of code that can be embedded
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CodeChunk {
/// Source file path (relative to project root)
pub file_path: PathBuf,
/// Function/method name (None for file-level chunks)
pub function_name: Option<String>,
/// Class/struct name containing this function (if any)
pub class_name: Option<String>,
/// Start line number (1-indexed)
pub line_start: u32,
/// End line number (1-indexed, inclusive)
pub line_end: u32,
/// The source code text
pub content: String,
/// Content hash for cache invalidation (MD5 or SHA-256)
pub content_hash: String,
/// Language of the code
pub language: Language,
}
/// Embedding result for a code chunk
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EmbeddedChunk {
/// The original code chunk
pub chunk: CodeChunk,
/// Dense embedding vector (768 dimensions for Arctic-M)
pub embedding: Vec<f32>,
/// Model used to generate embedding
pub model: EmbeddingModel,
/// Timestamp when embedding was generated (Unix epoch seconds)
pub embedded_at: u64,
}
/// Supported embedding models (Snowflake Arctic family)
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, Default)]
#[serde(rename_all = "kebab-case")]
pub enum EmbeddingModel {
/// 384 dims, 30MB, 512 context
ArcticXS,
/// 384 dims, 90MB, 512 context
ArcticS,
/// 768 dims, 110MB, 512 context (DEFAULT)
#[default]
ArcticM,
/// 768 dims, 110MB, 8192 context
ArcticMLong,
/// 1024 dims, 335MB, 512 context
ArcticL,
}
impl EmbeddingModel {
/// Get embedding dimension for this model
pub fn dimensions(&self) -> usize {
match self {
Self::ArcticXS | Self::ArcticS => 384,
Self::ArcticM | Self::ArcticMLong => 768,
Self::ArcticL => 1024,
}
}
/// Get max context length (tokens)
pub fn max_context(&self) -> usize {
match self {
Self::ArcticMLong => 8192,
_ => 512,
}
}
/// Get fastembed model enum variant
pub fn fastembed_model(&self) -> fastembed::EmbeddingModel {
match self {
Self::ArcticXS => fastembed::EmbeddingModel::SnowflakeArcticEmbedXS,
Self::ArcticS => fastembed::EmbeddingModel::SnowflakeArcticEmbedS,
Self::ArcticM => fastembed::EmbeddingModel::SnowflakeArcticEmbedM,
Self::ArcticMLong => fastembed::EmbeddingModel::SnowflakeArcticEmbedMLong,
Self::ArcticL => fastembed::EmbeddingModel::SnowflakeArcticEmbedL,
}
}
}
/// Granularity for code chunking
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, Default)]
#[serde(rename_all = "lowercase")]
pub enum ChunkGranularity {
/// One chunk per file
File,
/// One chunk per function/method (DEFAULT)
#[default]
Function,
}
/// Semantic search result
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SemanticSearchResult {
/// File path
pub file_path: PathBuf,
/// Function name (if function-level)
pub function_name: Option<String>,
/// Class name (if method)
pub class_name: Option<String>,
/// Cosine similarity score (0.0 to 1.0)
pub score: f64,
/// Start line
pub line_start: u32,
/// End line
pub line_end: u32,
/// Code snippet (truncated for display)
pub snippet: String,
}
/// Report from semantic search
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SemanticSearchReport {
/// Search results sorted by score (descending)
pub results: Vec<SemanticSearchResult>,
/// Original query
pub query: String,
/// Model used for query embedding
pub model: EmbeddingModel,
/// Total chunks searched
pub total_chunks: usize,
/// Results above threshold
pub matches_above_threshold: usize,
/// Search latency in milliseconds
pub latency_ms: u64,
/// Whether cache was used
pub cache_hit: bool,
}
/// Report from embedding generation
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EmbedReport {
/// Path that was embedded
pub path: PathBuf,
/// Model used
pub model: EmbeddingModel,
/// Granularity used
pub granularity: ChunkGranularity,
/// Number of chunks embedded
pub chunks_embedded: usize,
/// Number of chunks loaded from cache
pub chunks_cached: usize,
/// Embedded chunks (if output requested)
#[serde(skip_serializing_if = "Option::is_none")]
pub chunks: Option<Vec<EmbeddedChunk>>,
/// Total embedding time in milliseconds
pub latency_ms: u64,
}
/// Report from similarity search
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SimilarityReport {
/// Source file/function being compared
pub source: CodeChunk,
/// Similar code fragments
pub similar: Vec<SemanticSearchResult>,
/// Model used
pub model: EmbeddingModel,
/// Total chunks compared
pub total_compared: usize,
/// Whether self was excluded
pub exclude_self: bool,
}
```
### 2.2 Embedder Types (`embedder.rs`)
```rust
use fastembed::TextEmbedding;
/// Embedding service using fastembed-rs
pub struct Embedder {
/// The fastembed TextEmbedding instance
model: TextEmbedding,
/// Model variant being used
model_type: EmbeddingModel,
/// Whether model is loaded
initialized: bool,
}
/// Options for embedding generation
#[derive(Debug, Clone, Default)]
pub struct EmbedOptions {
/// Model to use (default: ArcticM)
pub model: EmbeddingModel,
/// Show progress during embedding
pub show_progress: bool,
/// Batch size for embedding (default: 32)
pub batch_size: usize,
}
```
### 2.3 Chunker Types (`chunker.rs`)
```rust
/// Code chunking options
#[derive(Debug, Clone, Default)]
pub struct ChunkOptions {
/// Granularity (file or function)
pub granularity: ChunkGranularity,
/// Maximum chunk size in characters (0 = no limit)
pub max_chunk_size: usize,
/// Include docstrings/comments in chunks
pub include_docs: bool,
/// Languages to process (None = auto-detect)
pub languages: Option<Vec<Language>>,
}
```
### 2.4 Index Types (`index.rs`)
```rust
/// In-memory semantic index for fast similarity search
pub struct SemanticIndex {
/// All embedded chunks
chunks: Vec<EmbeddedChunk>,
/// Model used for all embeddings
model: EmbeddingModel,
/// Project root path
root: PathBuf,
}
/// Options for similarity search
#[derive(Debug, Clone)]
pub struct SearchOptions {
/// Number of results to return
pub top_k: usize,
/// Minimum similarity threshold (0.0 to 1.0)
pub threshold: f64,
/// Model to use for query embedding
pub model: EmbeddingModel,
/// Exclude exact matches (for similarity search)
pub exclude_self: bool,
}
impl Default for SearchOptions {
fn default() -> Self {
Self {
top_k: 10,
threshold: 0.5,
model: EmbeddingModel::default(),
exclude_self: false,
}
}
}
```
### 2.5 Cache Types (`cache.rs`)
```rust
/// JSON-based embedding cache
pub struct EmbeddingCache {
/// Cache file path
path: PathBuf,
/// In-memory cache entries
entries: HashMap<String, CacheEntry>,
/// Dirty flag for lazy writes
dirty: bool,
}
/// A single cache entry
#[derive(Debug, Clone, Serialize, Deserialize)]
struct CacheEntry {
/// Content hash of the source code
content_hash: String,
/// Model used to generate embedding
model: EmbeddingModel,
/// The embedding vector
embedding: Vec<f32>,
/// Timestamp when cached
cached_at: u64,
}
/// Cache configuration
#[derive(Debug, Clone)]
pub struct CacheConfig {
/// Cache directory (default: ~/.cache/tldr/embeddings/)
pub cache_dir: PathBuf,
/// Maximum cache size in MB (default: 500)
pub max_size_mb: usize,
/// Cache entry TTL in days (default: 30)
pub ttl_days: u32,
}
```
---
## 3. Core Components
### 3.1 Embedder (`embedder.rs`)
#### Primary Function
```rust
impl Embedder {
/// Create a new embedder with the specified model
///
/// # Arguments
/// * `model` - The embedding model to use
///
/// # Returns
/// * `TldrResult<Self>` - Initialized embedder or error
///
/// # Errors
/// * `TldrError::ModelLoadError` - Failed to load model
/// * `TldrError::IoError` - Cache directory inaccessible
///
/// # Example
/// ```rust
/// let embedder = Embedder::new(EmbeddingModel::ArcticM)?;
/// ```
pub fn new(model: EmbeddingModel) -> TldrResult<Self>;
/// Embed a single text
///
/// # Arguments
/// * `text` - Text to embed
///
/// # Returns
/// * `TldrResult<Vec<f32>>` - Embedding vector
///
/// # Invariants
/// * Output length == model.dimensions()
/// * Output is normalized (L2 norm == 1.0)
pub fn embed_text(&self, text: &str) -> TldrResult<Vec<f32>>;
/// Embed multiple texts in batch
///
/// # Arguments
/// * `texts` - Texts to embed
/// * `show_progress` - Whether to show progress bar
///
/// # Returns
/// * `TldrResult<Vec<Vec<f32>>>` - Embedding vectors
///
/// # Performance
/// * Batching reduces overhead for multiple texts
/// * Default batch size: 32
pub fn embed_batch(
&self,
texts: &[String],
show_progress: bool
) -> TldrResult<Vec<Vec<f32>>>;
/// Get the model being used
pub fn model(&self) -> EmbeddingModel;
}
```
#### Behavioral Contract
| Empty string | Zero vector | All zeros, normalized |
| Text > max_context | Truncated embedding | First N tokens embedded |
| Valid text | Normalized embedding | L2 norm == 1.0 |
| Invalid UTF-8 | Error | `TldrError::EncodingError` |
### 3.2 Chunker (`chunker.rs`)
#### Primary Function
```rust
/// Extract code chunks from a file or directory
///
/// # Arguments
/// * `path` - File or directory path
/// * `options` - Chunking options
///
/// # Returns
/// * `TldrResult<Vec<CodeChunk>>` - Extracted chunks
///
/// # Errors
/// * `TldrError::PathNotFound` - Path doesn't exist
/// * `TldrError::ParseError` - Syntax error in file
/// * `TldrError::UnsupportedLanguage` - No parser for file type
///
/// # Example
/// ```rust
/// let chunks = chunk_code(
/// Path::new("src/"),
/// ChunkOptions { granularity: ChunkGranularity::Function, ..Default::default() }
/// )?;
/// ```
pub fn chunk_code(path: &Path, options: ChunkOptions) -> TldrResult<Vec<CodeChunk>>;
/// Extract chunks from a single file
pub fn chunk_file(path: &Path, options: &ChunkOptions) -> TldrResult<Vec<CodeChunk>>;
/// Extract function-level chunks using tree-sitter
fn extract_function_chunks(
tree: &Tree,
source: &str,
path: &Path,
language: Language
) -> Vec<CodeChunk>;
```
#### Behavioral Contract
| Single file | File | 1 chunk (whole file) |
| Single file | Function | N chunks (one per function/method) |
| Directory | Function | All functions from all files |
| Empty file | Any | 0 chunks |
| Binary file | Any | Skipped (0 chunks) |
| Parse error | Any | File skipped with warning |
#### Integration with Existing AST Module
The chunker reuses `tldr_core::ast::extractor` for function extraction:
```rust
use crate::ast::extractor::{extract_functions, extract_methods, extract_classes};
use crate::ast::parser::parse_file;
fn extract_function_chunks(path: &Path, language: Language) -> TldrResult<Vec<CodeChunk>> {
let (tree, source, _) = parse_file(path)?;
// Reuse existing function extraction
let functions = extract_functions(&tree, &source, language);
let methods = extract_methods(&tree, &source, language);
// Build chunks with line ranges
// ...
}
```
### 3.3 Index (`index.rs`)
#### Primary Functions
```rust
impl SemanticIndex {
/// Build an index from a project directory
///
/// # Arguments
/// * `root` - Project root directory
/// * `options` - Chunking and embedding options
/// * `cache` - Optional embedding cache
///
/// # Returns
/// * `TldrResult<Self>` - Built index
pub fn build(
root: &Path,
chunk_options: ChunkOptions,
embed_options: EmbedOptions,
cache: Option<&mut EmbeddingCache>,
) -> TldrResult<Self>;
/// Search for chunks matching a natural language query
///
/// # Arguments
/// * `query` - Natural language search query
/// * `options` - Search options
///
/// # Returns
/// * `TldrResult<SemanticSearchReport>` - Search results
///
/// # Example
/// ```rust
/// let index = SemanticIndex::build(root, opts, embed_opts, None)?;
/// let results = index.search("parse configuration file", SearchOptions::default())?;
/// ```
pub fn search(&self, query: &str, options: SearchOptions) -> TldrResult<SemanticSearchReport>;
/// Find chunks similar to a given chunk
///
/// # Arguments
/// * `chunk` - Source chunk to find similar code for
/// * `options` - Search options
///
/// # Returns
/// * `TldrResult<SimilarityReport>` - Similar chunks
pub fn find_similar(
&self,
chunk: &CodeChunk,
options: SearchOptions,
) -> TldrResult<SimilarityReport>;
/// Get a chunk by file path and function name
pub fn get_chunk(&self, file: &Path, function: Option<&str>) -> Option<&EmbeddedChunk>;
/// Get total number of chunks in the index
pub fn len(&self) -> usize;
/// Check if index is empty
pub fn is_empty(&self) -> bool;
}
```
### 3.4 Similarity (`similarity.rs`)
#### Primary Functions
```rust
/// Compute cosine similarity between two vectors
///
/// # Arguments
/// * `a` - First vector (must be normalized)
/// * `b` - Second vector (must be normalized)
///
/// # Returns
/// * `f64` - Cosine similarity (0.0 to 1.0 for normalized vectors)
///
/// # Panics
/// * If vectors have different lengths
///
/// # Performance
/// * O(n) where n = vector dimension
/// * ~768 multiplications for ArcticM
pub fn cosine_similarity(a: &[f32], b: &[f32]) -> f64;
/// Find top-K most similar vectors using linear scan
///
/// # Arguments
/// * `query` - Query vector
/// * `candidates` - Candidate vectors to search
/// * `k` - Number of results to return
/// * `threshold` - Minimum similarity threshold
///
/// # Returns
/// * `Vec<(usize, f64)>` - (index, score) pairs sorted by score descending
///
/// # Performance
/// * O(n * d) where n = candidates, d = dimensions
/// * ~7.68M operations for 10K functions with 768-dim embeddings
pub fn top_k_similar(
query: &[f32],
candidates: &[Vec<f32>],
k: usize,
threshold: f64,
) -> Vec<(usize, f64)>;
/// Normalize a vector to unit length (L2 norm = 1)
pub fn normalize(v: &mut [f32]);
/// Check if a vector is normalized (L2 norm ≈ 1.0)
pub fn is_normalized(v: &[f32]) -> bool;
```
### 3.5 Cache (`cache.rs`)
#### Primary Functions
```rust
impl EmbeddingCache {
/// Open or create a cache at the given directory
///
/// # Arguments
/// * `config` - Cache configuration
///
/// # Returns
/// * `TldrResult<Self>` - Cache instance
pub fn open(config: CacheConfig) -> TldrResult<Self>;
/// Get cached embedding for a chunk
///
/// # Arguments
/// * `chunk` - Code chunk to look up
/// * `model` - Model that was used
///
/// # Returns
/// * `Option<Vec<f32>>` - Cached embedding if valid
///
/// # Cache Key
/// Key = SHA256(file_path + content_hash + model_name)
pub fn get(&self, chunk: &CodeChunk, model: EmbeddingModel) -> Option<Vec<f32>>;
/// Store embedding in cache
///
/// # Arguments
/// * `chunk` - Code chunk
/// * `embedding` - Embedding vector
/// * `model` - Model used
pub fn put(&mut self, chunk: &CodeChunk, embedding: Vec<f32>, model: EmbeddingModel);
/// Flush dirty entries to disk
pub fn flush(&mut self) -> TldrResult<()>;
/// Evict stale entries older than TTL
pub fn evict_stale(&mut self) -> usize;
/// Get cache statistics
pub fn stats(&self) -> CacheStats;
}
#[derive(Debug, Clone, Serialize)]
pub struct CacheStats {
/// Number of entries
pub entries: usize,
/// Total size in bytes
pub size_bytes: usize,
/// Hit rate (0.0 to 1.0)
pub hit_rate: f64,
}
```
#### Cache File Format
```json
{
"version": 1,
"model": "arctic-m",
"entries": {
"sha256:abc123...": {
"content_hash": "md5:def456...",
"embedding": [0.123, -0.456, ...],
"cached_at": 1706918400
}
}
}
```
---
## 4. CLI Commands
### 4.1 `tldr semantic` Command
```rust
/// Semantic code search using embeddings
#[derive(Debug, Args)]
pub struct SemanticArgs {
/// Natural language search query
pub query: String,
/// Path to search (default: current directory)
#[arg(default_value = ".")]
pub path: PathBuf,
/// Number of results to return
#[arg(long, short = 'n', default_value = "10")]
pub top: usize,
/// Minimum similarity threshold (0.0 to 1.0)
#[arg(long, short = 't', default_value = "0.5")]
pub threshold: f64,
/// Embedding model (xs, s, m, m-long, l)
#[arg(long, short = 'm', default_value = "m")]
pub model: EmbeddingModel,
/// Programming language (auto-detect if not specified)
#[arg(long, short = 'l')]
pub lang: Option<Language>,
/// Disable cache
#[arg(long)]
pub no_cache: bool,
}
```
#### Example Usage
```bash
# Basic semantic search
$ tldr semantic "parse config file" src/
# With options
$ tldr semantic "error handling retry logic" --top 5 --threshold 0.7 --model m
# Specific language
$ tldr semantic "database connection pool" --lang python src/
```
#### Example Output (JSON)
```json
{
"results": [
{
"file_path": "src/config.rs",
"function_name": "parse_config",
"class_name": null,
"score": 0.89,
"line_start": 10,
"line_end": 45,
"snippet": "fn parse_config(path: &Path) -> Result<Config> {"
},
{
"file_path": "src/loader.rs",
"function_name": "load_config",
"class_name": null,
"score": 0.82,
"line_start": 20,
"line_end": 35,
"snippet": "fn load_config() -> Config {"
}
],
"query": "parse config file",
"model": "arctic-m",
"total_chunks": 150,
"matches_above_threshold": 8,
"latency_ms": 245,
"cache_hit": true
}
```
#### Example Output (Text)
```
Semantic search: "parse config file"
Results (8 matches):
1. src/config.rs:parse_config (score: 0.89)
Lines 10-45
fn parse_config(path: &Path) -> Result<Config> {
2. src/loader.rs:load_config (score: 0.82)
Lines 20-35
fn load_config() -> Config {
3. src/settings.rs:read_settings (score: 0.71)
Lines 5-22
fn read_settings(file: &str) -> Settings {
Search completed in 245ms (cache hit)
```
### 4.2 `tldr embed` Command
```rust
/// Generate embeddings for code
#[derive(Debug, Args)]
pub struct EmbedArgs {
/// File or directory to embed
pub path: PathBuf,
/// Output file (stdout if not specified)
#[arg(long, short = 'o')]
pub output: Option<PathBuf>,
/// Chunking granularity (file or function)
#[arg(long, short = 'g', default_value = "function")]
pub granularity: ChunkGranularity,
/// Embedding model
#[arg(long, short = 'm', default_value = "m")]
pub model: EmbeddingModel,
/// Programming language
#[arg(long, short = 'l')]
pub lang: Option<Language>,
/// Include embedding vectors in output
#[arg(long)]
pub include_vectors: bool,
}
```
#### Example Usage
```bash
# Embed a file
$ tldr embed src/config.rs
# Embed a directory with function-level chunks
$ tldr embed src/ --granularity function --output embeddings.json
# Include vectors in output
$ tldr embed src/config.rs --include-vectors
```
#### Example Output (JSON)
```json
{
"path": "src/config.rs",
"model": "arctic-m",
"granularity": "function",
"chunks_embedded": 5,
"chunks_cached": 3,
"latency_ms": 120,
"chunks": [
{
"chunk": {
"file_path": "src/config.rs",
"function_name": "parse_config",
"class_name": null,
"line_start": 10,
"line_end": 45,
"content": "fn parse_config(path: &Path) -> Result<Config> {...}",
"content_hash": "md5:abc123",
"language": "rust"
},
"embedding": [0.123, -0.456, ...],
"model": "arctic-m",
"embedded_at": 1706918400
}
]
}
```
### 4.3 `tldr similar` Command
```rust
/// Find similar code fragments
#[derive(Debug, Args)]
pub struct SimilarArgs {
/// Source file to find similar code for
pub file: PathBuf,
/// Specific function to compare (whole file if not specified)
#[arg(long, short = 'f')]
pub function: Option<String>,
/// Number of similar results
#[arg(long, short = 'n', default_value = "5")]
pub top: usize,
/// Minimum similarity threshold
#[arg(long, short = 't', default_value = "0.7")]
pub threshold: f64,
/// Exclude self from results
#[arg(long, default_value = "true")]
pub exclude_self: bool,
/// Search path (default: current directory)
#[arg(long, short = 'p', default_value = ".")]
pub path: PathBuf,
/// Embedding model
#[arg(long, short = 'm', default_value = "m")]
pub model: EmbeddingModel,
}
```
#### Example Usage
```bash
# Find code similar to a function
$ tldr similar src/config.rs --function parse_config
# Find similar files
$ tldr similar src/config.rs --top 10 --threshold 0.6
# Search in specific directory
$ tldr similar src/config.rs --function parse_config --path lib/
```
#### Example Output (JSON)
```json
{
"source": {
"file_path": "src/config.rs",
"function_name": "parse_config",
"line_start": 10,
"line_end": 45
},
"similar": [
{
"file_path": "src/loader.rs",
"function_name": "load_config",
"score": 0.85,
"line_start": 20,
"line_end": 35,
"snippet": "fn load_config() -> Config {"
},
{
"file_path": "src/settings.rs",
"function_name": "read_settings",
"score": 0.72,
"line_start": 5,
"line_end": 22,
"snippet": "fn read_settings(file: &str) -> Settings {"
}
],
"model": "arctic-m",
"total_compared": 150,
"exclude_self": true
}
```
### 4.4 `tldr explain` Command (DEFERRED)
```rust
/// LLM-powered code explanation
#[derive(Debug, Args)]
pub struct ExplainArgs {
/// File containing the function
pub file: PathBuf,
/// Function name to explain
pub function: String,
/// Detail level
#[arg(long, short = 'd', default_value = "brief")]
pub detail: DetailLevel,
}
#[derive(Debug, Clone, Copy, clap::ValueEnum)]
pub enum DetailLevel {
/// One-sentence summary
Brief,
/// Full explanation with examples
Full,
}
```
**Note:** This command is deferred to a future session as it requires LLM integration.
---
## 5. Behavioral Contracts
### 5.1 Embedding Invariants
| Normalization | All embeddings have L2 norm = 1.0 (tolerance: 1e-6) |
| Dimensions | Output dimensions match model.dimensions() exactly |
| Determinism | Same input text + model produces identical embeddings |
| Empty input | Returns zero vector (normalized) |
### 5.2 Similarity Invariants
| Range | Cosine similarity ∈ [-1.0, 1.0] |
| Self-similarity | A · A = 1.0 for normalized vectors |
| Symmetry | A · B = B · A |
| Triangle inequality | Not guaranteed (embeddings are semantic, not metric) |
### 5.3 Caching Invariants
| Key uniqueness | Key = hash(file_path + content_hash + model) |
| Invalidation | Cache miss if content_hash differs |
| Model isolation | Different models have separate cache entries |
| Concurrent safety | Cache file locked during writes |
### 5.4 Search Invariants
| Ordering | Results sorted by score descending |
| Threshold | All results have score >= threshold |
| Top-K | At most K results returned |
| Completeness | All chunks above threshold are candidates |
---
## 6. Error Handling
### 6.1 New Error Types
Add to `tldr_core::error::TldrError`:
```rust
// =========================================================================
// Semantic Search Errors (Session 16)
// =========================================================================
/// Model loading/initialization failed
#[error("Failed to load embedding model '{model}': {detail}")]
ModelLoadError {
model: String,
detail: String,
},
/// Embedding generation failed
#[error("Embedding failed for {file}: {detail}")]
EmbeddingError {
file: PathBuf,
detail: String,
},
/// Cache corruption or version mismatch
#[error("Embedding cache corrupted: {0}")]
CacheCorrupted(String),
/// No chunks found to embed
#[error("No embeddable code chunks found in {0}")]
NoChunksFound(PathBuf),
/// Index not built
#[error("Semantic index not initialized. Run 'tldr embed' first.")]
IndexNotBuilt,
```
### 6.2 Error Scenarios
| Model not downloaded | `ModelLoadError` | Automatic download on first use |
| ONNX runtime failure | `ModelLoadError` | User retry with --offline |
| Parse error in file | `ParseError` | Skip file, continue processing |
| Cache file corrupted | `CacheCorrupted` | Delete and rebuild cache |
| Empty directory | `NoChunksFound` | Return empty results |
| Out of memory | `EmbeddingError` | Reduce batch size |
### 6.3 Graceful Degradation
Following the existing M8 pattern from `embedding_client.rs`:
```rust
/// Perform semantic search with graceful degradation
pub fn search_with_fallback(
query: &str,
root: &Path,
options: SearchOptions,
) -> TldrResult<SemanticSearchReport> {
match SemanticIndex::build(root, ChunkOptions::default(), EmbedOptions::default(), None) {
Ok(index) => index.search(query, options),
Err(TldrError::ModelLoadError { .. }) => {
// Fall back to BM25 search
eprintln!("Warning: Embedding model unavailable, falling back to keyword search");
let bm25_results = Bm25Index::from_project(root, Language::Python)?
.search(query, options.top_k);
// Convert BM25 results to SemanticSearchReport format
Ok(convert_bm25_to_semantic(bm25_results, query))
}
Err(e) => Err(e),
}
}
```
---
## 7. Integration Points
### 7.1 With Existing AST Module
The semantic module reuses tree-sitter parsing from `tldr_core::ast`:
```rust
// In chunker.rs
use crate::ast::parser::parse_file;
use crate::ast::extractor::{extract_functions, extract_methods};
pub fn chunk_file(path: &Path, options: &ChunkOptions) -> TldrResult<Vec<CodeChunk>> {
let (tree, source, _) = parse_file(path)?;
let language = Language::from_path(path).ok_or_else(|| {
TldrError::UnsupportedLanguage(path.extension().unwrap_or_default().to_string_lossy().to_string())
})?;
match options.granularity {
ChunkGranularity::File => {
Ok(vec![CodeChunk {
file_path: path.to_path_buf(),
function_name: None,
class_name: None,
line_start: 1,
line_end: source.lines().count() as u32,
content: source.to_string(),
content_hash: hash_content(&source),
language,
}])
}
ChunkGranularity::Function => {
extract_function_chunks_from_tree(&tree, &source, path, language)
}
}
}
```
### 7.2 With Existing Search Module
The semantic module complements the existing `search` module:
```rust
// In lib.rs
pub mod search; // BM25, hybrid search
pub mod semantic; // Dense embeddings, semantic search
// Re-exports
pub use search::{Bm25Index, hybrid_search, HybridResult};
pub use semantic::{SemanticIndex, SemanticSearchResult, Embedder, EmbeddingModel};
```
### 7.3 With Existing CLI Patterns
Commands follow the established pattern from `imports.rs`, `cfg.rs`, etc.:
```rust
// In semantic.rs
impl SemanticArgs {
pub fn run(&self, format: OutputFormat, quiet: bool) -> Result<()> {
let writer = OutputWriter::new(format, quiet);
// Auto-detect language if not specified
let language = self.lang.or_else(|| {
Language::from_path(&self.path)
});
writer.progress(&format!(
"Building semantic index for {}...",
self.path.display()
));
// Build index
let cache = if self.no_cache {
None
} else {
Some(EmbeddingCache::open(CacheConfig::default())?)
};
let index = SemanticIndex::build(
&self.path,
ChunkOptions {
granularity: ChunkGranularity::Function,
..Default::default()
},
EmbedOptions { model: self.model, ..Default::default() },
cache.as_mut(),
)?;
writer.progress(&format!(
"Searching {} chunks for '{}'...",
index.len(),
self.query
));
// Search
let report = index.search(
&self.query,
SearchOptions {
top_k: self.top,
threshold: self.threshold,
model: self.model,
..Default::default()
},
)?;
// Output
if writer.is_text() {
let text = format_semantic_text(&report);
writer.write_text(&text)?;
} else {
writer.write(&report)?;
}
Ok(())
}
}
```
### 7.4 With Hybrid Search
The semantic module can be integrated with the existing hybrid search:
```rust
// Future enhancement: Hybrid RRF with semantic embeddings
pub fn hybrid_semantic_search(
query: &str,
root: &Path,
options: HybridSearchOptions,
) -> TldrResult<HybridSearchReport> {
// BM25 results
let bm25_index = Bm25Index::from_project(root, options.language)?;
let bm25_results = bm25_index.search(query, options.top_k * 2);
// Semantic results
let semantic_index = SemanticIndex::build(root, ...)?;
let semantic_results = semantic_index.search(query, ...)?;
// Fuse with RRF
fuse_rrf(&bm25_results, &semantic_results.results, options.k_constant, options.top_k)
}
```
---
## 8. Performance Considerations
### 8.1 Time Complexity
| Model load | O(1) | ~2s (first time), ~100ms (cached) |
| Embed single text | O(d) | ~10ms |
| Embed batch (n texts) | O(n * d / batch) | ~50ms for 100 texts |
| Build index (n files) | O(n * (parse + embed)) | ~5s for 1000 functions |
| Search (n chunks) | O(n * d) | ~10ms for 10K chunks |
| Cache lookup | O(1) | ~1ms |
### 8.2 Space Complexity
| Model (ArcticM) | ~110MB on disk |
| Single embedding | 768 * 4 = 3KB |
| 10K embeddings | ~30MB in memory |
| Cache entry | ~3.5KB per chunk |
| Cache (10K) | ~35MB on disk |
### 8.3 Optimization Strategies
1. **Lazy model loading**: Model loaded on first embed call
2. **Batch embedding**: Process multiple chunks in single forward pass
3. **Cache-first**: Check cache before embedding
4. **Incremental updates**: Only re-embed changed files
5. **Parallel chunking**: Use rayon for file processing
### 8.4 Memory Limits
```rust
/// Maximum chunks to hold in memory
pub const MAX_INDEX_SIZE: usize = 100_000;
/// Maximum batch size for embedding
pub const MAX_BATCH_SIZE: usize = 64;
/// Maximum file size to embed (10MB)
pub const MAX_FILE_SIZE: usize = 10 * 1024 * 1024;
```
---
## 9. Testing Strategy
### 9.1 Unit Tests
```rust
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_cosine_similarity_identical() {
let v = vec![0.5, 0.5, 0.5, 0.5];
let normalized = normalize(&v);
assert!((cosine_similarity(&normalized, &normalized) - 1.0).abs() < 1e-6);
}
#[test]
fn test_cosine_similarity_orthogonal() {
let a = vec![1.0, 0.0, 0.0];
let b = vec![0.0, 1.0, 0.0];
assert!((cosine_similarity(&a, &b)).abs() < 1e-6);
}
#[test]
fn test_chunk_file_granularity() {
let tmp = create_test_file("fn foo() {}\nfn bar() {}");
let chunks = chunk_file(&tmp, &ChunkOptions {
granularity: ChunkGranularity::Function,
..Default::default()
}).unwrap();
assert_eq!(chunks.len(), 2);
}
#[test]
fn test_cache_invalidation() {
let mut cache = EmbeddingCache::open(CacheConfig::default()).unwrap();
let chunk1 = CodeChunk { content_hash: "abc".into(), ..};
let chunk2 = CodeChunk { content_hash: "xyz".into(), ..};
cache.put(&chunk1, vec![0.1; 768], EmbeddingModel::ArcticM);
assert!(cache.get(&chunk1, EmbeddingModel::ArcticM).is_some());
assert!(cache.get(&chunk2, EmbeddingModel::ArcticM).is_none());
}
#[test]
fn test_top_k_ordering() {
let query = vec![1.0, 0.0];
let candidates = vec![
vec![0.9, 0.1], // high similarity
vec![0.1, 0.9], // low similarity
vec![0.7, 0.3], // medium similarity
];
let results = top_k_similar(&query, &candidates, 3, 0.0);
assert_eq!(results[0].0, 0); // highest first
assert_eq!(results[1].0, 2);
assert_eq!(results[2].0, 1);
}
}
```
### 9.2 Integration Tests
```rust
#[test]
fn test_semantic_search_e2e() {
let tmp_dir = tempdir().unwrap();
// Create test files
fs::write(tmp_dir.path().join("config.rs"), "fn parse_config() {}").unwrap();
fs::write(tmp_dir.path().join("loader.rs"), "fn load_settings() {}").unwrap();
// Build index and search
let index = SemanticIndex::build(
tmp_dir.path(),
ChunkOptions::default(),
EmbedOptions::default(),
None,
).unwrap();
let results = index.search("configuration parser", SearchOptions::default()).unwrap();
// parse_config should rank higher than load_settings
assert!(!results.results.is_empty());
assert!(results.results[0].function_name.as_ref().unwrap().contains("config"));
}
```
### 9.3 Benchmark Tests
```rust
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn bench_similarity_search(c: &mut Criterion) {
let query = vec![0.5; 768];
let candidates: Vec<Vec<f32>> = (0..10000)
.map(|_| (0..768).map(|_| rand::random()).collect())
.collect();
c.bench_function("top_k_10k_chunks", |b| {
b.iter(|| top_k_similar(black_box(&query), black_box(&candidates), 10, 0.5))
});
}
criterion_group!(benches, bench_similarity_search);
criterion_main!(benches);
```
---
## 10. Dependencies
### 10.1 New Cargo Dependencies
Add to `tldr-core/Cargo.toml`:
```toml
[dependencies]
# Semantic search (Session 16)
fastembed = "5.8"
# Content hashing for cache keys
md5 = "0.7"
sha2 = "0.10"
# Progress reporting
indicatif = { version = "0.17", optional = true }
[features]
default = []
progress = ["indicatif"]
```
### 10.2 Feature Flags
| `progress` | Show progress bars during embedding | Off |
| `cache` | Enable embedding cache | On |
---
## 11. Migration Path
### Phase 1: Foundation (This Session)
- [ ] Create `semantic/` module structure
- [ ] Implement `types.rs` with all types
- [ ] Implement `embedder.rs` with fastembed wrapper
- [ ] Implement `similarity.rs` with cosine similarity
### Phase 2: Core Features
- [ ] Implement `chunker.rs` with tree-sitter integration
- [ ] Implement `index.rs` with in-memory search
- [ ] Implement `cache.rs` with JSON persistence
### Phase 3: CLI Commands
- [ ] Implement `semantic` command
- [ ] Implement `embed` command
- [ ] Implement `similar` command
### Phase 4: Integration
- [ ] Add to `lib.rs` exports
- [ ] Add to CLI command registry
- [ ] Update `--help` documentation
- [ ] Add integration tests
### Phase 5: Optimization (Future)
- [ ] SPLADE sparse embeddings (hybrid search)
- [ ] Cross-encoder reranking
- [ ] Incremental index updates
- [ ] `explain` command with LLM
---
## 12. Open Questions
1. **Model download behavior**: Should first-time model download be automatic or require explicit `--download` flag?
- **Decision**: Automatic with warning message
2. **Cache location**: Should cache be per-project or global?
- **Decision**: Global (`~/.cache/tldr/embeddings/`) with project-specific entries
3. **Hybrid search integration**: Should semantic search automatically include BM25?
- **Decision**: Separate commands; hybrid search is future enhancement
4. **Large codebase handling**: What's the cutoff for in-memory vs. external index?
- **Decision**: 100K chunks in-memory; beyond that, warn and suggest filtering
---
## 13. Success Criteria
The semantic search module is complete when:
1. **Functional**: All three commands (`semantic`, `embed`, `similar`) work end-to-end
2. **Correct**: Similarity scores are accurate and rankings are sensible
3. **Performant**: Search completes in <1s for 10K function index
4. **Integrated**: Commands follow existing CLI patterns and output formats
5. **Tested**: >80% code coverage with unit and integration tests
6. **Documented**: All public APIs have doc comments with examples
---
*End of Specification*