Crate turboprop

Crate turboprop 

Source
Expand description

§TurboProp - Fast Semantic Code Search and Indexing

TurboProp is a Rust library and CLI tool that enables fast semantic search across codebases using machine learning embeddings. It indexes your code files and allows you to search for functionality using natural language queries.

Now includes MCP server support for real-time integration with coding agents.

§Features

  • Semantic Search: Find code by meaning, not just keywords
  • Git Integration: Automatically respects .gitignore and only indexes tracked files
  • Watch Mode: Monitor file changes and automatically update the index
  • File Filtering: Filter by file type, size, and custom patterns
  • Multiple Output Formats: JSON for tools, human-readable text for reading
  • Performance Optimized: Handles codebases with 50-10,000+ files efficiently
  • Configurable Models: Use any HuggingFace sentence-transformer model
  • MCP Server: Real-time integration with coding agents via Model Context Protocol

§Quick Start

§CLI Usage

# Index your codebase
tp index --repo . --max-filesize 2mb

# Search for code
tp search "jwt authentication" --repo .

# Filter by file type
tp search --filetype .js "error handling" --repo .

# Get human-readable output
tp search "database queries" --repo . --output text

§Library Usage

The library provides both high-level convenience functions and low-level components for building custom search solutions.

use turboprop::{config::TurboPropConfig, build_persistent_index, search_with_config};
use std::path::Path;

// Build an index with default settings
let config = TurboPropConfig::default();
let index = build_persistent_index(Path::new("./src"), &config).await?;

// Search the index
let results = search_with_config(
    "error handling patterns",
    Path::new("./src"),
    Some(10),  // limit results
    Some(0.7)  // similarity threshold
).await?;

for result in results {
    println!("{}: {}", result.location_display(), result.content_preview(80));
}
§Custom Configuration
use turboprop::{
    config::TurboPropConfig,
    embeddings::EmbeddingConfig,
    types::FileDiscoveryConfig,
    build_persistent_index
};
use std::path::Path;

// Configure embedding model
let embedding_config = EmbeddingConfig::with_model("sentence-transformers/all-mpnet-base-v2")
    .with_batch_size(16);

// Configure file discovery
let file_config = FileDiscoveryConfig::default()
    .with_max_filesize(5_000_000)  // 5MB limit
    .with_gitignore_respect(true)
    .with_untracked(false);

// Create complete configuration
let config = TurboPropConfig {
    embedding: embedding_config,
    file_discovery: file_config,
    ..Default::default()
};

// Build index with custom configuration
let index = build_persistent_index(Path::new("./project"), &config).await?;
§Incremental Updates
use turboprop::{config::TurboPropConfig, update_persistent_index, index_exists};
use std::path::Path;

let path = Path::new("./src");
let config = TurboPropConfig::default();

if index_exists(path) {
    // Update existing index incrementally
    let (updated_index, update_result) = update_persistent_index(path, &config).await?;
     
    println!("Index updated: {} files added, {} files modified, {} files removed",
             update_result.added_files,
             update_result.updated_files,
             update_result.removed_files);
} else {
    println!("No existing index found, create one first");
}

§Architecture

TurboProp uses a multi-stage pipeline for indexing:

  1. File Discovery: Finds files to index based on git status and filters
  2. Content Processing: Reads and preprocesses file content
  3. Chunking: Breaks large files into smaller, searchable chunks
  4. Embedding Generation: Creates vector embeddings using ML models
  5. Index Storage: Stores embeddings and metadata for fast retrieval

For searching, it:

  1. Query Embedding: Converts the search query to a vector
  2. Similarity Search: Finds the most similar code chunks using cosine similarity
  3. Result Ranking: Sorts results by relevance score
  4. Output Formatting: Presents results in the requested format

§Performance Characteristics

  • Indexing Speed: ~100-500 files/second (varies by file size and hardware)
  • Search Speed: ~10-50ms per query (after model loading)
  • Memory Usage: ~50-200MB (varies with model and index size)
  • Index Size: Typically 10-30% of source code size

§Supported Models

TurboProp supports any HuggingFace sentence-transformer model:

  • sentence-transformers/all-MiniLM-L6-v2 (default, 384 dims, ~90MB)
  • sentence-transformers/all-MiniLM-L12-v2 (384 dims, ~130MB)
  • sentence-transformers/all-mpnet-base-v2 (768 dims, ~420MB, highest quality)
  • sentence-transformers/paraphrase-MiniLM-L6-v2 (384 dims, ~90MB)

§Error Handling

All functions return anyhow::Result<T> for comprehensive error handling. Common error types include:

  • I/O Errors: File access, permission issues
  • Model Errors: Download failures, model loading issues
  • Configuration Errors: Invalid settings, malformed config files
  • Index Errors: Corrupted index, version mismatches

§Thread Safety

Most operations are thread-safe and designed for concurrent use:

  • Index building uses multiple worker threads for parallel processing
  • Search operations are read-only and fully concurrent
  • File watching runs in a separate background thread

§Module Organization

  • cli: Command-line interface definitions
  • commands: CLI command implementations
  • config: Configuration structures and loading
  • embeddings: ML embedding generation
  • files: File discovery and git integration
  • index: Core indexing and storage functionality
  • search: Search algorithms and result processing
  • types: Common data structures and utilities

Re-exports§

pub use mcp::McpServer;

Modules§

backends
Backend implementations for different embedding model types.
chunking
cli
commands
Command implementations for the TurboProp CLI.
compression
Vector compression algorithms for efficient index storage.
config
Configuration management for TurboProp.
constants
Constants used throughout the TurboProp codebase.
content
embeddings
Embedding generation module for converting text chunks to vector representations.
error
Structured error types for TurboProp application.
error_classification
Error classification and user-friendly error handling.
error_utils
Common error handling utilities for consistent error contexts across the codebase.
files
filters
Search result filtering functionality.
git
incremental
Incremental index update logic for efficient file change processing.
index
Vector index management with persistence capabilities.
mcp
Model Context Protocol (MCP) implementation for TurboProp
metrics
Performance metrics collection for embedding operations.
model_validation
Model validation utilities for command execution.
models
Model management and caching functionality.
output
Output formatting for search results.
parallel
Parallel processing utilities for high-performance file operations.
pipeline
Processing pipeline coordination for indexing operations.
progress
Progress reporting utilities for indexing operations.
query
Query processing and embedding generation for search functionality.
recovery
Index recovery and validation functionality.
retry
Retry logic with exponential backoff for handling transient failures.
search
Main similarity search engine implementation.
storage
Persistent storage operations for vector indexes.
streaming
Streaming operations for memory-efficient processing of large datasets.
types
Type definitions for TurboProp.
validation
Configuration validation module for TurboProp.
warnings
Resource usage warnings and recommendations system.
watcher
File system watching implementation for incremental index updates.

Constants§

DEFAULT_INDEX_PATH
Default path for indexing when no path is specified

Functions§

build_persistent_index
Build a persistent vector index for the specified path.
index_exists
Check if a persistent index exists at the specified path.
index_files
Index files in the specified path for fast searching.
index_files_with_config
Index files with embedding generation using the provided configuration.
load_persistent_index
Load an existing persistent vector index from disk.
search_files
Search through indexed files using the specified query.
search_with_config
Advanced search function with configurable parameters.
update_persistent_index
Update an existing persistent index incrementally based on file changes.