turboprop 0.1.2

Fast semantic code search and indexing tool
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
# TurboProp

[![Crates.io](https://img.shields.io/crates/v/turboprop.svg)](https://crates.io/crates/turboprop)
[![Documentation](https://docs.rs/turboprop/badge.svg)](https://docs.rs/turboprop)
[![License](https://img.shields.io/badge/license-MIT%2FApache--2.0-blue.svg)](https://github.com/glamp/turboprop-rust)

**TurboProp** (`tp`) is a fast semantic code search and indexing tool written in Rust. It uses machine learning embeddings to enable intelligent code search across your codebase, making it easy to find relevant code snippets based on natural language queries.

## Key Features

- **Semantic Search**: Find code by meaning, not just keywords
- **Git Integration**: Respects `.gitignore` and only indexes files under source control
- **Watch Mode**: Automatically updates the index when files change
- **File Type Filtering**: Search within specific file types
- **Multiple Output Formats**: JSON for tools, human-readable text for reading
- **Performance Optimized**: Handles codebases from 50 to 10,000+ files
- **Easy Configuration**: Optional `.turboprop.yml` configuration file
- **MCP Server Integration**: Built-in MCP server for coding agents like Claude Code, Cursor, and Windsurf

## MCP Server for Coding Agents

**What is MCP?** MCP (Model Context Protocol) is a standard way for AI coding agents to access external tools. Think of it as a bridge that lets your AI assistant search through your code in real-time.

**Before MCP**: "Find JWT authentication code" → Agent can only see files you've shared  
**With MCP**: "Find JWT authentication code" → Agent searches your entire codebase semantically

TurboProp's MCP server works like a librarian for your codebase - it catalogs all your code, keeps it up-to-date, and helps agents find relevant code instantly.

### Quick Start (< 2 minutes)

1. **Start the MCP server**:
   ```bash
   tp mcp --repo .
   ```

2. **Configure your coding agent** (see integration examples below)

3. **Ask your agent**: "Find the JWT authentication implementation"

That's it! Your agent can now search your entire codebase semantically.

### Agent Integration

**Claude Code** - Add to `.claude.json` in your project:
```json
{
  "mcpServers": {
    "turboprop": {
      "command": "tp",
      "args": ["mcp", "--repo", "."]
    }
  }
}
```

**Cursor** - Add to `.cursor/mcp.json` in your project:
```json
{
  "mcpServers": {
    "turboprop": {
      "command": "tp", 
      "args": ["mcp", "--repo", "."],
      "cwd": "."
    }
  }
}
```

**Other Agents** (GitHub Copilot, Windsurf, etc.) - Use these settings:
- **Command**: `tp`
- **Arguments**: `["mcp", "--repo", "."]`

**✓ Verify Setup**: Restart your agent and ask: "Search for error handling code"

### What You Can Ask Your Agent

Once configured, you can ask natural language questions like:

- **"Find the JWT authentication implementation"** - Locates authentication code
- **"Show me error handling patterns"** - Finds error handling across the codebase  
- **"Where is database connection logic?"** - Discovers database-related code
- **"Find all tests for user login"** - Locates relevant test files
- **"How does the API rate limiting work?"** - Finds rate limiting implementation

### Advanced Search Options

Your agent can also use these parameters to refine searches:

- **`limit`**: Maximum results (default: 10)
- **`filetype`**: Filter by extension (`.rs`, `.js`, `.py`)
- **`filter`**: Glob pattern (`src/**/*.rs`, `tests/**`)
- **`threshold`**: Similarity threshold (0.0-1.0)

Example: *"Find authentication code, limit to 5 results, only in Rust files"*

### Configuration & Advanced Usage

**Custom Model & Settings**:
```bash
tp mcp --repo . --model sentence-transformers/all-MiniLM-L12-v2 --max-filesize 5mb
```

**Project Configuration** (`.turboprop.yml`):
```yaml
model: "sentence-transformers/all-MiniLM-L6-v2"
max_filesize: "2mb" 
similarity_threshold: 0.3
```

**📖 Complete Guide**: [MCP User Guide](docs/MCP_GUIDE.md)  
**🔧 Troubleshooting**: [Common Issues & Solutions](TROUBLESHOOTING.md#mcp-server-troubleshooting)  
**⚡ Performance**: Tips for large repositories and team usage

## Quick Start

### Installation

#### Via Cargo (Recommended)
```bash
cargo install turboprop
```

#### From Source
```bash
git clone https://github.com/glamp/turboprop-rust
cd turboprop-rust
cargo build --release
# Binary will be in target/release/tp
```

### Basic Usage

1. **Index your codebase**:
   ```bash
   tp index --repo . --max-filesize 2mb
   ```

2. **Search for code**:
   ```bash
   tp search "jwt authentication" --repo .
   ```

3. **Filter by file type**:
   ```bash
   tp search --filetype .js "jwt authentication" --repo .
   ```

4. **Get human-readable output**:
   ```bash
   tp search "jwt authentication" --repo . --output text
   ```

## Model Support

TurboProp now supports multiple embedding models to optimize for different use cases:

### Available Models

#### Sentence Transformer Models (FastEmbed)
- `sentence-transformers/all-MiniLM-L6-v2` (default)
  - Fast and lightweight, good for general use
  - 384 dimensions, ~23MB
  - Automatic download and caching

- `sentence-transformers/all-MiniLM-L12-v2`
  - Better accuracy with slightly more compute
  - 384 dimensions, ~44MB

#### Specialized Code Models
- `nomic-embed-code.Q5_K_S.gguf`
  - Specialized for code search and retrieval
  - 768 dimensions, ~2.5GB
  - Supports multiple programming languages
  - Quantized for efficient inference

#### Multilingual Models
- `Qwen/Qwen3-Embedding-0.6B`
  - State-of-the-art multilingual support (100+ languages)
  - 1024 dimensions, ~600MB
  - Supports instruction-based embeddings
  - Excellent for code and text retrieval

### Model Selection Guide

Choose your model based on your use case:

| Use Case | Recommended Model | Why |
|----------|-------------------|-----|
| General code search | `sentence-transformers/all-MiniLM-L6-v2` | Fast, reliable, good balance |
| Specialized code search | `nomic-embed-code.Q5_K_S.gguf` | Optimized for code understanding |
| Multilingual projects | `Qwen/Qwen3-Embedding-0.6B` | Best multilingual support |
| Low resource environments | `sentence-transformers/all-MiniLM-L6-v2` | Smallest memory footprint |
| Maximum accuracy | `Qwen/Qwen3-Embedding-0.6B` | State-of-the-art performance |

### Usage Examples

#### Basic Model Selection
```bash
# List available models
tp model list

# Get model information
tp model info "Qwen/Qwen3-Embedding-0.6B"

# Download a model before use
tp model download "nomic-embed-code.Q5_K_S.gguf"
```

#### Indexing with Different Models
```bash
# Use default model
tp index --repo ./my-project

# Use specialized code model
tp index --repo ./my-project --model "nomic-embed-code.Q5_K_S.gguf"

# Use multilingual model with instruction
tp index --repo ./my-project \
  --model "Qwen/Qwen3-Embedding-0.6B" \
  --instruction "Represent this code for semantic search"
```

#### Searching with Model Consistency
```bash
# Search using the same model used for indexing
tp search "jwt authentication" --model "nomic-embed-code.Q5_K_S.gguf"

# Use instruction for context-aware search (Qwen3 only)
tp search "error handling" \
  --model "Qwen/Qwen3-Embedding-0.6B" \
  --instruction "Find code related to error handling and exceptions"
```

### Configuration File Support

Create `.turboprop.yml` in your project root:
```yaml
# Default model for all operations
default_model: "sentence-transformers/all-MiniLM-L6-v2"

# Model-specific configurations
models:
  "Qwen/Qwen3-Embedding-0.6B":
    instruction: "Represent this code for semantic search"
    cache_dir: "~/.turboprop/qwen3-cache"
  
  "nomic-embed-code.Q5_K_S.gguf":
    cache_dir: "~/.turboprop/nomic-cache"

# Performance settings
embedding:
  batch_size: 32
  cache_embeddings: true
  
# Resource limits
max_memory_usage: "8GB"
warn_large_models: true
```

## Complete Usage Guide

### Indexing Command

The `index` command creates a searchable index of your codebase:

```bash
tp index [OPTIONS] --repo <REPO>
```

#### Options:
- `--repo <PATH>`: Repository path to index (default: current directory)
- `--max-filesize <SIZE>`: Maximum file size to index (e.g., "2mb", "500kb", "1gb")
- `--watch`: Monitor file changes and update index automatically
- `--model <MODEL>`: Embedding model to use (default: "sentence-transformers/all-MiniLM-L6-v2")
- `--cache-dir <DIR>`: Cache directory for models and data
- `--worker-threads <N>`: Number of worker threads for processing
- `--batch-size <N>`: Batch size for embedding generation (default: 32)
- `--verbose`: Enable verbose output

#### Examples:

```bash
# Basic indexing
tp index --repo .

# Index with size limit and watch mode
tp index --repo . --max-filesize 2mb --watch

# Use custom model and cache directory
tp index --repo . --model "sentence-transformers/all-MiniLM-L12-v2" --cache-dir ~/.turboprop-cache

# Index with custom performance settings
tp index --repo . --worker-threads 8 --batch-size 64
```

### Search Command

The `search` command finds relevant code using semantic similarity:

```bash
tp search <QUERY> [OPTIONS]
```

#### Options:
- `<QUERY>`: Search query (natural language or keywords)
- `--repo <PATH>`: Repository path to search in (default: current directory)
- `--limit <N>`: Maximum number of results to return (default: 10)
- `--threshold <FLOAT>`: Minimum similarity threshold (0.0 to 1.0)
- `--output <FORMAT>`: Output format: 'json' (default) or 'text'
- `--filetype <EXT>`: Filter results by file extension (e.g., '.rs', '.js', '.py')
- `--filter <PATTERN>`: Filter results by glob pattern (e.g., '*.rs', 'src/**/*.js')

#### Examples:

```bash
# Basic search
tp search "user authentication" --repo .

# Search with filters and limits
tp search "database connection" --repo . --filetype .rs --limit 5

# Get human-readable output
tp search "error handling" --repo . --output text

# High-precision search
tp search "jwt token validation" --repo . --threshold 0.8

# Search in specific directory
tp search "api routes" --repo ./backend

# Filter by glob pattern
tp search "authentication" --repo . --filter "src/*.js"

# Recursive glob patterns
tp search "error handling" --repo . --filter "**/*.{rs,py}"

# Combine filters
tp search "database" --repo . --filetype .rs --filter "src/**/*.rs"
```

## Glob Pattern Filtering

TurboProp supports powerful glob pattern filtering to search within specific files or directories. Glob patterns use Unix shell-style wildcards to match file paths.

### Basic Wildcards

| Wildcard | Description | Example |
|----------|-------------|---------|
| `*` | Match any characters within a directory | `*.rs` matches all Rust files |
| `?` | Match exactly one character | `file?.rs` matches `file1.rs`, `fileA.rs` |
| `**` | Match any characters across directories | `**/*.js` matches JS files anywhere |
| `[abc]` | Match any character in the set | `file[123].rs` matches `file1.rs`, `file2.rs`, `file3.rs` |
| `[!abc]` | Match any character NOT in the set | `file[!0-9].rs` matches `filea.rs` but not `file1.rs` |
| `{a,b}` | Match any of the alternatives | `*.{js,ts}` matches both `.js` and `.ts` files |

### Common Pattern Examples

#### File Type Filtering
```bash
# All Rust files anywhere in the codebase
tp search "async function" --filter "*.rs"

# All JavaScript and TypeScript files
tp search "react component" --filter "*.{js,ts,jsx,tsx}"

# All configuration files
tp search "database" --filter "*.{json,yaml,yml,toml,ini}"
```

#### Directory-Specific Filtering
```bash
# Files only in the src directory
tp search "main function" --filter "src/*.rs"

# Files only in tests directory
tp search "test case" --filter "tests/*.py"

# Files in specific subdirectories
tp search "handler" --filter "src/api/*.js"
```

#### Recursive Directory Filtering
```bash
# Python files anywhere in the project
tp search "authentication" --filter "**/*.py"

# Test files in any subdirectory
tp search "unit test" --filter "**/test_*.rs"

# Source files in src and all subdirectories
tp search "database connection" --filter "src/**/*.{rs,py,js}"

# Handler files in nested API directories
tp search "request handler" --filter "**/api/**/handlers/*.rs"
```

#### Advanced Pattern Examples
```bash
# Test files with specific naming patterns
tp search "integration test" --filter "tests/**/*_{test,spec}.{js,ts}"

# Source files excluding certain directories
tp search "function definition" --filter "src/**/*.rs" --filter "!**/target/**"

# Files in multiple specific directories
tp search "configuration" --filter "{src,config,scripts}/**/*.{json,yaml}"

# Files with numeric suffixes
tp search "version" --filter "**/*[0-9].{js,py,rs}"
```

### Pattern Behavior

**Path Matching**: Patterns match against the entire file path, not just the filename:
- `*.rs` matches `main.rs`, `src/main.rs`, and `lib/nested/file.rs`
- `src/*.rs` matches `src/main.rs` but not `src/nested/file.rs`
- `src/**/*.rs` matches both `src/main.rs` and `src/nested/file.rs`

**Case Sensitivity**: Patterns are case-sensitive by default:
- `*.RS` matches `FILE.RS` but not `file.rs`
- `*.rs` matches `file.rs` but not `FILE.RS`

**Path Separators**: Always use forward slashes (`/`) in patterns:
- `src/api/*.js` (correct)
-`src\\api\\*.js` (incorrect)

**Combining with File Type Filter**: You can use both `--filter` and `--filetype` together:
```bash
# Search for Rust files in src directory only
tp search "async" --filetype .rs --filter "src/**/*"
```

### Performance Tips

- **Simple patterns are faster**: `*.rs` is faster than `**/*.rs`
- **Be specific when possible**: `src/*.js` is faster than `**/*.js` if you know files are in `src/`
- **Avoid excessive wildcards**: Patterns with many `**` can be slower on large codebases
- **Use file type filter for extensions**: `--filetype .rs` is optimized compared to `--filter "*.rs"`

### Troubleshooting Glob Patterns

**Pattern doesn't match expected files**:
- Check case sensitivity: `*.RS` vs `*.rs`
- Verify path structure: `src/*.js` only matches direct children of `src/`
- Use `**` for recursive matching: `src/**/*.js` matches nested files

**Pattern matching too many files**:
- Be more specific: use `src/*.js` instead of `*.js`
- Add more path components: `src/components/*.jsx`
- Use character classes: `test_[0-9]*.rs` instead of `test_*.rs`

**Complex patterns not working**:
- Test simpler patterns first: start with `*.ext` then add complexity
- Check for typos in braces: `{js,ts}` not `{js, ts}` (no spaces)
- Validate bracket expressions: `[a-z]` not `[a-Z]`

For more pattern examples and troubleshooting, see the `TROUBLESHOOTING.md` file.

## Configuration

TurboProp supports optional configuration via a `.turboprop.yml` file in your repository root:

```yaml
# .turboprop.yml
max_filesize: "2mb"
model: "sentence-transformers/all-MiniLM-L6-v2"
cache_dir: "~/.turboprop-cache"
worker_threads: 4
batch_size: 32
default_output: "json"
similarity_threshold: 0.3
```

## Output Formats

### JSON Output (Default)
```json
{
  "file": "src/auth.rs",
  "score": 0.8234,
  "content": "fn authenticate_user(token: &str) -> Result<User, AuthError> { ... }"
}
```

### Text Output
```
Score: 0.82 | src/auth.rs
fn authenticate_user(token: &str) -> Result<User, AuthError> {
    // JWT token validation logic
    ...
}
```

## Performance Characteristics

- **Indexing Speed**: ~100-500 files/second (depending on file size and hardware)
- **Search Speed**: ~10-50ms per query (after initial model loading)
- **Memory Usage**: ~50-200MB (varies with model and index size)
- **Storage**: Index size is typically 10-30% of source code size

### Recommended Limits
- **File Count**: Up to 10,000 files (tested)
- **File Size**: Up to 2MB per file (configurable)
- **Total Codebase**: Up to 500MB of source code

## Supported File Types

TurboProp works with any text-based file but is optimized for common programming languages:

- **Web**: `.js`, `.ts`, `.jsx`, `.tsx`, `.html`, `.css`, `.scss`, `.vue`
- **Backend**: `.py`, `.rs`, `.go`, `.java`, `.kt`, `.scala`, `.rb`, `.php`
- **Systems**: `.c`, `.cpp`, `.h`, `.hpp`, `.cs`, `.swift`
- **Data**: `.sql`, `.json`, `.yaml`, `.yml`, `.xml`, `.toml`
- **Docs**: `.md`, `.txt`, `.rst`
- **Config**: `.env`, `.ini`, `.conf`, `.cfg`

## Integration Examples

### With Git Hooks
Add to `.git/hooks/post-commit`:
```bash
#!/bin/bash
tp index --repo . --max-filesize 2mb
```

### With IDEs
Many IDEs can be configured to run external tools. Add TurboProp as a custom search tool.

### With CI/CD
```bash
# In your CI script
tp index --repo . --max-filesize 2mb
tp search "security vulnerability" --repo . --output json > security-search-results.json
```

## Troubleshooting

### Common Issues

**Index not found**
```bash
Error: No index found in repository
```
Solution: Run `tp index --repo .` first to create an index.

**Model download fails**
```bash
Error: Failed to download model
```
Solution: Check internet connection or specify a local cache directory with `--cache-dir`.

**Large files skipped**
```bash
Warning: Skipping large file (>2MB)
```
Solution: Increase limit with `--max-filesize 5mb` or exclude large files.

**Out of memory**
```bash
Error: Out of memory during indexing
```
Solution: Reduce `--batch-size` or `--worker-threads`, or exclude large files.

### Getting Help

```bash
tp --help              # General help
tp index --help        # Index command help
tp search --help       # Search command help
```

## Development

### Building from Source
```bash
git clone https://github.com/glamp/turboprop-rust
cd turboprop-rust
cargo build --release
```

### Running Tests
```bash
cargo test                    # Run all tests
cargo test --test integration # Run integration tests only
cargo bench                   # Run benchmarks
```

### Dependencies
- **clap**: CLI parsing and help generation
- **tokio**: Async runtime for I/O operations  
- **serde**: JSON serialization
- **fastembed**: Machine learning embeddings
- **git2**: Git repository integration
- **notify**: File system watching
- **walkdir**: Directory traversal

## See Also

For more detailed information:

- **[Installation Guide]INSTALLATION.md** - Comprehensive installation instructions for all platforms
- **[Model Documentation]MODELS.md** - Complete guide to available embedding models and selection criteria
- **[Configuration Guide]CONFIGURATION.md** - Advanced configuration options and `.turboprop.yml` setup
- **[API Reference]docs/API.md** - Library API documentation for programmatic usage
- **[Troubleshooting Guide]TROUBLESHOOTING.md** - Solutions to common issues and performance problems
- **[Migration Guide]MIGRATION.md** - Upgrading from previous versions

## Contributing

1. Fork the repository
2. Create a feature branch
3. Add tests for your changes
4. Ensure all tests pass: `cargo test`
5. Submit a pull request

## License

Licensed under either of:
- MIT License ([LICENSE-MIT]LICENSE-MIT)
- Apache License, Version 2.0 ([LICENSE-APACHE]LICENSE-APACHE)

at your option.