vyctor 0.1.0

A fast CLI tool for semantic file search using vector embeddings
Documentation
# Vyctor

A fast CLI tool for semantic file search using vector embeddings.

## Features

- **Semantic Search**: Find files using natural language queries
- **AST-Aware Chunking**: Code is split at function/class boundaries for better results
- **Fast Indexing**: Parallel processing with incremental updates
- **Multiple Embedding Providers**: OpenAI, Voyage AI, or local models
- **Optional Reranking**: Second-stage ranking with Voyage AI for improved precision
- **File Watching**: Auto-sync on file changes with background daemon
- **Index Browser**: Inspect indexed files and chunks for debugging
- **Local Storage**: DuckDB with HNSW vector index for fast similarity search

## Installation

### Prerequisites

- Rust 1.70+ (install via [rustup]https://rustup.rs/)
- DuckDB VSS extension (automatically installed on first run)

### Build from source

```bash
# Clone the repository
git clone https://github.com/antonmagnus/vyctor.git
cd vyctor

# Build release binary
cargo build --release

# The binary will be at target/release/vyctor
# Optionally, install it
cargo install --path .
```

### With local embeddings support

```bash
cargo build --release --features local-embeddings
```

## Quick Start

```bash
# Initialize vyctor in your project
cd your-project
vyctor init

# Set your API key (or use local embeddings which require no key)
# You can export it or add it to a .env file in your project
export VOYAGE_API_KEY=your-key-here

# Search for files
vyctor lookup "where do we handle user authentication"

# Search in a specific folder
vyctor lookup "database queries" --folder src/db

# Get more results
vyctor lookup "error handling" -n 10

# Show full content instead of preview
vyctor lookup "main function" --full
```

## Commands

### `vyctor init`

Initialize vyctor in the current directory. Creates `vyctor.config.toml` and starts initial indexing.

```bash
vyctor init           # Initialize
vyctor init --force   # Re-initialize (overwrites existing config)
```

### `vyctor lookup`

Search for files matching a natural language query.

```bash
vyctor lookup "query"                    # Basic search
vyctor lookup "query" --folder src/      # Search in folder
vyctor lookup "query" -n 10              # Return 10 results
vyctor lookup "query" --full             # Show full content
vyctor lookup "query" --verbose          # Show detailed timing and model info
```

### `vyctor sync`

Synchronize the index with current files.

```bash
vyctor sync           # Incremental sync (only changed files)
vyctor sync --force   # Force re-index all files
```

### `vyctor watch`

Watch for file changes and automatically sync. Can run in foreground or as a background daemon.

```bash
# Foreground mode (Ctrl+C to stop)
vyctor watch                # Start watching
vyctor watch --debounce 500 # Custom debounce (ms)

# Daemon mode (runs in background) - macOS/Linux only
vyctor watch --daemon       # Start background watcher
vyctor watch --status       # Check daemon status
vyctor watch --logs         # Show recent logs
vyctor watch --logs -f      # Follow logs (like tail -f)
vyctor watch --stop         # Stop daemon
```

> **Note:** Daemon mode (`--daemon`) is supported on macOS and Linux. Windows users should use foreground mode (`vyctor watch`).

### `vyctor status`

Show index status and statistics.

```bash
vyctor status
```

### `vyctor config`

Show or edit configuration.

```bash
vyctor config         # Show current config
vyctor config --edit  # Open config in editor
```

### `vyctor browse`

Browse and analyze indexed files and chunks. Useful for debugging and understanding how your code is being indexed.

```bash
# List indexed files
vyctor browse files              # List all indexed files
vyctor browse files --filter src # Filter by path pattern
vyctor browse files --hash       # Show content hashes

# Browse chunks
vyctor browse chunks             # Browse chunks (paginated)
vyctor browse chunks --file path # Show chunks for a specific file
vyctor browse chunks --id 123    # Show a specific chunk by ID
vyctor browse chunks --page 2    # Go to page 2
vyctor browse chunks -n 20       # Show 20 chunks per page
vyctor browse chunks --full      # Show full chunk content

# Index statistics
vyctor browse stats              # Show statistics by file extension
```

> **Note:** The browse commands require exclusive access to the database. If the watcher daemon is running, stop it first with `vyctor watch --stop`.

## Configuration

The configuration file is stored at `vyctor.config.toml` in the project root (tracked by git).

**Environment Variables**: Vyctor automatically loads `.env` and `.env.local` files from your project directory, so you can store API keys there instead of exporting them.

```toml
[indexing]
# Glob patterns for files to include
include = ["**/*.rs", "**/*.ts", "**/*.py", "**/*.md"]

# Glob patterns for files to exclude
exclude = ["**/node_modules/**", "**/target/**", "**/.git/**"]

# Chunk size in characters
chunk_size = 1000

# Overlap between chunks
chunk_overlap = 200

# AST-aware semantic chunking (splits at function/class boundaries)
semantic_chunking = true

# Maximum chunk size before splitting large functions/classes
max_chunk_size = 3000

[embedding]
# Provider: "openai", "voyage", or "local"
provider = "local"

# Embedding dimensions (must match your model)
dimensions = 384

# Batch size for API requests
batch_size = 100

[embedding.local]
model = "sentence-transformers/all-MiniLM-L6-v2"

[embedding.openai]
model = "text-embedding-3-small"
api_key_env = "OPENAI_API_KEY"

[embedding.voyage]
model = "voyage-3-lite"
api_key_env = "VOYAGE_API_KEY"

[reranker]
# Optional second-stage ranking for better results
# Provider: "voyage" or "none" (disabled)
provider = "none"
top_k = 30  # Candidates to retrieve before reranking

[reranker.voyage]
api_key_env = "VOYAGE_API_KEY"
model = "rerank-2"

[watch]
# Auto-start daemon when running init/sync/lookup
auto_start = false

# Debounce interval in milliseconds
debounce_ms = 300
```

## Embedding Providers

### Local (default)

No API key needed. Models are downloaded from HuggingFace on first use.

```toml
[embedding]
provider = "local"
dimensions = 384

[embedding.local]
model = "sentence-transformers/all-MiniLM-L6-v2"
```

### OpenAI

```bash
export OPENAI_API_KEY=your-key-here
```

```toml
[embedding]
provider = "openai"
dimensions = 1536

[embedding.openai]
model = "text-embedding-3-small"  # or "text-embedding-3-large" (3072 dims)
api_key_env = "OPENAI_API_KEY"
```

### Voyage AI

```bash
export VOYAGE_API_KEY=your-key-here
```

```toml
[embedding]
provider = "voyage"
dimensions = 512

[embedding.voyage]
model = "voyage-3-lite"  # or "voyage-code-3" (1024 dims)
api_key_env = "VOYAGE_API_KEY"
```

## How It Works

1. **Indexing**: Vyctor walks your directory, reads matching files, and splits them into semantic chunks.

2. **Embedding**: Each chunk is converted to a vector embedding using your configured provider.

3. **Storage**: Embeddings are stored in DuckDB with an HNSW index for fast similarity search.

4. **Search**: Your query is embedded and compared against all chunks using cosine similarity.

5. **Reranking** (optional): Results are reranked using a cross-encoder model for improved relevance.

6. **Incremental Updates**: File hashes are tracked so only changed files are re-indexed.

## Semantic Chunking

By default, vyctor uses AST-aware chunking to split code at meaningful boundaries:

- **Functions, classes, and methods** are kept together as single chunks
- **Tree-sitter** is used to parse code and identify semantic boundaries
- **Fallback to regex patterns** when tree-sitter isn't available for a language
- **Large functions** are split with the signature preserved in each sub-chunk

This produces much better search results than naive character-based splitting, which often cuts through the middle of functions.

Supported languages for AST parsing: Rust, TypeScript, JavaScript, Python, Go, Java, C, C++

```toml
[indexing]
# Enable/disable semantic chunking (default: true)
semantic_chunking = true

# Maximum size before splitting large functions (default: 3000)
max_chunk_size = 3000
```

## Reranking

For improved search quality, you can enable a reranker that performs a second-stage ranking of results:

```toml
[reranker]
provider = "voyage"  # Enable Voyage AI reranker
top_k = 30           # Retrieve 30 candidates, then rerank to return top N

[reranker.voyage]
api_key_env = "VOYAGE_API_KEY"
model = "rerank-2"
```

The reranker uses a cross-encoder model that scores query-document pairs more accurately than vector similarity alone. This is especially useful for:

- Complex queries with multiple concepts
- Finding exact matches within semantically similar results
- Improving precision when you need the most relevant results

## Performance Tips

- Use larger `batch_size` for faster initial indexing (but more memory)
- Exclude large generated files and dependencies
- Use `vyctor watch --daemon` instead of frequent `vyctor sync`
- For large codebases, consider using `text-embedding-3-small` for speed

## Auto-Start Daemon

Enable `auto_start` in your config to automatically start the watcher daemon:

```toml
[watch]
auto_start = true
debounce_ms = 300
```

With this enabled:
- `vyctor init` starts the daemon after initial indexing
- `vyctor sync` starts the daemon after syncing
- `vyctor lookup` starts the daemon before searching (if not already running)

This ensures your index stays fresh without manually running `vyctor watch`.

Each project gets its own independent daemon. Use `vyctor watch --status` to check if it's running.

## License

MIT