opencodesearch 0.1.0

Largescale MCP server for codebase search with background indexing and automatic updating to git commits in rust
# opencodesearch

`opencodesearch` is an asynchronous Rust code search system with a Model Context Protocol (MCP) server.
It indexes large repositories into vector + keyword backends and serves search results through MCP tools.

## Features
- Fully async runtime (`tokio`)
- 4 isolated processes:
  - orchestrator (state machine + supervision)
  - background ingestor
  - MCP server process
  - git watchdog process
- Required crates integrated and used in runtime code:
  - `opencodesearchparser`
  - `qdrant-client`
  - `ollama-rs`
  - `rmcp`
- Hybrid retrieval:
  - semantic search (Qdrant vectors)
  - keyword search (Quickwit HTTP + local shadow fallback)
- MCP stdio server compatible with MCP clients (opencode / Codex / Claude Code style stdio transport)

## Architecture
State machine in orchestrator:
- `SPINUP`: load `config.json`
- `NORMAL`: run `ingestor` + `mcp` + `watchdog`
- `UPDATE`: keep `watchdog`, stop `ingestor` + `mcp` during update window
- `CLOSING`: stop all children gracefully

Update flow:
- watchdog tracks git commits since last sync
- when threshold (`commit_threshold`) is reached:
  - send `UPDATE_START` to orchestrator
  - pull + compute changed/deleted files
  - remove stale docs
  - reindex changed files
  - send `UPDATE_END`

## Requirements
- Rust stable toolchain
- Docker + Docker Compose
- Local network access to:
  - Ollama (`11434`)
  - Qdrant (`6333` HTTP, `6334` gRPC)
  - Quickwit (`7280`)

## Configuration
`config.json` schema:

```json
{
  "codebase": {
    "directory_path": "/path/to/massive/repo",
    "git_branch": "main",
    "commit_threshold": 50,
    "mcp_server": "stdio"
  },
  "ollama": {
    "server_url": "http://localhost:11434",
    "embedding_model": "qwen3-embedding:0.6b",
    "context_size": 5000
  },
  "qdrant": {
    "server_url": "http://localhost:6334",
    "api_key": null
  },
  "quickwit": {
    "quickwit_url": "http://localhost:7280",
    "quickwit_index_id": "opencodesearch-code-chunks"
  }
}
```

Important:
- `qdrant.server_url` should target the gRPC endpoint port (`6334`) for `qdrant-client`.
- `quickwit.quickwit_url` should target HTTP (`7280`).

## Start Backend Services
Run all local dependencies:

```bash
docker compose up -d
```

Check containers:

```bash
docker ps
```

## Running the System

### 1) Orchestrator mode (recommended)
Starts and supervises all child processes.

```bash
cargo run -- orchestrator --config config.json
```

### 2) Individual process modes
You can run each process directly for debugging.

Ingestor:
```bash
cargo run -- ingestor --config config.json
```

MCP server:
```bash
cargo run -- mcp --config config.json
```

Watchdog (requires orchestrator IPC env):
```bash
OPENCODESEARCH_IPC_SOCKET=/tmp/opencodesearch.sock cargo run -- watchdog --config config.json
```

## MCP Server Usage
The MCP server runs over stdio using `rmcp` transport.

Implemented MCP tool:
- `search_code`
  - input:
    - `query: string`
    - `limit?: number` (default 8, max 50)
  - output (JSON string): array of objects with
    - `snippet`
    - `path`
    - `start_line`
    - `end_line`
    - `score`
    - `source`

### Example tool input

```json
{
  "query": "which function changes obj variable",
  "limit": 5
}
```

### Result shape

```json
[
  {
    "path": "/repo/module.py",
    "snippet": "def mutate(obj): ...",
    "start_line": 10,
    "end_line": 22,
    "score": 0.92,
    "source": "qdrant"
  }
]
```

## Using With MCP Clients
Any MCP client that supports stdio transport can launch this server binary.

Client command pattern:
- executable: `opencodesearch`
- args: `mcp --config /abs/path/config.json`

If using `cargo run` in development:
- command: `cargo`
- args: `run -- mcp --config /abs/path/config.json`

## Rust API Documentation
The crate exposes reusable modules for embedding, indexing, MCP serving, and process control.

### Modules
- `config`: parse typed app config (`AppConfig`)
- `chunking`: parse/split source files into chunks (`chunk_file`)
- `indexing`: indexing runtime (`IndexingRuntime`)
- `qdrant_store`: vector storage + semantic query (`QdrantStore`)
- `quickwit`: keyword storage/query (`QuickwitStore`)
- `mcp`: MCP server type (`OpenCodeSearchMcpServer`)
- `watchdog`: git update monitor (`WatchdogProcess`)
- `orchestrator`: multi-process supervisor (`Orchestrator`)

### Minimal Rust indexing example

```rust
use opencodesearch::config::AppConfig;
use opencodesearch::indexing::IndexingRuntime;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let config = AppConfig::from_path("config.json")?;
    let runtime = IndexingRuntime::from_config(config)?;

    runtime.index_entire_codebase().await?;
    Ok(())
}
```

### Minimal Rust semantic search example

```rust
use opencodesearch::config::AppConfig;
use opencodesearch::indexing::IndexingRuntime;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let config = AppConfig::from_path("config.json")?;
    let runtime = IndexingRuntime::from_config(config)?;

    let query_vec = runtime.embed_query("where is object mutated") .await?;
    let hits = runtime.qdrant.semantic_search(query_vec, 5).await?;

    for hit in hits {
        println!("{}:{}-{}", hit.path, hit.start_line, hit.end_line);
    }

    Ok(())
}
```

### Minimal Rust MCP server embedding

```rust
use opencodesearch::config::AppConfig;
use opencodesearch::indexing::IndexingRuntime;
use opencodesearch::mcp::OpenCodeSearchMcpServer;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let config = AppConfig::from_path("config.json")?;
    let runtime = IndexingRuntime::from_config(config)?;
    OpenCodeSearchMcpServer::new(runtime).run_stdio().await
}
```

## Testing
### Standard tests
```bash
cargo test
```

### Live container integration tests
Requires running Docker services and local git:

```bash
cargo test -- --ignored
```

Current ignored integration tests validate:
- Ollama connectivity
- Quickwit + Qdrant connectivity
- full indexing flow on generated Python project
- retrieval through MCP search path with non-exact query phrasing
- 100-commit refactor scenario for watchdog threshold behavior

## Troubleshooting
- Quickwit health endpoint: use `http://localhost:7280/health/livez`
- If embeddings fail, confirm Ollama model availability:
  - `qwen3-embedding:0.6b`
- Qdrant client requires gRPC port (`6334`) in config
- If integration tests fail on startup race, rerun after a short container warmup