ruve-db 0.1.1

A hybrid vector and full-text search database with HNSW approximate nearest-neighbour indexing and BM25
Documentation
# RuVe

A hybrid vector + full-text search database written in Rust.

RuVe combines an **HNSW** approximate nearest-neighbour graph with a **BM25** keyword index.

---

## Features

- **HNSW index** — sub-linear approximate nearest-neighbour search with configurable exploration factor
- **BM25 index** — IDF-weighted full-text ranking with tokenisation and stop-word filtering
- **Append-only binary storage** — fast sequential writes, O(1) random-access reads via stored offsets
- **Fully persistent** — all data, graph edges, and indices are written to disk; no in-memory-only state

---

## Installation

Add RuVe to your `Cargo.toml`:

```toml
[dependencies]
ruve-db = "0.1"
```

The default feature set includes the embedder (OpenAI / Ollama). If you only need the core library and will supply your own vectors:

```toml
[dependencies]
ruve-db = { version = "0.1", default-features = false }
```

---

## Library Usage

### Insert and search with your own vectors

```rust
use ruve::database::Database;

let mut db = Database::new(
    "data/data.bin",     // binary record store
    "data/index.json",   // UUID → file-offset map
    "data/bm25.json",    // BM25 term statistics
    "data/hnsw.json",    // HNSW graph metadata
    "data/graph.bin",    // HNSW edge data
);

// Insert with an auto-generated UUID
let vector: Vec<f32> = vec![0.1, 0.2, 0.3, /* ... */];
db.insert_raw(vector, "The quick brown fox", None);

// Insert with a custom key
let vector2: Vec<f32> = vec![0.4, 0.5, 0.6, /* ... */];
db.insert_raw(vector2, "Jumps over the lazy dog", Some("my-doc-id"));

// HNSW approximate nearest-neighbour search
// ef is the exploration factor — higher = better recall, slower
let query = vec![0.15, 0.25, 0.35];
let records = db.search_hnsw(&query, 20);
for record in &records {
    println!("{} — {:?}", record.id, record.metadata);
}

// BM25 full-text search
let results = db.text_search("quick fox", 5);
for record in &results {
    println!("{} — {:?}", record.id, record.metadata);
}

// Delete by id
db.delete("my-doc-id");

// Wipe everything
db.wipe();
```

### Embedder (feature = `"embedder"`)

Use RuVe's built-in embedding backends to turn text into vectors automatically.

**OpenAI** (requires `OPENAI_API_KEY` in your environment or a `.env` file):

```rust
use ruve::embedder::Embedder;

let embedder = Embedder::openai();
let vector = embedder.embed("The quick brown fox");
db.insert_raw(vector, "The quick brown fox", None);
```

**Ollama** (requires `ollama` running locally with `nomic-embed-text` pulled):

```rust
let embedder = Embedder::ollama();
let vector = embedder.embed("The quick brown fox");
db.insert_raw(vector, "The quick brown fox", None);
```

---

## CLI

An interactive REPL for exploring your database directly from the terminal.

```bash
# default features already include embedder + cli
cargo run --bin ruve

# or explicitly
cargo run --bin ruve --features cli
```

```
RuVe v0.1.0 — type help for available commands, quit to exit
ruve> insert The quick brown fox jumps over the lazy dog
inserted
ruve> search text quick fox 3
01980... — Some("The quick brown fox jumps over the lazy dog")
ruve> search vec The quick brown fox 3
query vector dim: 3072
0.9821 | dim=3072 | 01980... — Some("The quick brown fox jumps over the lazy dog")
ruve> list
01980... — Some("The quick brown fox jumps over the lazy dog")
ruve> delete 01980...
deleted
ruve> wipe
wiped
```

### Commands

| Command | Description |
|---------|-------------|
| `insert <text>` | Embed the text and insert a record |
| `insert raw [1.0, 2.0, ...] <text>` | Insert with a pre-computed vector |
| `search vec <query> <k>` | Embed the query and run HNSW vector search |
| `search text <query> <k>` | BM25 full-text search |
| `delete <id>` | Delete a record by UUID |
| `wipe` | Delete all records and indices |
| `load <filename>` | Batch-embed and index every line from `books/<filename>` |
| `list` | List all stored records |
| `help` | Show this help text |
| `quit` / `exit` | Exit the REPL |

---


## Benchmark

Measure insert throughput, vector search latency, text search latency, and Recall@k against brute-force ground truth.

```bash
# run the two smallest scenarios by default
cargo run --bin benchmark

# pick specific scenarios
cargo run --bin benchmark -- xs small medium large highdim
```

| Scenario | Nodes | Dims |
|----------|------:|-----:|
| `xxs` | 200 | 128 |
| `xs` | 1 K | 128 |
| `small` | 10 K | 128 |
| `medium` | 50 K | 128 |
| `large` | 100 K | 128 |
| `highdim` | 10 K | 768 |

### Results — `small` (10 K × 128d, k=10)

| Operation | Throughput | p50 | p95 | p99 |
|-----------|----------:|----:|----:|----:|
| Insert | 54 ops/s ||||
| HNSW vector search | 125 qps | 7.89 ms | 9.27 ms | 9.73 ms |
| Brute-force vector search | 13 qps | 79.20 ms | 83.07 ms | 85.80 ms |
| BM25 text search | 1 011 qps | 1.05 ms | 1.28 ms | 2.03 ms |

---

## Visualizer

An interactive 3-D viewer for the HNSW graph. Click any node to inspect it.

![HNSW graph visualizer](https://i.imgur.com/1INg2Jk.png)

```bash
# populate a small graph with the benchmark, then open the viewer
cargo run --release --bin benchmark -- xxs
cargo run --release --bin visualize
```

The viewer opens as a self-contained HTML file in your browser. You can also pass a specific scenario or point it at any CLI database directory:

```bash
# different benchmark scenario
cargo run --release --bin visualize -- xs

# a database you built through the CLI (stored in ./data)
cargo run --release --bin visualize -- ./data
```

## Running tests

```bash
cargo test
```

## License

MIT