hnsw_vector_search 0.1.0

High-performance semantic search using HNSW and ONNX embeddings
Documentation
# Architecture Documentation

## Overview

This document describes the internal architecture of the HNSW Vector Search library.

## Module Hierarchy

```
hnsw_vector_search
├── simd              → Vector operations (NEON optimized)
├── hnsw              → HNSW algorithm
│   ├── graph         → Data structures
│   ├── search        → Search algorithm
│   └── insert        → Insertion algorithm
├── embeddings        → ONNX Runtime integration
├── persistence       → Serialization
└── cli               → Command-line interface
```

## Data Flow

### Building an Index

```
Text Documents
Tokenizer → [token IDs]
ONNX Model → [embeddings (384-dim)]
HNSW Insert → Graph structure
Serialize → Binary file
```

### Searching

```
Query Text
Tokenizer → [token IDs]
ONNX Model → [query embedding]
HNSW Search → [(node_id, distance), ...]
Lookup Documents → Results
```

## HNSW Graph Structure

### In-Memory Representation

```rust
HnswGraph {
    vectors: Arc<RwLock<Vec<Vec<f32>>>>,    // Actual embeddings
    layers: Vec<Layer>,                      // Hierarchical graphs
    entry_point: AtomicUsize,                // Top-level node
    // ... parameters
}

Layer {
    neighbors: Arc<RwLock<Vec<Vec<usize>>>>, // Adjacency list
}
```

### Layer Selection

Nodes are assigned to layers with probability:

```
P(level = l) = (1/M)^l
```

For M=16, this creates:

- Layer 0: 100% of nodes (all)
- Layer 1: 6.25% of nodes (1/16)
- Layer 2: 0.39% of nodes (1/256)
- Layer 3: 0.024% of nodes (1/4096)

### Connections

Each node has up to M connections per layer:

- **Layer 0**: Up to M0 = 2×M connections (denser for accuracy)
- **Other layers**: Up to M connections

Connections are bidirectional and maintained via neighbor selection heuristic.

## Search Algorithm

### Two-Phase Search

**Phase 1: Navigation (upper layers)**

```python
current = entry_point
for layer in reversed(range(target_layer + 1, max_layer + 1)):
    current = greedy_search(layer, current, ef=1)
```

**Phase 2: Beam Search (layer 0)**

```python
candidates = priority_queue()  # max-heap
results = priority_queue()     # min-heap

while candidates:
    current = candidates.pop()
    for neighbor in get_neighbors(current):
        if neighbor closer than worst result:
            candidates.push(neighbor)
            results.push(neighbor)
            if len(results) > ef:
                results.pop_worst()

return results.top_k(k)
```

### Complexity Analysis

- **Navigation phase**: O(log N) - one greedy step per layer
- **Beam search phase**: O(M × ef) - explore M neighbors for ef candidates
- **Total**: O(log N × M × ef)

With M=16, ef=50: approximately O(50 × log N)

## Insertion Algorithm

### Steps

1. **Layer selection**: Choose random layer l with P(l) = (1/M)^l
2. **Search phase**: Find entry points at each layer from top to l
3. **Connection phase**: For each layer from l down to 0:
   - Search for ef_construction candidates
   - Select M best using heuristic
   - Connect bidirectionally
   - Prune existing nodes if needed

### Neighbor Selection Heuristic

Goal: Balance proximity and diversity

```python
def select_neighbors(candidates, M):
    selected = []
    for candidate in sorted(candidates, by=distance):
        is_redundant = False
        for already_selected in selected:
            if distance(selected, candidate) < distance(query, candidate):
                is_redundant = True  # covered by closer node
                break
        if not is_redundant:
            selected.append(candidate)
            if len(selected) == M:
                break
    return selected
```

This prevents local clustering and maintains navigability.

## SIMD Optimization

### NEON Implementation

ARM NEON processes 4 floats per instruction:

```rust
// Scalar: 1 multiply per cycle
for i in 0..384 {
    sum += a[i] * b[i];  // 384 cycles
}

// NEON: 4 multiplies per cycle
for i in (0..384).step_by(4) {
    sum_vec += a[i..i+4] * b[i..i+4];  // 96 cycles
}
```

### Horizontal Sum

Reducing 4-wide NEON register to scalar:

```
[a, b, c, d]
    ↓ pairwise add
[a+b, c+d, -, -]
    ↓ pairwise add
[a+b+c+d, -, -, -]
    ↓ extract lane 0
a+b+c+d
```

## ONNX Integration

### Model Loading

```rust
Session::builder()
    .with_optimization_level(Level3)  // Graph fusion, constant folding
    .with_intra_threads(4)            // Parallel ops
    .commit_from_file(model_path)
```

### Inference Pipeline

```
Input text
Tokenize → [input_ids, attention_mask]
Prepare tensors → ndarray::Array2<i64>
Run model → last_hidden_state [batch, seq_len, 384]
Mean pooling → [batch, 384]
Normalize → unit vectors
```

### Mean Pooling

```python
def mean_pool(hidden_states, attention_mask):
    # Sum embeddings where mask=1
    sum_embeddings = sum(hidden_states[i] for i where mask[i] == 1)
    # Divide by number of real tokens
    return sum_embeddings / sum(attention_mask)
```

## Persistence

### Serialization Format

```
File structure:
┌─────────────────┐
│ Magic number    │ 4 bytes
│ Version         │ 4 bytes
├─────────────────┤
│ Metadata        │
│  - m_max        │
│  - ef_const     │
│  - entry_point  │
│  - max_layer    │
├─────────────────┤
│ Vectors         │
│  - count        │
│  - dimension    │
│  - data [][]    │
├─────────────────┤
│ Layers          │
│  - count        │
│  - neighbors[][]│
└─────────────────┘
```

### Memory Mapping

```
load_index_mmap():
    file = open(path)
    mmap = Mmap::map(file)  // OS virtual memory mapping
    deserialize(mmap)        // No copy! Direct access
```

Benefits:

- Instant startup (no deserialization time)
- OS manages paging (only load what's needed)
- Shared memory across processes

## Concurrency Model

### Read-Heavy Workload

```rust
Arc<RwLock<T>>  // Multiple readers, single writer

During search: Multiple threads read simultaneously
During insert: One thread writes, others blocked
```

### Lock Granularity

- **Fine-grained**: One lock per layer
- **Benefit**: Insertions at different layers don't block each other
- **Trade-off**: More complex but better concurrency

## Performance Characteristics

### Time Complexity

| Operation | Complexity          | Actual (1K vecs) | Actual (100K vecs) |
| --------- | ------------------- | ---------------- | ------------------ |
| Insert    | O(log N × M × ef_c) | ~90 µs           | ~150 µs            |
| Search    | O(log N × M × ef)   | ~12 µs           | ~35 µs             |
| Embedding | O(seq_len × d²)     | ~18 ms           | ~18 ms             |

### Space Complexity

Per node:

- Vector: 384 floats × 4 bytes = 1,536 bytes
- Layer 0 edges: 32 usize × 8 bytes = 256 bytes
- Layer 1+ edges: ~16 usize × 8 bytes = 128 bytes (if present)

Total for 1M vectors: ~1.8 GB

## Optimization Opportunities

### Future Work

1. **Quantization**: Store vectors as INT8 → 4x smaller
2. **Product Quantization**: Further compression with minimal recall loss
3. **GPU acceleration**: ONNX with Metal backend on Apple Silicon
4. **Parallel insertion**: Lock-free techniques for concurrent builds
5. **Filtered search**: Metadata filtering during search
6. **Dynamic updates**: Efficient deletion and modification

## References

- [HNSW Paper]https://arxiv.org/abs/1603.09320
- [FAISS Implementation]https://github.com/facebookresearch/faiss
- [ARM NEON Intrinsics]https://developer.arm.com/architectures/instruction-sets/intrinsics/