# Architecture Documentation
## Overview
This document describes the internal architecture of the HNSW Vector Search library.
## Module Hierarchy
```
hnsw_vector_search
├── simd → Vector operations (NEON optimized)
├── hnsw → HNSW algorithm
│ ├── graph → Data structures
│ ├── search → Search algorithm
│ └── insert → Insertion algorithm
├── embeddings → ONNX Runtime integration
├── persistence → Serialization
└── cli → Command-line interface
```
## Data Flow
### Building an Index
```
Text Documents
↓
Tokenizer → [token IDs]
↓
ONNX Model → [embeddings (384-dim)]
↓
HNSW Insert → Graph structure
↓
Serialize → Binary file
```
### Searching
```
Query Text
↓
Tokenizer → [token IDs]
↓
ONNX Model → [query embedding]
↓
HNSW Search → [(node_id, distance), ...]
↓
Lookup Documents → Results
```
## HNSW Graph Structure
### In-Memory Representation
```rust
HnswGraph {
vectors: Arc<RwLock<Vec<Vec<f32>>>>, // Actual embeddings
layers: Vec<Layer>, // Hierarchical graphs
entry_point: AtomicUsize, // Top-level node
// ... parameters
}
Layer {
neighbors: Arc<RwLock<Vec<Vec<usize>>>>, // Adjacency list
}
```
### Layer Selection
Nodes are assigned to layers with probability:
```
P(level = l) = (1/M)^l
```
For M=16, this creates:
- Layer 0: 100% of nodes (all)
- Layer 1: 6.25% of nodes (1/16)
- Layer 2: 0.39% of nodes (1/256)
- Layer 3: 0.024% of nodes (1/4096)
### Connections
Each node has up to M connections per layer:
- **Layer 0**: Up to M0 = 2×M connections (denser for accuracy)
- **Other layers**: Up to M connections
Connections are bidirectional and maintained via neighbor selection heuristic.
## Search Algorithm
### Two-Phase Search
**Phase 1: Navigation (upper layers)**
```python
current = entry_point
for layer in reversed(range(target_layer + 1, max_layer + 1)):
current = greedy_search(layer, current, ef=1)
```
**Phase 2: Beam Search (layer 0)**
```python
candidates = priority_queue() # max-heap
results = priority_queue() # min-heap
while candidates:
current = candidates.pop()
for neighbor in get_neighbors(current):
if neighbor closer than worst result:
candidates.push(neighbor)
results.push(neighbor)
if len(results) > ef:
results.pop_worst()
return results.top_k(k)
```
### Complexity Analysis
- **Navigation phase**: O(log N) - one greedy step per layer
- **Beam search phase**: O(M × ef) - explore M neighbors for ef candidates
- **Total**: O(log N × M × ef)
With M=16, ef=50: approximately O(50 × log N)
## Insertion Algorithm
### Steps
1. **Layer selection**: Choose random layer l with P(l) = (1/M)^l
2. **Search phase**: Find entry points at each layer from top to l
3. **Connection phase**: For each layer from l down to 0:
- Search for ef_construction candidates
- Select M best using heuristic
- Connect bidirectionally
- Prune existing nodes if needed
### Neighbor Selection Heuristic
Goal: Balance proximity and diversity
```python
def select_neighbors(candidates, M):
selected = []
for candidate in sorted(candidates, by=distance):
is_redundant = False
for already_selected in selected:
if distance(selected, candidate) < distance(query, candidate):
is_redundant = True # covered by closer node
break
if not is_redundant:
selected.append(candidate)
if len(selected) == M:
break
return selected
```
This prevents local clustering and maintains navigability.
## SIMD Optimization
### NEON Implementation
ARM NEON processes 4 floats per instruction:
```rust
// Scalar: 1 multiply per cycle
for i in 0..384 {
sum += a[i] * b[i]; // 384 cycles
}
// NEON: 4 multiplies per cycle
for i in (0..384).step_by(4) {
sum_vec += a[i..i+4] * b[i..i+4]; // 96 cycles
}
```
### Horizontal Sum
Reducing 4-wide NEON register to scalar:
```
[a, b, c, d]
↓ pairwise add
[a+b, c+d, -, -]
↓ pairwise add
[a+b+c+d, -, -, -]
↓ extract lane 0
a+b+c+d
```
## ONNX Integration
### Model Loading
```rust
Session::builder()
.with_optimization_level(Level3) // Graph fusion, constant folding
.with_intra_threads(4) // Parallel ops
.commit_from_file(model_path)
```
### Inference Pipeline
```
Input text
↓
Tokenize → [input_ids, attention_mask]
↓
Prepare tensors → ndarray::Array2<i64>
↓
Run model → last_hidden_state [batch, seq_len, 384]
↓
Mean pooling → [batch, 384]
↓
Normalize → unit vectors
```
### Mean Pooling
```python
def mean_pool(hidden_states, attention_mask):
# Sum embeddings where mask=1
sum_embeddings = sum(hidden_states[i] for i where mask[i] == 1)
# Divide by number of real tokens
return sum_embeddings / sum(attention_mask)
```
## Persistence
### Serialization Format
```
File structure:
┌─────────────────┐
│ Magic number │ 4 bytes
│ Version │ 4 bytes
├─────────────────┤
│ Metadata │
│ - m_max │
│ - ef_const │
│ - entry_point │
│ - max_layer │
├─────────────────┤
│ Vectors │
│ - count │
│ - dimension │
│ - data [][] │
├─────────────────┤
│ Layers │
│ - count │
│ - neighbors[][]│
└─────────────────┘
```
### Memory Mapping
```
load_index_mmap():
file = open(path)
mmap = Mmap::map(file) // OS virtual memory mapping
deserialize(mmap) // No copy! Direct access
```
Benefits:
- Instant startup (no deserialization time)
- OS manages paging (only load what's needed)
- Shared memory across processes
## Concurrency Model
### Read-Heavy Workload
```rust
Arc<RwLock<T>> // Multiple readers, single writer
During search: Multiple threads read simultaneously
During insert: One thread writes, others blocked
```
### Lock Granularity
- **Fine-grained**: One lock per layer
- **Benefit**: Insertions at different layers don't block each other
- **Trade-off**: More complex but better concurrency
## Performance Characteristics
### Time Complexity
| Insert | O(log N × M × ef_c) | ~90 µs | ~150 µs |
| Search | O(log N × M × ef) | ~12 µs | ~35 µs |
| Embedding | O(seq_len × d²) | ~18 ms | ~18 ms |
### Space Complexity
Per node:
- Vector: 384 floats × 4 bytes = 1,536 bytes
- Layer 0 edges: 32 usize × 8 bytes = 256 bytes
- Layer 1+ edges: ~16 usize × 8 bytes = 128 bytes (if present)
Total for 1M vectors: ~1.8 GB
## Optimization Opportunities
### Future Work
1. **Quantization**: Store vectors as INT8 → 4x smaller
2. **Product Quantization**: Further compression with minimal recall loss
3. **GPU acceleration**: ONNX with Metal backend on Apple Silicon
4. **Parallel insertion**: Lock-free techniques for concurrent builds
5. **Filtered search**: Metadata filtering during search
6. **Dynamic updates**: Efficient deletion and modification
## References
- [HNSW Paper](https://arxiv.org/abs/1603.09320)
- [FAISS Implementation](https://github.com/facebookresearch/faiss)
- [ARM NEON Intrinsics](https://developer.arm.com/architectures/instruction-sets/intrinsics/)