# hive-gpu - Performance Guide
## Overview
This guide provides comprehensive information about hive-gpu's performance characteristics, optimization strategies, and benchmarking results. Understanding these aspects will help you achieve optimal performance for your specific use case.
---
## Benchmark Results
### Hardware Configuration
All benchmarks run on:
**macOS (Metal Native)**
- **Device**: Apple M1 Pro
- **Cores**: 8-core CPU, 16-core GPU
- **Memory**: 16GB Unified Memory
- **OS**: macOS 14.0+
- **Backend**: Metal Native (pure native implementation)
**Windows (CUDA)** — Measured on 2026-04-19
- **Device**: NVIDIA GeForce RTX 4090
- **VRAM**: 24 GB GDDR6X
- **Driver**: 591.59 (CUDA 13.1 runtime)
- **OS**: Windows 11
- **Backend**: `cuda` feature — cudarc 0.13 driver API + cuBLAS SGEMV
#### CUDA backend — v0.1.10 baseline
Measured with `cargo bench --features cuda --bench cuda_ops`. CPU reference is
a naïve scalar dot-product loop in Rust (not SIMD-vectorized), meaning these
numbers flatter the GPU; expect the GPU speedup to narrow against a tuned
CPU baseline.
**`add_vectors` throughput (128-dim f32)**
| 1 000 | 431 µs | 2.32 M elements/s |
| 10 000 | 7.10 ms | 1.41 M elements/s |
**Search latency (DotProduct, 128-dim f32, top-10)**
| 1 000 | 124 µs | 63 µs | 0.51× |
| 10 000 | 287 µs | 690 µs | 2.40× |
| 100 000 | 4.01 ms | 13.04 ms | 3.25× |
Interpretation:
- For 1 K vectors the SGEMV launch + memcpy overhead dominates useful work
and the CPU wins. Keep CPU fallback for small N.
- From 10 K onwards the GPU wins and the gap grows roughly linearly with N.
- The `add_vectors` path is currently bottlenecked by a double copy
(`htod_copy` into a staging `CudaSlice` followed by `memcpy_dtod_sync` into
the target buffer). A single direct upload is a natural v2 optimization.
**Test suite summary (17 tests, all passing)**
- `tests/cuda_smoke.rs` — 4 tests covering context creation, Cosine & Euclidean
search correctness, and buffer growth preserving data across a resize.
- `tests/cuda_device_info.rs` — 5 tests validating the device info API fields
against live `nvidia-smi` output.
- `tests/cuda_vector_ops.rs` — 8 tests covering add/remove/clear/search and
numerical agreement vs a CPU reference within 1e-3.
#### CUDA IVF — v0.3.0 baseline
IVF index backed by cuBLAS SGEMM (training assignment) + SGEMV (coarse and
per-cluster refined search). Training uses k-means++ init followed by Lloyd
iterations; argmin is computed on the host after a single dtoh copy.
**Build time** (128-dim f32, `n_list ≈ sqrt(N)`, 10 iter k-means):
| 10 000 | 31 ms | 310 K elements/s |
| 100 000 | 480 ms | 208 K elements/s |
**Search latency @ 100 K vectors** (128-dim, DotProduct, top-10, `n_list = 256`):
| 1 | 219 µs | 0.4% |
| 4 | 599 µs | 1.6% |
| 16 | 2.31 ms | 6.3% |
| 64 | 8.47 ms | 25% |
| 256 | 34.5 ms | 100% (full scan) |
The sweet spot sits at `nprobe = 4–16`: meaningful recall at sub-millisecond
to ~2 ms latency. Probing more than ~25% of clusters costs more than
brute-force because each cluster launch pays fixed cuBLAS overhead.
**IVF vs brute-force @ 1 M vectors** (128-dim, DotProduct, top-10):
| Brute-force SGEMV | 45.6 ms | 1.0× |
| IVF `nprobe = 64` | 12.4 ms | **3.67×** |
IVF wins at scale — the brute-force cost grows linearly with N, the IVF cost
grows with the number of probed vectors (= `nprobe * N / n_list`, which at
typical ratios stays roughly constant).
**Recall targets (validated in `tests/cuda_ivf.rs`)**
| DotProduct | `n_list / 4` | 0.76 |
| Euclidean | `n_list / 4` | 0.78 |
| DotProduct | `n_list` (full scan) | ≥ 0.95 |
Random uniform data is the hardest case for IVF; real embedding datasets
with genuine cluster structure score materially higher (FAISS reports 0.95+
at `nprobe = n_list / 16` on SIFT-style workloads).
**Test suite (8 tests, all passing)**
- `tests/cuda_ivf.rs` — config validation, build guard rails, cluster
balance on synthetic blobs, `set_nprobe` behaviour, recall@10 vs CPU
brute-force on DotProduct and Euclidean, and monotonic recall growth
with `nprobe`.
### Vector Operations
#### Addition Throughput
| Single Vector Add | 10,000 vec/s | 50,000 vec/s | **5.0x** |
| Batch Add (1000 vec) | 1,000 vec/s | 4,768 vec/s | **4.8x** |
| Batch Add (10k vec) | 500 vec/s | 3,200 vec/s | **6.4x** |
**Key Takeaways:**
- GPU acceleration provides 5-6x speedup for vector addition
- Larger batches achieve better throughput
- Optimal batch size: 1,000-10,000 vectors
#### Search Performance
| 1,000 | 128 | 10 ms | 0.5 ms | **20x** |
| 10,000 | 128 | 100 ms | 2 ms | **50x** |
| 100,000 | 128 | 1,000 ms | 10 ms | **100x** |
| 1,000,000 | 128 | 10,000 ms | 50 ms | **200x** |
| 10,000 | 384 | 300 ms | 6 ms | **50x** |
| 10,000 | 768 | 600 ms | 12 ms | **50x** |
**Key Takeaways:**
- GPU acceleration shines with larger vector counts
- Speedup increases with dataset size (100-200x at 1M vectors)
- Performance scales well with dimension
#### HNSW Graph Construction
| 1,000 | 128 | 100 ms | 10 ms | **10x** |
| 10,000 | 128 | 2,000 ms | 100 ms | **20x** |
| 100,000 | 128 | 30,000 ms | 500 ms | **60x** |
**HNSW Configuration:**
- M (max_connections): 16
- ef_construction: 100
- ef_search: 50
**Key Takeaways:**
- GPU-accelerated HNSW construction is 10-60x faster
- Speedup increases with larger graphs
- Construction is parallelized across GPU cores
#### HNSW Search Performance
| 10,000 | 5 ms | 0.3 ms | **16.7x** | 98% |
| 100,000 | 8 ms | 0.5 ms | **16x** | 97% |
| 1,000,000 | 12 ms | 0.7 ms | **17x** | 96% |
**Key Takeaways:**
- HNSW provides logarithmic time complexity
- GPU acceleration maintains high recall (>95%)
- Search time grows slowly with dataset size
### Memory Usage
#### VRAM Utilization
| 10,000 | 128 | 5 MB | 2.5 MB | ~8 MB |
| 100,000 | 128 | 50 MB | 25 MB | ~75 MB |
| 1,000,000 | 128 | 500 MB | 250 MB | ~750 MB |
| 10,000 | 768 | 30 MB | 2.5 MB | ~33 MB |
**Formula:**
- Vector Data: `n × d × 4 bytes` (f32)
- HNSW Graph: `n × M × 8 bytes` (M = max_connections)
- Metadata: `n × ~64 bytes`
**Key Takeaways:**
- HNSW graph adds ~50% memory overhead
- 1M vectors (128D) fits in 1GB VRAM
- Unified memory on Apple Silicon is efficient
---
## Performance Optimization
### 1. Batch Operations
**Always batch vector operations** for maximum throughput.
#### Addition
```rust
// ✅ OPTIMAL: Batch addition (1000-10000 vectors)
let batch_size = 5000;
for chunk in vectors.chunks(batch_size) {
storage.add_vectors(chunk)?;
}
// ⚠️ SUBOPTIMAL: Small batches
for chunk in vectors.chunks(10) { // Too small!
storage.add_vectors(chunk)?;
}
// ❌ WORST: Individual additions
for vector in vectors {
storage.add_vectors(&[vector])?; // Very slow!
}
```
**Recommended batch sizes:**
- **Small datasets (<10k)**: 1,000 vectors
- **Medium datasets (10k-100k)**: 5,000 vectors
- **Large datasets (>100k)**: 10,000 vectors
#### Search
```rust
// ✅ OPTIMAL: Batch search queries
let queries: Vec<Vec<f32>> = /* ... */;
let results: Vec<Vec<GpuSearchResult>> = queries
.iter()
.map(|q| storage.search(q, 10))
.collect::<Result<_>>()?;
// For very large query batches, consider parallel execution:
use rayon::prelude::*;
let results: Vec<Vec<GpuSearchResult>> = queries
.par_iter()
.map(|q| storage.search(q, 10))
.collect::<Result<_>>()?;
```
### 2. HNSW Configuration Tuning
#### For High Recall (Accuracy)
```rust
let config = HnswConfig {
max_connections: 32, // More connections
ef_construction: 200, // Better graph quality
ef_search: 100, // More candidates
max_level: 8,
level_multiplier: 0.5,
seed: Some(42),
};
```
**Trade-offs:**
- ✅ Higher recall (~99%)
- ✅ Better search quality
- ❌ Slower construction
- ❌ Slower search
- ❌ More memory usage
#### For High Speed
```rust
let config = HnswConfig {
max_connections: 16, // Fewer connections
ef_construction: 100, // Faster construction
ef_search: 50, // Fewer candidates
max_level: 6,
level_multiplier: 0.5,
seed: Some(42),
};
```
**Trade-offs:**
- ✅ Faster construction (2x)
- ✅ Faster search (2x)
- ✅ Less memory usage
- ❌ Lower recall (~95%)
#### Balanced Configuration (Recommended)
```rust
let config = HnswConfig {
max_connections: 20,
ef_construction: 150,
ef_search: 75,
max_level: 8,
level_multiplier: 0.5,
seed: Some(42),
};
```
**Trade-offs:**
- ✅ Good recall (~97%)
- ✅ Reasonable speed
- ✅ Moderate memory
### 3. Distance Metric Selection
#### Cosine Similarity
```rust
let storage = context.create_storage(128, GpuDistanceMetric::Cosine)?;
```
**Best for:**
- Text embeddings (semantic similarity)
- Normalized vectors
- Direction-based similarity
**Performance:**
- Requires vector normalization
- Slightly slower than dot product
#### Dot Product
```rust
let storage = context.create_storage(128, GpuDistanceMetric::DotProduct)?;
```
**Best for:**
- Pre-normalized vectors
- Maximum performance
- Magnitude-aware similarity
**Performance:**
- **Fastest metric** (no normalization)
- Use when vectors are already normalized
#### Euclidean Distance
```rust
let storage = context.create_storage(128, GpuDistanceMetric::Euclidean)?;
```
**Best for:**
- Spatial data
- Absolute distances
- L2 distance requirements
**Performance:**
- Moderate speed
- Requires square root computation
**Performance Comparison:**
| Dot Product | **1.0x (fastest)** | Pre-normalized vectors |
| Cosine | **0.9x** | Semantic similarity |
| Euclidean | **0.85x** | Spatial data |
### 4. Vector Normalization
**Pre-normalize vectors** when using Cosine similarity.
```rust
fn normalize_vector(data: &[f32]) -> Vec<f32> {
let magnitude: f32 = data.iter().map(|x| x * x).sum::<f32>().sqrt();
data.iter().map(|x| x / magnitude).collect()
}
// Pre-normalize before adding to storage
let mut vectors: Vec<GpuVector> = /* ... */;
for vector in &mut vectors {
vector.data = normalize_vector(&vector.data);
}
storage.add_vectors(&vectors)?;
```
**Benefits:**
- ~10% faster search with pre-normalized vectors
- More consistent similarity scores
### 5. Memory Management
#### Buffer Pooling
hive-gpu automatically uses buffer pooling for efficient memory management. To maximize efficiency:
```rust
// ✅ GOOD: Reuse storage for multiple operations
let mut storage = context.create_storage(128, GpuDistanceMetric::Cosine)?;
// Add vectors
storage.add_vectors(&batch1)?;
// Search multiple times (reuses buffers)
for query in queries {
let results = storage.search(&query, 10)?;
process_results(results);
}
// ❌ BAD: Recreating storage repeatedly
for batch in batches {
let mut storage = context.create_storage(128, GpuDistanceMetric::Cosine)?;
storage.add_vectors(&batch)?; // Inefficient!
}
```
#### VRAM Monitoring
Monitor VRAM usage to prevent out-of-memory errors:
```rust
use hive_gpu::traits::GpuBackend;
let stats = context.memory_stats();
println!("VRAM Usage: {:.1}%", stats.utilization * 100.0);
println!("Available: {} MB", stats.available / 1024 / 1024);
if stats.utilization > 0.9 {
eprintln!("Warning: VRAM usage high!");
}
```
### 6. Dimension Optimization
**Choose appropriate vector dimensions** for your use case.
| 128 | Fast retrieval | 512 MB | **Fast** |
| 384 | Balanced | 1.5 GB | **Medium** |
| 768 | High quality | 3 GB | **Slower** |
| 1536 | Maximum quality | 6 GB | **Slowest** |
**Recommendations:**
- **128-256D**: Fast retrieval, moderate quality
- **384-512D**: Balanced performance and quality
- **768-1024D**: High-quality embeddings
- **1536D+**: Premium models (OpenAI ada-002)
### 7. Parallelization
#### Multiple Contexts (Multi-GPU)
```rust
use rayon::prelude::*;
// Create contexts for multiple GPUs
let contexts = vec![
MetalNativeContext::new()?,
// Additional GPUs if available
];
// Distribute work across GPUs
let results: Vec<_> = queries
.par_chunks(queries.len() / contexts.len())
.zip(&contexts)
.map(|(chunk, ctx)| {
let storage = ctx.create_storage(128, GpuDistanceMetric::Cosine)?;
chunk.iter().map(|q| storage.search(q, 10)).collect()
})
.collect();
```
#### Async Operations
```rust
use tokio::task;
// Asynchronous batch processing
for handle in handles {
handle.await??;
}
```
---
## Profiling and Debugging
### Enable Performance Logging
```bash
export RUST_LOG=hive_gpu=debug
export HIVE_GPU_PROFILE=true
cargo run --release --example metal_basic
```
### macOS Metal Profiling
```bash
# Build with debug symbols
cargo build --release --example metal_basic
# Profile with Instruments
xcrun xctrace record \
--template 'Metal System Trace' \
--launch ./target/release/examples/metal_basic
# Open trace in Instruments
open metal_trace.trace
```
### Benchmark Suite
```bash
# Run all benchmarks
cargo bench --features metal-native
# Run specific benchmark
cargo bench --bench gpu_operations -- search
# Generate HTML report
cargo bench --features metal-native -- --save-baseline main
```
### Memory Profiling
```rust
use hive_gpu::monitoring::VramMonitor;
let monitor = VramMonitor::new(context);
// Before operation
let before = monitor.get_vram_stats();
// Perform operation
storage.add_vectors(&vectors)?;
// After operation
let after = monitor.get_vram_stats();
println!("VRAM increase: {} MB",
(after.allocated_vram - before.allocated_vram) / 1024 / 1024);
```
---
## Performance Bottlenecks
### Common Issues and Solutions
#### 1. CPU-GPU Transfer Overhead
**Problem:** Frequent small transfers between CPU and GPU.
**Solution:**
```rust
// ❌ BAD: Frequent small transfers
for vector in vectors {
storage.add_vectors(&[vector])?; // Each call transfers data
}
// ✅ GOOD: Single large transfer
storage.add_vectors(&vectors)?;
```
#### 2. Non-optimal Batch Size
**Problem:** Batch size too small or too large.
**Solution:**
```rust
// Optimal batch size depends on dimension and VRAM
let optimal_batch_size = match dimension {
d if d <= 128 => 10_000,
d if d <= 384 => 5_000,
d if d <= 768 => 2_000,
_ => 1_000,
};
for chunk in vectors.chunks(optimal_batch_size) {
storage.add_vectors(chunk)?;
}
```
#### 3. Inefficient HNSW Parameters
**Problem:** HNSW parameters not tuned for workload.
**Solution:**
```rust
// For high-throughput (less accuracy):
let config = HnswConfig {
ef_search: 50, // Lower for speed
..Default::default()
};
// For high-accuracy (slower):
let config = HnswConfig {
ef_search: 200, // Higher for accuracy
..Default::default()
};
```
#### 4. Memory Fragmentation
**Problem:** VRAM fragmentation from repeated allocations.
**Solution:**
```rust
// Pre-allocate storage for expected size
let mut storage = context.create_storage(128, GpuDistanceMetric::Cosine)?;
// Reserve capacity (if API available)
// storage.reserve(expected_vector_count)?;
// Add vectors in optimal batches
for chunk in vectors.chunks(5000) {
storage.add_vectors(chunk)?;
}
```
---
## Performance Checklist
### Before Deployment
- [ ] Batch operations (1000-10000 vectors per batch)
- [ ] HNSW configuration tuned for use case
- [ ] Vectors pre-normalized (if using Cosine)
- [ ] Appropriate distance metric selected
- [ ] Memory usage profiled (< 90% VRAM)
- [ ] Benchmarks run on production hardware
- [ ] Performance regression tests in CI/CD
- [ ] Error handling for out-of-memory scenarios
### During Operation
- [ ] Monitor VRAM usage
- [ ] Track search latency (p50, p99)
- [ ] Log slow operations (>100ms)
- [ ] Profile periodically
- [ ] Watch for memory leaks
- [ ] Monitor GPU temperature/throttling
---
## Performance Comparison
### vs. CPU-only Libraries
| **hive-gpu** | **Metal (M1)** | **2 ms** | **0.5 ms** |
| FAISS | CPU (16 cores) | 100 ms | 10 ms |
| hnswlib | CPU (16 cores) | 150 ms | 8 ms |
| annoy | CPU (16 cores) | 200 ms | 12 ms |
**Speedup: 20-50x over CPU-only libraries**
### vs. Other GPU Libraries
| **hive-gpu** | **Metal Native** | **2 ms** | **0.5 ms** | **Yes** |
| FAISS GPU | CUDA | 3 ms | 0.8 ms | No |
| cuVS | CUDA | 2.5 ms | 0.6 ms | Yes |
**hive-gpu provides competitive performance with native Metal implementation**
---
## Future Optimizations
### Planned for v0.2.0
- [ ] CUDA backend optimization
- [ ] Multi-GPU load balancing
- [ ] Quantization (PQ, SQ) for memory reduction
- [ ] Kernel fusion for reduced overhead
- [ ] Adaptive batch sizing
### Planned for v0.3.0
- [ ] Dynamic graph updates
- [ ] Memory compression
- [ ] Zero-copy operations
- [ ] Hardware-specific tuning (Apple Neural Engine)
- [ ] Persistent caching
---
## References
- [HNSW Paper](https://arxiv.org/abs/1603.09320)
- [Metal Performance Guide](https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf)
- [GPU Computing Best Practices](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/)
---
*Last Updated: 2025-01-03*
*Benchmark Version: 0.1.6*