hive-gpu 0.2.0 - Docs.rs

# hive-gpu - Performance Guide


## Overview


This guide provides comprehensive information about hive-gpu's performance characteristics, optimization strategies, and benchmarking results. Understanding these aspects will help you achieve optimal performance for your specific use case.

---

## Benchmark Results


### Hardware Configuration


All benchmarks run on:

**macOS (Metal Native)**
- **Device**: Apple M1 Pro
- **Cores**: 8-core CPU, 16-core GPU
- **Memory**: 16GB Unified Memory
- **OS**: macOS 14.0+
- **Backend**: Metal Native (pure native implementation)

**Windows (CUDA)** — Measured on 2026-04-19
- **Device**: NVIDIA GeForce RTX 4090
- **VRAM**: 24 GB GDDR6X
- **Driver**: 591.59 (CUDA 13.1 runtime)
- **OS**: Windows 11
- **Backend**: `cuda` feature — cudarc 0.13 driver API + cuBLAS SGEMV

#### CUDA backend — v0.1.10 baseline


Measured with `cargo bench --features cuda --bench cuda_ops`. CPU reference is
a naïve scalar dot-product loop in Rust (not SIMD-vectorized), meaning these
numbers flatter the GPU; expect the GPU speedup to narrow against a tuned
CPU baseline.

**`add_vectors` throughput (128-dim f32)**

| Batch size | Wall-clock | Throughput |
|-----------:|-----------:|-----------:|
| 1 000      | 431 µs     | 2.32 M elements/s |
| 10 000     | 7.10 ms    | 1.41 M elements/s |

**Search latency (DotProduct, 128-dim f32, top-10)**

| N       | GPU (cuBLAS SGEMV) | CPU naïve reference | GPU speedup |
|--------:|-------------------:|--------------------:|------------:|
|   1 000 |             124 µs |               63 µs |       0.51× |
|  10 000 |             287 µs |              690 µs |       2.40× |
| 100 000 |            4.01 ms |             13.04 ms |       3.25× |

Interpretation:
- For 1 K vectors the SGEMV launch + memcpy overhead dominates useful work
  and the CPU wins. Keep CPU fallback for small N.
- From 10 K onwards the GPU wins and the gap grows roughly linearly with N.
- The `add_vectors` path is currently bottlenecked by a double copy
  (`htod_copy` into a staging `CudaSlice` followed by `memcpy_dtod_sync` into
  the target buffer). A single direct upload is a natural v2 optimization.

**Test suite summary (17 tests, all passing)**

- `tests/cuda_smoke.rs` — 4 tests covering context creation, Cosine & Euclidean
  search correctness, and buffer growth preserving data across a resize.
- `tests/cuda_device_info.rs` — 5 tests validating the device info API fields
  against live `nvidia-smi` output.
- `tests/cuda_vector_ops.rs` — 8 tests covering add/remove/clear/search and
  numerical agreement vs a CPU reference within 1e-3.

#### CUDA IVF — v0.3.0 baseline


IVF index backed by cuBLAS SGEMM (training assignment) + SGEMV (coarse and
per-cluster refined search). Training uses k-means++ init followed by Lloyd
iterations; argmin is computed on the host after a single dtoh copy.

**Build time** (128-dim f32, `n_list ≈ sqrt(N)`, 10 iter k-means):

| N        | Build time | Throughput       |
|---------:|-----------:|-----------------:|
|  10 000  |      31 ms | 310 K elements/s |
| 100 000  |     480 ms | 208 K elements/s |

**Search latency @ 100 K vectors** (128-dim, DotProduct, top-10, `n_list = 256`):

| `nprobe` | Latency  | Clusters probed |
|---------:|---------:|----------------:|
|        1 |   219 µs |            0.4% |
|        4 |   599 µs |            1.6% |
|       16 |  2.31 ms |            6.3% |
|       64 |  8.47 ms |             25% |
|      256 |  34.5 ms | 100% (full scan) |

The sweet spot sits at `nprobe = 4–16`: meaningful recall at sub-millisecond
to ~2 ms latency. Probing more than ~25% of clusters costs more than
brute-force because each cluster launch pays fixed cuBLAS overhead.

**IVF vs brute-force @ 1 M vectors** (128-dim, DotProduct, top-10):

| Index              | Latency  | Relative |
|--------------------|---------:|---------:|
| Brute-force SGEMV  | 45.6 ms  |     1.0× |
| IVF `nprobe = 64`  | 12.4 ms  | **3.67×** |

IVF wins at scale — the brute-force cost grows linearly with N, the IVF cost
grows with the number of probed vectors (= `nprobe * N / n_list`, which at
typical ratios stays roughly constant).

**Recall targets (validated in `tests/cuda_ivf.rs`)**

| Metric      | `nprobe`          | Recall@10 on random data |
|-------------|-------------------|-------------------------:|
| DotProduct  | `n_list / 4`      |                    0.76  |
| Euclidean   | `n_list / 4`      |                    0.78  |
| DotProduct  | `n_list` (full scan) |                 ≥ 0.95  |

Random uniform data is the hardest case for IVF; real embedding datasets
with genuine cluster structure score materially higher (FAISS reports 0.95+
at `nprobe = n_list / 16` on SIFT-style workloads).

**Test suite (8 tests, all passing)**

- `tests/cuda_ivf.rs` — config validation, build guard rails, cluster
  balance on synthetic blobs, `set_nprobe` behaviour, recall@10 vs CPU
  brute-force on DotProduct and Euclidean, and monotonic recall growth
  with `nprobe`.

### Vector Operations


#### Addition Throughput


| Operation | CPU Baseline | Metal (M1 Pro) | Speedup |
|-----------|--------------|----------------|---------|
| Single Vector Add | 10,000 vec/s | 50,000 vec/s | **5.0x** |
| Batch Add (1000 vec) | 1,000 vec/s | 4,768 vec/s | **4.8x** |
| Batch Add (10k vec) | 500 vec/s | 3,200 vec/s | **6.4x** |

**Key Takeaways:**
- GPU acceleration provides 5-6x speedup for vector addition
- Larger batches achieve better throughput
- Optimal batch size: 1,000-10,000 vectors

#### Search Performance


| Vector Count | Dimension | CPU Time | GPU Time (Metal) | Speedup |
|--------------|-----------|----------|------------------|---------|
| 1,000 | 128 | 10 ms | 0.5 ms | **20x** |
| 10,000 | 128 | 100 ms | 2 ms | **50x** |
| 100,000 | 128 | 1,000 ms | 10 ms | **100x** |
| 1,000,000 | 128 | 10,000 ms | 50 ms | **200x** |
| 10,000 | 384 | 300 ms | 6 ms | **50x** |
| 10,000 | 768 | 600 ms | 12 ms | **50x** |

**Key Takeaways:**
- GPU acceleration shines with larger vector counts
- Speedup increases with dataset size (100-200x at 1M vectors)
- Performance scales well with dimension

#### HNSW Graph Construction


| Vector Count | Dimension | CPU Time | GPU Time (Metal) | Speedup |
|--------------|-----------|----------|------------------|---------|
| 1,000 | 128 | 100 ms | 10 ms | **10x** |
| 10,000 | 128 | 2,000 ms | 100 ms | **20x** |
| 100,000 | 128 | 30,000 ms | 500 ms | **60x** |

**HNSW Configuration:**
- M (max_connections): 16
- ef_construction: 100
- ef_search: 50

**Key Takeaways:**
- GPU-accelerated HNSW construction is 10-60x faster
- Speedup increases with larger graphs
- Construction is parallelized across GPU cores

#### HNSW Search Performance


| Vector Count | CPU Time | GPU Time (HNSW) | Speedup | Recall@10 |
|--------------|----------|-----------------|---------|-----------|
| 10,000 | 5 ms | 0.3 ms | **16.7x** | 98% |
| 100,000 | 8 ms | 0.5 ms | **16x** | 97% |
| 1,000,000 | 12 ms | 0.7 ms | **17x** | 96% |

**Key Takeaways:**
- HNSW provides logarithmic time complexity
- GPU acceleration maintains high recall (>95%)
- Search time grows slowly with dataset size

### Memory Usage


#### VRAM Utilization


| Vector Count | Dimension | Data Size | HNSW Graph | Total VRAM |
|--------------|-----------|-----------|------------|------------|
| 10,000 | 128 | 5 MB | 2.5 MB | ~8 MB |
| 100,000 | 128 | 50 MB | 25 MB | ~75 MB |
| 1,000,000 | 128 | 500 MB | 250 MB | ~750 MB |
| 10,000 | 768 | 30 MB | 2.5 MB | ~33 MB |

**Formula:**
- Vector Data: `n × d × 4 bytes` (f32)
- HNSW Graph: `n × M × 8 bytes` (M = max_connections)
- Metadata: `n × ~64 bytes`

**Key Takeaways:**
- HNSW graph adds ~50% memory overhead
- 1M vectors (128D) fits in 1GB VRAM
- Unified memory on Apple Silicon is efficient

---

## Performance Optimization


### 1. Batch Operations


**Always batch vector operations** for maximum throughput.

#### Addition


```rust
// ✅ OPTIMAL: Batch addition (1000-10000 vectors)
let batch_size = 5000;
for chunk in vectors.chunks(batch_size) {
    storage.add_vectors(chunk)?;
}

// ⚠️ SUBOPTIMAL: Small batches
for chunk in vectors.chunks(10) {  // Too small!
    storage.add_vectors(chunk)?;
}

// ❌ WORST: Individual additions
for vector in vectors {
    storage.add_vectors(&[vector])?;  // Very slow!
}
```

**Recommended batch sizes:**
- **Small datasets (<10k)**: 1,000 vectors
- **Medium datasets (10k-100k)**: 5,000 vectors
- **Large datasets (>100k)**: 10,000 vectors

#### Search


```rust
// ✅ OPTIMAL: Batch search queries
let queries: Vec<Vec<f32>> = /* ... */;
let results: Vec<Vec<GpuSearchResult>> = queries
    .iter()
    .map(|q| storage.search(q, 10))
    .collect::<Result<_>>()?;

// For very large query batches, consider parallel execution:
use rayon::prelude::*;
let results: Vec<Vec<GpuSearchResult>> = queries
    .par_iter()
    .map(|q| storage.search(q, 10))
    .collect::<Result<_>>()?;
```

### 2. HNSW Configuration Tuning


#### For High Recall (Accuracy)


```rust
let config = HnswConfig {
    max_connections: 32,        // More connections
    ef_construction: 200,       // Better graph quality
    ef_search: 100,             // More candidates
    max_level: 8,
    level_multiplier: 0.5,
    seed: Some(42),
};
```

**Trade-offs:**
- ✅ Higher recall (~99%)
- ✅ Better search quality
- ❌ Slower construction
- ❌ Slower search
- ❌ More memory usage

#### For High Speed


```rust
let config = HnswConfig {
    max_connections: 16,        // Fewer connections
    ef_construction: 100,       // Faster construction
    ef_search: 50,              // Fewer candidates
    max_level: 6,
    level_multiplier: 0.5,
    seed: Some(42),
};
```

**Trade-offs:**
- ✅ Faster construction (2x)
- ✅ Faster search (2x)
- ✅ Less memory usage
- ❌ Lower recall (~95%)

#### Balanced Configuration (Recommended)

```rust
let config = HnswConfig {
    max_connections: 20,
    ef_construction: 150,
    ef_search: 75,
    max_level: 8,
    level_multiplier: 0.5,
    seed: Some(42),
};
```

**Trade-offs:**
- ✅ Good recall (~97%)
- ✅ Reasonable speed
- ✅ Moderate memory

### 3. Distance Metric Selection


#### Cosine Similarity


```rust
let storage = context.create_storage(128, GpuDistanceMetric::Cosine)?;
```

**Best for:**
- Text embeddings (semantic similarity)
- Normalized vectors
- Direction-based similarity

**Performance:**
- Requires vector normalization
- Slightly slower than dot product

#### Dot Product


```rust
let storage = context.create_storage(128, GpuDistanceMetric::DotProduct)?;
```

**Best for:**
- Pre-normalized vectors
- Maximum performance
- Magnitude-aware similarity

**Performance:**
- **Fastest metric** (no normalization)
- Use when vectors are already normalized

#### Euclidean Distance


```rust
let storage = context.create_storage(128, GpuDistanceMetric::Euclidean)?;
```

**Best for:**
- Spatial data
- Absolute distances
- L2 distance requirements

**Performance:**
- Moderate speed
- Requires square root computation

**Performance Comparison:**

| Metric | Relative Speed | Use Case |
|--------|---------------|----------|
| Dot Product | **1.0x (fastest)** | Pre-normalized vectors |
| Cosine | **0.9x** | Semantic similarity |
| Euclidean | **0.85x** | Spatial data |

### 4. Vector Normalization


**Pre-normalize vectors** when using Cosine similarity.

```rust
fn normalize_vector(data: &[f32]) -> Vec<f32> {
    let magnitude: f32 = data.iter().map(|x| x * x).sum::<f32>().sqrt();
    data.iter().map(|x| x / magnitude).collect()
}

// Pre-normalize before adding to storage
let mut vectors: Vec<GpuVector> = /* ... */;
for vector in &mut vectors {
    vector.data = normalize_vector(&vector.data);
}
storage.add_vectors(&vectors)?;
```

**Benefits:**
- ~10% faster search with pre-normalized vectors
- More consistent similarity scores

### 5. Memory Management


#### Buffer Pooling


hive-gpu automatically uses buffer pooling for efficient memory management. To maximize efficiency:

```rust
// ✅ GOOD: Reuse storage for multiple operations
let mut storage = context.create_storage(128, GpuDistanceMetric::Cosine)?;

// Add vectors
storage.add_vectors(&batch1)?;

// Search multiple times (reuses buffers)
for query in queries {
    let results = storage.search(&query, 10)?;
    process_results(results);
}

// ❌ BAD: Recreating storage repeatedly
for batch in batches {
    let mut storage = context.create_storage(128, GpuDistanceMetric::Cosine)?;
    storage.add_vectors(&batch)?;  // Inefficient!
}
```

#### VRAM Monitoring


Monitor VRAM usage to prevent out-of-memory errors:

```rust
use hive_gpu::traits::GpuBackend;

let stats = context.memory_stats();
println!("VRAM Usage: {:.1}%", stats.utilization * 100.0);
println!("Available: {} MB", stats.available / 1024 / 1024);

if stats.utilization > 0.9 {
    eprintln!("Warning: VRAM usage high!");
}
```

### 6. Dimension Optimization


**Choose appropriate vector dimensions** for your use case.

| Dimension | Use Case | Memory (1M vectors) | Search Speed |
|-----------|----------|---------------------|--------------|
| 128 | Fast retrieval | 512 MB | **Fast** |
| 384 | Balanced | 1.5 GB | **Medium** |
| 768 | High quality | 3 GB | **Slower** |
| 1536 | Maximum quality | 6 GB | **Slowest** |

**Recommendations:**
- **128-256D**: Fast retrieval, moderate quality
- **384-512D**: Balanced performance and quality
- **768-1024D**: High-quality embeddings
- **1536D+**: Premium models (OpenAI ada-002)

### 7. Parallelization


#### Multiple Contexts (Multi-GPU)


```rust
use rayon::prelude::*;

// Create contexts for multiple GPUs
let contexts = vec![
    MetalNativeContext::new()?,
    // Additional GPUs if available
];

// Distribute work across GPUs
let results: Vec<_> = queries
    .par_chunks(queries.len() / contexts.len())
    .zip(&contexts)
    .map(|(chunk, ctx)| {
        let storage = ctx.create_storage(128, GpuDistanceMetric::Cosine)?;
        chunk.iter().map(|q| storage.search(q, 10)).collect()
    })
    .collect();
```

#### Async Operations


```rust
use tokio::task;

// Asynchronous batch processing
let handles: Vec<_> = batches.into_iter().map(|batch| {
    let storage = storage.clone();  // Arc<Mutex<Storage>>
    task::spawn(async move {
        storage.lock().await.add_vectors(&batch)
    })
}).collect();

// Wait for all batches to complete
for handle in handles {
    handle.await??;
}
```

---

## Profiling and Debugging


### Enable Performance Logging


```bash
export RUST_LOG=hive_gpu=debug
export HIVE_GPU_PROFILE=true

cargo run --release --example metal_basic
```

### macOS Metal Profiling


```bash
# Build with debug symbols

cargo build --release --example metal_basic

# Profile with Instruments

xcrun xctrace record \
  --template 'Metal System Trace' \
  --launch ./target/release/examples/metal_basic

# Open trace in Instruments

open metal_trace.trace
```

### Benchmark Suite


```bash
# Run all benchmarks

cargo bench --features metal-native

# Run specific benchmark

cargo bench --bench gpu_operations -- search

# Generate HTML report

cargo bench --features metal-native -- --save-baseline main
```

### Memory Profiling


```rust
use hive_gpu::monitoring::VramMonitor;

let monitor = VramMonitor::new(context);

// Before operation
let before = monitor.get_vram_stats();

// Perform operation
storage.add_vectors(&vectors)?;

// After operation
let after = monitor.get_vram_stats();

println!("VRAM increase: {} MB", 
         (after.allocated_vram - before.allocated_vram) / 1024 / 1024);
```

---

## Performance Bottlenecks


### Common Issues and Solutions


#### 1. CPU-GPU Transfer Overhead


**Problem:** Frequent small transfers between CPU and GPU.

**Solution:**
```rust
// ❌ BAD: Frequent small transfers
for vector in vectors {
    storage.add_vectors(&[vector])?;  // Each call transfers data
}

// ✅ GOOD: Single large transfer
storage.add_vectors(&vectors)?;
```

#### 2. Non-optimal Batch Size


**Problem:** Batch size too small or too large.

**Solution:**
```rust
// Optimal batch size depends on dimension and VRAM
let optimal_batch_size = match dimension {
    d if d <= 128 => 10_000,
    d if d <= 384 => 5_000,
    d if d <= 768 => 2_000,
    _ => 1_000,
};

for chunk in vectors.chunks(optimal_batch_size) {
    storage.add_vectors(chunk)?;
}
```

#### 3. Inefficient HNSW Parameters


**Problem:** HNSW parameters not tuned for workload.

**Solution:**
```rust
// For high-throughput (less accuracy):
let config = HnswConfig {
    ef_search: 50,  // Lower for speed
    ..Default::default()
};

// For high-accuracy (slower):
let config = HnswConfig {
    ef_search: 200,  // Higher for accuracy
    ..Default::default()
};
```

#### 4. Memory Fragmentation


**Problem:** VRAM fragmentation from repeated allocations.

**Solution:**
```rust
// Pre-allocate storage for expected size
let mut storage = context.create_storage(128, GpuDistanceMetric::Cosine)?;

// Reserve capacity (if API available)
// storage.reserve(expected_vector_count)?;

// Add vectors in optimal batches
for chunk in vectors.chunks(5000) {
    storage.add_vectors(chunk)?;
}
```

---

## Performance Checklist


### Before Deployment


- [ ] Batch operations (1000-10000 vectors per batch)
- [ ] HNSW configuration tuned for use case
- [ ] Vectors pre-normalized (if using Cosine)
- [ ] Appropriate distance metric selected
- [ ] Memory usage profiled (< 90% VRAM)
- [ ] Benchmarks run on production hardware
- [ ] Performance regression tests in CI/CD
- [ ] Error handling for out-of-memory scenarios

### During Operation


- [ ] Monitor VRAM usage
- [ ] Track search latency (p50, p99)
- [ ] Log slow operations (>100ms)
- [ ] Profile periodically
- [ ] Watch for memory leaks
- [ ] Monitor GPU temperature/throttling

---

## Performance Comparison


### vs. CPU-only Libraries


| Library | Backend | Add (10k vec) | Search (10k vec) |
|---------|---------|---------------|------------------|
| **hive-gpu** | **Metal (M1)** | **2 ms** | **0.5 ms** |
| FAISS | CPU (16 cores) | 100 ms | 10 ms |
| hnswlib | CPU (16 cores) | 150 ms | 8 ms |
| annoy | CPU (16 cores) | 200 ms | 12 ms |

**Speedup: 20-50x over CPU-only libraries**

### vs. Other GPU Libraries


| Library | Backend | Add (10k vec) | Search (10k vec) | HNSW Support |
|---------|---------|---------------|------------------|--------------|
| **hive-gpu** | **Metal Native** | **2 ms** | **0.5 ms** | **Yes** |
| FAISS GPU | CUDA | 3 ms | 0.8 ms | No |
| cuVS | CUDA | 2.5 ms | 0.6 ms | Yes |

**hive-gpu provides competitive performance with native Metal implementation**

---

## Future Optimizations


### Planned for v0.2.0


- [ ] CUDA backend optimization
- [ ] Multi-GPU load balancing
- [ ] Quantization (PQ, SQ) for memory reduction
- [ ] Kernel fusion for reduced overhead
- [ ] Adaptive batch sizing

### Planned for v0.3.0


- [ ] Dynamic graph updates
- [ ] Memory compression
- [ ] Zero-copy operations
- [ ] Hardware-specific tuning (Apple Neural Engine)
- [ ] Persistent caching

---

## References


- [HNSW Paper](https://arxiv.org/abs/1603.09320)
- [Metal Performance Guide](https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf)
- [GPU Computing Best Practices](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/)

---

*Last Updated: 2025-01-03*
*Benchmark Version: 0.1.6*