hive-gpu 0.2.0 - Docs.rs

# hive-gpu - Architecture


## Overview


`hive-gpu` is a high-performance GPU acceleration library for vector operations, specifically designed for vector similarity search workloads. It provides a unified API across multiple native GPU backends (Metal for Apple Silicon, CUDA for NVIDIA GPUs, and ROCm for AMD GPUs) with optimized implementations for HNSW (Hierarchical Navigable Small World) graph-based approximate nearest neighbor search.

## System Architecture


```
┌─────────────────────────────────────────────────────────────┐
│                     Application Layer                        │
│  (User Code, Vectorizer Integration, Custom Implementations) │
└────────────────────────────┬────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────┐
│                      Core API Layer                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ GpuContext   │  │ GpuVector    │  │ GpuSearch    │      │
│  │    Trait     │  │  Storage     │  │   Result     │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└────────────────────────────┬────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────┐
│                   Backend Abstraction                        │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │  GpuBackend  │  │ GpuBuffer    │  │  GpuMonitor  │      │
│  │    Trait     │  │   Manager    │  │     Trait    │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└────────────────────────────┬────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────┐
│              GPU Backend Implementations                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │    Metal     │  │     CUDA     │  │     ROCm     │      │
│  │   Native     │  │    Native    │  │    Native    │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└────────────────────────────┬────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────┐
│                  Hardware Layer (GPU)                         │
│      Apple Silicon / NVIDIA GPUs / AMD Radeon/Instinct      │
└─────────────────────────────────────────────────────────────┘
```

## Core Modules


### 1. **Core Types Module** (`src/types.rs`)


**Responsibility:** Defines fundamental data structures used throughout the system.

**Key Types:**
- `GpuVector`: Vector representation with ID, data, and metadata
- `GpuDistanceMetric`: Distance metric enum (Cosine, Euclidean, DotProduct)
- `GpuSearchResult`: Search result with ID, score, and index
- `GpuDeviceInfo`: GPU device information
- `GpuMemoryStats`: Memory usage statistics
- `HnswConfig`: HNSW graph configuration

**Dependencies:** None (foundation module)

### 2. **Core Traits Module** (`src/traits.rs`)


**Responsibility:** Defines interfaces for GPU operations and backend abstraction.

**Key Traits:**
- `GpuBackend`: Core backend interface for device info and capabilities
- `GpuContext`: Factory trait for creating vector storage instances
- `GpuVectorStorage`: Vector operations interface (add, search, remove)
- `GpuBufferManager`: Buffer allocation and management
- `GpuMonitor`: VRAM monitoring and validation

**Dependencies:** `types` module

### 3. **Backend Detection Module** (`src/backends/detector.rs`)


**Responsibility:** Detects available GPU backends and selects the optimal one.

**Key Functions:**
- `detect_available_backends()`: Returns list of available backends
- `select_best_backend()`: Selects best backend based on priority (Metal > CUDA > CPU)
- `get_backend_info()`: Retrieves backend-specific information
- `get_backend_performance_info()`: Returns performance characteristics

**Dependencies:** Feature flags (`metal-native`, `cuda`, `rocm`)

### 4. **Metal Backend Module** (`src/metal/`)


**Responsibility:** Metal-specific GPU acceleration for Apple Silicon.

**Submodules:**
- `context.rs`: Metal device and command queue management
- `vector_storage.rs`: Vector storage implementation using Metal buffers
- `hnsw_graph.rs`: HNSW graph construction and search on GPU
- `buffer_pool.rs`: Metal buffer pool for efficient memory management
- `vram_monitor.rs`: VRAM usage monitoring and validation
- `helpers.rs`: Metal-specific utility functions

**Key Features:**
- Pure Metal Native implementation
- Optimized for Apple Silicon (M1/M2/M3/M4)
- Support for Metal Performance Shaders (MPS)
- VRAM-only storage for maximum performance
- Modern type-safe Rust bindings via objc2-metal (v0.1.8+)

**Dependencies:** 
- `objc2-metal` v0.3 - Metal framework bindings (actively maintained)
- `objc2-foundation` v0.3 - Foundation framework support  
- `objc2` v0.6 - Modern Objective-C interop

**Migration Note (v0.1.8):**
Migrated from discontinued `metal-rs` to actively maintained `objc2-metal` for:
- Active maintenance and security updates
- Modern type-safe bindings with `ProtocolObject<dyn Trait>` pattern
- Full Metal API coverage and integrated Foundation support

See `docs/guides/MIGRATION_METAL_OBJC2.md` for migration details.

### 5. **CUDA Backend Module** (`src/cuda/`)


**Responsibility:** CUDA-specific GPU acceleration for NVIDIA GPUs.

**Submodules:**
- `context.rs`: CUDA device and stream management
- `vector_storage.rs`: Vector storage using CUDA memory
- `hnsw_graph.rs`: HNSW implementation with CUDA kernels
- `buffer_pool.rs`: CUDA memory pool
- `vram_monitor.rs`: CUDA memory monitoring
- `helpers.rs`: CUDA-specific utility functions

**Key Features:**
- Native CUDA implementation
- Optimized for NVIDIA GPUs (Compute Capability 7.0+)
- Support for Tensor Cores (A100+)
- Multi-GPU support

**Status:** Implementation planned (v0.2.0)

**Dependencies:** `cudarc` crate

### 6. **ROCm Backend Module** (`src/rocm/`)


**Responsibility:** ROCm-specific GPU acceleration for AMD GPUs.

**Submodules:**
- `context.rs`: ROCm device and queue management
- `vector_storage.rs`: Vector storage using ROCm memory
- `hnsw_graph.rs`: HNSW implementation with HIP kernels
- `buffer_pool.rs`: ROCm memory pool
- `vram_monitor.rs`: ROCm memory monitoring
- `helpers.rs`: ROCm-specific utility functions

**Key Features:**
- Native ROCm/HIP implementation
- Optimized for AMD Radeon and Instinct GPUs
- Support for MI200/MI300 series
- Multi-GPU support

**Status:** Implementation planned (v0.2.0)

**Dependencies:** `hip-sys` crate

### 7. **Shader Management Module** (`src/shaders/`)


**Responsibility:** GPU shader source code and compilation.

**Key Files:**
- `metal_hnsw.metal`: Metal Shading Language for HNSW operations
- `metal_shaders.rs`: Rust bindings for Metal shaders
- `cuda_kernels.cu`: CUDA kernels for HNSW operations
- `cuda_shaders.rs`: Rust bindings for CUDA kernels
- `hip_kernels.hip`: HIP kernels for ROCm
- `rocm_shaders.rs`: Rust bindings for ROCm kernels

**Dependencies:** Backend-specific shader/kernel compilation

### 8. **Monitoring Module** (`src/monitoring/`)


**Responsibility:** Performance and memory monitoring.

**Submodules:**
- `performance_monitor.rs`: Operation timing and throughput metrics
- `vram_monitor.rs`: Unified VRAM monitoring interface

**Dependencies:** `types`, `traits`

### 9. **Utilities Module** (`src/utils/`)


**Responsibility:** Common utility functions.

**Submodules:**
- `math.rs`: Mathematical operations (normalize, distance calculations)
- `memory.rs`: Memory alignment and buffer utilities
- `timing.rs`: Performance measurement utilities

**Dependencies:** None

### 10. **Error Handling Module** (`src/error.rs`)


**Responsibility:** Unified error handling across all backends.

**Key Types:**
- `HiveGpuError`: Main error enum
- `Result<T>`: Type alias for Result<T, HiveGpuError>

**Error Categories:**
- Device errors (NoDeviceAvailable, DeviceNotSupported)
- Memory errors (OutOfMemory, AllocationFailed)
- Operation errors (InvalidDimension, InvalidOperation)
- Backend errors (ShaderCompilationFailed, BackendError)

## Data Flow


### Vector Addition Flow


```
User Code
  │
  ├─► GpuContext::create_storage()
  │     └─► MetalNativeContext::new()
  │           └─► Device + CommandQueue + Library initialization
  │
  ├─► GpuVectorStorage::add_vectors(vectors)
  │     │
  │     ├─► Validate vector dimensions
  │     ├─► Allocate GPU buffers
  │     ├─► Upload vectors to VRAM
  │     ├─► Update HNSW graph (if configured)
  │     └─► Return vector indices
  │
  └─► Result<Vec<usize>>
```

### Vector Search Flow


```
User Code
  │
  ├─► GpuVectorStorage::search(query, limit)
  │     │
  │     ├─► Upload query vector to GPU
  │     ├─► Compute distances (GPU kernel)
  │     │     └─► Parallel similarity computation
  │     ├─► HNSW graph traversal (if configured)
  │     │     └─► GPU-accelerated graph search
  │     ├─► Top-K selection on GPU
  │     ├─► Download results from VRAM
  │     └─► Return sorted results
  │
  └─► Result<Vec<GpuSearchResult>>
```

## Technology Stack


### Language

- **Primary**: Rust (Edition 2024)
- **Shader Languages**: Metal Shading Language (MSL), CUDA C++, HIP (ROCm)

### GPU Frameworks

- **Metal**: Apple's GPU framework (version 0.27+)
- **CUDA**: NVIDIA's parallel computing platform (version 12.0+)
- **ROCm/HIP**: AMD's GPU computing platform (version 5.0+)

### Core Dependencies

- `metal`: Metal framework bindings
- `objc`: Objective-C runtime for Metal interop
- `thiserror`: Error handling
- `serde`: Serialization/deserialization
- `tokio`: Async runtime
- `tracing`: Structured logging
- `bytemuck`: Safe transmutation between types
- `half`: Half-precision float support

### Development Tools

- `criterion`: Benchmarking framework
- `cargo-nextest`: Fast test runner
- `clippy`: Linter
- `rustfmt`: Code formatter

## Design Patterns


### 1. **Trait-Based Abstraction**


All GPU backends implement common traits (`GpuBackend`, `GpuContext`, `GpuVectorStorage`) enabling polymorphic usage and easy backend switching.

```rust
pub trait GpuContext {
    fn create_storage(&self, dimension: usize, metric: GpuDistanceMetric) 
        -> Result<Box<dyn GpuVectorStorage>>;
}
```

### 2. **Builder Pattern**


HNSW configuration uses builder pattern for flexible parameter tuning:

```rust
let config = HnswConfig {
    max_connections: 16,
    ef_construction: 100,
    ef_search: 50,
    ..Default::default()
};
```

### 3. **Buffer Pooling**


Memory management uses buffer pooling to reduce allocation overhead:

```rust
pub trait GpuBufferManager {
    fn allocate_buffer(&mut self, size: usize) -> Result<GpuBuffer>;
    fn deallocate_buffer(&mut self, buffer: GpuBuffer) -> Result<()>;
}
```

### 4. **RAII (Resource Acquisition Is Initialization)**


GPU resources are automatically cleaned up when dropped:

```rust
impl Drop for MetalNativeVectorStorage {
    fn drop(&mut self) {
        // Automatic cleanup of GPU buffers
    }
}
```

### 5. **Zero-Copy Architecture**


Data remains in VRAM throughout operations, minimizing CPU-GPU transfers:

```
CPU Memory                    VRAM (GPU Memory)
    ───►  Initial Upload  ───►  Storage
                                    │
                                    ├─► Similarity Computation
                                    ├─► HNSW Graph Traversal
                                    └─► Top-K Selection
    ◄───  Final Results  ◄────
```

## Security Considerations


### Memory Safety

- **Rust Guarantees**: Memory safety enforced by Rust's ownership system
- **Buffer Validation**: All buffer operations validated before GPU execution
- **No Unsafe Code**: Minimized use of `unsafe` blocks, carefully audited when necessary

### Data Protection

- **VRAM Isolation**: Vector data isolated in GPU memory
- **No Disk I/O**: All operations in memory (no persistence by default)
- **Metadata Sanitization**: User-provided metadata validated before storage

### Resource Limits

- **Memory Bounds**: Maximum buffer sizes enforced
- **Batch Limits**: Maximum batch sizes to prevent resource exhaustion
- **Timeout Protection**: Long-running operations have configurable timeouts

## Performance Characteristics


### Expected Throughput


| Operation | CPU Baseline | Metal (Apple Silicon) | Speedup |
|-----------|--------------|----------------------|---------|
| Vector Addition | 1,000 vec/s | 4,768 vec/s | **4.8x** |
| Similarity Search | 1 ms | 0.668 ms | **1.5x** |
| HNSW Construction | 100 ms | <1 ms | **100x+** |
| Batch Search (1000 queries) | 10 s | 0.001 s | **10,000x** |

### Latency Targets


- **Single Vector Search**: < 1ms (with HNSW)
- **Batch Search (100 queries)**: < 10ms
- **Vector Addition**: < 0.1ms per vector (batched)
- **HNSW Graph Construction**: < 100ms (10k vectors)

### Memory Efficiency


- **VRAM-Only Storage**: Zero CPU memory overhead
- **Sparse Indexing**: HNSW graph reduces memory footprint
- **Buffer Pooling**: 90%+ buffer reuse rate
- **Compression**: Optional f16 support for 50% memory reduction

### Scalability


- **Vector Count**: Tested up to 10M vectors
- **Batch Size**: Up to 10,000 vectors per batch
- **Dimensions**: Optimized for 128-768 dimensions
- **Concurrent Searches**: Multiple queries in parallel

## Optimization Techniques


### 1. **SIMD Vectorization**

GPU shaders utilize SIMD instructions for parallel distance computation.

### 2. **Shared Memory Usage**

Local memory in GPU workgroups for fast intermediate storage.

### 3. **Memory Coalescing**

Aligned memory access patterns for optimal memory bandwidth.

### 4. **Kernel Fusion**

Multiple operations combined in single GPU kernel to reduce overhead.

### 5. **Asynchronous Execution**

Non-blocking GPU operations with async/await support.

## Backend-Specific Optimizations


### Metal (Apple Silicon)


- **Tile Memory**: Utilizes Apple GPU tile-based architecture
- **Metal Performance Shaders (MPS)**: Hardware-accelerated operations
- **Unified Memory**: Shared memory architecture reduces overhead
- **Apple GPU Families**: Optimized for Apple7+ GPU families

### CUDA (NVIDIA)


- **Tensor Cores**: Hardware-accelerated matrix operations (A100+)
- **Shared Memory Banks**: Optimized bank access patterns
- **Warp-Level Primitives**: Collective thread operations
- **Stream Compaction**: Efficient filtering on GPU
- **Unified Memory**: Automatic memory management
- **Multi-GPU**: NCCL for multi-GPU communication

### ROCm (AMD)


- **HIP Programming**: Portable GPU code compatible with CUDA
- **Wavefront Optimization**: AMD GPU wavefront-specific tuning
- **Infinity Fabric**: High-bandwidth GPU-GPU communication
- **Matrix Cores**: Hardware-accelerated matrix operations (MI200+)
- **Memory Pooling**: Efficient HIP memory management
- **Multi-GPU**: RCCL for multi-GPU communication

## Testing Strategy


### Unit Tests

- Individual function testing
- Mock GPU contexts for CPU-only tests
- Edge case validation

### Integration Tests

- End-to-end workflows
- Multiple backend testing
- Real GPU hardware validation

### Benchmarks

- Performance regression detection
- Comparative benchmarking across backends
- Memory usage profiling

### Coverage Requirements

- **Minimum**: 95% code coverage
- **Critical Paths**: 100% coverage (search, add_vectors)
- **Error Handling**: All error paths tested

## Future Enhancements


### Phase 1 (v0.2.0)

- [ ] Complete CUDA backend implementation
- [ ] Add ROCm backend for AMD GPU support
- [ ] Implement f16 (half-precision) support
- [ ] Advanced HNSW optimizations

### Phase 2 (v0.3.0)

- [ ] GPU-based quantization (PQ, SQ)
- [ ] Multi-GPU support
- [ ] Distributed vector search
- [ ] Real-time index updates

### Phase 3 (v1.0.0)

- [ ] Production-grade stability
- [ ] Comprehensive benchmarking suite
- [ ] Advanced monitoring and profiling
- [ ] Official integrations (Qdrant, Milvus)

---
*Last Updated: 2025-01-03*