# Recomendações de Implementação para hive-gpu
**Date:** 2025-01-07
**Version:** 1.0
**Author:** Vectorizer Team
**Target:** hive-gpu Library (v0.2.0+)
---
## 📋 Executive Summary
Este documento detalha as **recomendações críticas** para expansão da biblioteca `hive-gpu`, necessárias para suportar aceleração GPU em **todas as plataformas principais** do Vectorizer.
### Current Status
| **Metal** | ✅ Implementado | macOS | ~5% servidores |
| **CUDA** | ❌ Faltando | Linux/Windows (NVIDIA) | ~70% servidores |
| **ROCm** | ❌ Faltando | Linux (AMD) | ~15% servidores |
| **WebGPU** | ❌ Faltando | Cross-platform | ~10% uso geral |
### Impacto da Implementação
**Se CUDA for implementado:**
- ✅ **95% dos servidores de produção** poderão usar GPU
- ✅ **10-50x speedup** em operações vetoriais
- ✅ **Latência reduzida** de 10-30ms para 0.5-3ms
- ✅ **ROI altíssimo** para apenas 1-2 semanas de desenvolvimento
---
## 🎯 Objetivos Principais
### 1. **CUDA Backend** (Prioridade: 🔥 CRÍTICA)
- Suportar NVIDIA GPUs em Linux e Windows
- Atingir 70% do mercado de servidores ML/AI
- Performance competitiva com implementações nativas CUDA
### 2. **Device Info API** (Prioridade: 🔥 ALTA)
- Fornecer informações detalhadas do GPU
- Suportar consulta de VRAM disponível/total
- Identificar modelo de GPU e driver version
### 3. **ROCm Backend** (Prioridade: ⚡ MÉDIA)
- Suportar AMD GPUs em Linux
- Atingir 15% adicional do mercado
### 4. **WebGPU Backend** (Prioridade: 📦 BAIXA)
- Fallback universal cross-platform
- Suporte para deployment em browsers
### 5. **Memory Pooling** (Prioridade: ⚡ MÉDIA)
- Otimizar alocação/deallocação de memória
- Reduzir overhead de operações batch
### 6. **Async Operations** (Prioridade: 📦 BAIXA)
- Operações GPU assíncronas
- Melhor integração com Tokio runtime
---
## 🏗️ Arquitetura Recomendada
### Estrutura de Diretórios Proposta
```
hive-gpu/
├── src/
│ ├── lib.rs # Core traits and types
│ ├── error.rs # Error types
│ ├── types.rs # Common types (GpuVector, etc.)
│ │
│ ├── metal/ # ✅ Existente
│ │ ├── mod.rs
│ │ ├── context.rs
│ │ └── storage.rs
│ │
│ ├── cuda/ # ❌ IMPLEMENTAR (Phase 1)
│ │ ├── mod.rs
│ │ ├── context.rs
│ │ ├── storage.rs
│ │ ├── kernels.cu # CUDA kernels
│ │ └── utils.rs
│ │
│ ├── rocm/ # ❌ IMPLEMENTAR (Phase 2)
│ │ ├── mod.rs
│ │ ├── context.rs
│ │ ├── storage.rs
│ │ └── kernels.hip # HIP kernels
│ │
│ ├── wgpu/ # ❌ IMPLEMENTAR (Phase 3)
│ │ ├── mod.rs
│ │ ├── context.rs
│ │ ├── storage.rs
│ │ └── shaders.wgsl # WGSL shaders
│ │
│ └── pool/ # ❌ IMPLEMENTAR (Phase 4)
│ ├── mod.rs
│ └── memory_pool.rs
│
├── benches/ # Benchmarks por backend
│ ├── metal_bench.rs
│ ├── cuda_bench.rs
│ └── comparison.rs
│
├── tests/ # Integration tests
│ ├── metal_tests.rs
│ ├── cuda_tests.rs
│ └── cross_backend.rs
│
├── build.rs # Build script (CUDA/ROCm compilation)
├── Cargo.toml
└── README.md
```
---
## 📦 Phase 1: Device Info API (Prioridade: 🔥 ALTA)
**Tempo Estimado:** 1-2 dias
**Complexidade:** Baixa
**Impacto:** Alto (usado por todos backends)
### Especificação da API
#### 1.1 Nova Struct: `GpuDeviceInfo`
```rust
// src/types.rs
/// Detailed GPU device information
#[derive(Debug, Clone)]
pub struct GpuDeviceInfo {
/// GPU device name (e.g., "Apple M1 Pro", "NVIDIA RTX 4090")
pub name: String,
/// Backend type
pub backend: String,
/// Total VRAM in bytes
pub vram_total: usize,
/// Available VRAM in bytes
pub vram_available: usize,
/// Currently used VRAM in bytes
pub vram_used: usize,
/// Driver version string
pub driver_version: String,
/// Compute capability (CUDA) or equivalent
pub compute_capability: Option<String>,
/// Maximum threads per block (for parallel operations)
pub max_threads_per_block: Option<usize>,
/// Maximum shared memory per block
pub max_shared_memory: Option<usize>,
/// Device ID (for multi-GPU systems)
pub device_id: usize,
/// PCI Bus ID (for identification)
pub pci_bus_id: Option<String>,
}
impl GpuDeviceInfo {
/// Calculate VRAM usage percentage
pub fn vram_usage_percent(&self) -> f32 {
(self.vram_used as f32 / self.vram_total as f32) * 100.0
}
/// Check if device has enough available VRAM
pub fn has_available_vram(&self, required_bytes: usize) -> bool {
self.vram_available >= required_bytes
}
}
```
#### 1.2 Adicionar ao Trait `GpuContext`
```rust
// src/lib.rs
pub trait GpuContext: Send + Sync {
/// Get detailed device information
fn device_info(&self) -> Result<GpuDeviceInfo, HiveGpuError>;
/// Get current VRAM usage in bytes
fn vram_usage(&self) -> Result<usize, HiveGpuError> {
Ok(self.device_info()?.vram_used)
}
/// Check if device has enough available VRAM
fn has_available_vram(&self, required_bytes: usize) -> Result<bool, HiveGpuError> {
Ok(self.device_info()?.has_available_vram(required_bytes))
}
// ... métodos existentes
}
```
### Implementação para Metal
```rust
// src/metal/context.rs
impl GpuContext for MetalNativeContext {
fn device_info(&self) -> Result<GpuDeviceInfo, HiveGpuError> {
let device = self.device();
// Query Metal device properties
let name = device.name().to_string();
let vram_total = device.recommended_max_working_set_size() as usize;
let vram_used = device.current_allocated_size() as usize;
let vram_available = vram_total.saturating_sub(vram_used);
// Get macOS version as "driver version"
let driver_version = Self::get_macos_version();
Ok(GpuDeviceInfo {
name,
backend: "Metal".to_string(),
vram_total,
vram_available,
vram_used,
driver_version,
compute_capability: None, // Metal doesn't have this
max_threads_per_block: Some(1024), // Metal typical
max_shared_memory: Some(32 * 1024), // 32KB typical
device_id: 0,
pci_bus_id: None,
})
}
}
impl MetalNativeContext {
fn get_macos_version() -> String {
use std::process::Command;
if let Ok(output) = Command::new("sw_vers").arg("-productVersion").output() {
String::from_utf8_lossy(&output.stdout).trim().to_string()
} else {
"Unknown".to_string()
}
}
}
```
### Testes
```rust
// tests/device_info_tests.rs
#[test]
fn test_metal_device_info() {
#[cfg(all(feature = "metal-native", target_os = "macos"))]
{
use hive_gpu::metal::MetalNativeContext;
use hive_gpu::GpuContext;
let ctx = MetalNativeContext::new().expect("Metal should be available");
let info = ctx.device_info().expect("Should get device info");
assert!(!info.name.is_empty());
assert_eq!(info.backend, "Metal");
assert!(info.vram_total > 0);
assert!(info.vram_available <= info.vram_total);
assert!(!info.driver_version.is_empty());
println!("Device: {}", info.name);
println!("VRAM: {:.2} GB total, {:.2} GB available",
info.vram_total as f64 / 1024.0 / 1024.0 / 1024.0,
info.vram_available as f64 / 1024.0 / 1024.0 / 1024.0
);
}
}
```
---
## 🚀 Phase 2: CUDA Backend (Prioridade: 🔥 CRÍTICA)
**Tempo Estimado:** 1-2 semanas
**Complexidade:** Alta
**Impacto:** Crítico (70% do mercado)
### 2.1 Dependências Necessárias
```toml
# Cargo.toml
[dependencies]
# Existing...
# CUDA support (optional)
cuda-runtime-sys = { version = "0.3", optional = true }
cuda-driver-sys = { version = "0.3", optional = true }
cublas-sys = { version = "0.3", optional = true }
[build-dependencies]
cc = "1.0" # Para compilar kernels CUDA
[features]
default = ["metal-native"]
metal-native = []
cuda = ["cuda-runtime-sys", "cuda-driver-sys", "cublas-sys"]
[lib]
name = "hive_gpu"
crate-type = ["rlib", "cdylib"]
```
### 2.2 Build Script
```rust
// build.rs
fn main() {
#[cfg(feature = "cuda")]
{
build_cuda_kernels();
}
}
#[cfg(feature = "cuda")]
fn build_cuda_kernels() {
use std::env;
use std::path::PathBuf;
let cuda_path = env::var("CUDA_PATH")
.or_else(|_| env::var("CUDA_HOME"))
.unwrap_or_else(|_| "/usr/local/cuda".to_string());
println!("cargo:rerun-if-changed=src/cuda/kernels.cu");
println!("cargo:rustc-link-search=native={}/lib64", cuda_path);
println!("cargo:rustc-link-lib=cudart");
println!("cargo:rustc-link-lib=cublas");
// Compile CUDA kernels
cc::Build::new()
.cuda(true)
.flag("-gencode")
.flag("arch=compute_70,code=sm_70") // Volta
.flag("-gencode")
.flag("arch=compute_75,code=sm_75") // Turing
.flag("-gencode")
.flag("arch=compute_80,code=sm_80") // Ampere
.flag("-gencode")
.flag("arch=compute_86,code=sm_86") // Ampere
.flag("-gencode")
.flag("arch=compute_89,code=sm_89") // Ada Lovelace
.flag("-gencode")
.flag("arch=compute_90,code=sm_90") // Hopper
.file("src/cuda/kernels.cu")
.compile("hive_gpu_cuda_kernels");
}
```
### 2.3 CUDA Context Implementation
```rust
// src/cuda/context.rs
use cuda_runtime_sys::*;
use std::ffi::c_void;
use std::ptr;
use crate::{GpuContext, GpuDeviceInfo, GpuVectorStorage, HiveGpuError};
/// CUDA GPU Context
pub struct CudaContext {
device_id: i32,
stream: cudaStream_t,
cublas_handle: cublasHandle_t,
}
impl CudaContext {
/// Create a new CUDA context
pub fn new() -> Result<Self, HiveGpuError> {
Self::new_with_device(0)
}
/// Create a new CUDA context with specific device
pub fn new_with_device(device_id: i32) -> Result<Self, HiveGpuError> {
unsafe {
// Check device count
let mut device_count = 0;
cuda_check(cudaGetDeviceCount(&mut device_count))?;
if device_count == 0 {
return Err(HiveGpuError::NoDevice);
}
if device_id >= device_count {
return Err(HiveGpuError::InvalidDeviceId(device_id));
}
// Set device
cuda_check(cudaSetDevice(device_id))?;
// Create stream
let mut stream = ptr::null_mut();
cuda_check(cudaStreamCreate(&mut stream))?;
// Create cuBLAS handle
let mut cublas_handle = ptr::null_mut();
cublas_check(cublasCreate_v2(&mut cublas_handle))?;
cublas_check(cublasSetStream_v2(cublas_handle, stream))?;
Ok(Self {
device_id,
stream,
cublas_handle,
})
}
}
/// Get device count
pub fn device_count() -> Result<i32, HiveGpuError> {
unsafe {
let mut count = 0;
cuda_check(cudaGetDeviceCount(&mut count))?;
Ok(count)
}
}
/// Check if CUDA is available
pub fn is_available() -> bool {
Self::device_count().map(|c| c > 0).unwrap_or(false)
}
}
impl GpuContext for CudaContext {
fn device_info(&self) -> Result<GpuDeviceInfo, HiveGpuError> {
unsafe {
let mut props: cudaDeviceProp = std::mem::zeroed();
cuda_check(cudaGetDeviceProperties(&mut props, self.device_id))?;
// Get memory info
let mut free_mem: usize = 0;
let mut total_mem: usize = 0;
cuda_check(cudaMemGetInfo(&mut free_mem, &mut total_mem))?;
let name = std::ffi::CStr::from_ptr(props.name.as_ptr())
.to_string_lossy()
.into_owned();
// Get driver version
let mut driver_version = 0;
cuda_check(cudaDriverGetVersion(&mut driver_version))?;
Ok(GpuDeviceInfo {
name,
backend: "CUDA".to_string(),
vram_total: total_mem,
vram_available: free_mem,
vram_used: total_mem - free_mem,
driver_version: format!("{}.{}", driver_version / 1000, (driver_version % 1000) / 10),
compute_capability: Some(format!("{}.{}", props.major, props.minor)),
max_threads_per_block: Some(props.maxThreadsPerBlock as usize),
max_shared_memory: Some(props.sharedMemPerBlock as usize),
device_id: self.device_id as usize,
pci_bus_id: Some(format!("{:04x}:{:02x}:{:02x}.0",
props.pciDomainID, props.pciBusID, props.pciDeviceID)),
})
}
}
fn create_storage(
&self,
dimension: usize,
metric: crate::GpuDistanceMetric,
) -> Result<Box<dyn GpuVectorStorage>, HiveGpuError> {
Ok(Box::new(CudaVectorStorage::new(
self.device_id,
self.stream,
self.cublas_handle,
dimension,
metric,
)?))
}
}
impl Drop for CudaContext {
fn drop(&mut self) {
unsafe {
if !self.cublas_handle.is_null() {
cublasDestroy_v2(self.cublas_handle);
}
if !self.stream.is_null() {
cudaStreamDestroy(self.stream);
}
}
}
}
// Helper functions
unsafe fn cuda_check(result: cudaError_t) -> Result<(), HiveGpuError> {
if result != cudaError::cudaSuccess {
let error_str = std::ffi::CStr::from_ptr(cudaGetErrorString(result))
.to_string_lossy()
.into_owned();
return Err(HiveGpuError::CudaError(error_str));
}
Ok(())
}
unsafe fn cublas_check(result: cublasStatus_t) -> Result<(), HiveGpuError> {
if result != cublasStatus_t::CUBLAS_STATUS_SUCCESS {
return Err(HiveGpuError::CublasError(format!("{:?}", result)));
}
Ok(())
}
```
### 2.4 CUDA Vector Storage
```rust
// src/cuda/storage.rs
use cuda_runtime_sys::*;
use std::ptr;
use crate::{GpuVector, GpuSearchResult, GpuVectorStorage, HiveGpuError, GpuDistanceMetric};
pub struct CudaVectorStorage {
device_id: i32,
stream: cudaStream_t,
cublas_handle: cublasHandle_t,
dimension: usize,
metric: GpuDistanceMetric,
vectors: Vec<GpuVector>,
d_vectors: *mut f32, // Device pointer
capacity: usize,
}
impl CudaVectorStorage {
pub fn new(
device_id: i32,
stream: cudaStream_t,
cublas_handle: cublasHandle_t,
dimension: usize,
metric: GpuDistanceMetric,
) -> Result<Self, HiveGpuError> {
let capacity = 10000; // Initial capacity
let bytes = capacity * dimension * std::mem::size_of::<f32>();
unsafe {
let mut d_vectors = ptr::null_mut();
cuda_check(cudaMalloc(&mut d_vectors as *mut *mut c_void, bytes))?;
Ok(Self {
device_id,
stream,
cublas_handle,
dimension,
metric,
vectors: Vec::new(),
d_vectors: d_vectors as *mut f32,
capacity,
})
}
}
fn ensure_capacity(&mut self, new_size: usize) -> Result<(), HiveGpuError> {
if new_size > self.capacity {
let new_capacity = new_size.next_power_of_two();
let bytes = new_capacity * self.dimension * std::mem::size_of::<f32>();
unsafe {
let mut new_d_vectors = ptr::null_mut();
cuda_check(cudaMalloc(&mut new_d_vectors as *mut *mut c_void, bytes))?;
// Copy old data
if !self.d_vectors.is_null() && self.vectors.len() > 0 {
let old_bytes = self.vectors.len() * self.dimension * std::mem::size_of::<f32>();
cuda_check(cudaMemcpyAsync(
new_d_vectors as *mut c_void,
self.d_vectors as *const c_void,
old_bytes,
cudaMemcpyKind::cudaMemcpyDeviceToDevice,
self.stream,
))?;
cuda_check(cudaStreamSynchronize(self.stream))?;
cuda_check(cudaFree(self.d_vectors as *mut c_void))?;
}
self.d_vectors = new_d_vectors as *mut f32;
self.capacity = new_capacity;
}
}
Ok(())
}
}
impl GpuVectorStorage for CudaVectorStorage {
fn add_vector(&mut self, vector: GpuVector) -> Result<usize, HiveGpuError> {
if vector.data.len() != self.dimension {
return Err(HiveGpuError::DimensionMismatch {
expected: self.dimension,
got: vector.data.len(),
});
}
let index = self.vectors.len();
self.ensure_capacity(index + 1)?;
unsafe {
// Copy vector to GPU
let offset = index * self.dimension;
let bytes = self.dimension * std::mem::size_of::<f32>();
cuda_check(cudaMemcpyAsync(
self.d_vectors.add(offset) as *mut c_void,
vector.data.as_ptr() as *const c_void,
bytes,
cudaMemcpyKind::cudaMemcpyHostToDevice,
self.stream,
))?;
}
self.vectors.push(vector);
Ok(index)
}
fn add_vectors(&mut self, vectors: &[GpuVector]) -> Result<(), HiveGpuError> {
if vectors.is_empty() {
return Ok(());
}
let start_index = self.vectors.len();
self.ensure_capacity(start_index + vectors.len())?;
// Flatten all vectors into single buffer
let mut flat_vectors = Vec::with_capacity(vectors.len() * self.dimension);
for vector in vectors {
if vector.data.len() != self.dimension {
return Err(HiveGpuError::DimensionMismatch {
expected: self.dimension,
got: vector.data.len(),
});
}
flat_vectors.extend_from_slice(&vector.data);
}
unsafe {
// Batch copy to GPU
let offset = start_index * self.dimension;
let bytes = flat_vectors.len() * std::mem::size_of::<f32>();
cuda_check(cudaMemcpyAsync(
self.d_vectors.add(offset) as *mut c_void,
flat_vectors.as_ptr() as *const c_void,
bytes,
cudaMemcpyKind::cudaMemcpyHostToDevice,
self.stream,
))?;
cuda_check(cudaStreamSynchronize(self.stream))?;
}
self.vectors.extend_from_slice(vectors);
Ok(())
}
fn search(&self, query: &[f32], k: usize) -> Result<Vec<GpuSearchResult>, HiveGpuError> {
if query.len() != self.dimension {
return Err(HiveGpuError::DimensionMismatch {
expected: self.dimension,
got: query.len(),
});
}
if self.vectors.is_empty() {
return Ok(Vec::new());
}
let k = k.min(self.vectors.len());
unsafe {
// Allocate device memory for query
let mut d_query = ptr::null_mut();
let query_bytes = self.dimension * std::mem::size_of::<f32>();
cuda_check(cudaMalloc(&mut d_query as *mut *mut c_void, query_bytes))?;
cuda_check(cudaMemcpyAsync(
d_query,
query.as_ptr() as *const c_void,
query_bytes,
cudaMemcpyKind::cudaMemcpyHostToDevice,
self.stream,
))?;
// Allocate device memory for distances
let mut d_distances = ptr::null_mut();
let distances_bytes = self.vectors.len() * std::mem::size_of::<f32>();
cuda_check(cudaMalloc(&mut d_distances as *mut *mut c_void, distances_bytes))?;
// Compute distances using cuBLAS (matrix-vector multiply)
match self.metric {
GpuDistanceMetric::Cosine | GpuDistanceMetric::DotProduct => {
// Use SGEMV: y = alpha * A * x + beta * y
let alpha = 1.0f32;
let beta = 0.0f32;
cublas_check(cublasSgemv_v2(
self.cublas_handle,
cublasOperation_t::CUBLAS_OP_N,
self.vectors.len() as i32,
self.dimension as i32,
&alpha,
self.d_vectors,
self.vectors.len() as i32,
d_query as *const f32,
1,
&beta,
d_distances as *mut f32,
1,
))?;
}
GpuDistanceMetric::Euclidean => {
// Call custom CUDA kernel for L2 distance
cuda_l2_distance(
d_query as *const f32,
self.d_vectors as *const f32,
d_distances as *mut f32,
self.vectors.len(),
self.dimension,
self.stream,
)?;
}
}
// Copy distances back to host
let mut h_distances = vec![0.0f32; self.vectors.len()];
cuda_check(cudaMemcpyAsync(
h_distances.as_mut_ptr() as *mut c_void,
d_distances as *const c_void,
distances_bytes,
cudaMemcpyKind::cudaMemcpyDeviceToHost,
self.stream,
))?;
cuda_check(cudaStreamSynchronize(self.stream))?;
// Free device memory
cuda_check(cudaFree(d_query))?;
cuda_check(cudaFree(d_distances))?;
// Find top-k
let mut results: Vec<(usize, f32)> = h_distances
.into_iter()
.enumerate()
.collect();
// Sort by distance (higher is better for cosine/dot, lower for euclidean)
match self.metric {
GpuDistanceMetric::Cosine | GpuDistanceMetric::DotProduct => {
results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
}
GpuDistanceMetric::Euclidean => {
results.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap());
}
}
results.truncate(k);
Ok(results
.into_iter()
.map(|(idx, distance)| GpuSearchResult {
id: self.vectors[idx].id.clone(),
distance,
metadata: self.vectors[idx].metadata.clone(),
})
.collect())
}
}
fn get_vector(&self, id: &str) -> Result<Option<GpuVector>, HiveGpuError> {
Ok(self.vectors.iter().find(|v| v.id == id).cloned())
}
fn remove_vector(&mut self, id: &str) -> Result<bool, HiveGpuError> {
if let Some(pos) = self.vectors.iter().position(|v| v.id == id) {
self.vectors.remove(pos);
// TODO: Compact GPU memory
Ok(true)
} else {
Ok(false)
}
}
fn update_vector(&mut self, vector: GpuVector) -> Result<bool, HiveGpuError> {
if let Some(pos) = self.vectors.iter().position(|v| v.id == vector.id) {
self.vectors[pos] = vector.clone();
unsafe {
// Update on GPU
let offset = pos * self.dimension;
let bytes = self.dimension * std::mem::size_of::<f32>();
cuda_check(cudaMemcpyAsync(
self.d_vectors.add(offset) as *mut c_void,
vector.data.as_ptr() as *const c_void,
bytes,
cudaMemcpyKind::cudaMemcpyHostToDevice,
self.stream,
))?;
cuda_check(cudaStreamSynchronize(self.stream))?;
}
Ok(true)
} else {
Ok(false)
}
}
fn vector_count(&self) -> usize {
self.vectors.len()
}
fn clear(&mut self) -> Result<(), HiveGpuError> {
self.vectors.clear();
Ok(())
}
}
impl Drop for CudaVectorStorage {
fn drop(&mut self) {
unsafe {
if !self.d_vectors.is_null() {
cudaFree(self.d_vectors as *mut c_void);
}
}
}
}
// External CUDA kernel functions
extern "C" {
fn cuda_l2_distance(
query: *const f32,
vectors: *const f32,
distances: *mut f32,
n_vectors: usize,
dimension: usize,
stream: cudaStream_t,
) -> cudaError_t;
}
```
### 2.5 CUDA Kernels
```cuda
// src/cuda/kernels.cu
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
extern "C" {
// Kernel for computing L2 (Euclidean) distances
__global__ void l2_distance_kernel(
const float* query,
const float* vectors,
float* distances,
int n_vectors,
int dimension
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n_vectors) {
float sum = 0.0f;
const float* vector = vectors + idx * dimension;
for (int d = 0; d < dimension; d++) {
float diff = query[d] - vector[d];
sum += diff * diff;
}
distances[idx] = sqrtf(sum);
}
}
// Wrapper function
cudaError_t cuda_l2_distance(
const float* query,
const float* vectors,
float* distances,
size_t n_vectors,
size_t dimension,
cudaStream_t stream
) {
int threads_per_block = 256;
int num_blocks = (n_vectors + threads_per_block - 1) / threads_per_block;
l2_distance_kernel<<<num_blocks, threads_per_block, 0, stream>>>(
query, vectors, distances, n_vectors, dimension
);
return cudaGetLastError();
}
} // extern "C"
```
### 2.6 Error Types
```rust
// src/error.rs
#[derive(Debug, thiserror::Error)]
pub enum HiveGpuError {
// Existing errors...
#[error("CUDA error: {0}")]
CudaError(String),
#[error("cuBLAS error: {0}")]
CublasError(String),
#[error("No CUDA device available")]
NoDevice,
#[error("Invalid device ID: {0}")]
InvalidDeviceId(i32),
}
```
### 2.7 Testes CUDA
```rust
// tests/cuda_tests.rs
#[cfg(feature = "cuda")]
mod cuda_tests {
use hive_gpu::cuda::CudaContext;
use hive_gpu::{GpuContext, GpuVector, GpuDistanceMetric, GpuVectorStorage};
#[test]
fn test_cuda_availability() {
if CudaContext::is_available() {
println!("CUDA is available");
println!("Device count: {}", CudaContext::device_count().unwrap());
} else {
println!("CUDA not available, skipping test");
}
}
#[test]
fn test_cuda_device_info() {
if !CudaContext::is_available() {
println!("CUDA not available, skipping test");
return;
}
let ctx = CudaContext::new().expect("Failed to create CUDA context");
let info = ctx.device_info().expect("Failed to get device info");
println!("Device: {}", info.name);
println!("VRAM: {:.2} GB", info.vram_total as f64 / 1024.0 / 1024.0 / 1024.0);
println!("Driver: {}", info.driver_version);
println!("Compute Capability: {:?}", info.compute_capability);
assert_eq!(info.backend, "CUDA");
assert!(info.vram_total > 0);
}
#[test]
fn test_cuda_vector_operations() {
if !CudaContext::is_available() {
return;
}
let ctx = CudaContext::new().unwrap();
let mut storage = ctx.create_storage(128, GpuDistanceMetric::Cosine).unwrap();
// Add vectors
let v1 = GpuVector {
id: "v1".to_string(),
data: vec![1.0; 128],
metadata: Default::default(),
};
let v2 = GpuVector {
id: "v2".to_string(),
data: vec![0.5; 128],
metadata: Default::default(),
};
storage.add_vector(v1).unwrap();
storage.add_vector(v2).unwrap();
assert_eq!(storage.vector_count(), 2);
// Search
let query = vec![1.0; 128];
let results = storage.search(&query, 2).unwrap();
assert_eq!(results.len(), 2);
assert_eq!(results[0].id, "v1");
}
#[test]
fn test_cuda_batch_operations() {
if !CudaContext::is_available() {
return;
}
let ctx = CudaContext::new().unwrap();
let mut storage = ctx.create_storage(128, GpuDistanceMetric::Cosine).unwrap();
// Create 1000 vectors
let vectors: Vec<GpuVector> = (0..1000)
.map(|i| GpuVector {
id: format!("v{}", i),
data: vec![i as f32 / 1000.0; 128],
metadata: Default::default(),
})
.collect();
// Batch add
let start = std::time::Instant::now();
storage.add_vectors(&vectors).unwrap();
let duration = start.elapsed();
println!("Batch add 1000 vectors: {:?}", duration);
assert_eq!(storage.vector_count(), 1000);
// Search
let query = vec![0.5; 128];
let start = std::time::Instant::now();
let results = storage.search(&query, 10).unwrap();
let duration = start.elapsed();
println!("Search in 1000 vectors: {:?}", duration);
assert_eq!(results.len(), 10);
}
}
```
---
## 📦 Phase 3: ROCm Backend (Prioridade: ⚡ MÉDIA)
**Tempo Estimado:** 1-2 semanas
**Complexidade:** Alta
**Impacto:** Médio (15% do mercado)
### Especificação
Similar ao CUDA, mas usando HIP (Heterogeneous-compute Interface for Portability):
```rust
// src/rocm/mod.rs
// src/rocm/context.rs
// src/rocm/storage.rs
```
**Dependências:**
```toml
hip-runtime-sys = { version = "0.3", optional = true }
rocblas-sys = { version = "0.3", optional = true }
```
**Kernels em HIP:**
```hip
// src/rocm/kernels.hip
// Sintaxe similar a CUDA, mas usando HIP APIs
```
---
## 📦 Phase 4: WebGPU Backend (Prioridade: 📦 BAIXA)
**Tempo Estimado:** 1 semana
**Complexidade:** Média
**Impacto:** Baixo-Médio (10% uso geral, fallback universal)
### Especificação
```toml
# Cargo.toml
[dependencies]
wgpu = { version = "0.19", optional = true }
pollster = { version = "0.3", optional = true }
[features]
wgpu = ["dep:wgpu", "dep:pollster"]
```
```rust
// src/wgpu/context.rs
use wgpu::*;
use crate::{GpuContext, GpuDeviceInfo, HiveGpuError};
pub struct WgpuContext {
device: Device,
queue: Queue,
adapter: Adapter,
}
impl WgpuContext {
pub async fn new() -> Result<Self, HiveGpuError> {
let instance = Instance::new(InstanceDescriptor {
backends: Backends::all(),
..Default::default()
});
let adapter = instance
.request_adapter(&RequestAdapterOptions {
power_preference: PowerPreference::HighPerformance,
..Default::default()
})
.await
.ok_or(HiveGpuError::NoDevice)?;
let (device, queue) = adapter
.request_device(&DeviceDescriptor::default(), None)
.await
.map_err(|e| HiveGpuError::InitializationFailed(e.to_string()))?;
Ok(Self { device, queue, adapter })
}
pub fn is_available() -> bool {
pollster::block_on(async {
let instance = Instance::new(InstanceDescriptor::default());
instance.request_adapter(&RequestAdapterOptions::default())
.await
.is_some()
})
}
}
impl GpuContext for WgpuContext {
fn device_info(&self) -> Result<GpuDeviceInfo, HiveGpuError> {
let info = self.adapter.get_info();
let limits = self.device.limits();
Ok(GpuDeviceInfo {
name: info.name.clone(),
backend: format!("{:?}", info.backend),
vram_total: 0, // WebGPU doesn't expose this
vram_available: 0,
vram_used: 0,
driver_version: info.driver_info.clone(),
compute_capability: None,
max_threads_per_block: Some(limits.max_compute_workgroup_size_x as usize),
max_shared_memory: Some(limits.max_compute_workgroup_storage_size as usize),
device_id: 0,
pci_bus_id: None,
})
}
// ... rest of implementation
}
```
---
## 📦 Phase 5: Memory Pooling (Prioridade: ⚡ MÉDIA)
**Tempo Estimado:** 3-5 dias
**Complexidade:** Média
**Impacto:** Médio (otimização de performance)
### Especificação
```rust
// src/pool/memory_pool.rs
use std::collections::VecDeque;
/// GPU memory pool for reducing allocation overhead
pub struct GpuMemoryPool {
buffers: VecDeque<GpuBuffer>,
buffer_size: usize,
max_buffers: usize,
}
pub struct GpuBuffer {
ptr: *mut c_void,
size: usize,
in_use: bool,
}
impl GpuMemoryPool {
pub fn new(buffer_size: usize, max_buffers: usize) -> Self {
Self {
buffers: VecDeque::new(),
buffer_size,
max_buffers,
}
}
pub fn allocate(&mut self) -> Result<GpuBuffer, HiveGpuError> {
// Try to reuse existing buffer
if let Some(mut buffer) = self.buffers.pop_front() {
buffer.in_use = true;
return Ok(buffer);
}
// Create new buffer if under limit
if self.buffers.len() < self.max_buffers {
let mut ptr = std::ptr::null_mut();
unsafe {
cuda_check(cudaMalloc(&mut ptr, self.buffer_size))?;
}
return Ok(GpuBuffer {
ptr,
size: self.buffer_size,
in_use: true,
});
}
Err(HiveGpuError::OutOfMemory)
}
pub fn deallocate(&mut self, mut buffer: GpuBuffer) {
buffer.in_use = false;
self.buffers.push_back(buffer);
}
pub fn clear(&mut self) {
for buffer in &self.buffers {
unsafe {
cudaFree(buffer.ptr);
}
}
self.buffers.clear();
}
}
```
---
## 📦 Phase 6: Async Operations (Prioridade: 📦 BAIXA)
**Tempo Estimado:** 3-5 dias
**Complexidade:** Média
**Impacto:** Baixo-Médio (melhor integração com async Rust)
### Especificação
```rust
// src/lib.rs
#[async_trait::async_trait]
pub trait GpuVectorStorageAsync: Send + Sync {
async fn add_vector_async(&mut self, vector: GpuVector) -> Result<usize, HiveGpuError>;
async fn search_async(&self, query: &[f32], k: usize) -> Result<Vec<GpuSearchResult>, HiveGpuError>;
async fn add_vectors_async(&mut self, vectors: &[GpuVector]) -> Result<(), HiveGpuError>;
}
```
---
## 🧪 Testing Strategy
### Unit Tests
- ✅ Device detection per backend
- ✅ Context creation/destruction
- ✅ Memory allocation/deallocation
- ✅ Vector CRUD operations
- ✅ Search correctness
### Integration Tests
- ✅ Cross-backend consistency
- ✅ Batch operations
- ✅ Large dataset handling (1M+ vectors)
- ✅ Multi-GPU support
### Performance Tests
- ✅ Benchmark vs CPU
- ✅ Compare Metal vs CUDA vs ROCm
- ✅ Memory usage tracking
- ✅ Latency measurements
### CI/CD
```yaml
# .github/workflows/test.yml
name: Test
on: [push, pull_request]
jobs:
test-metal:
runs-on: macos-latest
steps:
- uses: actions/checkout@v3
- run: cargo test --features metal-native
test-cuda:
runs-on: ubuntu-latest
container: nvidia/cuda:12.0-devel
steps:
- uses: actions/checkout@v3
- run: cargo test --features cuda
test-cpu:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- run: cargo test --no-default-features
```
---
## 📊 Roadmap e Prioridades
### Q1 2025 - Foundational (Critical)
- [x] Metal backend (✅ Done)
- [ ] Device Info API (🔥 Critical - 1-2 days)
- [ ] CUDA backend (🔥 Critical - 1-2 weeks)
### Q2 2025 - Expansion (High Priority)
- [ ] ROCm backend (⚡ Medium - 1-2 weeks)
- [ ] Memory Pooling (⚡ Medium - 3-5 days)
- [ ] Performance benchmarks
- [ ] Documentation improvements
### Q3 2025 - Polish (Low Priority)
- [ ] WebGPU backend (📦 Low - 1 week)
- [ ] Async operations (📦 Low - 3-5 days)
- [ ] Multi-GPU support
- [ ] Advanced optimizations
---
## 📈 Expected Impact
### Before (Current State)
| macOS | ✅ Metal | ~5% |
| Linux | ❌ None | ~75% |
| Windows | ❌ None | ~20% |
| **Total GPU Coverage** | **5%** | |
### After (With CUDA)
| macOS | ✅ Metal | ~5% |
| Linux | ✅ CUDA | ~52% (NVIDIA) |
| Windows | ✅ CUDA | ~14% (NVIDIA) |
| **Total GPU Coverage** | **~71%** | |
### After (With CUDA + ROCm)
| macOS | ✅ Metal | ~5% |
| Linux | ✅ CUDA + ROCm | ~67% |
| Windows | ✅ CUDA | ~14% |
| **Total GPU Coverage** | **~86%** | |
### After (Full Implementation)
| macOS | ✅ Metal | ~5% |
| Linux | ✅ CUDA + ROCm + WebGPU | ~75% |
| Windows | ✅ CUDA + WebGPU | ~20% |
| **Total GPU Coverage** | **~100%** | |
---
## 💰 ROI Analysis
### Investment
- **Device Info API:** 1-2 days (~$2K)
- **CUDA Backend:** 1-2 weeks (~$10-20K)
- **ROCm Backend:** 1-2 weeks (~$10-20K)
- **WebGPU Backend:** 1 week (~$5-10K)
- **Total:** 4-6 weeks (~$27-52K)
### Returns
- **10-50x speedup** on GPU operations
- **95% market coverage** with CUDA alone
- **Competitive advantage** vs CPU-only solutions
- **Reduced latency:** 10-30ms → 0.5-3ms
- **Higher throughput:** 100x more queries/second
### Break-even
- With just **10 enterprise customers** using GPU acceleration
- Or **1000+ community users** benefiting from GPU speedup
- **Estimated ROI:** 10-20x within first year
---
## 📞 Support & Contact
For questions or assistance with implementation:
- **GitHub Issues:** https://github.com/hivellm/hive-gpu/issues
- **Email:** dev@hivellm.com
- **Discord:** https://discord.gg/hivellm
---
## 📚 References
### CUDA
- CUDA Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
- cuBLAS Documentation: https://docs.nvidia.com/cuda/cublas/
- cuda-runtime-sys crate: https://crates.io/crates/cuda-runtime-sys
### ROCm
- ROCm Documentation: https://rocmdocs.amd.com/
- HIP Programming Guide: https://rocm.docs.amd.com/projects/HIP/
- rocBLAS: https://rocm.docs.amd.com/projects/rocBLAS/
### WebGPU
- WebGPU Spec: https://www.w3.org/TR/webgpu/
- wgpu Rust: https://wgpu.rs/
- WGSL Shading Language: https://www.w3.org/TR/WGSL/
### Metal
- Metal Programming Guide: https://developer.apple.com/metal/
- Metal Shading Language: https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf
---
## ✅ Checklist de Implementação
### Device Info API
- [ ] Define `GpuDeviceInfo` struct
- [ ] Add `device_info()` to `GpuContext` trait
- [ ] Implement for Metal
- [ ] Add tests
- [ ] Update documentation
### CUDA Backend
- [ ] Add CUDA dependencies to Cargo.toml
- [ ] Create build.rs for kernel compilation
- [ ] Implement `CudaContext`
- [ ] Implement `CudaVectorStorage`
- [ ] Write CUDA kernels
- [ ] Add error handling
- [ ] Write unit tests
- [ ] Write integration tests
- [ ] Benchmark performance
- [ ] Update documentation
### ROCm Backend
- [ ] Add ROCm/HIP dependencies
- [ ] Implement `RocmContext`
- [ ] Implement `RocmVectorStorage`
- [ ] Write HIP kernels
- [ ] Add tests
- [ ] Benchmark performance
- [ ] Update documentation
### WebGPU Backend
- [ ] Add wgpu dependency
- [ ] Implement `WgpuContext`
- [ ] Implement `WgpuVectorStorage`
- [ ] Write WGSL shaders
- [ ] Add tests
- [ ] Update documentation
### Memory Pooling
- [ ] Implement `GpuMemoryPool`
- [ ] Integrate with existing backends
- [ ] Add configuration options
- [ ] Benchmark improvements
- [ ] Update documentation
### Async Operations
- [ ] Define async trait
- [ ] Implement for all backends
- [ ] Add async tests
- [ ] Update examples
- [ ] Update documentation
---
**Last Updated:** 2025-01-07
**Version:** 1.0
**Status:** Ready for Implementation
**Next Steps:** Review with hive-gpu team, prioritize phases, begin implementation