hive-gpu 0.2.0
GPU acceleration for vector similarity search, written in Rust.
Four GPU backends, all feature-gated and target-gated so the crate builds cleanly on every host:
- Metal — Apple Silicon (M-series), built on
objc2-metal. Brute-force search and IVF via custom compute kernels. IVF and the new real-kernel brute-force path are authored but not yet validated on macOS — see.rulebook/tasks/phase4d_validate-metal-backend-on-mac. - CUDA — NVIDIA (Volta / sm_70+) on Linux and Windows, built on
cudarcdriver API + cuBLAS SGEMV/SGEMM. Brute-force + IVF. Validated on RTX 4090 (3.67× over brute-force at 1 M vectors). - ROCm — AMD (gfx900 through gfx1100) on Linux, via hand-rolled HIP FFI
- rocBLAS. Brute-force + IVF. Authored blind — see
.rulebook/tasks/phase4e_validate-rocm-backend-on-amd.
- rocBLAS. Brute-force + IVF. Authored blind — see
- Intel / Vulkan Compute — Intel Arc / Battlemage (preferred) on Linux
and Windows, with
HIVE_GPU_VULKAN_UNIVERSAL=1fallback for any Vulkan 1.2 GPU. Built onash+ WGSL shaders compiled to SPIR-V vianaga. Brute-force + IVF. Authored blind — see.rulebook/tasks/phase4f_validate-intel-backend-on-vulkan.
Design notes for each backend live in docs/analysis/.
What's new in 0.2.0
- 🔥 CUDA backend is functional.
CudaContext,CudaVectorStorage, GPU-accelerated search (cuBLAS SGEMV for Cosine/DotProduct, derived L2), and a full IVF index (CudaIvfIndex— k-means++ + cuBLAS SGEMM) all run against a real driver. 3.67× over brute-force at 1 M vectors, recall ≥ 0.95 on clustered data. - Metal gets a real brute-force compute kernel (replacing the prior CPU
shim) plus
MetalIvfIndex. Authored blind — awaits Apple Silicon validation. - ROCm / HIP backend for AMD GPUs on Linux —
RocmContext+RocmVectorStorage+RocmIvfIndex, hand-rolled HIP FFI overlibamdhip64+librocblas. Authored blind — awaits AMD validation. - Intel / Vulkan Compute backend —
IntelContext+IntelVectorStorage+IntelIvfIndexonash, WGSL shaders compiled to SPIR-V at build time vianaga(pure-Rust, no CMake / C++ toolchain). Works on any Vulkan 1.2 GPU underHIVE_GPU_VULKAN_UNIVERSAL=1. Authored blind — awaits Vulkan-host validation. - Real device-info API on CUDA — compute capability, total/free VRAM, driver version, PCI bus id — all queried live from the driver.
- Dynamic buffer growth with device-to-device copy mirroring the Metal backend's shape (2× → 1.5× → 1.2× adaptive factor).
- Criterion benchmarks under
benches/cuda_ops.rsandbenches/cuda_ivf.rs. - CI job (
.github/workflows/cuda-build.yml) builds against the officialnvidia/cuda:12.4.1-devel-ubuntu22.04image. - Project-wide
#![allow(warnings)]removed; clippy runs with-D warningson all feature combinations.
Only CUDA is validated on real hardware (RTX 4090). Metal, ROCm, and Intel
are code-complete but pending maintainer validation — see
.rulebook/tasks/phase4{d,e,f}_* for the validation checklists.
Full changelog in CHANGELOG.md.
Installation
[]
# macOS — Metal backend (default)
= "0.2.0"
# Linux / Windows — CUDA backend
= { = "0.2.0", = false, = ["cuda"] }
# Linux — AMD ROCm / HIP backend
= { = "0.2.0", = false, = ["rocm"] }
# Linux / Windows — Intel / Vulkan Compute backend (also works as a universal
# Vulkan fallback on any Vulkan 1.2 GPU under HIVE_GPU_VULKAN_UNIVERSAL=1)
= { = "0.2.0", = false, = ["intel"] }
# Everything (cross-platform crate — each cfg is gated internally)
= { = "0.2.0", = ["metal-native", "cuda", "rocm", "intel"] }
Runtime requirements:
- CUDA — NVIDIA driver (no CUDA Toolkit required —
cudarcis built withdynamic-linking). - ROCm — ROCm 6.x runtime with
libamdhip64.soandlibrocblas.soon the dynamic linker path. - Intel — Vulkan 1.2 loader (
libvulkan.so.1/vulkan-1.dll), shipped with any recent GPU driver.
For a development checkout you also need a reachable driver so integration tests can hit real hardware; without one, each suite runs as a no-op.
Quick start
Metal (macOS)
use MetalNativeContext;
use ;
use ;
CUDA (Linux / Windows)
use CudaContext;
use ;
use ;
See examples/cuda_basic.rs and
examples/metal_basic.rs for runnable variants.
Performance
Two data points captured on real hardware. All numbers are median wall-clock
from Criterion benchmarks (cargo bench).
CUDA — NVIDIA GeForce RTX 4090 (24 GB, driver 591.59, CUDA 13.1)
Search latency — DotProduct, 128-dim f32, top-10 (from
benches/cuda_ops.rs):
| N | GPU (cuBLAS SGEMV) | CPU naïve reference | GPU speedup |
|---|---|---|---|
| 1 000 | 124 µs | 63 µs | 0.51× |
| 10 000 | 287 µs | 690 µs | 2.40× |
| 100 000 | 4.01 ms | 13.04 ms | 3.25× |
For N < 10 K the SGEMV launch + host-to-device copy dominates useful work and a scalar CPU loop wins. From 10 K onward the GPU wins and the margin widens roughly linearly with N.
Add-vectors throughput (128-dim f32):
| Batch size | Wall-clock | Throughput |
|---|---|---|
| 1 000 | 431 µs | 2.32 M elements/s |
| 10 000 | 7.10 ms | 1.41 M elements/s |
Metal — Apple M3 Pro
| Operation | CPU baseline | Metal | Speedup |
|---|---|---|---|
| Vector addition (sustained) | 1 000 vec/s | 3 728 vec/s | 3.7× |
| Vector addition (peak 10 K) | 1 000 vec/s | 4 250 vec/s | 4.25× |
| Search latency (k = 10) | ~1 ms | 0.92 µs | ~1 000× |
| Search throughput | — | 1.08 M qps | — |
Full methodology, hardware matrix, and historical runs live in
docs/benchmarks/PERFORMANCE.md.
GPU backend matrix
| OS | Metal | CUDA | ROCm | Intel | CPU |
|---|---|---|---|---|---|
| macOS (Apple Silicon) | 🟡 | ❌ | ❌ | ❌ | ✅ |
| Linux x86_64 + NVIDIA | ❌ | ✅ | ❌ | 🟡 | ✅ |
| Linux x86_64 + AMD | ❌ | ❌ | 🟡 | 🟡 | ✅ |
| Linux x86_64 + Intel Arc | ❌ | ❌ | ❌ | 🟡 | ✅ |
| Windows x86_64 + NVIDIA | ❌ | ✅ | ❌ | 🟡 | ✅ |
| Windows x86_64 + AMD | ❌ | ❌ | ❌ | 🟡 | ✅ |
| Windows x86_64 + Intel Arc | ❌ | ❌ | ❌ | 🟡 | ✅ |
Legend: ✅ shipping and validated · 🟡 code-complete, pending hardware validation · ❌ unsupported.
On Linux/Windows the Intel / Vulkan backend doubles as a universal
Vulkan-Compute fallback when HIVE_GPU_VULKAN_UNIVERSAL=1 is set.
Backend-selection order at runtime is Metal > CUDA > ROCm > Intel > CPU.
Override via the HIVE_GPU_BACKEND env var (planned).
Feature flags
| Feature | Target OS | Pulls in |
|---|---|---|
metal-native |
macOS | objc2-metal, objc2-foundation, objc2 |
cuda |
Linux / Windows | cudarc (driver + cublas + dynamic-linking) |
rocm |
Linux | libloading (dlopens libamdhip64.so + librocblas.so) |
intel |
Linux / Windows | ash (Vulkan loader) + naga (WGSL → SPIR-V at build time) |
metal-native is the default. On non-macOS hosts the default feature
contributes nothing (its deps are target-gated), so the crate compiles clean
everywhere with default features.
Testing and benchmarks
# Metal (macOS)
# CUDA (Linux / Windows with an NVIDIA driver installed)
# ROCm (Linux with ROCm 6.x installed)
# Intel / Vulkan (Linux / Windows with a Vulkan 1.2 GPU)
# On a non-Intel Vulkan GPU, set HIVE_GPU_VULKAN_UNIVERSAL=1 first.
Every suite is a no-op on hosts without a reachable driver, so they stay green on CI runners that lack GPU hardware.
Roadmap
- v0.2.x — hardware validation pass for the three blind backends
(Metal, ROCm, Intel). Each ships its own point release once the matching
phase4{d,e,f}task lands benchmarks and test results on real hardware. - v0.4 — GPU HNSW construction and search across all four backends, quantization (PQ / SQ), GPU-side top-K (radix select).
Detailed roadmap in docs/ROADMAP.md.
Project documentation
docs/analysis/— backend implementation analyses (CUDA, ROCm, Intel) with gap analysis, architectural decisions, and phased plans.docs/benchmarks/PERFORMANCE.md— full performance guide and historical numbers.docs/ROADMAP.md— release plan.CHANGELOG.md— release notes.CONTRIBUTING.md— contribution guide.
License
Apache 2.0 — see LICENSE.