hive-gpu 0.2.0

High-performance GPU acceleration for vector operations with Device Info API (Metal, CUDA, ROCm)
Documentation
# Changelog


All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.2.0] - 2026-04-19


### Added


- **Functional CUDA backend** for NVIDIA GPUs (Volta / sm_70+) on Linux and
  Windows, built on `cudarc 0.13` driver API + cuBLAS:
  - `CudaContext` wraps `cudarc::driver::CudaDevice` plus a cuBLAS handle;
    live device info via `cuDeviceGetAttribute`, `cuMemGetInfo`, and
    `cuDriverGetVersion`, plus PCI bus id formatting.
  - `CudaVectorStorage` backed by a single contiguous `CudaSlice<f32>` with
    batched `htod_copy` + `memcpy_dtod_sync` uploads. Adaptive capacity
    growth (2× / 1.5× / 1.2×) matching the Metal backend, soft-delete via
    `HashSet<usize>`, host-cached squared norms.
  - GPU search via cuBLAS SGEMV (trans=T on the column-major view): works
    for DotProduct, Cosine (SGEMV + norm normalisation), and Euclidean
    (L2² derived from dots + norms). Top-K on the CPU after a single
    `dtoh_sync_copy`.
- **CUDA IVF index** (`CudaIvfIndex`, phase4b) — inverted-file index on
  NVIDIA. k-means++ initialisation + Lloyd iterations for training,
  cuBLAS SGEMM for the assignment step (query × centroids), per-list
  residual refinement, and an adjustable `nprobe`. Validated on RTX
  4090: **3.67× faster than the brute-force baseline at 1 M vectors**,
  recall ≥ 0.95 on clustered data. Tests in `tests/cuda_ivf.rs`,
  benchmarks in `benches/cuda_ivf.rs`.
- **CUDA integration tests** (17 brute-force + IVF tests, all passing on
  RTX 4090): `tests/cuda_smoke.rs`, `tests/cuda_device_info.rs`,
  `tests/cuda_vector_ops.rs`, `tests/cuda_ivf.rs`. Each test is a no-op
  on hosts without a reachable driver, keeping CI green on GPU-less
  runners.
- **CUDA benchmarks** `benches/cuda_ops.rs` and `benches/cuda_ivf.rs`
  comparing GPU throughput against a naïve CPU reference.
- **GitHub Actions workflow** `.github/workflows/cuda-build.yml` runs
  check, clippy, fmt, and the test suite against the
  `nvidia/cuda:12.4.1-devel-ubuntu22.04` container image.
- **Metal brute-force via a real compute kernel** (phase4a, *authored
  blind*). Replaces the prior CPU-fallback shim with a
  `sgemv_dot.metal` compute shader driven through
  `MTLComputeCommandEncoder`. Supports DotProduct, Cosine, and
  Euclidean on Apple Silicon.
- **Metal IVF index** (`MetalIvfIndex`, phase4c, *authored blind*) —
  mirrors `CudaIvfIndex`, wired to two custom Metal compute kernels
  (`sgemv_dot.metal`, `sgemm_dot.metal`). Tests in
  `tests/metal_bruteforce.rs` and `tests/metal_ivf.rs`.
- **ROCm / HIP backend** (phase3b, *authored blind*) for AMD GPUs on
  Linux (gfx900–gfx1100). `RocmContext` + `RocmVectorStorage` +
  `RocmIvfIndex`, hand-rolled HIP FFI via `libloading`
  (`libamdhip64.so` + `librocblas.so`). Mirrors the CUDA architecture
  one-for-one. Tests in `tests/rocm_smoke.rs` and
  `tests/rocm_ivf.rs`.
- **Intel / Vulkan Compute backend** (phase3c, *authored blind*) for
  Intel Arc / Battlemage on Linux and Windows, with
  `HIVE_GPU_VULKAN_UNIVERSAL=1` fallback for any Vulkan 1.2 GPU.
  `IntelContext` + `IntelVectorStorage` + `IntelIvfIndex` built on
  `ash 0.38`. WGSL compute shaders (`sgemv_dot.wgsl`,
  `sgemm_dot.wgsl`) compiled to SPIR-V at build time via `naga`
  (pure-Rust, no CMake / C++ toolchain). Tests in
  `tests/intel_smoke.rs` and `tests/intel_ivf.rs`.
- `HiveGpuError` variants for every new backend: `CudaError`,
  `CublasError`, `HipError`, `RocblasError`, `RocmError`,
  `VulkanError`, `IntelError`, `SpirvCompileError`.
- `GpuBackendType::{Rocm, Intel}` in `src/backends/detector.rs`. New
  priority order is `Metal > CUDA > ROCm > Intel > CPU`, each probed
  with a real loader check — `is_rocm_available()` via `libloading`,
  `is_intel_available()` via Vulkan `vkEnumeratePhysicalDevices`.
- `IvfConfig` in `src/types.rs` — shared across all four IVF
  implementations (CUDA / Metal / ROCm / Intel).
- `build.rs` compiles the Intel WGSL shaders when the `intel` feature
  is active and emits rerun hints for CUDA kernel assets.
- Multi-backend analyses under `docs/analysis/{cuda,gcn,intel}/`
  documenting state, gaps, and phased plans.

### Changed


- Detection in `src/backends/detector.rs` now uses
  `cudarc::driver::result::init` + `get_count` instead of env-var
  inspection. Target-gated to Linux / Windows; macOS is unaffected.
- `cuda` Cargo feature now actually pulls in its dependency: `cuda =
  ["dep:cudarc"]` with `cudarc` declared in a target-gated
  `[target.'cfg(any(target_os = "linux", target_os = "windows"))']`
  block carrying `driver`, `cublas`, `cuda-12040`, and
  `dynamic-linking` features.
- `default-features` resolves to nothing on non-macOS hosts — every
  backend dep is target- and feature-gated, so the crate builds
  clean everywhere with default features.
- Removed project-wide `#![allow(warnings)]` from `src/lib.rs`.
  Cleaned up 24 latent warnings (unused imports, underscore-prefixed
  params, scoped `#[allow(dead_code)]` on struct fields still being
  populated by follow-up phases). `cargo clippy --all-features --lib
  --tests --benches -- -D warnings` is now part of the quality gate.
- `docs/benchmarks/PERFORMANCE.md` updated with RTX 4090 baseline
  numbers, CUDA IVF head-to-head vs. brute-force, and the CUDA test
  suite summary.
- `docs/ROADMAP.md` reflects the actual ship order with explicit
  validated-vs-blind status per backend.

### Breaking


- None at the public-API level; existing Metal code paths and trait
  signatures are unchanged. The `cuda` feature behaviour changed from
  a compile-time no-op in 0.1.x to a fully functional backend in
  0.2.0.

### Status notes


Only the CUDA path (brute-force + IVF) has been validated on real
hardware (RTX 4090) in the 0.2.0 release window. Metal (real kernel +
IVF), ROCm, and Intel ship as **authored blind** — the code
cross-compiles, passes `clippy -D warnings`, and has a complete test
suite, but has never executed against the target hardware. Follow-up
validation tasks are live in `.rulebook/tasks/` and will ship a
minor bump each once the corresponding maintainer runs them:

- `phase4d_validate-metal-backend-on-mac` — Metal brute-force + IVF
- `phase4e_validate-rocm-backend-on-amd` — ROCm brute-force + IVF
- `phase4f_validate-intel-backend-on-vulkan` — Intel / Vulkan
  brute-force + IVF

## [0.1.10] - 2025-11-04


### Fixed


- **Clippy Build Error**: Moved `Duration` and `Instant` imports inside conditional compilation block in `tests/gpu_stress_tests.rs`
  - Imports were causing unused import warnings when compiled without `metal-native` feature
  - Now properly scoped within `#[cfg(all(target_os = "macos", feature = "metal-native"))]` module
  - All clippy checks passing with `-D warnings`

## [0.1.9] - 2025-11-04


### Added


- **Comprehensive GPU Test Suite (72 tests total)**
  - GPU Detection Tests (9 tests): Metal device availability, name retrieval, capabilities, multiple contexts, VRAM query, backend detection, performance info
  - Vector Operations Tests (11 tests): Small/medium/large vector addition, cosine similarity, orthogonal vectors, Euclidean distance, batch operations, search accuracy, edge cases
  - Memory Management Tests (10 tests): Small/medium/large buffer allocation, multiple allocations, deallocation, repeated cycles, memory reuse, clear vectors, stress tests
  - VRAM Monitoring Tests (10 tests): Tracking accuracy, percentage calculation, available VRAM checks, usage during allocation, consistency, monitoring over time, pressure detection
  - Integration Tests (9 tests): Basic operations, HNSW construction, error handling, distance metrics, VRAM monitoring, cross-backend compatibility
  - Device Info Tests (4 tests): Metal device info query, VRAM usage percent, availability checks, convenience methods
  - **Performance Benchmarks (10 tests)**: 
    - Vector addition throughput: 3,740 vectors/sec
    - Search latency: 0.92 μs (k=10), 1.08M queries/sec
    - Memory bandwidth: 8+ MB/s
    - Dimension scaling: 64D to 1024D
    - Vector count scaling: 100 to 5000
    - Cold vs warm performance comparison
    - Distance metric performance comparison
    - Concurrent operations testing
    - Performance baseline validation
  - **Stress Tests (9 tests)**:
    - Sustained load: 5000 vectors, 3728 vec/sec
    - Maximum capacity: 10K vectors, 4250 vec/sec
    - Memory pressure scenarios
    - Rapid allocation cycles: 50 cycles
    - Sustained search load: 2000+ QPS
    - Mixed read/write workload
    - Long-running stability (5 seconds)
    - Error recovery validation
    - Concurrent high load: 10 storages

- **Test Infrastructure**
  - `scripts/run-gpu-tests.sh`: Comprehensive test runner with detailed output
  - Enhanced Git hooks with test suite information in pre-commit and pre-push
  - All tests run on real Metal backend (Apple M3 Pro)
  - Tests validate real GPU operations, memory management, VRAM usage, and performance

- **Documentation**
  - `docs/guides/DEVICE_INFO_IMPLEMENTATION.md`: Complete implementation guide for Device Info API
    - Architecture and design decisions
    - Platform-specific details (Metal, CUDA)
    - Real-world usage examples
    - Common pitfalls and best practices
    - Performance considerations
    - Migration guide from old code
    - Test coverage overview
  - `docs/guides/GIT_HOOKS_TESTING.md`: Git hooks configuration and testing guide (created earlier)

### Changed


- Updated Git pre-commit hook to display detailed test suite information
- Updated Git pre-push hook with comprehensive test count (78 tests total)
- Enhanced test output with performance metrics and real GPU data

### Performance


Real-world benchmarks on Apple M3 Pro:
- **Search Performance**: 1.08M queries/sec (k=10), 0.92 μs latency
- **Vector Addition**: 3,740 vectors/sec sustained, 4,250 vectors/sec peak (10K vectors)
- **Memory Bandwidth**: 8+ MB/s effective (including Metal overhead)
- **Scalability**: Linear scaling from 100 to 5000 vectors
- **Sustained Load**: Stable operation at 3,728 vec/sec over 5 seconds
- **Concurrent Load**: 10 storages with 500 vectors each, all successful

### Testing


- Total test count: 78 tests (72 functional + 6 doc tests)
- All tests passing on Apple M3 Pro with Metal backend
- Stress tests validate system stability under extreme load
- Performance benchmarks establish baseline for CI/CD validation

## [0.1.8] - 2025-11-04


### Changed


- **BREAKING: Migrated from discontinued `metal-rs` to `objc2-metal` ecosystem**
  - Replaced `metal 0.27` with `objc2-metal 0.3.2` (actively maintained)
  - Replaced `objc 0.2` with `objc2 0.6.3` (modern, type-safe bindings)
  - Added `objc2-foundation 0.3.2` for Foundation framework support
  - Updated all Metal bindings to use `ProtocolObject<dyn MTLDevice>` pattern
  - Migrated buffer operations to objc2-metal API (camelCase method names)
  - All Metal-specific code now uses objc2-metal traits (MTLDevice, MTLCommandQueue, MTLCommandEncoder, etc.)
  - Complete migration of `src/metal/context.rs`, `src/metal/vector_storage.rs`, `src/metal/buffer_pool.rs`, `src/backends/detector.rs`
  - All 21 tests passing (unit, integration, doc tests)
  - Zero clippy warnings
  - **Security**: Removed dependency on discontinued library with no security updates
  - **Maintenance**: Now using actively maintained crates from objc2 ecosystem
  - **Type Safety**: Improved type safety with modern Objective-C bindings
  - **Future-Proof**: Foundation for continued macOS/Metal development

### Fixed


- Metal device detection now uses `MTLCreateSystemDefaultDevice()` from objc2-metal
- Buffer creation uses proper `MTLResourceOptions` and type-safe methods
- Command buffer and blit encoder creation using objc2-metal patterns

### Internal


- OpenSpec change `migrate-to-objc2-metal` tracking migration progress
- Created rollback tag `pre-objc2-migration` for safety
- Comprehensive migration documentation in `docs/guides/MIGRATION_METAL_OBJC2.md`

## [0.1.7] - 2025-11-03


### Added


- **Device Info API** (Phase 2) - Comprehensive GPU device information API
  - New `GpuDeviceInfo` struct with detailed hardware information:
    - VRAM tracking (total, available, used bytes)
    - Driver version and compute capability
    - Hardware limits (max threads per block, max shared memory)
    - Backend identification (Metal, CUDA, ROCm, wgpu)
    - Device ID and PCI bus ID (where applicable)
  - New `device_info()` method on `GpuContext` trait (returns `Result<GpuDeviceInfo>`)
  - Helper methods:
    - `vram_usage_percent()` - Calculate VRAM usage percentage
    - `has_available_vram(bytes)` - Check if sufficient VRAM available
    - `total_vram_mb()` / `available_vram_mb()` - Convenient MB conversions
  - Full Metal backend implementation with macOS version detection
  - Placeholder implementations for CUDA and wgpu backends
- Comprehensive test suite for Device Info API (5 tests, 100% passing)
- OpenSpec changes for future implementations:
  - `add-device-info-api` - Device Info API specification
  - `add-cuda-backend` - CUDA backend specification (43 tasks)
  - `add-rocm-backend` - ROCm backend specification (46 tasks)
  - `add-memory-pooling` - Memory pooling optimization (33 tasks)
- Complete project documentation:
  - `docs/API_REFERENCE.md` - API documentation
  - `docs/ARCHITECTURE.md` - System architecture
  - `docs/DEVELOPMENT.md` - Development guide
  - `docs/ROADMAP.md` - Project roadmap
  - `docs/DAG.md` - Component dependencies
  - `docs/PERFORMANCE.md` - Performance benchmarks
  - `docs/INTEGRATION_GUIDE.md` - Integration examples
- CI/CD workflows:
  - Rust testing workflow
  - Rust linting workflow
  - Codespell workflow
- Project governance files:
  - `CODE_OF_CONDUCT.md`
  - `CONTRIBUTING.md`
  - `SECURITY.md`
  - `AGENTS.md` - AI assistant rules

### Changed


- Updated `GpuContext` trait to return `Result<GpuDeviceInfo>` instead of `GpuDeviceInfo`
- Improved error handling across all backends
- Enhanced documentation with comprehensive examples

### Fixed


- Fixed unused imports in benchmarks
- Fixed clippy warnings in test files
- Fixed doctest compilation errors

## [0.1.6] - Previous Release


### Added


- Initial Metal Native backend implementation
- Basic CUDA and wgpu placeholder implementations
- Vector storage and HNSW graph operations
- Core traits and types

[0.1.7]: https://github.com/hivellm/hive-gpu/compare/v0.1.6...v0.1.7
[0.1.6]: https://github.com/hivellm/hive-gpu/releases/tag/v0.1.6