pub struct GpuAccelerator { /* private fields */ }Expand description
GPU-accelerated vector search engine.
Upload vectors once, then run many searches against them. If GPU initialization fails, callers should fall back to CPU SIMD.
Vectors are automatically split into chunks when they exceed the device’s
max_storage_buffer_binding_size (typically 128 MB). Searches dispatch
against each chunk and merge results transparently.
Implementations§
Source§impl GpuAccelerator
impl GpuAccelerator
Sourcepub fn is_available() -> bool
pub fn is_available() -> bool
Check if any GPU is available on this system.
Sourcepub fn new() -> Result<Self>
pub fn new() -> Result<Self>
Initialize the best available GPU device.
Requests the adapter’s maximum buffer limits so that chunking only kicks in when truly necessary.
pub fn device_info(&self) -> &DeviceInfo
pub fn max_storage_buffer_binding_size(&self) -> u32
Sourcepub fn upload_vectors(&mut self, vectors: &[f32], dim: usize) -> Result<()>
pub fn upload_vectors(&mut self, vectors: &[f32], dim: usize) -> Result<()>
Upload a flat array of vectors to GPU memory. Automatically splits into chunks when data exceeds the binding limit.
Sourcepub fn upload_norms(&mut self, norms: &[f32]) -> Result<()>
pub fn upload_norms(&mut self, norms: &[f32]) -> Result<()>
Upload pre-computed L2 norms, split to match vector chunk layout.
Sourcepub fn cosine_search(
&self,
query: &[f32],
k: usize,
) -> Result<Vec<(usize, f32)>>
pub fn cosine_search( &self, query: &[f32], k: usize, ) -> Result<Vec<(usize, f32)>>
Cosine similarity search: returns top-k (index, score) pairs, highest first. Dispatches against each vector chunk and merges results.
Sourcepub fn batch_cosine_search(
&self,
queries: &[Vec<f32>],
k: usize,
) -> Result<Vec<Vec<(usize, f32)>>>
pub fn batch_cosine_search( &self, queries: &[Vec<f32>], k: usize, ) -> Result<Vec<Vec<(usize, f32)>>>
Batch cosine search: multiple queries at once.
Sourcepub fn l2_search(&self, query: &[f32], k: usize) -> Result<Vec<(usize, f32)>>
pub fn l2_search(&self, query: &[f32], k: usize) -> Result<Vec<(usize, f32)>>
L2 distance search: returns top-k (index, distance) pairs, smallest first.
Sourcepub fn compute_norms(&self) -> Result<Vec<f32>>
pub fn compute_norms(&self) -> Result<Vec<f32>>
Compute L2 norms for all uploaded vectors on GPU.
Sourcepub fn compute_norms_gpu(&self, vectors: &[f32], dim: usize) -> Result<Vec<f32>>
pub fn compute_norms_gpu(&self, vectors: &[f32], dim: usize) -> Result<Vec<f32>>
Compute L2 norms from raw vectors (not previously uploaded). Handles chunking automatically for large inputs.
Sourcepub fn batch_dot_product(
&self,
queries_flat: &[f32],
num_queries: usize,
) -> Result<Vec<f32>>
pub fn batch_dot_product( &self, queries_flat: &[f32], num_queries: usize, ) -> Result<Vec<f32>>
Batch dot product: queries [Q×D] × vectors [N×D] -> flat [Q×N] scores.
Sourcepub fn distance_matrix(
&self,
queries: &[f32],
vectors: &[f32],
dim: usize,
) -> Result<Vec<Vec<f32>>>
pub fn distance_matrix( &self, queries: &[f32], vectors: &[f32], dim: usize, ) -> Result<Vec<Vec<f32>>>
Compute L2 distance matrix: queries × vectors -> Q×N distances. Uses 16×16 workgroup tiling for cache efficiency.
Sourcepub fn f16_to_f32_batch(&self, f16_bits: &[u16]) -> Result<Vec<f32>>
pub fn f16_to_f32_batch(&self, f16_bits: &[u16]) -> Result<Vec<f32>>
Convert f16 values (as raw u16 bits) to f32 on the GPU.
Sourcepub fn f32_to_f16_batch(&self, values: &[f32]) -> Result<Vec<u16>>
pub fn f32_to_f16_batch(&self, values: &[f32]) -> Result<Vec<u16>>
Convert f32 values to f16 (as raw u16 bits) on the GPU.