Module centroid_chunk

Expand description

CentroidChunk value encoding/decoding.

Stores a chunk of cluster centroids for SPANN-style approximate nearest neighbor search.

§Background: SPANN Index Architecture

SPANN (Scalable Partition-based Approximate Nearest Neighbor) is a disk-based indexing strategy that enables billion-scale vector search with bounded memory. The key insight is to keep only cluster centroids in memory while storing the actual vectors on disk in posting lists.

The index works by clustering vectors into groups, where each group has a representative centroid (typically the cluster’s mean vector). During search:

Find the k nearest centroids to the query vector using an in-memory index (HNSW)
Load the posting lists for those centroids from disk
Compute exact distances for vectors in those posting lists
Return the top results

This approach trades some recall for massive memory savings—instead of keeping all vectors in memory, only 0.1-1% (the centroids) need to be loaded.

§Centroid Chunking

Centroids are split across multiple CentroidChunk records rather than stored individually. This chunking strategy:

Reduces key count: Storing one centroid per key would explode the number of keys fetched on startup, increasing latency.
Enables block-level caching: SlateDB can cache entire chunks efficiently.
Supports partial reads: Only touched chunks need to be read from disk.

All centroid chunks are fully scanned on startup (scan(version|0x02|*)) to load the in-memory HNSW graph. The chunk_id exists solely to give each chunk a distinct record key.

§Centroid IDs

Each centroid carries an explicit centroid_id (u64) stored alongside its vector. These IDs:

Start at 1 (0 is reserved for the deleted vectors bitmap in PostingList)
Are never reassigned once issued
Remain stable when chunks are rewritten or reordered
Are referenced by PostingList keys to map centroids to their assigned vectors

§Sizing Guidelines

From RFC 0001:

CHUNK_TARGET centroids per chunk (default 4096, ~25 MB per chunk at 1536 dims)
Centroid ratio typically 0.1-1% of total vectors
Example: 10M vectors → 10K-100K centroids → 3-25 chunks
Higher ratios improve recall at the cost of memory

Structs§

CentroidChunkValue: CentroidChunk value storing a chunk of cluster centroids.
CentroidEntry: A single centroid entry with its ID and vector.

Functions§

decode_centroid_entry: Decode a centroid entry with known dimensionality.