hannoy is a key-value backed HNSW implementation based on arroy.
Motivation
Many popular HNSW libraries are built in memory, meaning you need enough RAM to store all the vectors you're indexing. Instead, hannoy uses LMDB — a memory-mapped KV store — as a storage backend. This is more well-suited for machines running multiple programs, or cases where the dataset you're indexing won't fit in memory. LMDB also supports non-blocking concurrent reads by design, meaning its safe to query the index in multi-threaded environments.
Features
- Supported metrics: euclidean, cosine, manhattan, hamming, as well as quantized counterparts.
- Multithreaded builds using rayon
- Small memory usage thanks to LMDB
- Compressed bitmaps to store graph edges, adding overhead of only ~200 bytes per vector
Usage
Here's a quick demo:
use ;
use EnvOpenOptions;
use ;
🚀 Roadmap
- add hnsw entrypoints to db
Node::Metadata - update edge bitmap of re-indexed nodes
- handle re-indexing case where new node may be on higher level
- parallelize indexing
- implement heuristic edge selection (mandatory; improves perfs non trivially -> Sparse Neighborhood Graph condition)
- use papaya for NodeStates? (n_reads >> n_writes).
papaya::HashMap::<NoHash> - add explanations to readme (KV rationale, pic of hnsw, etc.)
- LRU cache for recently accessed vectors ? -> effectively solved with frozzen reader
- remove hardcode on lmdb_index=0 in builder
- either make Writer<R,D,M,M0>, remove SmallVec, or make Writer<R,D,M> (M0=2*M)
- make hannoy work on vector-relevancy-benchmark
- see if we can memoize <p,q> in a cache during search heuristic
- check to make sure each node only has at most M links (data races in parallel build might violate this), using
tinyvecenforces this - add a metrics struct to the build to track number of link add ops
- add tests back to hannoy
- add other distances to hannoy
- check to see if filtered search is feasible
- add search stats ? figure out how many "jumps" are being done to see if ideas below can help
- add tracing
ideas for improvement
-
keep a counter of most frequently accessed nodes during build and make those entry points (e.g. use centroid-like)
-
merge upper layers of graph if they only have one element
-
product quantization
UnalignedVectorCodec -
cache layers 1->L in RAM (speeds up M*(L-1) reads) using a hash table storing raw byte offsets and lengths
-
*threadpool for
Readerto parallelize searching neighbours -
change Metadata.entry_points from
Vec<u32>to aRoaringBitmapto avoid manually deduplicating entries -
TODO: ask kero
- Currently I made ImmutableItems read in ALL items from db. In theory we could postpone distance calcs with on-disk vectors until the end of the build using a new ImmutableItems reader.
- then we'd store another HnswBuilder-like list of hashmaps for work we need to do later !
- actually just get his take/views on my approach for incremental indexing
- Currently I made ImmutableItems read in ALL items from db. In theory we could postpone distance calcs with on-disk vectors until the end of the build using a new ImmutableItems reader.
-
I'm suspicious of the freshdiskann paper's 5% insert + 5% delete cycle -- i've noticed that at 50% insert the performance deteriorates significantly !
- wait nevermind apparently they do do that
-
TODO: check if using \alpha sng improves recall on incremental builds, e.g. with alpha=1.2 or something (single pass not twice over)
- id does but it also increases build time (if used for entire build). also not a magic bullet.
-
ask what's wrong with a global pool for doing vector-vector ops and sending back to search thread ?
-
could we also reindex points on levels > 0 during incremental build ?
-
need to try building whole index, then deleting & inserting instead of 2-phase build