[−][src]Crate lsh_rs
lsh-rs (Locality Sensitive Hashing)
Locality sensitive hashing can help retrieving Approximate Nearest Neighbors in sub-linear time.
For more information on the subject see:
- Introduction on LSH
- Section 2. describes the hash families used in this crate
- LSH and neural networks
Implementations
- Base LSH
- Signed Random Projections (Cosine similarity)
- L2 distance
- MIPS (Dot products/ Maximum Inner Product Search)
- MinHash (Jaccard Similarity)
- Multi Probe LSH
- Step wise probing
- SRP (only bit shifts)
- Query directed probing
- L2
- MIPS
- Step wise probing
- Generic numeric types
Features
- "blas"
- "sqlite"
Getting started
use lsh_rs::prelude::*; // 2 rows w/ dimension 3. let p = &[vec![1., 1.5, 2.], vec![2., 1.1, -0.3]]; // Do one time expensive preprocessing. let n_projections = 9; let n_hash_tables = 30; let dim = 10; let dim = 3; let mut lsh = LshMem::new(n_projections, n_hash_tables, dim) .srp() .unwrap(); lsh.store_vecs(p); // Query in sublinear time. let query = &[1.1, 1.2, 1.2]; lsh.query_bucket(query);
Signed Random Projections
LSH for maximum cosine similarity search.
let mut lsh = LshMem::<_, f32>::new(n_projections, n_hash_tables, dim) .srp() .unwrap();
L2
LSH for minimal L2 distance search.
// hyper parameter r in https://arxiv.org/pdf/1411.3787.pdf (eq. 8) let mut lsh = LshMem::<_, f32>::new(n_projections, n_hash_tables, dim) .l2(bucket_width) .unwrap();
Jaccard Index
LSH for the Jaccard Index
// length of the shingles vector let dim = 2500; let mut lsh = LshSqlMem::<_, u16>::new(n_projections, n_hash_tables, dim) .minhash() .unwrap();
Maximum Inner Product (MIPS)
LSH for maximum inner product search.
let bucket_width = 2.2; // l2(x) < U < 1.0 let U = 0.83; let r = 4.; // number of concatenations let m = 3; let n_projections = 15; let n_hash_tables = 10; let dim = 10; let mut lsh = LshMem::<_, f32>::new(n_projections, n_hash_tables, dim) .mips(r, U, m) .unwrap();
Seed
Random projections are used to generate the hash functions. The default seeding of randomness is taken from the system. If you want to have reproducable outcomes, you can set a manual seed.
let mut lsh = LshMem::<_, f32>::new(n_projections, n_hash_tables, dim) .seed(12) .srp() .unwrap();
Unique indexes
Instead of storing data points as vectors. Storing L
copies of the data points (one in every
hash table). You can choose to only store unique indexes of the data points. The index ids are
assigned in chronological order. This will drastically decrease the required memory.
let mut lsh = LshMem::<_, f32>::new(n_projections, n_hash_tables, dim) .only_index() .srp() .unwrap();
Builder pattern methods
The following methods can be used to change internal state during object initialization:
Backends
The LSH struct is exposed with multiple backends that store the hashes.
- in memory (fastest / can save state with serialization) LshMem
- SQLite (slower due to disk io, but automatic state preservation between sessions) LshSql
- in memory SQLite (can backup to SQLite when processing is done) LshSqlMem
Hash primitives
The hashers in this crate will produces hashes of type Vec<T>
. Where T
should be one of i8
,
i16
, i32
or i64
. This concrete primitive value can be set by choosing on of the utillity types
in the following sub-modules:
Using smaller primitives for the hash values, will result in less space requirements and greater performance. However this may lead to panics if the hash value doesn't fit the chosen primitive due to buffer overflow.
Note: the hash primitive cannot be set for every Hash family that has implemented
VecHash. For instance, SignRandomProjections
will allways use i8
as hash primitive.
// use i8 hash values: let lsh_i8 = hi8::LshMem::<_, u8>::new(n_projections, n_hash_tables, dim) .minhash() .unwrap(); // use i64 hash values: let lhs_i8 = hi64::LshMem::<_, u8>::new(n_projections, n_hash_tables, dim) .minhash() .unwrap();
BLAS support
Utilizing BLAS will heavily increase
performance. To make use of BLAS, install lsh-rs
with "blas"
feature and reinstall ndarray
with "blas"
support.
Cargo.toml:
lsh-rs = {version ="x.x"}, features=["blas"]}
ndarray = {version = "0.13", features=["blas"]}
# Or any other blas backend.
blas-src = { version = "0.6", defeault-features = false, features = ["openblas"]}
Need your own hashers?
The LSH struct can easily be extended with your own hashers. Your own hasher structs need
to implement VecHash<N, K>. N
and K
are generic types of the input
and output numbers respectively.
Need you own backend?
If you need another backend, you can extend you backend with the HashTables<N, K> trait.
Modules
data | Generic traits for numeric input and hash outputs. |
dist | Distance/ similarity functions. |
prelude | Re-export of the public api of lsh-rs. |
stats | Some utilities to help choose LSH parameters. |
Structs
MemoryTable | In memory backend for LSH. |
SqlTable | Sqlite backend for LSH. |
SqlTableMem | In memory Sqlite backend for LSH. |
Traits
HashTables | Hashtable consisting of |
QueryDirectedProbe | Query directed probing |
StepWiseProbe | Step wise probing |
VecHash | Implement this trait to create your own custom hashers.
In case of a symmetrical hash function, only |