[][src]Crate lsh_rs

lsh-rs (Locality Sensitive Hashing)

Locality sensitive hashing can help retrieving Approximate Nearest Neighbors in sub-linear time.

For more information on the subject see:

Implementations

  • Base LSH
    • Signed Random Projections (Cosine similarity)
    • L2 distance
    • MIPS (Dot products/ Maximum Inner Product Search)
    • MinHash (Jaccard Similarity)
  • Multi Probe LSH
    • Step wise probing
      • SRP (only bit shifts)
    • Query directed probing
      • L2
      • MIPS
  • Generic numeric types

Features

  • "blas"
  • "sqlite"

Getting started

use lsh_rs::prelude::*;
// 2 rows w/ dimension 3.
let p = &[vec![1., 1.5, 2.],
        vec![2., 1.1, -0.3]];

// Do one time expensive preprocessing.
let n_projections = 9;
let n_hash_tables = 30;
let dim = 10;
let dim = 3;
let mut lsh = LshMem::new(n_projections, n_hash_tables, dim)
    .srp()
    .unwrap();
lsh.store_vecs(p);

// Query in sublinear time.
let query = &[1.1, 1.2, 1.2];
lsh.query_bucket(query);

Signed Random Projections

LSH for maximum cosine similarity search.

let mut lsh = LshMem::<_, f32>::new(n_projections, n_hash_tables, dim)
    .srp()
    .unwrap();

L2

LSH for minimal L2 distance search.

// hyper parameter r in https://arxiv.org/pdf/1411.3787.pdf (eq. 8)
let mut lsh = LshMem::<_, f32>::new(n_projections, n_hash_tables, dim)
    .l2(bucket_width)
    .unwrap();

Jaccard Index

LSH for the Jaccard Index

// length of the shingles vector
let dim = 2500;
let mut lsh = LshSqlMem::<_, u16>::new(n_projections, n_hash_tables, dim)
    .minhash()
    .unwrap();

Maximum Inner Product (MIPS)

LSH for maximum inner product search.

let bucket_width = 2.2;
// l2(x) < U < 1.0
let U = 0.83;
let r = 4.;
// number of concatenations
let m = 3;
let n_projections = 15;
let n_hash_tables = 10;
let dim = 10;
let mut lsh = LshMem::<_, f32>::new(n_projections, n_hash_tables, dim)
    .mips(r, U, m)
    .unwrap();

Seed

Random projections are used to generate the hash functions. The default seeding of randomness is taken from the system. If you want to have reproducable outcomes, you can set a manual seed.

let mut lsh = LshMem::<_, f32>::new(n_projections, n_hash_tables, dim)
    .seed(12)
    .srp()
    .unwrap();

Unique indexes

Instead of storing data points as vectors. Storing L copies of the data points (one in every hash table). You can choose to only store unique indexes of the data points. The index ids are assigned in chronological order. This will drastically decrease the required memory.

let mut lsh = LshMem::<_, f32>::new(n_projections, n_hash_tables, dim)
    .only_index()
    .srp()
    .unwrap();

Builder pattern methods

The following methods can be used to change internal state during object initialization:

Backends

The LSH struct is exposed with multiple backends that store the hashes.

  • in memory (fastest / can save state with serialization) LshMem
  • SQLite (slower due to disk io, but automatic state preservation between sessions) LshSql
  • in memory SQLite (can backup to SQLite when processing is done) LshSqlMem

Hash primitives

The hashers in this crate will produces hashes of type Vec<T>. Where T should be one of i8, i16, i32 or i64. This concrete primitive value can be set by choosing on of the utillity types in the following sub-modules:

Using smaller primitives for the hash values, will result in less space requirements and greater performance. However this may lead to panics if the hash value doesn't fit the chosen primitive due to buffer overflow.

Note: the hash primitive cannot be set for every Hash family that has implemented VecHash. For instance, SignRandomProjections will allways use i8 as hash primitive.

// use i8 hash values:
let lsh_i8 = hi8::LshMem::<_, u8>::new(n_projections, n_hash_tables, dim)
    .minhash()
    .unwrap();
// use i64 hash values:
let lhs_i8 = hi64::LshMem::<_, u8>::new(n_projections, n_hash_tables, dim)
    .minhash()
    .unwrap();

BLAS support

Utilizing BLAS will heavily increase performance. To make use of BLAS, install lsh-rs with "blas" feature and reinstall ndarray with "blas" support.

Cargo.toml:

lsh-rs = {version ="x.x"}, features=["blas"]}
ndarray = {version = "0.13", features=["blas"]}
# Or any other blas backend.
blas-src = { version = "0.6", defeault-features = false, features = ["openblas"]}

Need your own hashers?

The LSH struct can easily be extended with your own hashers. Your own hasher structs need to implement VecHash<N, K>. N and K are generic types of the input and output numbers respectively.

Need you own backend?

If you need another backend, you can extend you backend with the HashTables<N, K> trait.

Modules

data

Generic traits for numeric input and hash outputs.

dist

Distance/ similarity functions.

prelude

Re-export of the public api of lsh-rs.

stats

Some utilities to help choose LSH parameters.

Structs

MemoryTable

In memory backend for LSH.

SqlTable

Sqlite backend for LSH.

SqlTableMem

In memory Sqlite backend for LSH.

Traits

HashTables

Hashtable consisting of L Hash tables.

QueryDirectedProbe

Query directed probing

StepWiseProbe

Step wise probing

VecHash

Implement this trait to create your own custom hashers. In case of a symmetrical hash function, only hash_vec_query needs to be implemented.