Crate simd_sketch

Source
Expand description

§SimdSketch

This library provides two types of sequence sketches:

  • the classic bottom-s sketch;
  • the newer bucket sketch, returning the smallest hash in each of s buckets.

See the corresponding blogpost for more background and an evaluation.

§Hash function

All internal hashes are 32 bits. Either a forward-only hash or reverse-complement-aware (canonical) hash can be used.

TODO: Current we use (canonical) ntHash. This causes some hash-collisions for k <= 16, which could be avoided.

§BucketSketch

For classic bottom-sketch, evaluating the similarity is slow because a merge-sort must be done between the two lists.

The bucket sketch solves this by partitioning the hashes into s partitions. Previous methods partition into ranges of size u32::MAX/s, but here we partition by remainder mod s instead.

We find the smallest hash for each remainder as the sketch. To compute the similarity, we can simply use the hamming distance between two sketches, which is significantly faster.

The bucket sketch similarity has a very strong one-to-one correlation with the classic bottom-sketch.

TODO: A drawback of this method is that some buckets may remain empty when the input sequences are not long enough. In that case, densification could be applied, but this is not currently implemented. If you need this, please reach out.

§Jaccard similarity

For the bottom sketch, we conceptually estimate similarity as follows:

  1. Find the smallest s distinct k-mer hashes in the union of two sketches.
  2. Return the fraction of these k-mers that occurs in both sketches.

For the bucket sketch, we simply return the fraction of partitions that have the same k-mer for both sequences.

§b-bit sketches

Instead of storing the full 32-bit hashes, it is sufficient to only store the low bits of each hash. In practice, b=8 is usually fine. When extra fast comparisons are needed, use b=1 in combination with a 3 to 4x larger s.

§Usage

The main entrypoint of this library is the Sketcher object. Construct it in either the forward or canonical variant, and give k and s. Then call either Sketcher::bottom_sketch or Sketcher::sketch on it, and use the similarity functions on the returned BottomSketch and BucketSketch objects.

use packed_seq::SeqVec;

let k = 31;   // Hash all k-mers.
let s = 8192; // Sample 8192 hashes
let b = 8;    // Store the bottom 8 bits of each hash.

// Use `new_rc` for a canonical (reverse-complement aware) hash.
// `new_fwd` uses a plain forward hash instead.
let sketcher = simd_sketch::Sketcher::new_rc(k, s, b);

// Generate two random sequences of 2M characters.
let n = 2_000_000;
let seq1 = packed_seq::PackedSeqVec::random(n);
let seq2 = packed_seq::PackedSeqVec::random(n);

// Bottom-sketch variant

let sketch1: simd_sketch::BottomSketch = sketcher.bottom_sketch(seq1.as_slice());
let sketch2: simd_sketch::BottomSketch = sketcher.bottom_sketch(seq2.as_slice());

// Value between 0 and 1, estimating the fraction of shared k-mers.
let similarity = sketch1.similarity(&sketch2);

// Bucket sketch variant

let sketch1: simd_sketch::BucketSketch = sketcher.sketch(seq1.as_slice());
let sketch2: simd_sketch::BucketSketch = sketcher.sketch(seq2.as_slice());

// Value between 0 and 1, estimating the fraction of shared k-mers.
let similarity: f32 = sketch1.similarity(&sketch2);

TODO: Currently there is no support yet for merging sketches, or for sketching multiple sequences into one sketch. It’s not hard, I just need to find a good API. Please reach out if you’re interested in this.

TODO: If you would like a binary instead of a library, again, please reach out :)

§Implementation notes

This library works by partitioning the input sequence into 8 chunks, and processing those in parallel using SIMD. This is based on the packed-seq and simd-minimizers crates.

For bottom sketch, the largest hash should be around target = u32::MAX * s / n (ignoring duplicates). To ensure a branch-free algorithm, we first collect all hashes up to bound = 1.5 * target. Then we sort the collected hashes. If there are at least s left after deduplicating, we return the bottom s. Otherwise, we double the 1.5 multiplication factor and retry. This factor is cached to make the sketching of multiple genomes more efficient.

For bucket sketch, we use the same approach, and increase the factor until we find a k-mer hash in every bucket. In expectation, this needs to collect a fraction around log(n) * s / n of hashes, rather than s / n. In practice this doesn’t matter much, as the hashing of all input k-mers is the bottleneck, and the sorting of the small sample of k-mers is relatively fast.

For bucket sketch we assign each element to its bucket via its remainder modulo s. We compute this efficiently using fast-mod.

§Performance

The sketching throughput of this library is around 2 seconds for a 3GB human genome (once the scaling factor is large enough to avoid a second pass). That’s typically a few times faster than parsing a Fasta file.

BinDash instead takes 180s (90x more), when running on a single thread.

Comparing sketches is relatively fast, but can become a bottleneck when there are many input sequences, since the number of comparisons grows quadratically. In this case, prefer bucket sketch. As an example, when sketching 5MB bacterial genomes using s=10000, each sketch takes 4ms. Comparing two sketches takes 1.6us. This starts to be the dominant factor when the number of input sequences is more than 5000.

Structs§

BottomSketch
A sketch containing the s smallest k-mer hashes.
BucketSketch
A sketch containing the smallest k-mer hash for each remainder mod s.
Sketcher
An object containing the sketch parameters.