Expand description
§SimdSketch
This library provides two types of sequence sketches:
- the classic bottom-
s
sketch; - the newer bucket sketch, returning the smallest hash in each of
s
buckets.
See the corresponding blogpost for more background and an evaluation.
§Hash function
All internal hashes are 32 bits. Either a forward-only hash or reverse-complement-aware (canonical) hash can be used.
TODO: Current we use (canonical) ntHash. This causes some hash-collisions
for k <= 16
, which could be avoided.
§BucketSketch
For classic bottom-sketch, evaluating the similarity is slow because a merge-sort must be done between the two lists.
The bucket sketch solves this by partitioning the hashes into s
partitions.
Previous methods partition into ranges of size u32::MAX/s
, but here we
partition by remainder mod s
instead.
We find the smallest hash for each remainder as the sketch. To compute the similarity, we can simply use the hamming distance between two sketches, which is significantly faster.
The bucket sketch similarity has a very strong one-to-one correlation with the classic bottom-sketch.
TODO: A drawback of this method is that some buckets may remain empty when the input sequences are not long enough. In that case, densification could be applied, but this is not currently implemented. If you need this, please reach out. Instead, we currently simply keep a bitvector indicating empty buckets.
§Jaccard similarity
For the bottom sketch, we conceptually estimate similarity as follows:
- Find the smallest
s
distinct k-mer hashes in the union of two sketches. - Return the fraction of these k-mers that occurs in both sketches.
For the bucket sketch, we simply return the fraction of parts that have the same k-mer for both sequences (out of those that are not both empty).
§b-bit sketches
Instead of storing the full 32-bit hashes, it is sufficient to only store the low bits of each hash.
In practice, b=8
is usually fine.
When extra fast comparisons are needed, use b=1
in combination with a 3 to 4x larger s
.
This causes around 1/2^b
matches because of collisions in the lower bits.
We correct for this via j = (j0 - 1/2^b) / (1 - 1/2^b)
.
When the fraction of matches is less than 1/2^b
, this is negative, which we explicitly correct to 0
.
§Mash distance
We compute the mash distance as -log( 2*j / (1+j) ) / k
.
This is always >=0, but can be as large as inf
when j=0
(as is the case for disjoint input sets).
§Usage
The main entrypoint of this library is the Sketcher
object.
Construct it in either the forward or canonical variant, and give k
and s
.
Then call either Sketcher::bottom_sketch
or Sketcher::sketch
on it, and use the
similarity
functions on the returned BottomSketch
and BucketSketch
objects.
use packed_seq::SeqVec;
let sketcher = simd_sketch::SketchParams {
alg: simd_sketch::SketchAlg::Bucket,
rc: false, // Set to `true` for a canonical (reverse-complement-aware) hash.
k: 31, // Hash 31-mers
s: 8192, // Sample 8192 hashes
b: 8, // Store the bottom 8 bits of each hash.
filter_empty: true, // Explicitly filter out empty buckets for BucketSketch.
}.build();
// Generate two random sequences of 2M characters.
let n = 2_000_000;
let seq1 = packed_seq::PackedSeqVec::random(n);
let seq2 = packed_seq::PackedSeqVec::random(n);
// Sketch using given algorithm:
let sketch1: simd_sketch::Sketch = sketcher.sketch(seq1.as_slice());
let sketch2: simd_sketch::Sketch = sketcher.sketch(seq2.as_slice());
// Value between 0 and 1, estimating the fraction of shared k-mers.
let j = sketch1.jaccard_similarity(&sketch2);
assert!(0.0 <= j && j <= 1.0);
let d = sketch1.mash_distance(&sketch2);
assert!(0.0 <= d);
TODO: Currently there is no support yet for merging sketches, or for sketching multiple sequences into one sketch. It’s not hard, I just need to find a good API. Please reach out if you’re interested in this.
TODO: If you would like a binary instead of a library, again, please reach out :)
§Implementation notes
This library works by partitioning the input sequence into 8 chunks,
and processing those in parallel using SIMD.
This is based on the packed-seq
and simd-minimizers
crates.
For bottom sketch, the largest hash should be around target = u32::MAX * s / n
(ignoring duplicates).
To ensure a branch-free algorithm, we first collect all hashes up to bound = 1.5 * target
.
Then we sort the collected hashes. If there are at least s
left after deduplicating, we return the bottom s
.
Otherwise, we double the 1.5
multiplication factor and retry. This
factor is cached to make the sketching of multiple genomes more efficient.
For bucket sketch, we use the same approach, and increase the factor until we find a k-mer hash in every bucket.
In expectation, this needs to collect a fraction around log(n) * s / n
of hashes, rather than s / n
.
In practice this doesn’t matter much, as the hashing of all input k-mers is the bottleneck,
and the sorting of the small sample of k-mers is relatively fast.
For bucket sketch we assign each element to its bucket via its remainder modulo s
.
We compute this efficiently using fast-mod.
§Performance
The sketching throughput of this library is around 2 seconds for a 3GB human genome (once the scaling factor is large enough to avoid a second pass). That’s typically a few times faster than parsing a Fasta file.
BinDash instead takes 180s (90x more), when running on a single thread.
Comparing sketches is relatively fast, but can become a bottleneck when there are many input sequences,
since the number of comparisons grows quadratically. In this case, prefer bucket sketch.
As an example, when sketching 5MB bacterial genomes using s=10000
, each sketch takes 4ms.
Comparing two sketches takes 1.6us.
This starts to be the dominant factor when the number of input sequences is more than 5000.
Structs§
- Bottom
Sketch - A sketch containing the
s
smallest k-mer hashes. - Bucket
Sketch - A sketch containing the smallest k-mer hash for each remainder mod
s
. - Sketch
Params - Sketcher
- An object containing the sketch parameters.