pub struct MinHash { /* private fields */ }Expand description
Bottom-k MinHash sketch for rapid genome comparison.
Keeps the sketch_size smallest canonical k-mer hash values from a DNA
sequence. Jaccard similarity between two sketches is estimated from the
overlap of their bottom-k hash sets.
Implementations§
Source§impl MinHash
impl MinHash
Sourcepub fn from_sequence(seq: &[u8], k: usize, sketch_size: usize) -> Result<Self>
pub fn from_sequence(seq: &[u8], k: usize, sketch_size: usize) -> Result<Self>
Build a MinHash sketch from a DNA sequence.
Non-ACGT bases act as k-mer break points (k-mers spanning them are skipped). Input is case-insensitive.
§Errors
Returns an error if k == 0 or sketch_size == 0.
Sourcepub fn add_sequence(&mut self, seq: &[u8])
pub fn add_sequence(&mut self, seq: &[u8])
Add k-mers from a DNA sequence to this sketch.
Non-ACGT bases act as k-mer break points. The sketch is maintained as a sorted bottom-k set.
Sourcepub fn jaccard(&self, other: &MinHash) -> Result<f64>
pub fn jaccard(&self, other: &MinHash) -> Result<f64>
Estimate Jaccard similarity between this sketch and another.
Uses the merge-based estimator: merge both sorted hash arrays, count how many of the bottom-k values from the union appear in both sketches.
§Errors
Returns an error if the sketches have different k values.
Sourcepub fn containment(&self, other: &MinHash) -> Result<f64>
pub fn containment(&self, other: &MinHash) -> Result<f64>
Estimate containment of self in other.
Containment C(A, B) = |A intersect B| / |A|.
§Errors
Returns an error if the sketches have different k values.
Sourcepub fn ani(&self, other: &MinHash) -> Result<f64>
pub fn ani(&self, other: &MinHash) -> Result<f64>
Estimate average nucleotide identity (ANI) from Jaccard similarity.
Uses the Mash formula: ANI = 1 + (2/k) * ln(2J / (1 + J))
§Errors
Returns an error if the sketches have different k values, or if the
Jaccard similarity is zero (ANI undefined).
Sourcepub fn sketch_size(&self) -> usize
pub fn sketch_size(&self) -> usize
The target sketch size (bottom-k parameter).