Expand description
A library to quickly compute (canonical) minimizers of DNA and text sequences.
The main functions are:
minimizer_positions
: compute the positions of all minimizers of a sequence.canonical_minimizer_positions
: compute the positions of all canonical minimizers of a sequence. Adjacent equal positions are deduplicated, but since the canonical minimizer is not forward, a position could appear more than once.
The implementation uses SIMD by splitting each sequence into 8 chunks and processing those in parallel.
The minimizer_positions_scalar
and canonical_minimizer_positions_scalar
versions can be more efficient on short sequences where the overhead of chunking is large.
The minimizer of a single window can be found using one_minimizer
and one_canonical_minimizer
.
§Minimizers
The code is explained in detail in our preprint:
SimdMinimizers: Computing random minimizers, fast. Ragnar Groot Koerkamp, Igor Martayan
Briefly, minimizers are defined using two parameters k
and w
.
Given a sequence of characters, all k-mers (substrings of length k
) are hashed,
and for each window of k
consecutive k-mers (of length l = w + k - 1
characters),
(the position of) the smallest k-mer is sampled.
Minimizers are found as follows:
- Split the input to 8 chunks that are processed in parallel using SIMD.
- Compute a 32-bit ntHash rolling hash of the k-mers.
- Use the ‘two stacks’ sliding window minimum on the top 16 bits of each hash.
- Break ties towards the leftmost position by storing the position in the bottom 16 bits.
- Compute 8 consecutive minimizer positions, and dedup them.
- Collect the deduplicated minimizer positions from all 8 chunks into a single vector.
§Canonical minimizers
Canonical minimizers have the property that the sampled k-mers of a DNA sequence are the same as those sampled from the reverse complement sequence.
This works as follows:
- ntHash is modified to use the canonical version that computes the xor of the hash of the forward and reverse complement k-mer.
- Compute the leftmost and rightmost minimal k-mer.
- Compute the ‘preferred’ strand of the current window as the one with more
TG
characters. This requiresl=w+k-1
to be odd for proper tie-breaking. - Return either the leftmost or rightmost smallest k-mer, depending on the preferred strand.
§Input types
This crate depends on [packed-seq
] to handle generic types of input sequences.
Most commonly, one should use packed_seq::PackedSeqVec
for packed DNA sequences, but one can also simply wrap a sequence of ACTGactg
characters in packed_seq::AsciiSeqVec
.
Additionally, [simd-minimizers
] works on general (ASCII) &[u8]
text.
The main function provided by packed_seq
is packed_seq::Seq::iter_bp
, which splits the input into 8 chunks and iterates them in parallel using SIMD.
When dealing with ASCII input, use the AsciiSeq
and AsciiSeqVec
types.
§Hash function
By default, the library uses the ntHash
hash function, which maps each DNA base ACTG
to a pseudo-random value using a table lookup.
This hash function is specifically designed to be fast for hashing DNA sequences with input type packed_seq::PackedSeq
and packed_seq::AsciiSeq
.
For general ASCII sequences (&[u8]
), mulHash
is used instead, which instead multiplies each character value by a pseudo-random constant.
The mul_hash
module provides functions that always use mulHash, also for DNA sequences.
§Performance
This library depends on AVX2 or NEON SIMD instructions to achieve good performance.
Make sure to compile with -C target-cpu=native
to enable these instructions.
All functions take a out_vec: &mut Vec<u32>
parameter to which positions are appended.
For best performance, re-use the same out_vec
between invocations, and Vec::clear
it before or after each call.
§Features
hide-simd-warning
: If your system does not support AVX2 or NEON, enable this feature to disable the compile warning that will be shown.
§Examples
// Scalar ASCII version.
use packed_seq::{Seq, AsciiSeq};
let seq = b"ACGTGCTCAGAGACTCAG";
let ascii_seq = AsciiSeq(seq);
let k = 5;
let w = 7;
let mut out_vec = Vec::new();
simd_minimizers::minimizer_positions_scalar(ascii_seq, k, w, &mut out_vec);
assert_eq!(out_vec, vec![0, 6, 8, 10, 12]);
// Packed SIMD version.
use packed_seq::{complement_char, PackedSeqVec, SeqVec};
let seq = b"ACGTGCTCAGAGACTCAG";
let k = 5;
let w = 7;
let packed_seq = PackedSeqVec::from_ascii(seq);
let mut fwd_pos = Vec::new();
// Unfortunately, `PackedSeqVec` can not `Deref` into a `PackedSeq`.
simd_minimizers::canonical_minimizer_positions(packed_seq.as_slice(), k, w, &mut fwd_pos);
assert_eq!(fwd_pos, vec![3, 5, 12]);
// Check that reverse complement sequence has minimizers at 'reverse' positions.
let rc_seq = seq.iter().rev().map(|&b| complement_char(b)).collect::<Vec<_>>();
let rc_packed_seq = PackedSeqVec::from_ascii(&rc_seq);
let mut rc_pos = Vec::new();
simd_minimizers::canonical_minimizer_positions(rc_packed_seq.as_slice(), k, w, &mut rc_pos);
assert_eq!(rc_pos, vec![1, 8, 10]);
for (fwd, &rc) in std::iter::zip(fwd_pos, rc_pos.iter().rev()) {
assert_eq!(fwd as usize, seq.len() - k - rc as usize);
}
Re-exports§
pub use packed_seq;
Modules§
- mul_
hash - Variants that always use mulHash, instead of the default ntHash for DNA and mulHash for text.
- private
- Re-exported internals. Used for benchmarking, and not part of the semver-compatible stable API.
Functions§
- canonical_
minimizer_ positions - Deduplicated positions of all canonical minimizers in the sequence, using SIMD.
- canonical_
minimizer_ positions_ scalar - Deduplicated positions of all canonical minimizers in the sequence. This scalar version can be faster for short sequences.
- minimizer_
positions - Deduplicated positions of all minimizers in the sequence, using SIMD.
- minimizer_
positions_ scalar - Deduplicated positions of all minimizers in the sequence. This scalar version can be faster for short sequences.
- one_
canonical_ minimizer - Canonical minimizer position of a single window.
- one_
minimizer - Minimizer position of a single window.