Expand description
A library to quickly compute (canonical) minimizers of DNA and text sequences.
The main functions are:
minimizer_positions
: compute the positions of all minimizers of a sequence.canonical_minimizer_positions
: compute the positions of all canonical minimizers of a sequence. Adjacent equal positions are deduplicated, but since the canonical minimizer is not forward, a position could appear more than once.
The implementation uses SIMD by splitting each sequence into 8 chunks and processing those in parallel.
When using super-k-mers, use the _and_superkmer
variants to additionally return a vector containing the index of the first window the minimizer is minimal.
The minimizer of a single window can be found using one_minimizer
and [one_canonical_minimizer
], but note that these functions are not nearly as efficient.
The [scalar
] versions are mostly for testing only, and basically always slower.
Only for short sequences with length up to 100 is [scalar::minimizer_positions_scalar
] faster than the SIMD version.
§Minimizers
The code is explained in detail in our paper:
SimdMinimizers: Computing random minimizers, fast. Ragnar Groot Koerkamp, Igor Martayan, SEA 2025
Briefly, minimizers are defined using two parameters k
and w
.
Given a sequence of characters, all k-mers (substrings of length k
) are hashed,
and for each window of k
consecutive k-mers (of length l = w + k - 1
characters),
(the position of) the smallest k-mer is sampled.
Minimizers are found as follows:
- Split the input to 8 chunks that are processed in parallel using SIMD.
- Compute a 32-bit ntHash rolling hash of the k-mers.
- Use the ‘two stacks’ sliding window minimum on the top 16 bits of each hash.
- Break ties towards the leftmost position by storing the position in the bottom 16 bits.
- Compute 8 consecutive minimizer positions, and dedup them.
- Collect the deduplicated minimizer positions from all 8 chunks into a single vector.
§Canonical minimizers
Canonical minimizers have the property that the sampled k-mers of a DNA sequence are the same as those sampled from the reverse complement sequence.
This works as follows:
- ntHash is modified to use the canonical version that computes the xor of the hash of the forward and reverse complement k-mer.
- Compute the leftmost and rightmost minimal k-mer.
- Compute the ‘preferred’ strand of the current window as the one with more
TG
characters. This requiresl=w+k-1
to be odd for proper tie-breaking. - Return either the leftmost or rightmost smallest k-mer, depending on the preferred strand.
§Input types
This crate depends on [packed-seq
] to handle generic types of input sequences.
Most commonly, one should use packed_seq::PackedSeqVec
for packed DNA sequences, but one can also simply wrap a sequence of ACTGactg
characters in packed_seq::AsciiSeqVec
.
Additionally, [simd-minimizers
] works on general (ASCII) &[u8]
text.
The main function provided by packed_seq
is packed_seq::Seq::iter_bp
, which splits the input into 8 chunks and iterates them in parallel using SIMD.
When dealing with ASCII input, use the AsciiSeq
and AsciiSeqVec
types.
§Hash function
By default, the library uses the ntHash
hash function, which maps each DNA base ACTG
to a pseudo-random value using a table lookup.
This hash function is specifically designed to be fast for hashing DNA sequences with input type packed_seq::PackedSeq
and packed_seq::AsciiSeq
.
For general ASCII sequences (&[u8]
), mulHash
is used instead, which instead multiplies each character value by a pseudo-random constant.
The mul_hash
module provides functions that always use mulHash, also for DNA sequences.
§Performance
This library depends on AVX2 or NEON SIMD instructions to achieve good performance.
Make sure to compile with -C target-cpu=native
to enable these instructions.
All functions take a out_vec: &mut Vec<u32>
parameter to which positions are appended.
For best performance, re-use the same out_vec
between invocations, and Vec::clear
it before or after each call.
§Examples
§Scalar AsciiSeq
// Scalar ASCII version.
use packed_seq::{SeqVec, AsciiSeq};
let seq = b"ACGTGCTCAGAGACTCAG";
let ascii_seq = AsciiSeq(seq);
let k = 5;
let w = 7;
let positions = simd_minimizers::minimizer_positions(ascii_seq, k, w);
assert_eq!(positions, vec![4, 5, 8, 13]);
§SIMD PackedSeq
// Packed SIMD version.
use packed_seq::{PackedSeqVec, SeqVec, Seq};
let seq = b"ACGTGCTCAGAGACTCAGAGGA";
let packed_seq = PackedSeqVec::from_ascii(seq);
let k = 5;
let w = 7;
// Unfortunately, `PackedSeqVec` can not `Deref` into a `PackedSeq`, so `as_slice` is needed.
// Since we also need the values, this uses the Builder API.
let mut fwd_pos = vec![];
let fwd_vals: Vec<_> = simd_minimizers::canonical_minimizers(k, w).run(packed_seq.as_slice(), &mut fwd_pos).values_u64().collect();
assert_eq!(fwd_pos, vec![0, 7, 9, 15]);
assert_eq!(fwd_vals, vec![
// T G C A C, CACGT is rc of ACGTG at pos 0
0b10_11_01_00_01,
// G A G A C, CAGAG is at pos 7
0b11_00_11_00_01,
// C A G A G, GAGAC is at pos 9
0b01_00_11_00_11,
// G A G A C, CAGAG is at pos 15
0b11_00_11_00_01
]);
// Check that reverse complement sequence has minimizers at 'reverse' positions.
let rc_packed_seq = packed_seq.as_slice().to_revcomp();
let mut rc_pos = Vec::new();
let mut rc_vals: Vec<_> = simd_minimizers::canonical_minimizers(k, w).run(rc_packed_seq.as_slice(), &mut rc_pos).values_u64().collect();
assert_eq!(rc_pos, vec![2, 8, 10, 17]);
for (fwd, &rc) in std::iter::zip(fwd_pos, rc_pos.iter().rev()) {
assert_eq!(fwd as usize, seq.len() - k - rc as usize);
}
rc_vals.reverse();
assert_eq!(rc_vals, fwd_vals);
§Seeded hasher
// Packed SIMD version with seeded hashes.
use packed_seq::{PackedSeqVec, SeqVec};
let seq = b"ACGTGCTCAGAGACTCAG";
let packed_seq = PackedSeqVec::from_ascii(seq);
let k = 5;
let w = 7;
let seed = 101010;
// Canonical by default. Use `NtHasher<false>` for forward-only.
let hasher = <seq_hash::NtHasher>::new_with_seed(k, seed);
let fwd_pos = simd_minimizers::canonical_minimizers(k, w).hasher(&hasher).run_once(packed_seq.as_slice());
Re-exports§
pub use packed_seq;
pub use seq_hash;
Modules§
- collect
- Collect (and dedup) SIMD-iterator values into a flat
Vec<u32>
. - private
- Re-exported internals. Used for benchmarking, and not part of the semver-compatible stable API.
Structs§
Functions§
- canonical_
minimizer_ positions - Positions of all canonical minimizers in the sequence.
- canonical_
minimizers - minimizer_
positions - Positions of all minimizers in the sequence.
- minimizers
- one_
minimizer - Minimizer position of a single window.