Crate simd_minimizers

Source
Expand description

A library to quickly compute (canonical) minimizers of DNA and text sequences.

The main functions are:

  • minimizer_positions: compute the positions of all minimizers of a sequence.
  • canonical_minimizer_positions: compute the positions of all canonical minimizers of a sequence. Adjacent equal positions are deduplicated, but since the canonical minimizer is not forward, a position could appear more than once.

The implementation uses SIMD by splitting each sequence into 8 chunks and processing those in parallel. The minimizer_positions_scalar and canonical_minimizer_positions_scalar versions can be more efficient on short sequences where the overhead of chunking is large.

The minimizer of a single window can be found using one_minimizer and one_canonical_minimizer.

§Minimizers

The code is explained in detail in our preprint:

SimdMinimizers: Computing random minimizers, fast. Ragnar Groot Koerkamp, Igor Martayan

Briefly, minimizers are defined using two parameters k and w. Given a sequence of characters, all k-mers (substrings of length k) are hashed, and for each window of k consecutive k-mers (of length l = w + k - 1 characters), (the position of) the smallest k-mer is sampled.

Minimizers are found as follows:

  1. Split the input to 8 chunks that are processed in parallel using SIMD.
  2. Compute a 32-bit ntHash rolling hash of the k-mers.
  3. Use the ‘two stacks’ sliding window minimum on the top 16 bits of each hash.
  4. Break ties towards the leftmost position by storing the position in the bottom 16 bits.
  5. Compute 8 consecutive minimizer positions, and dedup them.
  6. Collect the deduplicated minimizer positions from all 8 chunks into a single vector.

§Canonical minimizers

Canonical minimizers have the property that the sampled k-mers of a DNA sequence are the same as those sampled from the reverse complement sequence.

This works as follows:

  1. ntHash is modified to use the canonical version that computes the xor of the hash of the forward and reverse complement k-mer.
  2. Compute the leftmost and rightmost minimal k-mer.
  3. Compute the ‘preferred’ strand of the current window as the one with more TG characters. This requires l=w+k-1 to be odd for proper tie-breaking.
  4. Return either the leftmost or rightmost smallest k-mer, depending on the preferred strand.

§Input types

This crate depends on [packed-seq] to handle generic types of input sequences. Most commonly, one should use packed_seq::PackedSeqVec for packed DNA sequences, but one can also simply wrap a sequence of ACTGactg characters in packed_seq::AsciiSeqVec. Additionally, [simd-minimizers] works on general (ASCII) &[u8] text.

The main function provided by packed_seq is packed_seq::Seq::iter_bp, which splits the input into 8 chunks and iterates them in parallel using SIMD.

When dealing with ASCII input, use the AsciiSeq and AsciiSeqVec types.

§Hash function

By default, the library uses the ntHash hash function, which maps each DNA base ACTG to a pseudo-random value using a table lookup. This hash function is specifically designed to be fast for hashing DNA sequences with input type packed_seq::PackedSeq and packed_seq::AsciiSeq.

For general ASCII sequences (&[u8]), mulHash is used instead, which instead multiplies each character value by a pseudo-random constant. The mul_hash module provides functions that always use mulHash, also for DNA sequences.

§Performance

This library depends on AVX2 or NEON SIMD instructions to achieve good performance. Make sure to compile with -C target-cpu=native to enable these instructions.

All functions take a out_vec: &mut Vec<u32> parameter to which positions are appended. For best performance, re-use the same out_vec between invocations, and Vec::clear it before or after each call.

§Features

  • hide-simd-warning: If your system does not support AVX2 or NEON, enable this feature to disable the compile warning that will be shown.

§Examples

// Scalar ASCII version.
use packed_seq::{Seq, AsciiSeq};
let seq = b"ACGTGCTCAGAGACTCAG";
let ascii_seq = AsciiSeq(seq);
let k = 5;
let w = 7;
let mut out_vec = Vec::new();
simd_minimizers::minimizer_positions_scalar(ascii_seq, k, w, &mut out_vec);
assert_eq!(out_vec, vec![0, 6, 8, 10, 12]);
// Packed SIMD version.
use packed_seq::{complement_char, PackedSeqVec, SeqVec};
let seq = b"ACGTGCTCAGAGACTCAG";
let k = 5;
let w = 7;

let packed_seq = PackedSeqVec::from_ascii(seq);
let mut fwd_pos = Vec::new();
// Unfortunately, `PackedSeqVec` can not `Deref` into a `PackedSeq`.
simd_minimizers::canonical_minimizer_positions(packed_seq.as_slice(), k, w, &mut fwd_pos);
assert_eq!(fwd_pos, vec![3, 5, 12]);

// Check that reverse complement sequence has minimizers at 'reverse' positions.
let rc_seq = seq.iter().rev().map(|&b| complement_char(b)).collect::<Vec<_>>();
let rc_packed_seq = PackedSeqVec::from_ascii(&rc_seq);
let mut rc_pos = Vec::new();
simd_minimizers::canonical_minimizer_positions(rc_packed_seq.as_slice(), k, w, &mut rc_pos);
assert_eq!(rc_pos, vec![1, 8, 10]);
for (fwd, &rc) in std::iter::zip(fwd_pos, rc_pos.iter().rev()) {
    assert_eq!(fwd as usize, seq.len() - k - rc as usize);
}

Re-exports§

pub use packed_seq;

Modules§

mul_hash
Variants that always use mulHash, instead of the default ntHash for DNA and mulHash for text.
private
Re-exported internals. Used for benchmarking, and not part of the semver-compatible stable API.

Functions§

canonical_minimizer_positions
Deduplicated positions of all canonical minimizers in the sequence, using SIMD.
canonical_minimizer_positions_scalar
Deduplicated positions of all canonical minimizers in the sequence. This scalar version can be faster for short sequences.
minimizer_positions
Deduplicated positions of all minimizers in the sequence, using SIMD.
minimizer_positions_scalar
Deduplicated positions of all minimizers in the sequence. This scalar version can be faster for short sequences.
one_canonical_minimizer
Canonical minimizer position of a single window.
one_minimizer
Minimizer position of a single window.