simd_minimizers/
lib.rs

1//! A library to quickly compute (canonical) minimizers of DNA and text sequences.
2//!
3//! The main functions are:
4//! - [`minimizer_positions`]: compute the positions of all minimizers of a sequence.
5//! - [`canonical_minimizer_positions`]: compute the positions of all _canonical_ minimizers of a sequence.
6//! Adjacent equal positions are deduplicated, but since the canonical minimizer is _not_ _forward_, a position could appear more than once.
7//!
8//! The implementation uses SIMD by splitting each sequence into 8 chunks and processing those in parallel.
9//!
10//! When using super-k-mers, use the `_and_superkmer` variants to additionally return a vector containing the index of the first window the minimizer is minimal.
11//!
12//! The minimizer of a single window can be found using [`one_minimizer`] and [`one_canonical_minimizer`], but note that these functions are not nearly as efficient.
13//!
14//! The [`scalar`] versions are mostly for testing only, and basically always slower.
15//! Only for short sequences with length up to 100 is [`scalar::minimizer_positions_scalar`] faster than the SIMD version.
16//!
17//! ## Minimizers
18//!
19//! The code is explained in detail in our [paper](https://doi.org/10.4230/LIPIcs.SEA.2025.20):
20//!
21//! > SimdMinimizers: Computing random minimizers, fast.
22//! > Ragnar Groot Koerkamp, Igor Martayan, SEA 2025
23//!
24//! Briefly, minimizers are defined using two parameters `k` and `w`.
25//! Given a sequence of characters, all k-mers (substrings of length `k`) are hashed,
26//! and for each _window_ of `k` consecutive k-mers (of length `l = w + k - 1` characters),
27//! (the position of) the smallest k-mer is sampled.
28//!
29//! Minimizers are found as follows:
30//! 1. Split the input to 8 chunks that are processed in parallel using SIMD.
31//! 2. Compute a 32-bit ntHash rolling hash of the k-mers.
32//! 3. Use the 'two stacks' sliding window minimum on the top 16 bits of each hash.
33//! 4. Break ties towards the leftmost position by storing the position in the bottom 16 bits.
34//! 5. Compute 8 consecutive minimizer positions, and dedup them.
35//! 6. Collect the deduplicated minimizer positions from all 8 chunks into a single vector.
36//!
37//! ## Canonical minimizers
38//!
39//! _Canonical_ minimizers have the property that the sampled k-mers of a DNA sequence are the same as those sampled from the _reverse complement_ sequence.
40//!
41//! This works as follows:
42//! 1. ntHash is modified to use the canonical version that computes the xor of the hash of the forward and reverse complement k-mer.
43//! 2. Compute the leftmost and rightmost minimal k-mer.
44//! 3. Compute the 'preferred' strand of the current window as the one with more `TG` characters. This requires `l=w+k-1` to be odd for proper tie-breaking.
45//! 4. Return either the leftmost or rightmost smallest k-mer, depending on the preferred strand.
46//!
47//! ## Input types
48//!
49//! This crate depends on [`packed-seq`] to handle generic types of input sequences.
50//! Most commonly, one should use [`packed_seq::PackedSeqVec`] for packed DNA sequences, but one can also simply wrap a sequence of `ACTGactg` characters in [`packed_seq::AsciiSeqVec`].
51//! Additionally, [`simd-minimizers`] works on general (ASCII) `&[u8]` text.
52//!
53//! The main function provided by [`packed_seq`] is [`packed_seq::Seq::iter_bp`], which splits the input into 8 chunks and iterates them in parallel using SIMD.
54//!
55//! When dealing with ASCII input, use the `AsciiSeq` and `AsciiSeqVec` types.
56//!
57//! ## Hash function
58//!
59//! By default, the library uses the `ntHash` hash function, which maps each DNA base `ACTG` to a pseudo-random value using a table lookup.
60//! This hash function is specifically designed to be fast for hashing DNA sequences with input type [`packed_seq::PackedSeq`] and [`packed_seq::AsciiSeq`].
61//!
62//! For general ASCII sequences (`&[u8]`), `mulHash` is used instead, which instead multiplies each character value by a pseudo-random constant.
63//! The `mul_hash` module provides functions that _always_ use mulHash, also for DNA sequences.
64//!
65//! ## Performance
66//!
67//! This library depends on AVX2 or NEON SIMD instructions to achieve good performance.
68//! Make sure to compile with `-C target-cpu=native` to enable these instructions.
69//!
70//! All functions take a `out_vec: &mut Vec<u32>` parameter to which positions are _appended_.
71//! For best performance, re-use the same `out_vec` between invocations, and [`Vec::clear`] it before or after each call.
72//!
73//! ## Examples
74//!
75//! #### Scalar `AsciiSeq`
76//!
77//! ```
78//! // Scalar ASCII version.
79//! use packed_seq::{SeqVec, AsciiSeq};
80//!
81//! let seq = b"ACGTGCTCAGAGACTCAG";
82//! let ascii_seq = AsciiSeq(seq);
83//!
84//! let k = 5;
85//! let w = 7;
86//!
87//! let positions = simd_minimizers::minimizer_positions(ascii_seq, k, w);
88//! assert_eq!(positions, vec![4, 5, 8, 13]);
89//! ```
90//!
91//! #### SIMD `PackedSeq`
92//!
93//! ```
94//! // Packed SIMD version.
95//! use packed_seq::{PackedSeqVec, SeqVec, Seq};
96//!
97//! let seq = b"ACGTGCTCAGAGACTCAGAGGA";
98//! let packed_seq = PackedSeqVec::from_ascii(seq);
99//!
100//! let k = 5;
101//! let w = 7;
102//!
103//! // Unfortunately, `PackedSeqVec` can not `Deref` into a `PackedSeq`, so `as_slice` is needed.
104//! // Since we also need the values, this uses the Builder API.
105//! let mut fwd_pos = vec![];
106//! let fwd_vals: Vec<_> = simd_minimizers::canonical_minimizers(k, w).run(packed_seq.as_slice(), &mut fwd_pos).values_u64().collect();
107//! assert_eq!(fwd_pos, vec![0, 7, 9, 15]);
108//! assert_eq!(fwd_vals, vec![
109//!     // T  G  C  A  C, CACGT is rc of ACGTG at pos 0
110//!     0b10_11_01_00_01,
111//!     // G  A  G  A  C, CAGAG is at pos 7
112//!     0b11_00_11_00_01,
113//!     // C  A  G  A  G, GAGAC is at pos 9
114//!     0b01_00_11_00_11,
115//!     // G  A  G  A  C, CAGAG is at pos 15
116//!     0b11_00_11_00_01
117//! ]);
118//!
119//! // Check that reverse complement sequence has minimizers at 'reverse' positions.
120//! let rc_packed_seq = packed_seq.as_slice().to_revcomp();
121//! let mut rc_pos = Vec::new();
122//! let mut rc_vals: Vec<_> = simd_minimizers::canonical_minimizers(k, w).run(rc_packed_seq.as_slice(), &mut rc_pos).values_u64().collect();
123//! assert_eq!(rc_pos, vec![2, 8, 10, 17]);
124//! for (fwd, &rc) in std::iter::zip(fwd_pos, rc_pos.iter().rev()) {
125//!     assert_eq!(fwd as usize, seq.len() - k - rc as usize);
126//! }
127//! rc_vals.reverse();
128//! assert_eq!(rc_vals, fwd_vals);
129//! ```
130//!
131//! #### Seeded hasher
132//!
133//! ```
134//! // Packed SIMD version with seeded hashes.
135//! use packed_seq::{PackedSeqVec, SeqVec};
136//!
137//! let seq = b"ACGTGCTCAGAGACTCAG";
138//! let packed_seq = PackedSeqVec::from_ascii(seq);
139//!
140//! let k = 5;
141//! let w = 7;
142//! let seed = 101010;
143//! // Canonical by default. Use `NtHasher<false>` for forward-only.
144//! let hasher = <seq_hash::NtHasher>::new_with_seed(k, seed);
145//!
146//! let fwd_pos = simd_minimizers::canonical_minimizers(k, w).hasher(&hasher).run_once(packed_seq.as_slice());
147//! ```
148
149#![allow(clippy::missing_transmute_annotations)]
150
151mod canonical;
152pub mod collect;
153mod minimizers;
154mod sliding_min;
155mod intrinsics {
156    mod dedup;
157    pub use dedup::{append_unique_vals, append_unique_vals_2};
158}
159
160#[cfg(test)]
161mod test;
162
163/// Re-exported internals. Used for benchmarking, and not part of the semver-compatible stable API.
164pub mod private {
165    pub mod canonical {
166        pub use crate::canonical::*;
167    }
168    pub mod minimizers {
169        pub use crate::minimizers::*;
170    }
171    pub mod sliding_min {
172        pub use crate::sliding_min::*;
173    }
174    pub use packed_seq::u32x8 as S;
175}
176
177use collect::CollectAndDedup;
178use collect::collect_and_dedup_into_scalar;
179use collect::collect_and_dedup_with_index_into_scalar;
180use minimizers::canonical_minimizers_skip_ambiguous_windows;
181/// Re-export of the `packed-seq` crate.
182pub use packed_seq;
183use packed_seq::PackedNSeq;
184use packed_seq::PackedSeq;
185/// Re-export of the `seq-hash` crate.
186pub use seq_hash;
187
188use minimizers::{
189    canonical_minimizers_seq_scalar, canonical_minimizers_seq_simd, minimizers_seq_scalar,
190    minimizers_seq_simd,
191};
192use packed_seq::Seq;
193use packed_seq::u32x8 as S;
194use seq_hash::KmerHasher;
195
196pub use minimizers::one_minimizer;
197use seq_hash::NtHasher;
198pub use sliding_min::Cache;
199
200thread_local! {
201    static CACHE: std::cell::RefCell<(Cache, Vec<S>, Vec<S>)> = std::cell::RefCell::new(Default::default());
202}
203
204pub struct Builder<'h, const CANONICAL: bool, H: KmerHasher, SkPos> {
205    k: usize,
206    w: usize,
207    hasher: Option<&'h H>,
208    sk_pos: SkPos,
209}
210
211pub struct Output<'o, const CANONICAL: bool, S> {
212    k: usize,
213    seq: S,
214    min_pos: &'o Vec<u32>,
215}
216
217#[must_use]
218pub fn minimizers(k: usize, w: usize) -> Builder<'static, false, NtHasher<false>, ()> {
219    Builder {
220        k,
221        w,
222        hasher: None,
223        sk_pos: (),
224    }
225}
226
227#[must_use]
228pub fn canonical_minimizers(k: usize, w: usize) -> Builder<'static, true, NtHasher<true>, ()> {
229    Builder {
230        k,
231        w,
232        hasher: None,
233        sk_pos: (),
234    }
235}
236
237impl<const CANONICAL: bool> Builder<'static, CANONICAL, NtHasher<CANONICAL>, ()> {
238    #[must_use]
239    pub fn hasher<'h, H2: KmerHasher>(&self, hasher: &'h H2) -> Builder<'h, CANONICAL, H2, ()> {
240        Builder {
241            k: self.k,
242            w: self.w,
243            sk_pos: (),
244            hasher: Some(hasher),
245        }
246    }
247}
248impl<'h, const CANONICAL: bool, H: KmerHasher> Builder<'h, CANONICAL, H, ()> {
249    #[must_use]
250    pub fn super_kmers<'o2>(
251        &self,
252        sk_pos: &'o2 mut Vec<u32>,
253    ) -> Builder<'h, CANONICAL, H, &'o2 mut Vec<u32>> {
254        Builder {
255            k: self.k,
256            w: self.w,
257            hasher: self.hasher,
258            sk_pos: sk_pos,
259        }
260    }
261}
262
263/// Without-superkmer version
264impl<'h, const CANONICAL: bool, H: KmerHasher> Builder<'h, CANONICAL, H, ()> {
265    pub fn run_scalar_once<'s, SEQ: Seq<'s>>(&self, seq: SEQ) -> Vec<u32> {
266        let mut min_pos = vec![];
267        self.run_impl::<false, _>(seq, &mut min_pos);
268        min_pos
269    }
270
271    pub fn run_once<'s, SEQ: Seq<'s>>(&self, seq: SEQ) -> Vec<u32> {
272        let mut min_pos = vec![];
273        self.run_impl::<true, _>(seq, &mut min_pos);
274        min_pos
275    }
276
277    pub fn run_scalar<'s, 'o, SEQ: Seq<'s>>(
278        &self,
279        seq: SEQ,
280        min_pos: &'o mut Vec<u32>,
281    ) -> Output<'o, CANONICAL, SEQ> {
282        self.run_impl::<false, _>(seq, min_pos)
283    }
284
285    pub fn run<'s, 'o, SEQ: Seq<'s>>(
286        &self,
287        seq: SEQ,
288        min_pos: &'o mut Vec<u32>,
289    ) -> Output<'o, CANONICAL, SEQ> {
290        self.run_impl::<true, _>(seq, min_pos)
291    }
292
293    fn run_impl<'s, 'o, const SIMD: bool, SEQ: Seq<'s>>(
294        &self,
295        seq: SEQ,
296        min_pos: &'o mut Vec<u32>,
297    ) -> Output<'o, CANONICAL, SEQ> {
298        let default_hasher = self.hasher.is_none().then(|| H::new(self.k));
299        let hasher = self
300            .hasher
301            .unwrap_or_else(|| default_hasher.as_ref().unwrap());
302
303        CACHE.with_borrow_mut(|cache| match (SIMD, CANONICAL) {
304            (false, false) => collect_and_dedup_into_scalar(
305                minimizers_seq_scalar(seq, hasher, self.w, &mut cache.0),
306                min_pos,
307            ),
308            (false, true) => collect_and_dedup_into_scalar(
309                canonical_minimizers_seq_scalar(seq, hasher, self.w, &mut cache.0),
310                min_pos,
311            ),
312            (true, false) => minimizers_seq_simd(seq, hasher, self.w, &mut cache.0)
313                .collect_and_dedup_into::<false>(min_pos),
314            (true, true) => canonical_minimizers_seq_simd(seq, hasher, self.w, &mut cache.0)
315                .collect_and_dedup_into::<false>(min_pos),
316        });
317        Output {
318            k: self.k,
319            seq,
320            min_pos,
321        }
322    }
323}
324
325impl<'h, H: KmerHasher> Builder<'h, true, H, ()> {
326    pub fn run_skip_ambiguous_windows_once<'s, 'o>(&self, nseq: PackedNSeq<'s>) -> Vec<u32> {
327        let mut min_pos = vec![];
328        self.run_skip_ambiguous_windows(nseq, &mut min_pos);
329        min_pos
330    }
331    pub fn run_skip_ambiguous_windows<'s, 'o>(
332        &self,
333        nseq: PackedNSeq<'s>,
334        min_pos: &'o mut Vec<u32>,
335    ) -> Output<'o, true, PackedSeq<'s>> {
336        CACHE
337            .with_borrow_mut(|cache| self.run_skip_ambiguous_windows_with_buf(nseq, min_pos, cache))
338    }
339    pub fn run_skip_ambiguous_windows_with_buf<'s, 'o>(
340        &self,
341        nseq: PackedNSeq<'s>,
342        min_pos: &'o mut Vec<u32>,
343        cache: &mut (Cache, Vec<S>, Vec<S>),
344    ) -> Output<'o, true, PackedSeq<'s>> {
345        let default_hasher = self.hasher.is_none().then(|| H::new(self.k));
346        let hasher = self
347            .hasher
348            .unwrap_or_else(|| default_hasher.as_ref().unwrap());
349        canonical_minimizers_skip_ambiguous_windows(nseq, hasher, self.w, cache)
350            .collect_and_dedup_into::<true>(min_pos);
351        Output {
352            k: self.k,
353            seq: nseq.seq,
354            min_pos,
355        }
356    }
357}
358
359/// With-superkmer version
360impl<'h, 'o2, const CANONICAL: bool, H: KmerHasher> Builder<'h, CANONICAL, H, &'o2 mut Vec<u32>> {
361    pub fn run_scalar_once<'s, SEQ: Seq<'s>>(self, seq: SEQ) -> Vec<u32> {
362        let mut min_pos = vec![];
363        self.run_scalar(seq, &mut min_pos);
364        min_pos
365    }
366
367    pub fn run_scalar<'s, 'o, SEQ: Seq<'s>>(
368        self,
369        seq: SEQ,
370        min_pos: &'o mut Vec<u32>,
371    ) -> Output<'o, CANONICAL, SEQ> {
372        let default_hasher = self.hasher.is_none().then(|| H::new(self.k));
373        let hasher = self
374            .hasher
375            .unwrap_or_else(|| default_hasher.as_ref().unwrap());
376
377        CACHE.with_borrow_mut(|cache| match CANONICAL {
378            false => collect_and_dedup_with_index_into_scalar(
379                minimizers_seq_scalar(seq, hasher, self.w, &mut cache.0),
380                min_pos,
381                self.sk_pos,
382            ),
383            true => collect_and_dedup_with_index_into_scalar(
384                canonical_minimizers_seq_scalar(seq, hasher, self.w, &mut cache.0),
385                min_pos,
386                self.sk_pos,
387            ),
388        });
389        Output {
390            k: self.k,
391            seq,
392            min_pos,
393        }
394    }
395
396    pub fn run_once<'s, SEQ: Seq<'s>>(self, seq: SEQ) -> Vec<u32> {
397        let mut min_pos = vec![];
398        self.run(seq, &mut min_pos);
399        min_pos
400    }
401
402    pub fn run<'s, 'o, SEQ: Seq<'s>>(
403        self,
404        seq: SEQ,
405        min_pos: &'o mut Vec<u32>,
406    ) -> Output<'o, CANONICAL, SEQ> {
407        CACHE.with_borrow_mut(|cache| self.run_with_buf(seq, min_pos, &mut cache.0))
408    }
409
410    #[inline(always)]
411    fn run_with_buf<'s, 'o, SEQ: Seq<'s>>(
412        self,
413        seq: SEQ,
414        min_pos: &'o mut Vec<u32>,
415        cache: &mut Cache,
416    ) -> Output<'o, CANONICAL, SEQ> {
417        let default_hasher = self.hasher.is_none().then(|| H::new(self.k));
418        let hasher = self
419            .hasher
420            .unwrap_or_else(|| default_hasher.as_ref().unwrap());
421
422        match CANONICAL {
423            false => minimizers_seq_simd(seq, hasher, self.w, cache)
424                .collect_and_dedup_with_index_into(min_pos, self.sk_pos),
425            true => canonical_minimizers_seq_simd(seq, hasher, self.w, cache)
426                .collect_and_dedup_with_index_into(min_pos, self.sk_pos),
427        };
428        Output {
429            k: self.k,
430            seq,
431            min_pos,
432        }
433    }
434}
435
436impl<'s, 'o, const CANONICAL: bool, SEQ: Seq<'s>> Output<'o, CANONICAL, SEQ> {
437    /// Iterator over (canonical) u64 kmer-values associated with all minimizer positions.
438    #[must_use]
439    pub fn values_u64(&self) -> impl ExactSizeIterator<Item = u64> {
440        self.pos_and_values_u64().map(|(_pos, val)| val)
441    }
442    /// Iterator over (canonical) u128 kmer-values associated with all minimizer positions.
443    #[must_use]
444    pub fn values_u128(&self) -> impl ExactSizeIterator<Item = u128> {
445        self.pos_and_values_u128().map(|(_pos, val)| val)
446    }
447    /// Iterator over positions and (canonical) u64 kmer-values associated with all minimizer positions.
448    #[must_use]
449    pub fn pos_and_values_u64(&self) -> impl ExactSizeIterator<Item = (u32, u64)> {
450        self.min_pos.iter().map(
451            #[inline(always)]
452            move |&pos| {
453                let val = if CANONICAL {
454                    let a = self.seq.read_kmer(self.k, pos as usize);
455                    let b = self.seq.read_revcomp_kmer(self.k, pos as usize);
456                    core::cmp::min(a, b)
457                } else {
458                    self.seq.read_kmer(self.k, pos as usize)
459                };
460                (pos, val)
461            },
462        )
463    }
464    /// Iterator over positions and (canonical) u128 kmer-values associated with all minimizer positions.
465    #[must_use]
466    pub fn pos_and_values_u128(&self) -> impl ExactSizeIterator<Item = (u32, u128)> {
467        self.min_pos.iter().map(
468            #[inline(always)]
469            move |&pos| {
470                let val = if CANONICAL {
471                    let a = self.seq.read_kmer_u128(self.k, pos as usize);
472                    let b = self.seq.read_revcomp_kmer_u128(self.k, pos as usize);
473                    core::cmp::min(a, b)
474                } else {
475                    self.seq.read_kmer_u128(self.k, pos as usize)
476                };
477                (pos, val)
478            },
479        )
480    }
481}
482
483/// Positions of all minimizers in the sequence.
484///
485/// See [`minimizers`], [`canonical_minimizers`], and [`Builder`] for more
486/// configurations supporting a custom hasher, super-kmer positions, and
487/// returning kmer-values.
488///
489/// Positions are appended to a reusable `min_pos` vector to avoid allocations.
490pub fn minimizer_positions<'s>(seq: impl Seq<'s>, k: usize, w: usize) -> Vec<u32> {
491    minimizers(k, w).run_once(seq)
492}
493
494/// Positions of all canonical minimizers in the sequence.
495///
496/// See [`minimizers`], [`canonical_minimizers`], and [`Builder`] for more
497/// configurations supporting a custom hasher, super-kmer positions, and
498/// returning kmer-values.
499///
500/// `l=w+k-1` must be odd to determine the strand of each window.
501///
502/// Positions are appended to a reusable `min_pos` vector to avoid allocations.
503pub fn canonical_minimizer_positions<'s>(seq: impl Seq<'s>, k: usize, w: usize) -> Vec<u32> {
504    canonical_minimizers(k, w).run_once(seq)
505}