simd_minimizers/lib.rs
1//! A library to quickly compute (canonical) minimizers of DNA and text sequences.
2//!
3//! The main functions are:
4//! - [`minimizer_positions`]: compute the positions of all minimizers of a sequence.
5//! - [`canonical_minimizer_positions`]: compute the positions of all _canonical_ minimizers of a sequence.
6//! Adjacent equal positions are deduplicated, but since the canonical minimizer is _not_ _forward_, a position could appear more than once.
7//!
8//! The implementation uses SIMD by splitting each sequence into 8 chunks and processing those in parallel.
9//!
10//! When using super-k-mers, use the `_and_superkmer` variants to additionally return a vector containing the index of the first window the minimizer is minimal.
11//!
12//! The minimizer of a single window can be found using [`one_minimizer`] and [`one_canonical_minimizer`], but note that these functions are not nearly as efficient.
13//!
14//! The [`scalar`] versions are mostly for testing only, and basically always slower.
15//! Only for short sequences with length up to 100 is [`scalar::minimizer_positions_scalar`] faster than the SIMD version.
16//!
17//! ## Minimizers
18//!
19//! The code is explained in detail in our [paper](https://doi.org/10.4230/LIPIcs.SEA.2025.20):
20//!
21//! > SimdMinimizers: Computing random minimizers, fast.
22//! > Ragnar Groot Koerkamp, Igor Martayan, SEA 2025
23//!
24//! Briefly, minimizers are defined using two parameters `k` and `w`.
25//! Given a sequence of characters, all k-mers (substrings of length `k`) are hashed,
26//! and for each _window_ of `k` consecutive k-mers (of length `l = w + k - 1` characters),
27//! (the position of) the smallest k-mer is sampled.
28//!
29//! Minimizers are found as follows:
30//! 1. Split the input to 8 chunks that are processed in parallel using SIMD.
31//! 2. Compute a 32-bit ntHash rolling hash of the k-mers.
32//! 3. Use the 'two stacks' sliding window minimum on the top 16 bits of each hash.
33//! 4. Break ties towards the leftmost position by storing the position in the bottom 16 bits.
34//! 5. Compute 8 consecutive minimizer positions, and dedup them.
35//! 6. Collect the deduplicated minimizer positions from all 8 chunks into a single vector.
36//!
37//! ## Canonical minimizers
38//!
39//! _Canonical_ minimizers have the property that the sampled k-mers of a DNA sequence are the same as those sampled from the _reverse complement_ sequence.
40//!
41//! This works as follows:
42//! 1. ntHash is modified to use the canonical version that computes the xor of the hash of the forward and reverse complement k-mer.
43//! 2. Compute the leftmost and rightmost minimal k-mer.
44//! 3. Compute the 'preferred' strand of the current window as the one with more `TG` characters. This requires `l=w+k-1` to be odd for proper tie-breaking.
45//! 4. Return either the leftmost or rightmost smallest k-mer, depending on the preferred strand.
46//!
47//! ## Input types
48//!
49//! This crate depends on [`packed-seq`] to handle generic types of input sequences.
50//! Most commonly, one should use [`packed_seq::PackedSeqVec`] for packed DNA sequences, but one can also simply wrap a sequence of `ACTGactg` characters in [`packed_seq::AsciiSeqVec`].
51//! Additionally, [`simd-minimizers`] works on general (ASCII) `&[u8]` text.
52//!
53//! The main function provided by [`packed_seq`] is [`packed_seq::Seq::iter_bp`], which splits the input into 8 chunks and iterates them in parallel using SIMD.
54//!
55//! When dealing with ASCII input, use the `AsciiSeq` and `AsciiSeqVec` types.
56//!
57//! ## Hash function
58//!
59//! By default, the library uses the `ntHash` hash function, which maps each DNA base `ACTG` to a pseudo-random value using a table lookup.
60//! This hash function is specifically designed to be fast for hashing DNA sequences with input type [`packed_seq::PackedSeq`] and [`packed_seq::AsciiSeq`].
61//!
62//! For general ASCII sequences (`&[u8]`), `mulHash` is used instead, which instead multiplies each character value by a pseudo-random constant.
63//! The `mul_hash` module provides functions that _always_ use mulHash, also for DNA sequences.
64//!
65//! ## Performance
66//!
67//! This library depends on AVX2 or NEON SIMD instructions to achieve good performance.
68//! Make sure to compile with `-C target-cpu=native` to enable these instructions.
69//!
70//! All functions take a `out_vec: &mut Vec<u32>` parameter to which positions are _appended_.
71//! For best performance, re-use the same `out_vec` between invocations, and [`Vec::clear`] it before or after each call.
72//!
73//! ## Examples
74//!
75//! #### Scalar `AsciiSeq`
76//!
77//! ```
78//! // Scalar ASCII version.
79//! use packed_seq::{SeqVec, AsciiSeq};
80//!
81//! let seq = b"ACGTGCTCAGAGACTCAG";
82//! let ascii_seq = AsciiSeq(seq);
83//!
84//! let k = 5;
85//! let w = 7;
86//!
87//! let positions = simd_minimizers::minimizer_positions(ascii_seq, k, w);
88//! assert_eq!(positions, vec![4, 5, 8, 13]);
89//! ```
90//!
91//! #### SIMD `PackedSeq`
92//!
93//! ```
94//! // Packed SIMD version.
95//! use packed_seq::{PackedSeqVec, SeqVec, Seq};
96//!
97//! let seq = b"ACGTGCTCAGAGACTCAGAGGA";
98//! let packed_seq = PackedSeqVec::from_ascii(seq);
99//!
100//! let k = 5;
101//! let w = 7;
102//!
103//! // Unfortunately, `PackedSeqVec` can not `Deref` into a `PackedSeq`, so `as_slice` is needed.
104//! // Since we also need the values, this uses the Builder API.
105//! let mut fwd_pos = vec![];
106//! let fwd_vals: Vec<_> = simd_minimizers::canonical_minimizers(k, w).run(packed_seq.as_slice(), &mut fwd_pos).values_u64().collect();
107//! assert_eq!(fwd_pos, vec![0, 7, 9, 15]);
108//! assert_eq!(fwd_vals, vec![
109//! // T G C A C, CACGT is rc of ACGTG at pos 0
110//! 0b10_11_01_00_01,
111//! // G A G A C, CAGAG is at pos 7
112//! 0b11_00_11_00_01,
113//! // C A G A G, GAGAC is at pos 9
114//! 0b01_00_11_00_11,
115//! // G A G A C, CAGAG is at pos 15
116//! 0b11_00_11_00_01
117//! ]);
118//!
119//! // Check that reverse complement sequence has minimizers at 'reverse' positions.
120//! let rc_packed_seq = packed_seq.as_slice().to_revcomp();
121//! let mut rc_pos = Vec::new();
122//! let mut rc_vals: Vec<_> = simd_minimizers::canonical_minimizers(k, w).run(rc_packed_seq.as_slice(), &mut rc_pos).values_u64().collect();
123//! assert_eq!(rc_pos, vec![2, 8, 10, 17]);
124//! for (fwd, &rc) in std::iter::zip(fwd_pos, rc_pos.iter().rev()) {
125//! assert_eq!(fwd as usize, seq.len() - k - rc as usize);
126//! }
127//! rc_vals.reverse();
128//! assert_eq!(rc_vals, fwd_vals);
129//! ```
130//!
131//! #### Seeded hasher
132//!
133//! ```
134//! // Packed SIMD version with seeded hashes.
135//! use packed_seq::{PackedSeqVec, SeqVec};
136//!
137//! let seq = b"ACGTGCTCAGAGACTCAG";
138//! let packed_seq = PackedSeqVec::from_ascii(seq);
139//!
140//! let k = 5;
141//! let w = 7;
142//! let seed = 101010;
143//! // Canonical by default. Use `NtHasher<false>` for forward-only.
144//! let hasher = <seq_hash::NtHasher>::new_with_seed(k, seed);
145//!
146//! let fwd_pos = simd_minimizers::canonical_minimizers(k, w).hasher(&hasher).run_once(packed_seq.as_slice());
147//! ```
148
149#![allow(clippy::missing_transmute_annotations)]
150
151mod canonical;
152pub mod collect;
153mod minimizers;
154mod sliding_min;
155mod intrinsics {
156 mod dedup;
157 pub use dedup::{append_unique_vals, append_unique_vals_2};
158}
159
160#[cfg(test)]
161mod test;
162
163/// Re-exported internals. Used for benchmarking, and not part of the semver-compatible stable API.
164pub mod private {
165 pub mod canonical {
166 pub use crate::canonical::*;
167 }
168 pub mod minimizers {
169 pub use crate::minimizers::*;
170 }
171 pub mod sliding_min {
172 pub use crate::sliding_min::*;
173 }
174 pub use packed_seq::u32x8 as S;
175}
176
177use collect::CollectAndDedup;
178use collect::collect_and_dedup_into_scalar;
179use collect::collect_and_dedup_with_index_into_scalar;
180use minimizers::canonical_minimizers_skip_ambiguous_windows;
181/// Re-export of the `packed-seq` crate.
182pub use packed_seq;
183use packed_seq::PackedNSeq;
184use packed_seq::PackedSeq;
185/// Re-export of the `seq-hash` crate.
186pub use seq_hash;
187
188use minimizers::{
189 canonical_minimizers_seq_scalar, canonical_minimizers_seq_simd, minimizers_seq_scalar,
190 minimizers_seq_simd,
191};
192use packed_seq::Seq;
193use packed_seq::u32x8 as S;
194use seq_hash::KmerHasher;
195
196pub use minimizers::one_minimizer;
197use seq_hash::NtHasher;
198pub use sliding_min::Cache;
199
200thread_local! {
201 static CACHE: std::cell::RefCell<(Cache, Vec<S>, Vec<S>)> = std::cell::RefCell::new(Default::default());
202}
203
204pub struct Builder<'h, const CANONICAL: bool, H: KmerHasher, SkPos> {
205 k: usize,
206 w: usize,
207 hasher: Option<&'h H>,
208 sk_pos: SkPos,
209}
210
211pub struct Output<'o, const CANONICAL: bool, S> {
212 k: usize,
213 seq: S,
214 min_pos: &'o Vec<u32>,
215}
216
217#[must_use]
218pub fn minimizers(k: usize, w: usize) -> Builder<'static, false, NtHasher<false>, ()> {
219 Builder {
220 k,
221 w,
222 hasher: None,
223 sk_pos: (),
224 }
225}
226
227#[must_use]
228pub fn canonical_minimizers(k: usize, w: usize) -> Builder<'static, true, NtHasher<true>, ()> {
229 Builder {
230 k,
231 w,
232 hasher: None,
233 sk_pos: (),
234 }
235}
236
237impl<const CANONICAL: bool> Builder<'static, CANONICAL, NtHasher<CANONICAL>, ()> {
238 #[must_use]
239 pub fn hasher<'h, H2: KmerHasher>(&self, hasher: &'h H2) -> Builder<'h, CANONICAL, H2, ()> {
240 Builder {
241 k: self.k,
242 w: self.w,
243 sk_pos: (),
244 hasher: Some(hasher),
245 }
246 }
247}
248impl<'h, const CANONICAL: bool, H: KmerHasher> Builder<'h, CANONICAL, H, ()> {
249 #[must_use]
250 pub fn super_kmers<'o2>(
251 &self,
252 sk_pos: &'o2 mut Vec<u32>,
253 ) -> Builder<'h, CANONICAL, H, &'o2 mut Vec<u32>> {
254 Builder {
255 k: self.k,
256 w: self.w,
257 hasher: self.hasher,
258 sk_pos: sk_pos,
259 }
260 }
261}
262
263/// Without-superkmer version
264impl<'h, const CANONICAL: bool, H: KmerHasher> Builder<'h, CANONICAL, H, ()> {
265 pub fn run_scalar_once<'s, SEQ: Seq<'s>>(&self, seq: SEQ) -> Vec<u32> {
266 let mut min_pos = vec![];
267 self.run_impl::<false, _>(seq, &mut min_pos);
268 min_pos
269 }
270
271 pub fn run_once<'s, SEQ: Seq<'s>>(&self, seq: SEQ) -> Vec<u32> {
272 let mut min_pos = vec![];
273 self.run_impl::<true, _>(seq, &mut min_pos);
274 min_pos
275 }
276
277 pub fn run_scalar<'s, 'o, SEQ: Seq<'s>>(
278 &self,
279 seq: SEQ,
280 min_pos: &'o mut Vec<u32>,
281 ) -> Output<'o, CANONICAL, SEQ> {
282 self.run_impl::<false, _>(seq, min_pos)
283 }
284
285 pub fn run<'s, 'o, SEQ: Seq<'s>>(
286 &self,
287 seq: SEQ,
288 min_pos: &'o mut Vec<u32>,
289 ) -> Output<'o, CANONICAL, SEQ> {
290 self.run_impl::<true, _>(seq, min_pos)
291 }
292
293 fn run_impl<'s, 'o, const SIMD: bool, SEQ: Seq<'s>>(
294 &self,
295 seq: SEQ,
296 min_pos: &'o mut Vec<u32>,
297 ) -> Output<'o, CANONICAL, SEQ> {
298 let default_hasher = self.hasher.is_none().then(|| H::new(self.k));
299 let hasher = self
300 .hasher
301 .unwrap_or_else(|| default_hasher.as_ref().unwrap());
302
303 CACHE.with_borrow_mut(|cache| match (SIMD, CANONICAL) {
304 (false, false) => collect_and_dedup_into_scalar(
305 minimizers_seq_scalar(seq, hasher, self.w, &mut cache.0),
306 min_pos,
307 ),
308 (false, true) => collect_and_dedup_into_scalar(
309 canonical_minimizers_seq_scalar(seq, hasher, self.w, &mut cache.0),
310 min_pos,
311 ),
312 (true, false) => minimizers_seq_simd(seq, hasher, self.w, &mut cache.0)
313 .collect_and_dedup_into::<false>(min_pos),
314 (true, true) => canonical_minimizers_seq_simd(seq, hasher, self.w, &mut cache.0)
315 .collect_and_dedup_into::<false>(min_pos),
316 });
317 Output {
318 k: self.k,
319 seq,
320 min_pos,
321 }
322 }
323}
324
325impl<'h, H: KmerHasher> Builder<'h, true, H, ()> {
326 pub fn run_skip_ambiguous_windows_once<'s, 'o>(&self, nseq: PackedNSeq<'s>) -> Vec<u32> {
327 let mut min_pos = vec![];
328 self.run_skip_ambiguous_windows(nseq, &mut min_pos);
329 min_pos
330 }
331 pub fn run_skip_ambiguous_windows<'s, 'o>(
332 &self,
333 nseq: PackedNSeq<'s>,
334 min_pos: &'o mut Vec<u32>,
335 ) -> Output<'o, true, PackedSeq<'s>> {
336 CACHE
337 .with_borrow_mut(|cache| self.run_skip_ambiguous_windows_with_buf(nseq, min_pos, cache))
338 }
339 pub fn run_skip_ambiguous_windows_with_buf<'s, 'o>(
340 &self,
341 nseq: PackedNSeq<'s>,
342 min_pos: &'o mut Vec<u32>,
343 cache: &mut (Cache, Vec<S>, Vec<S>),
344 ) -> Output<'o, true, PackedSeq<'s>> {
345 let default_hasher = self.hasher.is_none().then(|| H::new(self.k));
346 let hasher = self
347 .hasher
348 .unwrap_or_else(|| default_hasher.as_ref().unwrap());
349 canonical_minimizers_skip_ambiguous_windows(nseq, hasher, self.w, cache)
350 .collect_and_dedup_into::<true>(min_pos);
351 Output {
352 k: self.k,
353 seq: nseq.seq,
354 min_pos,
355 }
356 }
357}
358
359/// With-superkmer version
360impl<'h, 'o2, const CANONICAL: bool, H: KmerHasher> Builder<'h, CANONICAL, H, &'o2 mut Vec<u32>> {
361 pub fn run_scalar_once<'s, SEQ: Seq<'s>>(self, seq: SEQ) -> Vec<u32> {
362 let mut min_pos = vec![];
363 self.run_scalar(seq, &mut min_pos);
364 min_pos
365 }
366
367 pub fn run_scalar<'s, 'o, SEQ: Seq<'s>>(
368 self,
369 seq: SEQ,
370 min_pos: &'o mut Vec<u32>,
371 ) -> Output<'o, CANONICAL, SEQ> {
372 let default_hasher = self.hasher.is_none().then(|| H::new(self.k));
373 let hasher = self
374 .hasher
375 .unwrap_or_else(|| default_hasher.as_ref().unwrap());
376
377 CACHE.with_borrow_mut(|cache| match CANONICAL {
378 false => collect_and_dedup_with_index_into_scalar(
379 minimizers_seq_scalar(seq, hasher, self.w, &mut cache.0),
380 min_pos,
381 self.sk_pos,
382 ),
383 true => collect_and_dedup_with_index_into_scalar(
384 canonical_minimizers_seq_scalar(seq, hasher, self.w, &mut cache.0),
385 min_pos,
386 self.sk_pos,
387 ),
388 });
389 Output {
390 k: self.k,
391 seq,
392 min_pos,
393 }
394 }
395
396 pub fn run_once<'s, SEQ: Seq<'s>>(self, seq: SEQ) -> Vec<u32> {
397 let mut min_pos = vec![];
398 self.run(seq, &mut min_pos);
399 min_pos
400 }
401
402 pub fn run<'s, 'o, SEQ: Seq<'s>>(
403 self,
404 seq: SEQ,
405 min_pos: &'o mut Vec<u32>,
406 ) -> Output<'o, CANONICAL, SEQ> {
407 CACHE.with_borrow_mut(|cache| self.run_with_buf(seq, min_pos, &mut cache.0))
408 }
409
410 #[inline(always)]
411 fn run_with_buf<'s, 'o, SEQ: Seq<'s>>(
412 self,
413 seq: SEQ,
414 min_pos: &'o mut Vec<u32>,
415 cache: &mut Cache,
416 ) -> Output<'o, CANONICAL, SEQ> {
417 let default_hasher = self.hasher.is_none().then(|| H::new(self.k));
418 let hasher = self
419 .hasher
420 .unwrap_or_else(|| default_hasher.as_ref().unwrap());
421
422 match CANONICAL {
423 false => minimizers_seq_simd(seq, hasher, self.w, cache)
424 .collect_and_dedup_with_index_into(min_pos, self.sk_pos),
425 true => canonical_minimizers_seq_simd(seq, hasher, self.w, cache)
426 .collect_and_dedup_with_index_into(min_pos, self.sk_pos),
427 };
428 Output {
429 k: self.k,
430 seq,
431 min_pos,
432 }
433 }
434}
435
436impl<'s, 'o, const CANONICAL: bool, SEQ: Seq<'s>> Output<'o, CANONICAL, SEQ> {
437 /// Iterator over (canonical) u64 kmer-values associated with all minimizer positions.
438 #[must_use]
439 pub fn values_u64(&self) -> impl ExactSizeIterator<Item = u64> {
440 self.pos_and_values_u64().map(|(_pos, val)| val)
441 }
442 /// Iterator over (canonical) u128 kmer-values associated with all minimizer positions.
443 #[must_use]
444 pub fn values_u128(&self) -> impl ExactSizeIterator<Item = u128> {
445 self.pos_and_values_u128().map(|(_pos, val)| val)
446 }
447 /// Iterator over positions and (canonical) u64 kmer-values associated with all minimizer positions.
448 #[must_use]
449 pub fn pos_and_values_u64(&self) -> impl ExactSizeIterator<Item = (u32, u64)> {
450 self.min_pos.iter().map(
451 #[inline(always)]
452 move |&pos| {
453 let val = if CANONICAL {
454 let a = self.seq.read_kmer(self.k, pos as usize);
455 let b = self.seq.read_revcomp_kmer(self.k, pos as usize);
456 core::cmp::min(a, b)
457 } else {
458 self.seq.read_kmer(self.k, pos as usize)
459 };
460 (pos, val)
461 },
462 )
463 }
464 /// Iterator over positions and (canonical) u128 kmer-values associated with all minimizer positions.
465 #[must_use]
466 pub fn pos_and_values_u128(&self) -> impl ExactSizeIterator<Item = (u32, u128)> {
467 self.min_pos.iter().map(
468 #[inline(always)]
469 move |&pos| {
470 let val = if CANONICAL {
471 let a = self.seq.read_kmer_u128(self.k, pos as usize);
472 let b = self.seq.read_revcomp_kmer_u128(self.k, pos as usize);
473 core::cmp::min(a, b)
474 } else {
475 self.seq.read_kmer_u128(self.k, pos as usize)
476 };
477 (pos, val)
478 },
479 )
480 }
481}
482
483/// Positions of all minimizers in the sequence.
484///
485/// See [`minimizers`], [`canonical_minimizers`], and [`Builder`] for more
486/// configurations supporting a custom hasher, super-kmer positions, and
487/// returning kmer-values.
488///
489/// Positions are appended to a reusable `min_pos` vector to avoid allocations.
490pub fn minimizer_positions<'s>(seq: impl Seq<'s>, k: usize, w: usize) -> Vec<u32> {
491 minimizers(k, w).run_once(seq)
492}
493
494/// Positions of all canonical minimizers in the sequence.
495///
496/// See [`minimizers`], [`canonical_minimizers`], and [`Builder`] for more
497/// configurations supporting a custom hasher, super-kmer positions, and
498/// returning kmer-values.
499///
500/// `l=w+k-1` must be odd to determine the strand of each window.
501///
502/// Positions are appended to a reusable `min_pos` vector to avoid allocations.
503pub fn canonical_minimizer_positions<'s>(seq: impl Seq<'s>, k: usize, w: usize) -> Vec<u32> {
504 canonical_minimizers(k, w).run_once(seq)
505}