Expand description
Types and traits to iterate over (packed) input data.
The main type is PackedSeqVec, that holds a sequence of 2-bit packed DNA bases. PackedSeq is a non-owned slice of packed data.
To make libraries depending on this crate more generic, logic is encapsulated in the Seq and SeqVec traits.
Seq is a non-owned slice of characters that can be iterated, while SeqVec is the corresponding owned type.
These traits serve two purposes:
- They encapsulate the packing/unpacking of characters between ASCII and the possibly different in-memory format.
- They allow efficiently iterating over 8 chunks of a sequence in parallel using SIMD instructions.
§Sequence types
The traits are implemented for three types.
§Plain ASCII sequences
With &[u8]: Seq and Vec<u8>: SeqVec, the ASCII characters (or arbitrary
u8 values, really) of any input slice can be iterated.
§ASCII-encoded DNA sequences
The [AsciiSeq: Seq] and [AsciiSeqVec: SeqVec] types store a DNA sequence of ACTGactg characters.
When iterated, these are returned as 0123 values, with the mapping A=0, C=1, T=2, G=3.
Any other characters are silently mapped to 0123 using (c>>1) & 3, but this should not be relied upon.
§Packed DNA
The [PackedSeq: Seq] and [PackedSeqVec: SeqVec] types store a packed DNA sequence, encoded as 4 bases per byte.
Each ACTG base is stored as 0123 as above and four of these 2-bit values fill a byte.
Use PackedSeqVec::from_ascii to construct a PackedSeqVec.
Currently this relies on the pext instruction for good performance on x86.
§Parallel iterators
This library enables iterating 8 chunks of a sequence at the same time using SIMD instructions.
The Seq::par_iter_bp functions return a wide::u32x8 that contains the 2-bit or 8-bit values of the next character in each chunk in a u32 for 8 SIMD lanes.
This is used in the simd-minimizers crate, and explained in more detail in the corresponding preprint.
§Context
The context parameter determines how much adjacent chunks overlap. When context=1, they are disjoint.
When context=k, adjacent chunks overlap by k-1 characters, so that each k-mer is present in exactly one chunk.
Thus, this can be used to iterate all k-mers, where the first k-1 characters in each chunk are used to initialize the first k-mer.
§Delayed iteration
This crate also provides Seq::par_iter_bp_delayed and Seq::par_iter_bp_delayed_2 functions. Like Seq::par_iter_bp, these split the input into 8 chunks and stream over the chunks in parallel.
But instead of just returning a single character, they also return a second (and third) character, that is delay positions behind the new character (at index idx - delay).
This way, k-mers can be enumerated by setting delay=k and then mapping e.g. |(add, remove)| kmer = (kmer<<2) ^ add ^ (remove << (2*k)).
§Example
use packed_seq::{SeqVec, Seq, AsciiSeqVec, PackedSeqVec, pack_char};
// Plain ASCII sequence.
let seq = b"ACTGCAGCGCATATGTAGT";
// ASCII DNA sequence.
let ascii_seq = AsciiSeqVec::from_ascii(seq);
// Packed DNA sequence.
let packed_seq = PackedSeqVec::from_ascii(seq);
assert_eq!(ascii_seq.len(), packed_seq.len());
// Iterate the ASCII characters.
let characters: Vec<u8> = seq.iter_bp().collect();
assert_eq!(characters, seq);
// Iterate the bases with 0..4 values.
let bases: Vec<u8> = seq.iter().copied().map(pack_char).collect();
assert_eq!(bases, vec![0,1,2,3,1,0,3,1,3,1,0,2,0,2,3,2,0,3,2]);
let ascii_bases: Vec<u8> = ascii_seq.as_slice().iter_bp().collect();
assert_eq!(ascii_bases, bases);
let packed_bases: Vec<u8> = ascii_seq.as_slice().iter_bp().collect();
assert_eq!(packed_bases, bases);
// Iterate over 8 chunks at the same time.
let seq = b"AAAACCTTGGTTACTG"; // plain ASCII sequence
// chunks: ^ ^ ^ ^ ^ ^ ^ ^
let (par_iter, padding) = seq.as_slice().par_iter_bp(1);
let mut par_iter_u8 = par_iter.map(|x| x.as_array_ref().map(|c| c as u8));
assert_eq!(par_iter_u8.next(), Some(*b"AACTGTAT"));
assert_eq!(par_iter_u8.next(), Some(*b"AACTGTCG"));
assert_eq!(par_iter_u8.next(), None);§Feature flags
epserdeenablesderive(epserde::Epserde)forPackedSeqVecandAsciiSeqVec, and adds itsSerializeInnerandDeserializeInnertraits toSeqVec.pyo3enablesderive(pyo3::pyclass)forPackedSeqVecandAsciiSeqVec.
Structs§
- Ascii
Seq - A
&[u8]representing an ASCII-encoded DNA sequence ofACGTacgt. - Ascii
SeqVec - A
Vec<u8>representing an ASCII-encoded DNA sequence ofACGTacgt. - Packed
Seq - A 2-bit packed non-owned slice of DNA bases.
- Packed
SeqVec - A 2-bit packed owned sequence of DNA bases.
- u32x8
- A SIMD vector containing 8 u32s.
Constants§
- L
- The number of lanes in a
u32x8.
Traits§
- Seq
- A non-owned slice of characters.
- SeqVec
- An owned sequence.
Can be constructed from either ASCII input or the underlying non-owning
Seqtype.
Functions§
- complement_
base - Complement a 2-bit base:
0<>2and1<>3. - complement_
base_ simd - Complement 8 lanes of 2-bit bases:
0<>2and1<>3. - complement_
char - Complement an ASCII character:
A<>TandC<>G. - pack_
char - Pack an ASCII
ACTGactgcharacter into its 2-bit representation, and panic for anything else. - unpack_
base - Unpack a 2-bit DNA base into the corresponding
ACTGcharacter.