bio-seq
Bit packed and well-typed biological sequences
use *;
let seq = dna!;
for kmer in seq.
-
bio_seq::Dna: DNA use the lexicographically ordered 2-bit representation -
bio_seq::Iupac: IUPAC nucleotide ambiguity codes are represented with 4 bitsA C G T ------- S 0 1 1 0 - 0 0 0 0 C 0 1 0 0 N 1 1 1 1 B 0 1 1 1 ... etc.This supports membership resolution with bitwise operations:
assert_eq!; assert_eq!;
The Iupac struct implements From<Dna>
-
bio_seq::Amino: Amino acid sequences are represented with 6 bits.The representation of amino acids is designed to be easy to coerce from sequences of 2-bit encoded DNA. TODO: deal with alternate (e.g. mamalian mitochondrial) translation codes
Kmers
Kmers are sequences with a fixed size. These are implemented with const generics.
K * Codec::WIDTH must fit in a usize (i.e. 64). For larger Kmers use bigk::Kmer: (TODO)
Minimisers for free
The 2-bit representation of DNA sequences is lexicographically ordered:
// find the lexicographically minimum 8-mer
Derived codecs
Alphabet coding/decoding is derived from the variant names and discriminants of enum types:
The width attribute specifies how many bits the encoding requires per symbol.
Little endian
Kmers are represented stored as usizes with the least significant bit first.
dna! == 0b01 // not 0b0100_0000
dna! == 0b11_01
Conversion with From and Into
Iupac from Dna; Seq<Iupac> from Seq<Dna>
Amino from Kmer<3>; Seq<Amino> from Seq<Dna> (TODO)
- Sequence length not a multiple of 3 is an error
Seq<Iupac> from Amino; Seq<Iupac> from Seq<Amino> (TODO)
Vec<Seq<Dna>> from Seq<Iupac>: A sequence of IUPAC codes can generate a list of DNA sequences of the same length. (TODO)
Deref coercion
TODO: find out if Kmer<Dna, K> -> Kmer<Amino, K/3> is possible
Drop-in compatibility with rust-bio
meant to replace Text/TextSlice
TODO
- benchmarking
- macros for defining alphabet codecs more concisely
- wider SIMD-sized Kmers