bio-seq
Bit-packed and well-typed biological sequences
use ;
use ;
Contents
- Codec: Encoding scheme for the 'characters' of a biological sequence
- Seq: A sequence of encoded characters
- Kmer: A fixed size sequence of length
k - Derivable codecs: This crate offers utilities for defining your own bit-level encodings
- Safe conversion between sequences
Codecs
The Codec trait describes the coding/decoding process for the characters of a biological sequence. This trait can be derived procedurally. There are three built-in codecs:
codec::Dna
Using the lexicographically ordered 2-bit representation
codec::Iupac
IUPAC nucleotide ambiguity codes are represented with 4 bits. This supports membership resolution with bitwise operations. Logical or is the union:
assert_eq!;
Logical and is the intersection of two iupac sequences:
assert_eq!;
codec::Amino
Amino acid sequences are represented with 6 bits. The representation of amino acids is designed to be easy to coerce from sequences of 2-bit encoded DNA.
Sequences
Strings of encoded biological characters are packed into Seqs. Slicing, chunking, and windowing return SeqSlices. Seq<A: Codec>/&SeqSlice<A: Codec> are analogous to String/&str.
Kmers
kmers are sequences with a fixed size that can fit into a register. these are implemented with const generics.
k * Codec::width must fit in a usize (i.e. 64). for larger kmers use bigk::kmer: TODO
Dense encodings
For dense encodings, a lookup table can be populated and indexed in constant time with the usize representation:
TODO: finish example
let mut histogram = vec!;
Hashing
The Hash trait is implemented for Kmers
Canonical Kmers
Depending on the application, it may be permissible to superimpose the forward and reverse complements of a kmer:
k = kmer!;
let canonical = k ^ k.revcomp; // TODO: implement ReverseComplement for Kmer
Kmer minimisers
The 2-bit representation of DNA sequences is lexicographically ordered:
Example: Hashing minimiser of canonical Kmers
for ckmer in seq.window.map
Derivable codecs
Sequence coding/decoding is derived from the variant names and discriminants of enum types:
use Codec;
use ;
The width attribute specifies how many bits the encoding requires per symbol. The maximum supported is 8.
Kmers are stored as usizes with the least significant bit first.
Sequence conversions
Iupac from Dna; Seq<Iupac> from Seq<Dna>
Amino from Kmer<3>; Seq<Amino> from Seq<Dna> (TODO)
- Sequence length not a multiple of 3 is an error
Seq<Iupac> from Amino; Seq<Iupac> from Seq<Amino> (TODO)
Vec<Seq<Dna>> from Seq<Iupac>: A sequence of IUPAC codes can generate a list of DNA sequences of the same length. (TODO)
TODO: deal with alternate (e.g. mamalian mitochondrial) translation codes