bio-seq-0.3.0 has been yanked.

`bio-seq`

Bit packed and well-typed biological sequences

use bio_seq::*;

let seq = dna!("ACTGCTAGCA");

for kmer in seq.kmers::<8>() {
	println!("{}", kmer);
}

bio_seq::Dna: DNA use the lexicographically ordered 2-bit representation

bio_seq::Iupac: IUPAC nucleotide ambiguity codes are represented with 4 bits

   A C G T
   -------
 S 0 1 1 0
 - 0 0 0 0
 C 0 1 0 0
 N 1 1 1 1
 B 0 1 1 1
   ... etc.

This supports membership resolution with bitwise operations:

 assert_eq!(
     format!("{}", iupac!("AS-GYTNA") | iupac!("ANTGCAT-")),
     "ANTGYWNA"
 );
 assert_eq!(
     format!("{}", iupac!("ACGTSWKM") & iupac!("WKMSTNNA")),
     "A----WKA"
 );

The Iupac struct implements From<Dna>

bio_seq::Amino: Amino acid sequences are represented with 6 bits.

The representation of amino acids is designed to be easy to coerce from sequences of 2-bit encoded DNA. TODO: deal with alternate (e.g. mamalian mitochondrial) translation codes

Kmers

Kmers are sequences with a fixed size. These are implemented with const generics.

K * Codec::WIDTH must fit in a usize (i.e. 64). For larger Kmers use bigk::Kmer: (TODO)

Minimisers for free

The 2-bit representation of DNA sequences is lexicographically ordered:

// find the lexicographically minimum 8-mer
fn minimise(seq: Seq<Dna>) -> Option<Kmer::<8>> {
    seq.kmers::<8>().min()
}

Derived codecs

Alphabet coding/decoding is derived from the variant names and discriminants of enum types:

#[derive(Clone, Copy, Debug, PartialEq, Codec)]
#[width = 2]
#[repr(u8)]
pub enum Dna {
    A = 0b00,
    C = 0b01,
    G = 0b10,
    T = 0b11,
}

The width attribute specifies how many bits the encoding requires per symbol.

Little endian

Kmers are represented stored as usizes with the least significant bit first.

dna!("C") == 0b01 // not 0b0100_0000
dna!("CT") == 0b11_01

Conversion with `From` and `Into`

Iupac from Dna; Seq<Iupac> from Seq<Dna>

Amino from Kmer<3>; Seq<Amino> from Seq<Dna> (TODO)

Sequence length not a multiple of 3 is an error

Seq<Iupac> from Amino; Seq<Iupac> from Seq<Amino> (TODO)

Vec<Seq<Dna>> from Seq<Iupac>: A sequence of IUPAC codes can generate a list of DNA sequences of the same length. (TODO)

Deref coercion

TODO: find out if Kmer<Dna, K> -> Kmer<Amino, K/3> is possible

Drop-in compatibility with `rust-bio`

meant to replace Text/TextSlice

TODO

benchmarking
macros for defining alphabet codecs more concisely
wider SIMD-sized Kmers

bio-seq 0.3.0

bio-seq