bio-seq 0.3.0 - Docs.rs

<div class="title-block" style="text-align: center;" align="center">

# `bio-seq`

### Bit packed and well-typed biological sequences
</div>

```rust
use bio_seq::*;

let seq = dna!("ACTGCTAGCA");

for kmer in seq.kmers::<8>() {
	println!("{}", kmer);
}
```

* `bio_seq::Dna`: DNA use the lexicographically ordered 2-bit representation

* `bio_seq::Iupac`: IUPAC  nucleotide ambiguity codes are represented with 4 bits

	```
	  A C G T
	  -------
	S 0 1 1 0
	- 0 0 0 0
	C 0 1 0 0
	N 1 1 1 1
	B 0 1 1 1
	  ... etc.
	```
	This supports membership resolution with bitwise operations:

	```rust
    assert_eq!(
        format!("{}", iupac!("AS-GYTNA") | iupac!("ANTGCAT-")),
        "ANTGYWNA"
    );
    assert_eq!(
        format!("{}", iupac!("ACGTSWKM") & iupac!("WKMSTNNA")),
        "A----WKA"
    );
	```
The Iupac struct implements `From<Dna>`

* `bio_seq::Amino`: Amino acid sequences are represented with 6 bits.

   The representation of amino acids is designed to be easy to coerce from sequences of 2-bit encoded DNA.
   TODO: deal with alternate (e.g. mamalian mitochondrial) translation codes

## Kmers

Kmers are sequences with a fixed size. These are implemented with const generics.

`K * Codec::WIDTH` must fit in a `usize` (i.e. 64). For larger Kmers use `bigk::Kmer`: (TODO)

### Minimisers for free

The 2-bit representation of DNA sequences is lexicographically ordered:

```rust
// find the lexicographically minimum 8-mer
fn minimise(seq: Seq<Dna>) -> Option<Kmer::<8>> {
    seq.kmers::<8>().min()
}
```

## Derived codecs

Alphabet coding/decoding is derived from the variant names and discriminants of enum types:

```rust
#[derive(Clone, Copy, Debug, PartialEq, Codec)]
#[width = 2]
#[repr(u8)]
pub enum Dna {
    A = 0b00,
    C = 0b01,
    G = 0b10,
    T = 0b11,
}
```

The `width` attribute specifies how many bits the encoding requires per symbol.

## Little endian

Kmers are represented stored as `usize`s with the least significant bit first.

```rust
dna!("C") == 0b01 // not 0b0100_0000
dna!("CT") == 0b11_01
```

## Conversion with `From` and `Into`

`Iupac` from `Dna`; `Seq<Iupac>` from `Seq<Dna>`

`Amino` from `Kmer<3>`; `Seq<Amino>` from `Seq<Dna>` (TODO)
  * Sequence length not a multiple of 3 is an error

`Seq<Iupac>` from `Amino`; `Seq<Iupac>` from `Seq<Amino>` (TODO)

`Vec<Seq<Dna>>` from `Seq<Iupac>`: A sequence of IUPAC codes can generate a list of DNA sequences of the same length. (TODO)

### Deref coercion

TODO: find out if `Kmer<Dna, K>` -> `Kmer<Amino, K/3>` is possible

## Drop-in compatibility with `rust-bio`

meant to replace Text/TextSlice

## TODO

* benchmarking
* macros for defining alphabet codecs more concisely
* wider SIMD-sized Kmers