Crate bio_seq

Expand description

Bit-packed and well-typed biological sequences

The strength of rust is that we can safely separate the science (well-typed) and the engineering (bit-packed) of bioinformatics. An incremental benchmark improvement in the reverse complement algorithm should benefit the user of a succinct datastructure without anyone unwillingly learning about endianess.

Contributions are very welcome. There’s lots of low hanging fruit for optimisation and ideally we should only have to write them once!

§Sequences

A Seq is a heap allocated sequence of symbols that owns its data. A SeqSlice is a read-only window into a Seq. Static SeqArrays can be declared with the dna! and iupac! macros but these should be dereferenced as &'static SeqSlices.

Kmers are shorter, fixed-length sequences. They generally fit in a single register and implement Copy. They are used for optimised algorithms on sequences and succinct datastructures. The default implementation uses a usize for storage. Using the 2-bit Dna encoding a Kmer<Dna, 32> occupies 64 bits.

These sequence types are parameterised with Codecs (e.g. Seq<Dna>, Seq<Amino>, etc.) that define how symbols are encoded into strings of bits and decoded as readable strings.

§Quick start

Add bio-seq to Cargo.toml:

[dependencies]
bio-seq = "0.13"

use bio_seq::prelude::*;

let seq = dna!("ATACGATCGATCGATCGATCCGT");

// iterate over the 8-mers of the reverse complement
for kmer in seq.to_revcomp().kmers::<8>() {
    println!("{kmer}");
}

// ACGGATCG
// CGGATCGA
// GGATCGAT
// GATCGATC
// ATCGATCG
// ...

Sequences are analogous to rust’s string types and follow similar dereferencing conventions:

// The `dna!` macro packs a static sequence with 2-bits per symbol at compile time:
let s: &'static str = "hello!";
let seq: &'static SeqSlice<Dna> = dna!("CGCTAGCTACGATCGCAT");

// Sequences can also be copied into `Kmer`s:
let kmer: Kmer<Dna, 18> = dna!("CGCTAGCTACGATCGCAT").try_into().unwrap();
// or with the kmer! macro:
let kmer = kmer!("CGCTAGCTACGATCGCAT");

// `Seq`s can be allocated on the heap like `String`s are:
let s: String = "hello!".into();
let seq: Seq<Dna> = dna!("CGCTAGCTACGATCGCAT").into();

// Alternatively, a `Seq` can be fallibly encoded at runtime:
let seq: Seq<Dna> = "CGCTAGCTACGATCGCAT".try_into().unwrap();

// `&SeqSlice` is analogous to `&str`:
let slice: &str = &s[1..3];
let seqslice: &SeqSlice<Dna> = &seq[2..4];

§Bit-packed encodings

Encodings of genomic symbols are implemented as Codecs. This crate provides four common ones:

codec::dna: 2-bit encoding of the four nucleotides
codec::text: 8-bit ASCII encoding of nucleotides, meant to be compatible with plaintext sequencing data formats
codec::iupac: 4-bit encoding of ambiguous nucleotide identities (the IUPAC ambiguity codes)
codec::amino: 6-bit encoding of amino acids

Each of these encodings is designed to facilitate common bioinformatics tasks, such as minimising k-mers and implementing succinct datastructures. The translation module provides traits and methods for translating between nucleotide and amino acid sequences.

Custom codecs can also be implemented with the Codec trait and derived on specially crafted enums.

Modules§

codec: Coding/Decoding trait for bit-packable enums representing sets of genomic symbols
error
kmer: Encoded sequences of static length
prelude
seq: Arbitrary length sequences of bit-packed genomic data
translation: Amino acid translation tables

Macros§

__bio_seq_count_words
dna: Static DNA sequences encoded at compile time
iupac: Static degenerate nucleotide codes encoded at compile time
kmer: Convenient compile time kmer constructor

Traits§

Complement
ComplementMut: Nucleotide bases and sequences can be complemented
Maskable
MaskableMut: Some sequence types may be maskable
Reverse
ReverseComplement
ReverseComplementMut: A reversible sequence that can be complemented can be reverse complemented
ReverseMut: A reversible sequence