Crate bio_seq

source ·
Expand description

Bit-packed and well-typed biological sequences

A Seq is a heap allocated sequence of variable length that owns its data. A SeqSlice is a read-only window into a Seq.

Kmers are short, fixed-length sequences. They generally implement Copy and are used for optimised algorithms on sequences. The default implementation uses a usize for storage.

Binary encodings of genomic data types are implemented as “codecs.” Custom codecs can be defined, and this crate has four built in:

  • codec::dna: 2-bit encoding of the four nucleotides
  • codec::text: 8-bit ASCII encoding of nucleotides, meant to be compatible with plaintext sequencing data formats
  • codec::iupac: 4-bit encoding of ambiguous nucleotide identities (the IUPAC ambiguity codes)
  • codec::amino: 6-bit encoding of amino acids

Each of these encodings is designed to facilitate common bioinformatics tasks, such as minimising k-mers and implementing succinct datastructures. The translation module provides traits and methods for translating between nucleotide and amino acid sequences.

Add bio-seq to Cargo.toml:

[dependencies]
bio-seq = "0.12"
use bio_seq::prelude::*;

let seq = dna!("ATACGATCGATCGATCGATCCGT");

// iterate over the 8-mers of the reverse complement
for kmer in seq.revcomp().kmers::<8>() {
    println!("{kmer}");
}

// ACGGATCG
// CGGATCGA
// GGATCGAT
// GATCGATC
// ATCGATCG
// ...

Modules§

  • Coding/Decoding trait for bit-packable enums representing biological alphabets
  • Short sequences of fixed length.
  • Arbitrary length sequences of bit-packed genomic data, stored on the heap.
  • Genetic Code Translation

Macros§