Crate bio_seq

source ·
Expand description

Bit-packed and well-typed biological sequences

  • seq: A Seq is a heap allocated sequences of variable length that owns it’s own data. A SeqSlice is a read-only window into a Seq.
  • kmer: Kmers are short fixed length sequences. They generally implement Copy and can efficiently be passed on the stack.
  • codec: Encodings of genomic data types to be packed into sequences.
  • translation: Amino acid translation tables

This crate is designed to facilitate common bioinformatics tasks, incuding amino acid translation, k-mer minimisation and hashing, and nucleotide sequence manipulation.

Add bio-seq to Cargo.toml:

[dependencies]
bio-seq = "0.12"
use bio_seq::prelude::*;

let seq = dna!("ATACGATCGATCGATCGATCCGT");

// iterate over the 8-mers of the reverse complement
for kmer in seq.revcomp().kmers::<8>() {
    println!("{kmer}");
}

// ACGGATCG
// CGGATCGA
// GGATCGAT
// GATCGATC
// ATCGATCG
// ...

The 4-bit encoding of IUPAC nucleotide ambiguity codes naturally represent a set of bases for each position (0001: A, 1111: N, 0000: *, …):

use bio_seq::prelude::*;

let seq = iupac!("AGCTNNCAGTCGACGTATGTA");
let pattern = iupac!("AYG");

for slice in seq.windows(pattern.len()) {
    if pattern.contains(slice) {
        println!("{slice} matches pattern");
    }
}

// ACG matches pattern
// ATG matches pattern

Logical or is the union:

assert_eq!(iupac!("AS-GYTNA") | iupac!("ANTGCAT-"), iupac!("ANTGYWNA"));

Logical and is the intersection of two iupac sequences:

assert_eq!(iupac!("ACGTSWKM") & iupac!("WKMSTNNA"), iupac!("A----WKA"));

Modules§

  • Coding/Decoding trait for bit-packable enums representing biological alphabets
  • Kmers
  • Arbitrary length sequences of bit-packed genomic data, stored on the heap.
  • Genetic Code Translation

Macros§