Expand description
Bit-packed and well-typed biological sequences
The strength of rust is that we can safely separate the science (well-typed) and the engineering (bit-packed) of bioinformatics. An incremental benchmark improvement in the reverse complement algorithm should benefit the user of a succinct datastructure without anyone unwillingly learning about endianess.
Contributions are very welcome. There’s lots of low hanging fruit for optimisation and ideally we should only have to write them once!
§Sequences
A Seq
is a heap allocated sequence of symbols that owns its data. A SeqSlice
is a read-only window into a Seq
. Static SeqArray
s can be declared with the dna!
and iupac!
macros but these should be dereferenced as &'static SeqSlice
s.
Kmer
s are shorter, fixed-length sequences. They generally fit in a single register and implement Copy
. They are used for optimised algorithms on sequences and succinct datastructures. The default implementation uses a usize
for storage. Using the 2-bit Dna
encoding a Kmer<Dna, 32>
occupies 64 bits.
These sequence types are parameterised with Codec
s (e.g. Seq<Dna>
, Seq<Amino>
, etc.) that define how symbols are encoded into strings of bits and decoded as readable strings.
§Quick start
Add bio-seq
to Cargo.toml
:
[dependencies]
bio-seq = "0.13"
use bio_seq::prelude::*;
let seq = dna!("ATACGATCGATCGATCGATCCGT");
// iterate over the 8-mers of the reverse complement
for kmer in seq.to_revcomp().kmers::<8>() {
println!("{kmer}");
}
// ACGGATCG
// CGGATCGA
// GGATCGAT
// GATCGATC
// ATCGATCG
// ...
Sequences are analogous to rust’s string types and follow similar dereferencing conventions:
// The `dna!` macro packs a static sequence with 2-bits per symbol at compile time:
let s: &'static str = "hello!";
let seq: &'static SeqSlice<Dna> = dna!("CGCTAGCTACGATCGCAT");
// Sequences can also be copied into `Kmer`s:
let kmer: Kmer<Dna, 18> = dna!("CGCTAGCTACGATCGCAT").try_into().unwrap();
// or with the kmer! macro:
let kmer = kmer!("CGCTAGCTACGATCGCAT");
// `Seq`s can be allocated on the heap like `String`s are:
let s: String = "hello!".into();
let seq: Seq<Dna> = dna!("CGCTAGCTACGATCGCAT").into();
// Alternatively, a `Seq` can be fallibly encoded at runtime:
let seq: Seq<Dna> = "CGCTAGCTACGATCGCAT".try_into().unwrap();
// `&SeqSlice` is analogous to `&str`:
let slice: &str = &s[1..3];
let seqslice: &SeqSlice<Dna> = &seq[2..4];
§Bit-packed encodings
Encodings of genomic symbols are implemented as Codec
s. This crate provides four common ones:
codec::dna
: 2-bit encoding of the four nucleotidescodec::text
: 8-bit ASCII encoding of nucleotides, meant to be compatible with plaintext sequencing data formatscodec::iupac
: 4-bit encoding of ambiguous nucleotide identities (the IUPAC ambiguity codes)codec::amino
: 6-bit encoding of amino acids
Each of these encodings is designed to facilitate common bioinformatics tasks, such as minimising k-mers and implementing succinct datastructures. The translation module provides traits and methods for translating between nucleotide and amino acid sequences.
Custom codecs can also be implemented with the Codec
trait and derived on specially crafted enums.
Modules§
- codec
- Coding/Decoding trait for bit-packable enums representing sets of genomic symbols
- error
- kmer
- Encoded sequences of static length
- prelude
- seq
- Arbitrary length sequences of bit-packed genomic data
- translation
- Amino acid translation tables
Macros§
- __
bio_ seq_ count_ words - dna
- Static DNA sequences encoded at compile time
- iupac
- Static degenerate nucleotide codes encoded at compile time
- kmer
- Convenient compile time kmer constructor
Traits§
- Complement
- Complement
Mut - Nucleotide bases and sequences can be complemented
- Maskable
- Maskable
Mut - Some sequence types may be maskable
- Reverse
- Reverse
Complement - Reverse
Complement Mut - A reversible sequence that can be complemented can be reverse complemented
- Reverse
Mut - A reversible sequence