Expand description
Bit-packed and well-typed biological sequences
A Seq
is a heap allocated sequence of variable length that owns its data. A SeqSlice
is a read-only window into a Seq
.
Kmer
s are short, fixed-length sequences. They generally implement Copy
and are used for optimised algorithms on sequences. The default implementation uses a usize
for storage.
Binary encodings of genomic data types are implemented as “codec
s.” Custom codecs can be defined, and this crate has four built in:
- codec::dna: 2-bit encoding of the four nucleotides
- codec::text: 8-bit ASCII encoding of nucleotides, meant to be compatible with plaintext sequencing data formats
- codec::iupac: 4-bit encoding of ambiguous nucleotide identities (the IUPAC ambiguity codes)
- codec::amino: 6-bit encoding of amino acids
Each of these encodings is designed to facilitate common bioinformatics tasks, such as minimising k-mers and implementing succinct datastructures. The translation module provides traits and methods for translating between nucleotide and amino acid sequences.
Add bio-seq
to Cargo.toml
:
[dependencies]
bio-seq = "0.12"
use bio_seq::prelude::*;
let seq = dna!("ATACGATCGATCGATCGATCCGT");
// iterate over the 8-mers of the reverse complement
for kmer in seq.revcomp().kmers::<8>() {
println!("{kmer}");
}
// ACGGATCG
// CGGATCGA
// GGATCGAT
// GATCGATC
// ATCGATCG
// ...
Modules§
- Coding/Decoding trait for bit-packable enums representing biological alphabets
- Short sequences of fixed length.
- Arbitrary length sequences of bit-packed genomic data, stored on the heap.
- Genetic Code Translation