Module bio_seq::codec

source ·
Expand description

Coding/Decoding trait for bit-packable enums representing sets of genomic symbols

The dna, iupac, text, and amino alphabets are built in.

This trait implements the translation between the UTF-8 representation of an alphabet and its efficient bit-packing. The BITS attribute stores the number of bits used by the representation.

use bio_seq::prelude::{Dna, Codec};
use bio_seq::codec::text;
assert_eq!(Dna::BITS, 2);
assert_eq!(text::Dna::BITS, 8);

§Deriving custom Codecs

Custom encodings can be easily defined on enums using the derivable Codec trait.

use bio_seq::prelude;
use bio_seq::prelude::Codec;

#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash, Codec)]
pub enum Dna {
    A = 0b00,
    C = 0b01,
    G = 0b10,
    T = 0b11,
}

§Implementing custom Codecs

Custom encodings can be defined on enums by implementing the Codec trait.

use bio_seq::prelude;
use bio_seq::prelude::Codec;

#[derive(Copy, Clone, Eq, PartialEq, Hash, Debug)]
pub enum Dna {
    A = 0b00,
    C = 0b01,
    G = 0b10,
    T = 0b11,
}

impl From<Dna> for u8 {
   fn from(base: Dna) -> u8 {
        match base {
            Dna::A => 0b00,
            Dna::C => 0b01,
            Dna::G => 0b10,
            Dna::T => 0b11,
        }
   }
}

impl Codec for Dna {
    const BITS: u8 = 2;

    fn unsafe_from_bits(bits: u8) -> Self {
        if let Some(base) = Self::try_from_bits(bits) {
            base
        } else {
            panic!("Unrecognised bit pattern!")
        }
    }

    fn try_from_bits(bits: u8) -> Option<Self> {
        match bits {
            0b00 => Some(Dna::A),
            0b01 => Some(Dna::C),
            0b10 => Some(Dna::G),
            0b11 => Some(Dna::T),
            _ => None,
        }
    }

    fn unsafe_from_ascii(chr: u8) -> Self {
        if let Some(base) = Self::try_from_ascii(chr) {
            base
        } else {
            panic!("Unrecognised bit pattern!")
        }
    }

    fn try_from_ascii(chr: u8) -> Option<Self> {
        match chr {
            b'A' => Some(Dna::A),
            b'C' => Some(Dna::C),
            b'G' => Some(Dna::G),
            b'T' => Some(Dna::T),
            _ => None,
        }
    }

    fn to_char(self) -> char {
        match self {
            Dna::A => 'A',
            Dna::C => 'C',
            Dna::G => 'G',
            Dna::T => 'T',
        }
    }

    fn to_bits(self) -> u8 {
        self as u8
    }

    fn items() -> impl Iterator<Item = Self> {
        vec![Dna::A, Dna::C, Dna::G, Dna::T].into_iter()
    }
}

Modules§

  • 6-bit representation of amino acids
  • 2-bit DNA representation: A: 00, C: 01, G: 10, T: 11
  • 4-bit IUPAC nucleotide ambiguity codes
  • 8-bit ASCII representation of nucleotides

Traits§

  • The bit encodings of an alphabet’s symbols can be represented with any type. Encoding from ASCII bytes and decoding the representation is implemented through the Codec trait.
  • Nucleotides and nucleotide sequences can be complemented

Derive Macros§