Module sequence

Module sequence 

Source
Expand description

Sequence encoding and manipulation utilities.

This module provides functions for encoding DNA sequences into compact bitmap representations and performing sequence analysis operations.

§Overview

DNA sequences are encoded using a 2-bit representation where:

  • A (adenine): 00
  • C (cytosine): 01
  • G (guanine): 10
  • T/U (thymine/uracil): 11

This encoding reduces memory usage by 75% compared to ASCII representation and enables fast bitwise operations for sequence analysis.

§Modules

  • encoded: Encoded sequence structures with forward and reverse-complement
  • io: FASTA file reading and parsing
  • processing: Sequence analysis functions (GC content, codon detection)

§Examples

§Encode a sequence

use orphos_core::sequence::encoded::EncodedSequence;

let sequence = b"ATGAAACGCATTAGCACCACCATT";
let encoded = EncodedSequence::without_masking(sequence);

println!("Length: {} bp", encoded.sequence_length);
println!("GC content: {:.2}%", encoded.gc_content * 100.0);

§Test for specific nucleotides

use orphos_core::sequence::{is_a, is_gc};
use orphos_core::sequence::encoded::EncodedSequence;

let sequence = b"ATGC";
let encoded = EncodedSequence::without_masking(sequence);

assert!(is_a(&encoded.forward_sequence, 0)); // Position 0 is 'A'
assert!(is_gc(&encoded.forward_sequence, 3)); // Position 3 is 'C'

Re-exports§

pub use io::*;
pub use processing::*;

Modules§

encoded
io
processing

Functions§

calculate_background_mer_frequencies
Calculate background k-mer frequencies for both strands
calculate_kmer_index
Calculate k-mer index from sequence position for frequency analysis
char_to_nuc
Converts nucleotide character to 2-bit encoding for bitmap storage.
create_reverse_complement_sequence
Generate the reverse complement of an encoded DNA sequence
encode_sequence
Encode a DNA sequence into compact 2-bit representation
encode_sequence_simd_wide
SIMD-accelerated encoding using the wide crate with u8x32
encode_sequence_simd_wide_packed
Optimized packed encoding version with u8x32 and batch bit operations
find_max_reading_frame
Determine which of three reading frames has the highest score
gc_content
Calculate the GC content of a sequence region
is_a
Test if nucleotide at given position is adenine (A)
is_atg
Test if codon at position is ATG (methionine start codon)
is_c
Test if nucleotide at given position is cytosine (C)
is_g
Test if nucleotide at given position is guanine (G)
is_gc
Test if nucleotide at given position is G or C (high GC content indicator)
is_gtg
Test if codon at position is GTG (valine start codon)
is_n
Test if position contains an unknown nucleotide (N)
is_start
Test if codon at given position is a valid start codon
is_stop
Test if codon at given position is a stop codon
is_t
Test if nucleotide at given position is thymine (T)
is_ttg
Test if codon at position is TTG (leucine start codon)
mer_text
Convert k-mer index back to nucleotide sequence representation
min_of_two_integers
Return the minimum of two integers (utility function)
reverse_strand_reading_frame
Convert a forward strand reading frame to its corresponding reverse strand frame