Expand description
Sequence encoding and manipulation utilities.
This module provides functions for encoding DNA sequences into compact bitmap representations and performing sequence analysis operations.
§Overview
DNA sequences are encoded using a 2-bit representation where:
- A (adenine): 00
- C (cytosine): 01
- G (guanine): 10
- T/U (thymine/uracil): 11
This encoding reduces memory usage by 75% compared to ASCII representation and enables fast bitwise operations for sequence analysis.
§Modules
encoded: Encoded sequence structures with forward and reverse-complementio: FASTA file reading and parsingprocessing: Sequence analysis functions (GC content, codon detection)
§Examples
§Encode a sequence
use orphos_core::sequence::encoded::EncodedSequence;
let sequence = b"ATGAAACGCATTAGCACCACCATT";
let encoded = EncodedSequence::without_masking(sequence);
println!("Length: {} bp", encoded.sequence_length);
println!("GC content: {:.2}%", encoded.gc_content * 100.0);§Test for specific nucleotides
use orphos_core::sequence::{is_a, is_gc};
use orphos_core::sequence::encoded::EncodedSequence;
let sequence = b"ATGC";
let encoded = EncodedSequence::without_masking(sequence);
assert!(is_a(&encoded.forward_sequence, 0)); // Position 0 is 'A'
assert!(is_gc(&encoded.forward_sequence, 3)); // Position 3 is 'C'Re-exports§
pub use io::*;pub use processing::*;
Modules§
Functions§
- calculate_
background_ mer_ frequencies - Calculate background k-mer frequencies for both strands
- calculate_
kmer_ index - Calculate k-mer index from sequence position for frequency analysis
- char_
to_ nuc - Converts nucleotide character to 2-bit encoding for bitmap storage.
- create_
reverse_ complement_ sequence - Generate the reverse complement of an encoded DNA sequence
- encode_
sequence - Encode a DNA sequence into compact 2-bit representation
- encode_
sequence_ simd_ wide - SIMD-accelerated encoding using the
widecrate with u8x32 - encode_
sequence_ simd_ wide_ packed - Optimized packed encoding version with u8x32 and batch bit operations
- find_
max_ reading_ frame - Determine which of three reading frames has the highest score
- gc_
content - Calculate the GC content of a sequence region
- is_a
- Test if nucleotide at given position is adenine (A)
- is_atg
- Test if codon at position is ATG (methionine start codon)
- is_c
- Test if nucleotide at given position is cytosine (C)
- is_g
- Test if nucleotide at given position is guanine (G)
- is_gc
- Test if nucleotide at given position is G or C (high GC content indicator)
- is_gtg
- Test if codon at position is GTG (valine start codon)
- is_n
- Test if position contains an unknown nucleotide (N)
- is_
start - Test if codon at given position is a valid start codon
- is_stop
- Test if codon at given position is a stop codon
- is_t
- Test if nucleotide at given position is thymine (T)
- is_ttg
- Test if codon at position is TTG (leucine start codon)
- mer_
text - Convert k-mer index back to nucleotide sequence representation
- min_
of_ two_ integers - Return the minimum of two integers (utility function)
- reverse_
strand_ reading_ frame - Convert a forward strand reading frame to its corresponding reverse strand frame