Expand description
Training algorithms for gene prediction models.
This module implements the unsupervised machine learning algorithms that train Orphos’s statistical models from genome sequences.
§Overview
Training extracts statistical patterns from genes predicted in an initial pass:
- Initial gene finding: Find high-confidence genes using basic models
- Codon usage: Calculate dicodon frequencies in predicted genes
- Start codon preference: Learn ATG/GTG/TTG usage patterns
- RBS detection: Identify ribosome binding site motifs (Shine-Dalgarno)
- Upstream composition: Analyze nucleotide patterns near start codons
- GC bias: Detect reading frame preferences based on GC content
§Training Modes
- Shine-Dalgarno (SD): For organisms with canonical RBS motifs
- Non-SD: For organisms without RBS or with alternative start recognition
The mode is auto-detected based on the strength of SD signals in the training data.
§Modules
sd_training: Shine-Dalgarno motif trainingnon_sd_training: Alternative start recognition trainingcommon: Shared training utilities
§Examples
Training is normally performed automatically by the OrphosAnalyzer, but
can be done manually for advanced use cases:
use orphos_core::engine::UntrainedOrphos;
use orphos_core::config::OrphosConfig;
use orphos_core::sequence::encoded::EncodedSequence;
let mut orphos = UntrainedOrphos::new();
let sequence = b"ATGAAACGCATTAGCACCACCATT...";
let encoded = EncodedSequence::without_masking(sequence);
// Train on the genome
let trained = orphos.train_single_genome(&encoded)?;
// Training data is now stored in the TrainedOrphos instanceModules§
Functions§
- load_
training_ file - should_
use_ sd - Checks if training should use Shine-Dalgarno motifs.
- write_
training_ file