Module training

Module training 

Source
Expand description

Training algorithms for gene prediction models.

This module implements the unsupervised machine learning algorithms that train Orphos’s statistical models from genome sequences.

§Overview

Training extracts statistical patterns from genes predicted in an initial pass:

  1. Initial gene finding: Find high-confidence genes using basic models
  2. Codon usage: Calculate dicodon frequencies in predicted genes
  3. Start codon preference: Learn ATG/GTG/TTG usage patterns
  4. RBS detection: Identify ribosome binding site motifs (Shine-Dalgarno)
  5. Upstream composition: Analyze nucleotide patterns near start codons
  6. GC bias: Detect reading frame preferences based on GC content

§Training Modes

  • Shine-Dalgarno (SD): For organisms with canonical RBS motifs
  • Non-SD: For organisms without RBS or with alternative start recognition

The mode is auto-detected based on the strength of SD signals in the training data.

§Modules

§Examples

Training is normally performed automatically by the OrphosAnalyzer, but can be done manually for advanced use cases:

use orphos_core::engine::UntrainedOrphos;
use orphos_core::config::OrphosConfig;
use orphos_core::sequence::encoded::EncodedSequence;

let mut orphos = UntrainedOrphos::new();
let sequence = b"ATGAAACGCATTAGCACCACCATT...";
let encoded = EncodedSequence::without_masking(sequence);

// Train on the genome
let trained = orphos.train_single_genome(&encoded)?;

// Training data is now stored in the TrainedOrphos instance

Modules§

common
non_sd_training
sd_training

Functions§

load_training_file
should_use_sd
Checks if training should use Shine-Dalgarno motifs.
write_training_file