Orphos Core
Core library for Orphos, a high-performance Rust implementation of Prodigal (prokaryotic gene prediction algorithms). This crate provides the foundational gene-finding capabilities for identifying protein-coding genes in microbial genomes.
Overview
orphos-core implements an unsupervised machine learning approach for finding genes in prokaryotic genomes. It uses dynamic programming and statistical models trained on genomic features to predict gene locations with high accuracy.
Key Features
- 🚀 High Performance: Multi-threaded processing using Rayon for parallel analysis
- 🔒 Type Safety: Compile-time guarantees using type-state pattern for training states
- 🧬 Dual Modes: Single genome mode for complete genomes and metagenomic mode for fragments
- 📊 Multiple Output Formats: GenBank, GFF3, GCA, and SCO formats
- 🎯 Accurate: Advanced start codon recognition with Shine-Dalgarno detection
- 💾 Memory Efficient: Optimized data structures for large-scale genomic analysis
Installation
Add to your Cargo.toml:
[]
= "0.1.0"
Quick Start
Basic Usage
use ;
Analyzing Sequences Directly
use ;
Custom Configuration
use ;
let config = OrphosConfig ;
let mut analyzer = new;
Metagenomic Mode
For analyzing short contigs or mixed community samples:
use OrphosConfig;
let config = OrphosConfig ;
let mut analyzer = new;
let results = analyzer.analyze_file?;
Module Organization
config: Configuration options and output format settingsengine: Main analysis engine with training and prediction logictypes: Core data structures (Gene, Training, error types)results: Gene prediction results and sequence informationsequence: Sequence encoding, I/O, and processing utilitiesalgorithms: Gene-finding algorithms including:- Dynamic programming for gene prediction
- Gene optimization and overlap resolution
- Scoring functions for connections
node: Gene node management, creation, and scoringtraining: Training algorithms for Shine-Dalgarno and non-SD modelsoutput: Output formatters for GenBank, GFF, GCA, and SCObitmap: Efficient sequence encoding utilitiesmetagenomic: Metagenomic mode presets and models
Output Formats
GenBank (.gbk)
Rich annotation format with full feature information:
LOCUS MyGenome 4641652 bp DNA linear BCT
FEATURES Location/Qualifiers
CDS 190..255
/gene="1"
/protein_id="MyGenome_1"
/translation="MTKRSAAAAAAVAAGMTSA"
GFF3 (.gff)
Standard genome annotation format:
##gff-version 3
MyGenome Orphos CDS 190 255 . + 0 ID=MyGenome_1;
GCA (.gca)
Tab-delimited gene coordinate annotation.
SCO (.sco)
Simple coordinate output with minimal information.
Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
metagenomic |
bool |
false |
Enable metagenomic mode for fragments |
closed_ends |
bool |
false |
Treat sequences as complete genomes |
mask_n_runs |
bool |
false |
Mask runs of N characters |
force_non_sd |
bool |
false |
Disable Shine-Dalgarno detection |
quiet |
bool |
false |
Suppress informational output |
output_format |
OutputFormat |
Genbank |
Output format selection |
translation_table |
Option<u8> |
None |
NCBI genetic code table (1-25) |
num_threads |
Option<usize> |
None |
Number of parallel threads |
Error Handling
All operations return Result<T, OrphosError> with detailed error types:
use OrphosError;
match analyzer.analyze_file
Contributing
Contributions are welcome! Please see the main Orphos repository for contribution guidelines.
License
This project is licensed under the GNU General Public License v3.0 or later - see the LICENSE file for details.
Citation
If you use Orphos in your research, please cite:
Related Projects
- orphos-cli: Command-line interface for gene prediction
- orphos-python: Python bindings via PyO3
- orphos-wasm: WebAssembly module for browser/Node.js
Acknowledgments
This implementation is based on Prodigal, originally developed by Doug Hyatt. Orphos provides a modern, type-safe Rust implementation while maintaining compatibility with the original algorithms.