orfm 0.1.0

A pure-Rust port of OrfM - a simple and not slow open reading frame (ORF) caller
Documentation

orfm-rs

A pure-Rust port of OrfM, a simple and not slow open reading frame (ORF) caller. Finds all open reading frames (ORFs) in FASTA/FASTQ sequences by searching all six reading frames for continuous stretches of codons without stop codons, then translates them to amino acid sequences.

Install

From source (requires Rust toolchain)

cargo install orfm

Or build without installing:

cargo build --release
# binary at target/release/orfm

Usage

orfm [OPTIONS] [INPUT]

Reads from stdin if no input file is given. Accepts FASTA or FASTQ, gzipped or uncompressed.

Options

Flag Description Default
-m <LENGTH> Minimum ORF length in nucleotides (must be a multiple of 3) 96
-c <TABLE_ID> NCBI codon table for translation (1–25) 1
-l <LENGTH> Ignore sequence beyond this position none
-t <FILE> Write nucleotide transcripts to this file none

Examples

# Basic usage
orfm input.fasta > orfs.faa

# From gzipped FASTQ, shorter minimum ORF length
orfm -m 30 reads.fastq.gz > orfs.faa

# Pipe from stdin, write transcripts
cat input.fasta | orfm -m 60 -t transcripts.fna > orfs.faa

# Use mitochondrial codon table
orfm -c 2 mito.fasta > orfs.faa

Library usage

orfm-rs can be used as a Rust library. Add to your Cargo.toml:

[dependencies]
orfm = '*'
use orfm::OrfCaller;

let caller = OrfCaller::new(1, 96, None).unwrap(); // table_id, min_length, position_limit

// Iterate over ORFs from a file
for orf in caller.call_from_file("input.fasta") {
    println!("{}", orf.header());
    println!("{}", std::str::from_utf8(&orf.protein).unwrap());
}

// Or call on a single sequence
let orfs = caller.find_orfs("seq1", "", b"ATGGATGCTGAA...");
for orf in &orfs {
    let transcript = orf.transcript(b"ATGGATGCTGAA...");
    // ...
}

Benchmarking

A Snakefile is included for comparing performance against the original C OrfM. It generates random sequences, runs both tools, checks output correctness, and collects wall-clock time and memory usage.

snakemake -j4
cat benchmark/results.tsv
cat benchmark/correctness.txt

When I ran it, orfm-rs was the winner by ~5% in walltime:

| tool | replicate | wall_clock_s | max_rss_kb |
|------|-----------|--------------|------------|
| orfm_c | 1 | 2.11 | 16088 |
| orfm_rs | 1 | 1.99 | 15572 |
| orfm_c | 2 | 2.1 | 15432 |
| orfm_rs | 2 | 2.0 | 16116 |
| orfm_c | 3 | 2.12 | 16164 |
| orfm_rs | 3 | 2.02 | 16148 |

Requires the C OrfM binary at ~/git/OrfM/orfm (this path can be changed in the Snakefile).

Differences from OrfM

  • Pure Rust, no C dependencies
  • Exposes a library API with an iterator over translated ORFs
  • Supports all codon tables that OrfM supports (NCBI tables 1–25)
  • Uses needletail for sequence parsing and aho-corasick for stop codon detection

License

orfm-rs is licensed under the GNU Lesser General Public License v3.0 (LGPL-3.0).

Citation

Since the algorithm was devised for the original version of OrfM, best to cite that:

Ben J. Woodcroft, Joel A. Boyd, and Gene W. Tyson. OrfM: A fast open reading frame predictor for metagenomic data. (2016). Bioinformatics. doi:10.1093/bioinformatics/btw241.