gecco 0.5.1

Gene Cluster prediction with Conditional Random Fields
Documentation
gecco-0.5.1 has been yanked.

GECCO-rs

Overview

GECCO-rs is a Rust reimplementation of GECCO (Gene Cluster prediction with Conditional Random Fields), a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).

  • 2026-04-30: GECCO original data can be downloaded during build time. This is likely what you want but you have to enable the feature as it deviates from typical build expectations: bundled-data
  • 2026-04-29: Validated to give the same output on real life data. Still an early translation; compare with original gecco on your data before switching. On par or maybe 50% faster than original

This is an LLM-mediated faithful (hopefully) translation, not the original code!

Most users should probably first see if the existing original code works for them, unless they have reason otherwise. The original source may have newer features and it has had more love in terms of fixing bugs. In fact, we aim to replicate bugs if they are present, for the sake of reproducibility! (but then we might have added a few more in the process)

There are however cases when you might prefer this Rust version. We generally agree with this manifesto but more specifically:

  • We have had many issues with ensuring that our software works using existing containers (Docker, PodMan, Singularity). One size does not fit all and it eats our resources trying to keep up with every way of delivering software
  • Common package managers do not work well. It was great when we had a few Linux distributions with stable procedures, but now there are just too many ecosystems (Homebrew, Conda). Conda has an NP-complete resolver which does not scale. Homebrew is only so-stable. And our dependencies in Python still break. These can no longer be considered professional serious options. Meanwhile, Cargo enables multiple versions of packages to be available, even within the same program(!)
  • The future is the web. We deploy software in the web browser, and until now that has meant Javascript. This is a language where even the == operator is broken. Typescript is one step up, but a game changer is the ability to compile Rust code into webassembly, enabling performance and sharing of code with the backend. Translating code to Rust enables new ways of deployment and running code in the browser has especial benefits for science - researchers do not have deep pockets to run servers, so pushing compute to the user enables deployment that otherwise would be impossible
  • Old CLI-based utilities are bad for the environment(!). A large amount of compute resources are spent creating and communicating via small files, which we can bypass by using code as libraries. Even better, we can avoid frequent reloading of databases by hoisting this stage, with up to 100x speedups in some cases. Less compute means faster compute and less electricity wasted
  • LLM-mediated translations may actually be safer to use than the original code. This article shows that running the same code on different operating systems can give somewhat different answers. This is a gap that Rust+Cargo can reduce. Typesafe interfaces also reduce coding mistakes and error handling, as opposed to typical command-line scripting

But:

  • This approach should still be considered experimental. The LLM technology is immature and has sharp corners. But there are opportunities to reap, and the genie is not going back into the bottle. This translation is as much aimed to learn how to improve the technology and get feedback on the results.
  • Translations are not endorsed by the original authors unless otherwise noted. Do not send bug reports to the original developers. Use our Github issues page instead.
  • Do not trust the benchmarks on this page. They are used to help evaluate the translation. If you want improved performance, you generally have to use this code as a library, and use the additional tricks it offers. We generally accept performance losses in order to reduce our dependency issues
  • Check the original Github pages for information about the package. This README is kept sparse on purpose. It is not meant to be the primary source of information
  • If you are the author of the original code and wish to move to Rust, you can obtain ownership of this repository and crate. Until then, our commitment is to offer an as-faithful-as-possible translation of a snapshot of your code. If we find serious bugs, we will report them to you. Otherwise we will just replicate them, to ensure comparability across studies that claim to use package XYZ v.666. Think of this like a fancy Ubuntu .deb-package of your software - that is how we treat it

This blurb might be out of date. Go to this page for the latest information and further information about how we approach translation

Building

Build the command-line tool with:

$ cargo build --release

Before first use, download the data files (Pfam HMM, InterPro metadata):

$ gecco build-data

This creates a gecco_data/ directory in the current working directory. At runtime, GECCO looks for data files next to the binary (gecco_data/ alongside the gecco executable). The data directory also contains GECCO's exported type classifier (type_classifier.rf.json) and its domain order (domains.tsv). You can override this with:

  • --data-dir /path/to/data on any command
  • The GECCO_DATA_DIR environment variable

Alternatively, enable the bundled-data feature to embed GECCO data in the binary:

$ cargo build --release --features bundled-data

or when installing from crates.io:

$ cargo install gecco --features bundled-data

With this feature enabled, gecco build-data and --data-dir are not required for the default GECCO model/database. The crate includes the Rust-specific converted models (model.crfsuite, type_classifier.rf.json) and the build script downloads the remaining original GECCO v0.10.3 assets from zellerlab/GECCO into Cargo's OUT_DIR, verifies their SHA256 checksums, and embeds them at compile time. This includes the large Pfam.h3m.gz release asset, so building this feature requires network access.

Command-Line Usage

Run the full pipeline on a genome:

$ gecco run --genome genome.fna --output-dir output_dir

To use data files from a custom location:

$ gecco run --genome genome.fna --data-dir /opt/gecco_data

Use gecco <command> --help for the full current option list.

Command Description
gecco run Full pipeline: gene finding, HMMER annotation, CRF prediction, clustering, and type classification
gecco annotate Gene finding and HMMER annotation only
gecco predict Predict clusters from pre-annotated feature/gene tables
gecco train Train a new CRF model from labeled data
gecco cv K-fold or leave-one-type-out cross-validation on labeled data
gecco convert Format conversion (GenBank, FASTA, GFF3)
gecco build-data Download and prepare the default data directory
gecco update-interpro Rebuild InterPro metadata from upstream databases

Global Options

Flag Description Default
-v, --verbose Increase verbosity (repeat for more, e.g. -vv)
-q, --quiet Reduce or disable console output

The -f, --features arguments below refer to GECCO domain feature tables, not Cargo feature flags.

gecco run

Run gene finding, domain annotation, CRF prediction, cluster refinement, and type classification.

$ gecco run --genome genome.fna --output-dir output_dir
Flag Description Default
-g, --genome Input genome file (FASTA or GenBank) required
-o, --output-dir Output directory .
--data-dir Data directory (HMM, CRF model, InterPro files) gecco_data/ next to binary
-j, --jobs Number of threads (0 = auto-detect) 0
-M, --mask Mask ambiguous nucleotides off
--hmm Additional HMM file path; repeat for multiple databases from data dir
-e, --e-filter E-value cutoff for protein domains
-p, --p-filter P-value cutoff for protein domains 1e-9
--disentangle Disentangle overlapping domains off
--model Alternative CRF model file from data dir
--no-pad Disable padding of short gene sequences off
-c, --cds Minimum coding sequences per cluster 3
-m, --threshold Probability threshold for cluster membership 0.8
-E, --edge-distance Minimum genes separating a cluster from a sequence edge 0
--no-trim Disable trimming genes without domain annotations off
--force-tsv Write TSV files even when empty off
--merge-gbk Write one merged GenBank file instead of one file per cluster off

gecco annotate

Run only gene finding and domain annotation.

$ gecco annotate --genome genome.fna --output-dir annotations
Flag Description Default
-g, --genome Input genome file required
-o, --output-dir Output directory .
--data-dir Data directory (HMM, InterPro files) gecco_data/ next to binary
-j, --jobs Number of threads (0 = auto-detect) 0
-M, --mask Mask ambiguous nucleotides off
--hmm Additional HMM file path; repeat for multiple databases from data dir
-e, --e-filter E-value cutoff
-p, --p-filter P-value cutoff 1e-9
--disentangle Disentangle overlapping domains off
--force-tsv Write TSV files even when empty off

gecco predict

Predict clusters from pre-annotated feature/gene tables.

$ gecco predict --genome genome.fna --genes genome.genes.tsv --features genome.features.tsv --output-dir output_dir
Flag Description Default
--genome Input genome file (for GenBank output) required
-g, --genes Gene coordinate table (TSV) required
-f, --features Domain annotation table(s); accepts multiple values optional
-o, --output-dir Output directory .
--data-dir Data directory (CRF model and type classifier data) gecco_data/ next to binary
-j, --jobs Number of threads (0 = auto-detect) 0
-e, --e-filter E-value cutoff
-p, --p-filter P-value cutoff 1e-9
--model Alternative CRF model from data dir
--no-pad Disable padding of short gene sequences off
-c, --cds Minimum coding sequences per cluster 3
-m, --threshold Probability threshold 0.8
-E, --edge-distance Minimum genes from sequence edge 0
--no-trim Disable trimming genes without domain annotations off
--force-tsv Write TSV files even when empty off
--merge-gbk Single GenBank file for all clusters off

gecco train

Train a new CRF model from labeled annotation tables.

$ gecco train --genes training.genes.tsv --features training.features.tsv --clusters training.clusters.tsv --output-dir model_dir
Flag Description Default
-g, --genes Gene coordinate table (TSV) required
-f, --features Domain annotation table(s); accepts multiple values optional
-c, --clusters Cluster annotation table (TSV) required
-o, --output-dir Output directory .
-e, --e-filter E-value cutoff
-p, --p-filter P-value cutoff 1e-9
--no-shuffle Disable data shuffling before fitting off
--seed Random number generator seed 42
-W, --window-size CRF sliding window length 5
--window-step CRF sliding window step 1
--c1 L1 regularization strength 0.15
--c2 L2 regularization strength 0.15
--feature-type Feature extraction level (protein or domain) protein
--select Fraction of most significant features to select (0.0–1.0) all

gecco cv

Cross-validation for model evaluation. Supports K-fold and Leave-One-Type-Out.

$ gecco cv --genes training.genes.tsv --features training.features.tsv --clusters training.clusters.tsv --output cv.tsv
Flag Description Default
-g, --genes Gene coordinate table (TSV) required
-f, --features Domain annotation table(s); accepts multiple values optional
-c, --clusters Cluster annotation table (TSV) required
-o, --output Output file path cv.tsv
-e, --e-filter E-value cutoff
-p, --p-filter P-value cutoff 1e-9
--no-shuffle Disable data shuffling off
--seed Random number generator seed 42
-W, --window-size CRF sliding window length 5
--window-step CRF sliding window step 1
--c1 L1 regularization strength 0.15
--c2 L2 regularization strength 0.15
--feature-type Feature extraction level (protein or domain) protein
--select Fraction of features to select all
--loto Use Leave-One-Type-Out instead of K-folds off
--splits Number of K-fold splits 10

gecco convert

Convert output files to other formats.

gecco convert gbk — Convert GenBank cluster files:

Flag Description Default
-i, --input-dir Input directory containing .gbk files required
-o, --output-dir Output directory same as input
-f, --format Output format: bigslice, fna, or faa required

gecco convert clusters — Convert cluster tables:

Flag Description Default
-i, --input-dir Input directory containing .clusters.tsv files required
-o, --output-dir Output directory same as input
-f, --format Output format: gff required

gecco build-data

Download HMM databases and prepare data files for the pipeline.

$ gecco build-data --output-dir gecco_data
Flag Description Default
-o, --output-dir Output directory for data files gecco_data
-f, --force Force re-download even if files exist off

Library Usage

GECCO-rs can be used as a Rust library. Add it to your Cargo.toml:

[dependencies]
gecco = { version = "0.5", default-features = false }

This pulls in only the core library without CLI dependencies (clap, ureq, etc.).

Then use the Gecco API to scan sequences for biosynthetic gene clusters:

use std::fs::File;
use std::path::Path;
use gecco::Gecco;
use gecco::io::genbank::read_sequences;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let pipeline = Gecco::builder()
        .threshold(0.8)
        .build()?;

    let records = read_sequences(Path::new("genome.fna"))?;

    // Runs gene finding, domain annotation, CRF prediction, and clustering.
    let results = pipeline.scan(&records)?;

    for cluster in &results.clusters {
        println!(
            "{}: {} genes, {}..{}",
            cluster.id,
            cluster.genes.len(),
            cluster.start(),
            cluster.end()
        );
    }

    results.write_gene_table(File::create("output.genes.tsv")?)?;
    results.write_feature_table(File::create("output.features.tsv")?)?;
    results.write_cluster_table(File::create("output.clusters.tsv")?)?;
    std::fs::create_dir_all("output_dir")?;
    results.write_cluster_gbks(Path::new("output_dir"))?;

    Ok(())
}

For more control, run individual pipeline stages separately:

let mut genes = pipeline.find_genes(&records)?;
pipeline.annotate_domains(&mut genes)?;
let genes = pipeline.predict_probabilities(&genes)?;
let clusters = pipeline.extract_clusters(&genes);

The builder supports many options — see the GeccoBuilder source for the full list:

let pipeline = Gecco::builder()
    .data_dir("/opt/gecco_data")
    .threshold(0.6)          // lower threshold → more clusters
    .jobs(4)                 // parallel threads
    .p_filter(1e-6)          // relaxed domain filtering
    .mask(true)              // mask ambiguous nucleotides
    .build()?;

Results

GECCO-rs produces the same output files as Python GECCO:

  • {genome}.genes.tsv -- Predicted genes with per-gene BGC probabilities
  • {genome}.features.tsv -- Identified protein domains in tabular format
  • {genome}.clusters.tsv -- Predicted cluster coordinates and biosynthetic types
  • {genome}_cluster_{N}.gbk -- GenBank file per cluster with annotated proteins and domains

Benchmarks

Benchmarked on a 5.3 Mbp bacterial genome (Streptomyces sp., GenBank CP157504.1, 5,401 predicted genes). Both tools run with -j 4 on the same machine (Linux, x86_64).

Performance

Stage Rust Python Speedup
Gene finding 5s 9s 1.8x
HMM annotation 17s 25s 1.5x
CRF + clustering 2s 8s 4.0x
Total 25s 42s 1.7x

Running Benchmarks

# Rust pipeline benchmark (per-stage timing)
$ cargo run --release --features bench --bin bench_pipeline

# Rust full pipeline benchmark (end-to-end)
$ cargo run --release --features bench --bin bench_full

Reference

GECCO can be cited using the following publication:

Accurate de novo identification of biosynthetic gene clusters with GECCO. Laura M Carroll, Martin Larralde, Jonas Simon Fleck, Ruby Ponnudurai, Alessio Milanese, Elisa Cappio Barazzone, Georg Zeller. bioRxiv 2021.05.03.442509; doi:10.1101/2021.05.03.442509

License

This software is provided under the GNU General Public License v3.0 or later.