gecco-0.4.0 has been yanked.

GECCO-RS

Overview

GECCO-RS is a Rust reimplementation of GECCO (Gene Cluster prediction with Conditional Random Fields), a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).

This port replaces all Python dependencies with native Rust crates, yielding significant speedups (see Benchmarks below) while preserving compatibility with the original GECCO models and data formats.

This is not the authorative version of GECCO. It is designed to give bit-wise the same answer as the Python version.

Advantages of GECCO-RS over the Python version are:

Generally faster
Especially faster if you need to keep calling it in a loop, since the database is only loaded once
Possible to embed in webpages (webassembly)
Type-safe interface when integrating in bigger projects; no need for intermediate files

Building

GECCO-RS requires a Rust toolchain (1.70+). Build with:

$ cargo build --release

Before first use, download the data files (Pfam HMM, InterPro metadata):

$ gecco build-data

This creates a gecco_data/ directory in the current working directory. At runtime, GECCO looks for data files next to the binary (gecco_data/ alongside the gecco executable). You can override this with:

--data-dir /path/to/data on any command
The GECCO_DATA_DIR environment variable

Usage

Run the full pipeline (gene finding, domain annotation, CRF prediction, cluster refinement, and type classification) on a genome:

$ gecco run --genome some_genome.fna -o output_dir --model model.crfsuite

To use data files from a custom location:

$ gecco run --genome some_genome.fna --data-dir /opt/gecco_data

Commands

Command	Description
`gecco run`	Full pipeline: gene finding -> annotation -> prediction -> clustering
`gecco annotate`	Domain annotation only (gene finding + HMMER)
`gecco predict`	Predict clusters from pre-annotated feature/gene tables
`gecco train`	Train a new CRF model from labeled data
`gecco cv`	K-fold or leave-one-type-out cross-validation
`gecco convert`	Format conversion (GenBank, FASTA, GFF3)

Global Options

Flag	Description	Default
`-v, --verbose`	Increase verbosity (repeat for more, e.g. `-vv`)	—
`-q, --quiet`	Reduce or disable console output	—

`gecco run`

Full pipeline: gene finding, domain annotation, CRF prediction, cluster refinement, and type classification.

Input:

Flag	Description	Default
`-g, --genome`	Input genome file (FASTA or GenBank)	required
`--data-dir`	Data directory (HMM, CRF model, InterPro files)	`gecco_data/` next to binary

Gene calling:

Flag	Description	Default
`-j, --jobs`	Number of threads (0 = auto-detect)	`0`
`-M, --mask`	Mask ambiguous nucleotides to prevent genes stretching across them	off
`--cds-feature`	Extract genes from annotated features instead of de-novo prediction	—
`--locus-tag`	Feature qualifier for naming extracted genes	`locus_tag`

Domain annotation:

Flag	Description	Default
`--hmm`	Path to alternative HMM file(s) (can be repeated)	from data dir
`-e, --e-filter`	E-value cutoff for protein domains	—
`-p, --p-filter`	P-value cutoff for protein domains	`1e-9`
`--disentangle`	Disentangle overlapping domains	off

Cluster detection:

Flag	Description	Default
`--model`	Path to alternative CRF model (`.crfsuite` format)	from data dir
`--no-pad`	Disable padding of short gene sequences	off
`-c, --cds`	Minimum coding sequences per cluster	`3`
`-m, --threshold`	Probability threshold for cluster membership	`0.8`
`-E, --edge-distance`	Minimum annotated genes separating cluster from sequence edge	`0`
`--no-trim`	Disable trimming genes without domain annotations at cluster edges	off

Output:

Flag	Description	Default
`-o, --output-dir`	Output directory	`.`
`--force-tsv`	Write TSV files even when empty	off
`--merge-gbk`	Single GenBank file for all clusters instead of one per cluster	off
`--antismash-sideload`	Write AntiSMASH v6 sideload JSON file	off

`gecco annotate`

Run only gene finding and domain annotation (no cluster prediction).

Flag	Description	Default
`-g, --genome`	Input genome file	required
`-o, --output-dir`	Output directory	`.`
`--data-dir`	Data directory (HMM, InterPro files)	`gecco_data/` next to binary
`-j, --jobs`	Number of threads (0 = auto-detect)	`0`
`-M, --mask`	Mask ambiguous nucleotides	off
`--cds-feature`	Extract genes from annotated features	—
`--locus-tag`	Feature qualifier for naming genes	`locus_tag`
`--hmm`	Alternative HMM file(s)	from data dir
`-e, --e-filter`	E-value cutoff	—
`-p, --p-filter`	P-value cutoff	`1e-9`
`--disentangle`	Disentangle overlapping domains	off
`--force-tsv`	Write TSV files even when empty	off

`gecco predict`

Predict clusters from pre-annotated feature/gene tables.

Flag	Description	Default
`--genome`	Input genome file (for GenBank output)	required
`-g, --genes`	Gene coordinate table (TSV)	required
`-f, --features`	Domain annotation table(s) (can be repeated)	required
`-o, --output-dir`	Output directory	`.`
`--data-dir`	Data directory (CRF model)	`gecco_data/` next to binary
`-j, --jobs`	Number of threads (0 = auto-detect)	`0`
`-e, --e-filter`	E-value cutoff	—
`-p, --p-filter`	P-value cutoff	`1e-9`
`--model`	Alternative CRF model	from data dir
`--no-pad`	Disable feature padding	off
`-c, --cds`	Minimum coding sequences per cluster	`3`
`-m, --threshold`	Probability threshold	`0.8`
`-E, --edge-distance`	Minimum genes from sequence edge	`0`
`--no-trim`	Disable edge gene trimming	off
`--force-tsv`	Write TSV files even when empty	off
`--merge-gbk`	Single GenBank file for all clusters	off
`--antismash-sideload`	Write AntiSMASH sideload JSON	off

`gecco train`

Train a new CRF model from labeled annotation tables.

Flag	Description	Default
`-g, --genes`	Gene coordinate table (TSV)	required
`-f, --features`	Domain annotation table(s) (can be repeated)	required
`-c, --clusters`	Cluster annotation table (TSV)	required
`-o, --output-dir`	Output directory	`.`
`-e, --e-filter`	E-value cutoff	—
`-p, --p-filter`	P-value cutoff	`1e-9`
`--no-shuffle`	Disable data shuffling before fitting	off
`--seed`	Random number generator seed	`42`
`-W, --window-size`	CRF sliding window length	`5`
`--window-step`	CRF sliding window step	`1`
`--c1`	L1 regularization strength	`0.15`
`--c2`	L2 regularization strength	`0.15`
`--feature-type`	Feature extraction level (`protein` or `domain`)	`protein`
`--select`	Fraction of most significant features to select (0.0–1.0)	all

`gecco cv`

Cross-validation for model evaluation. Supports K-fold and Leave-One-Type-Out.

Flag	Description	Default
`-g, --genes`	Gene coordinate table (TSV)	required
`-f, --features`	Domain annotation table(s) (can be repeated)	required
`-c, --clusters`	Cluster annotation table (TSV)	required
`-o, --output`	Output file path	`cv.tsv`
`-e, --e-filter`	E-value cutoff	—
`-p, --p-filter`	P-value cutoff	`1e-9`
`--no-shuffle`	Disable data shuffling	off
`--seed`	Random number generator seed	`42`
`-W, --window-size`	CRF sliding window length	`5`
`--window-step`	CRF sliding window step	`1`
`--c1`	L1 regularization strength	`0.15`
`--c2`	L2 regularization strength	`0.15`
`--feature-type`	Feature extraction level (`protein` or `domain`)	`protein`
`--select`	Fraction of features to select	all
`--loto`	Use Leave-One-Type-Out instead of K-folds	off
`--splits`	Number of K-fold splits	`10`

`gecco convert`

Convert output files to other formats.

gecco convert gbk — Convert GenBank cluster files:

Flag	Description	Default
`-i, --input-dir`	Input directory containing `.gbk` files	required
`-o, --output-dir`	Output directory	same as input
`-f, --format`	Output format: `bigslice`, `fna`, or `faa`	required

gecco convert clusters — Convert cluster tables:

Flag	Description	Default
`-i, --input-dir`	Input directory containing `.clusters.tsv` files	required
`-o, --output-dir`	Output directory	same as input
`-f, --format`	Output format: `gff`	required

`gecco build-data`

Download HMM databases and prepare data files for the pipeline.

Flag	Description	Default
`-o, --output-dir`	Output directory for data files	`gecco_data`
`-f, --force`	Force re-download even if files exist	off

Library Usage

GECCO-RS can be used as a Rust library. Add it to your Cargo.toml:

[dependencies]
gecco = { version = "0.4", default-features = false }

This pulls in only the core library without CLI dependencies (clap, ureq, etc.).

Then use the Gecco API to scan sequences for biosynthetic gene clusters:

use gecco::Gecco;
use gecco::orf::SeqRecord;

fn main() -> anyhow::Result<()> {
    // Build a pipeline — data directory is resolved automatically:
    //   1. GECCO_DATA_DIR env var
    //   2. gecco_data/ next to the binary
    //   3. gecco_data/ in the current working directory
    let pipeline = Gecco::builder()
        .threshold(0.8)
        .build()?;

    // Load your sequences (e.g. from a FASTA file)
    let records = vec![SeqRecord {
        id: "contig_1".into(),
        seq: std::fs::read_to_string("genome.fna")?,
    }];

    // Run the full pipeline: gene finding → annotation → CRF → clustering
    let clusters = pipeline.scan(&records)?;
    for cluster in &clusters {
        println!("{}: {} genes, type={}",
            cluster.id, cluster.genes.len(), cluster.cluster_type);
    }

    Ok(())
}

For more control, use scan_detailed to also get per-gene probabilities, or run individual stages separately:

// Get both genes (with probabilities) and clusters
let (genes, clusters) = pipeline.scan_detailed(&records)?;

for gene in &genes {
    println!("{}: p={:.3}", gene.id, gene.average_p.unwrap_or(0.0));
}

// Or run stages individually:
let mut genes = pipeline.find_genes(&records)?;
pipeline.annotate_domains(&mut genes)?;
let genes = pipeline.predict_probabilities(&genes)?;
let clusters = pipeline.extract_clusters(&genes);

To write output files (same formats as the CLI), use the I/O functions:

use std::fs::File;
use gecco::io::tables::{ClusterTable, GeneTable, FeatureTable};
use gecco::io::genbank::write_cluster_gbk;

let (genes, clusters) = pipeline.scan_detailed(&records)?;

// Write TSV tables
GeneTable::write_from_genes(File::create("output.genes.tsv")?, &genes)?;
FeatureTable::write_from_genes(File::create("output.features.tsv")?, &genes)?;
ClusterTable::write_from_clusters(File::create("output.clusters.tsv")?, &clusters)?;

// Write GenBank files (one per cluster)
for cluster in &clusters {
    let f = File::create(format!("{}.gbk", cluster.id))?;
    write_cluster_gbk(f, cluster, None, env!("CARGO_PKG_VERSION"))?;
}

The builder supports many options — see the GeccoBuilder source for the full list:

let pipeline = Gecco::builder()
    .data_dir("/opt/gecco_data")
    .threshold(0.6)          // lower threshold → more clusters
    .jobs(4)                 // parallel threads
    .p_filter(1e-6)          // relaxed domain filtering
    .mask(true)              // mask ambiguous nucleotides
    .build()?;

Results

GECCO-RS produces the same output files as Python GECCO:

{genome}.genes.tsv -- Predicted genes with per-gene BGC probabilities
{genome}.features.tsv -- Identified protein domains in tabular format
{genome}.clusters.tsv -- Predicted cluster coordinates and biosynthetic types
{genome}_cluster_{N}.gbk -- GenBank file per cluster with annotated proteins and domains

Architecture

The pipeline flows through 5 stages:

Gene Finding -- Predicts ORFs using a pure Rust port of Prodigal with rayon-parallel metagenomic model evaluation
HMM Annotation -- Annotates protein domains against Pfam/TIGRFAM using a pure Rust HMMER implementation (SIMD-accelerated SSE/AVX2)
CRF Prediction -- Sliding-window feature extraction + CRF marginal inference for per-gene biosynthetic probability
Cluster Refinement -- Groups contiguous high-probability genes into clusters, trims edges
Type Classification -- Random Forest predicts biosynthetic type (Polyketide, NRP, Terpene, RiPP, etc.)

Key Differences from Python GECCO

Component	Python GECCO	GECCO-RS
Gene finding	pyrodigal (Cython/C)	Pure Rust Prodigal + rayon
HMMER	pyhmmer (C bindings)	Pure Rust HMMER (SSE/AVX2)
CRF	sklearn-crfsuite (pickle)	crfsuite-compliant-rs (CRFsuite binary format)
Random Forest	scikit-learn	smartcore
DataFrames	Polars	csv + serde
Parallelism	multiprocessing	rayon

Benchmarks

Benchmarked on a 5.3 Mbp bacterial genome (Streptomyces sp., GenBank CP157504.1, 5,401 predicted genes). Both tools run with -j 4 on the same machine (Linux, x86_64).

Performance

Stage	Rust	Python	Speedup
Gene finding	5s	9s	1.8x
HMM annotation	17s	25s	1.5x
CRF + clustering	2s	8s	4.0x
Total	25s	42s	1.7x

Output Comparison

Metric	Rust	Python	Match?
Predicted genes	5,401	5,401	Identical
Gene coordinates	--	--	Identical
Domain hits	6,839	6,455	+6%
Unique domain types	1,655	1,623	+2%
Clusters found	9	3	See below

Gene finding produces identical results between the two implementations (same gene count, same coordinates). Rust finds slightly more domain hits due to minor numerical differences in the HMMER implementation.

Rust detects more clusters because its CRF marginal probabilities (from crfsuite-compliant-rs) run 5-15% higher than Python's sklearn-crfsuite, pushing additional regions above the default 0.8 threshold. All 3 Python clusters overlap with Rust clusters at matching genomic coordinates:

Region	Rust	Python
623 kbp	cluster_2 (avg 0.93)	cluster_1 (avg 0.91, Terpene)
3,953 kbp	cluster_12 (avg 0.97)	cluster_3 (avg 0.97, Saccharide)
4,138 kbp	cluster_14 (avg 0.98)	cluster_4 (avg 0.98, Unknown)

Running Benchmarks

# Rust pipeline benchmark (per-stage timing)
$ cargo run --release --bin bench_pipeline

# Rust full pipeline benchmark (end-to-end)
$ cargo run --release --bin bench_full

Dependencies

Core Algorithm

crfsuite-compliant-rs -- CRF model loading, training, and marginal inference
prodigal -- Gene prediction (pure Rust port with rayon parallelism)
hmmer-pure-rs -- HMMER3 domain search (SIMD-accelerated)
smartcore_proba -- Random Forest type classifier

Bio I/O

bio -- Bioinformatics utilities
noodles-fasta -- FASTA parsing
gb-io -- GenBank I/O

Reference

GECCO can be cited using the following publication:

Accurate de novo identification of biosynthetic gene clusters with GECCO. Laura M Carroll, Martin Larralde, Jonas Simon Fleck, Ruby Ponnudurai, Alessio Milanese, Elisa Cappio Barazzone, Georg Zeller. bioRxiv 2021.05.03.442509; doi:10.1101/2021.05.03.442509

License

This software is provided under the GNU General Public License v3.0 or later.

gecco 0.4.0