gecco 0.4.0

Gene Cluster prediction with Conditional Random Fields
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
# GECCO-RS

## Overview

GECCO-RS is a Rust reimplementation of [GECCO](https://github.com/zellerlab/GECCO)
(Gene Cluster prediction with Conditional Random Fields), a fast and scalable
method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in
genomic and metagenomic data using Conditional Random Fields (CRFs).

This port replaces all Python dependencies with native Rust crates, yielding
significant speedups (see [Benchmarks](#benchmarks) below) while preserving
compatibility with the original GECCO models and data formats.

This is not the authorative version of GECCO. It is designed to give bit-wise the same answer as the Python version.

Advantages of GECCO-RS over the Python version are:
* Generally faster
* Especially faster if you need to keep calling it in a loop, since the database is only loaded once
* Possible to embed in webpages (webassembly)
* Type-safe interface when integrating in bigger projects; no need for intermediate files


## Building

GECCO-RS requires a Rust toolchain (1.70+). Build with:

```console
$ cargo build --release
```

Before first use, download the data files (Pfam HMM, InterPro metadata):

```console
$ gecco build-data
```

This creates a `gecco_data/` directory in the current working directory. At
runtime, GECCO looks for data files next to the binary (`gecco_data/` alongside
the `gecco` executable). You can override this with:

- `--data-dir /path/to/data` on any command
- The `GECCO_DATA_DIR` environment variable

## Usage

Run the full pipeline (gene finding, domain annotation, CRF prediction,
cluster refinement, and type classification) on a genome:

```console
$ gecco run --genome some_genome.fna -o output_dir --model model.crfsuite
```

To use data files from a custom location:

```console
$ gecco run --genome some_genome.fna --data-dir /opt/gecco_data
```

### Commands

| Command | Description |
|---------|-------------|
| `gecco run` | Full pipeline: gene finding -> annotation -> prediction -> clustering |
| `gecco annotate` | Domain annotation only (gene finding + HMMER) |
| `gecco predict` | Predict clusters from pre-annotated feature/gene tables |
| `gecco train` | Train a new CRF model from labeled data |
| `gecco cv` | K-fold or leave-one-type-out cross-validation |
| `gecco convert` | Format conversion (GenBank, FASTA, GFF3) |

### Global Options

| Flag | Description | Default |
|------|-------------|---------|
| `-v, --verbose` | Increase verbosity (repeat for more, e.g. `-vv`) ||
| `-q, --quiet` | Reduce or disable console output ||

### `gecco run`

Full pipeline: gene finding, domain annotation, CRF prediction, cluster
refinement, and type classification.

**Input:**

| Flag | Description | Default |
|------|-------------|---------|
| `-g, --genome` | Input genome file (FASTA or GenBank) | *required* |
| `--data-dir` | Data directory (HMM, CRF model, InterPro files) | `gecco_data/` next to binary |

**Gene calling:**

| Flag | Description | Default |
|------|-------------|---------|
| `-j, --jobs` | Number of threads (0 = auto-detect) | `0` |
| `-M, --mask` | Mask ambiguous nucleotides to prevent genes stretching across them | off |
| `--cds-feature` | Extract genes from annotated features instead of de-novo prediction ||
| `--locus-tag` | Feature qualifier for naming extracted genes | `locus_tag` |

**Domain annotation:**

| Flag | Description | Default |
|------|-------------|---------|
| `--hmm` | Path to alternative HMM file(s) (can be repeated) | from data dir |
| `-e, --e-filter` | E-value cutoff for protein domains ||
| `-p, --p-filter` | P-value cutoff for protein domains | `1e-9` |
| `--disentangle` | Disentangle overlapping domains | off |

**Cluster detection:**

| Flag | Description | Default |
|------|-------------|---------|
| `--model` | Path to alternative CRF model (`.crfsuite` format) | from data dir |
| `--no-pad` | Disable padding of short gene sequences | off |
| `-c, --cds` | Minimum coding sequences per cluster | `3` |
| `-m, --threshold` | Probability threshold for cluster membership | `0.8` |
| `-E, --edge-distance` | Minimum annotated genes separating cluster from sequence edge | `0` |
| `--no-trim` | Disable trimming genes without domain annotations at cluster edges | off |

**Output:**

| Flag | Description | Default |
|------|-------------|---------|
| `-o, --output-dir` | Output directory | `.` |
| `--force-tsv` | Write TSV files even when empty | off |
| `--merge-gbk` | Single GenBank file for all clusters instead of one per cluster | off |
| `--antismash-sideload` | Write AntiSMASH v6 sideload JSON file | off |

### `gecco annotate`

Run only gene finding and domain annotation (no cluster prediction).

| Flag | Description | Default |
|------|-------------|---------|
| `-g, --genome` | Input genome file | *required* |
| `-o, --output-dir` | Output directory | `.` |
| `--data-dir` | Data directory (HMM, InterPro files) | `gecco_data/` next to binary |
| `-j, --jobs` | Number of threads (0 = auto-detect) | `0` |
| `-M, --mask` | Mask ambiguous nucleotides | off |
| `--cds-feature` | Extract genes from annotated features ||
| `--locus-tag` | Feature qualifier for naming genes | `locus_tag` |
| `--hmm` | Alternative HMM file(s) | from data dir |
| `-e, --e-filter` | E-value cutoff ||
| `-p, --p-filter` | P-value cutoff | `1e-9` |
| `--disentangle` | Disentangle overlapping domains | off |
| `--force-tsv` | Write TSV files even when empty | off |

### `gecco predict`

Predict clusters from pre-annotated feature/gene tables.

| Flag | Description | Default |
|------|-------------|---------|
| `--genome` | Input genome file (for GenBank output) | *required* |
| `-g, --genes` | Gene coordinate table (TSV) | *required* |
| `-f, --features` | Domain annotation table(s) (can be repeated) | *required* |
| `-o, --output-dir` | Output directory | `.` |
| `--data-dir` | Data directory (CRF model) | `gecco_data/` next to binary |
| `-j, --jobs` | Number of threads (0 = auto-detect) | `0` |
| `-e, --e-filter` | E-value cutoff ||
| `-p, --p-filter` | P-value cutoff | `1e-9` |
| `--model` | Alternative CRF model | from data dir |
| `--no-pad` | Disable feature padding | off |
| `-c, --cds` | Minimum coding sequences per cluster | `3` |
| `-m, --threshold` | Probability threshold | `0.8` |
| `-E, --edge-distance` | Minimum genes from sequence edge | `0` |
| `--no-trim` | Disable edge gene trimming | off |
| `--force-tsv` | Write TSV files even when empty | off |
| `--merge-gbk` | Single GenBank file for all clusters | off |
| `--antismash-sideload` | Write AntiSMASH sideload JSON | off |

### `gecco train`

Train a new CRF model from labeled annotation tables.

| Flag | Description | Default |
|------|-------------|---------|
| `-g, --genes` | Gene coordinate table (TSV) | *required* |
| `-f, --features` | Domain annotation table(s) (can be repeated) | *required* |
| `-c, --clusters` | Cluster annotation table (TSV) | *required* |
| `-o, --output-dir` | Output directory | `.` |
| `-e, --e-filter` | E-value cutoff ||
| `-p, --p-filter` | P-value cutoff | `1e-9` |
| `--no-shuffle` | Disable data shuffling before fitting | off |
| `--seed` | Random number generator seed | `42` |
| `-W, --window-size` | CRF sliding window length | `5` |
| `--window-step` | CRF sliding window step | `1` |
| `--c1` | L1 regularization strength | `0.15` |
| `--c2` | L2 regularization strength | `0.15` |
| `--feature-type` | Feature extraction level (`protein` or `domain`) | `protein` |
| `--select` | Fraction of most significant features to select (0.0–1.0) | all |

### `gecco cv`

Cross-validation for model evaluation. Supports K-fold and Leave-One-Type-Out.

| Flag | Description | Default |
|------|-------------|---------|
| `-g, --genes` | Gene coordinate table (TSV) | *required* |
| `-f, --features` | Domain annotation table(s) (can be repeated) | *required* |
| `-c, --clusters` | Cluster annotation table (TSV) | *required* |
| `-o, --output` | Output file path | `cv.tsv` |
| `-e, --e-filter` | E-value cutoff ||
| `-p, --p-filter` | P-value cutoff | `1e-9` |
| `--no-shuffle` | Disable data shuffling | off |
| `--seed` | Random number generator seed | `42` |
| `-W, --window-size` | CRF sliding window length | `5` |
| `--window-step` | CRF sliding window step | `1` |
| `--c1` | L1 regularization strength | `0.15` |
| `--c2` | L2 regularization strength | `0.15` |
| `--feature-type` | Feature extraction level (`protein` or `domain`) | `protein` |
| `--select` | Fraction of features to select | all |
| `--loto` | Use Leave-One-Type-Out instead of K-folds | off |
| `--splits` | Number of K-fold splits | `10` |

### `gecco convert`

Convert output files to other formats.

**`gecco convert gbk`** — Convert GenBank cluster files:

| Flag | Description | Default |
|------|-------------|---------|
| `-i, --input-dir` | Input directory containing `.gbk` files | *required* |
| `-o, --output-dir` | Output directory | same as input |
| `-f, --format` | Output format: `bigslice`, `fna`, or `faa` | *required* |

**`gecco convert clusters`** — Convert cluster tables:

| Flag | Description | Default |
|------|-------------|---------|
| `-i, --input-dir` | Input directory containing `.clusters.tsv` files | *required* |
| `-o, --output-dir` | Output directory | same as input |
| `-f, --format` | Output format: `gff` | *required* |

### `gecco build-data`

Download HMM databases and prepare data files for the pipeline.

| Flag | Description | Default |
|------|-------------|---------|
| `-o, --output-dir` | Output directory for data files | `gecco_data` |
| `-f, --force` | Force re-download even if files exist | off |

## Library Usage

GECCO-RS can be used as a Rust library. Add it to your `Cargo.toml`:

```toml
[dependencies]
gecco = { version = "0.4", default-features = false }
```

This pulls in only the core library without CLI dependencies (`clap`, `ureq`, etc.).

Then use the `Gecco` API to scan sequences for biosynthetic gene clusters:

```rust
use gecco::Gecco;
use gecco::orf::SeqRecord;

fn main() -> anyhow::Result<()> {
    // Build a pipeline — data directory is resolved automatically:
    //   1. GECCO_DATA_DIR env var
    //   2. gecco_data/ next to the binary
    //   3. gecco_data/ in the current working directory
    let pipeline = Gecco::builder()
        .threshold(0.8)
        .build()?;

    // Load your sequences (e.g. from a FASTA file)
    let records = vec![SeqRecord {
        id: "contig_1".into(),
        seq: std::fs::read_to_string("genome.fna")?,
    }];

    // Run the full pipeline: gene finding → annotation → CRF → clustering
    let clusters = pipeline.scan(&records)?;
    for cluster in &clusters {
        println!("{}: {} genes, type={}",
            cluster.id, cluster.genes.len(), cluster.cluster_type);
    }

    Ok(())
}
```

For more control, use `scan_detailed` to also get per-gene probabilities, or run individual stages separately:

```rust
// Get both genes (with probabilities) and clusters
let (genes, clusters) = pipeline.scan_detailed(&records)?;

for gene in &genes {
    println!("{}: p={:.3}", gene.id, gene.average_p.unwrap_or(0.0));
}

// Or run stages individually:
let mut genes = pipeline.find_genes(&records)?;
pipeline.annotate_domains(&mut genes)?;
let genes = pipeline.predict_probabilities(&genes)?;
let clusters = pipeline.extract_clusters(&genes);
```

To write output files (same formats as the CLI), use the I/O functions:

```rust
use std::fs::File;
use gecco::io::tables::{ClusterTable, GeneTable, FeatureTable};
use gecco::io::genbank::write_cluster_gbk;

let (genes, clusters) = pipeline.scan_detailed(&records)?;

// Write TSV tables
GeneTable::write_from_genes(File::create("output.genes.tsv")?, &genes)?;
FeatureTable::write_from_genes(File::create("output.features.tsv")?, &genes)?;
ClusterTable::write_from_clusters(File::create("output.clusters.tsv")?, &clusters)?;

// Write GenBank files (one per cluster)
for cluster in &clusters {
    let f = File::create(format!("{}.gbk", cluster.id))?;
    write_cluster_gbk(f, cluster, None, env!("CARGO_PKG_VERSION"))?;
}
```

The builder supports many options — see the [GeccoBuilder](src/pipeline.rs) source for the full list:

```rust
let pipeline = Gecco::builder()
    .data_dir("/opt/gecco_data")
    .threshold(0.6)          // lower threshold → more clusters
    .jobs(4)                 // parallel threads
    .p_filter(1e-6)          // relaxed domain filtering
    .mask(true)              // mask ambiguous nucleotides
    .build()?;
```

## Results

GECCO-RS produces the same output files as Python GECCO:

- `{genome}.genes.tsv` -- Predicted genes with per-gene BGC probabilities
- `{genome}.features.tsv` -- Identified protein domains in tabular format
- `{genome}.clusters.tsv` -- Predicted cluster coordinates and biosynthetic types
- `{genome}_cluster_{N}.gbk` -- GenBank file per cluster with annotated proteins and domains

## Architecture

The pipeline flows through 5 stages:

1. **Gene Finding** -- Predicts ORFs using a pure Rust port of Prodigal with rayon-parallel metagenomic model evaluation
2. **HMM Annotation** -- Annotates protein domains against Pfam/TIGRFAM using a pure Rust HMMER implementation (SIMD-accelerated SSE/AVX2)
3. **CRF Prediction** -- Sliding-window feature extraction + CRF marginal inference for per-gene biosynthetic probability
4. **Cluster Refinement** -- Groups contiguous high-probability genes into clusters, trims edges
5. **Type Classification** -- Random Forest predicts biosynthetic type (Polyketide, NRP, Terpene, RiPP, etc.)

### Key Differences from Python GECCO

| Component | Python GECCO | GECCO-RS |
|-----------|-------------|----------|
| Gene finding | pyrodigal (Cython/C) | Pure Rust Prodigal + rayon |
| HMMER | pyhmmer (C bindings) | Pure Rust HMMER (SSE/AVX2) |
| CRF | sklearn-crfsuite (pickle) | crfsuite-compliant-rs (CRFsuite binary format) |
| Random Forest | scikit-learn | smartcore |
| DataFrames | Polars | csv + serde |
| Parallelism | multiprocessing | rayon |

## Benchmarks

Benchmarked on a 5.3 Mbp bacterial genome (*Streptomyces* sp., GenBank CP157504.1, 5,401 predicted genes). Both tools run with `-j 4` on the same machine (Linux, x86_64).

### Performance

| Stage | Rust | Python | Speedup |
|-------|-----:|-------:|--------:|
| Gene finding | 5s | 9s | 1.8x |
| HMM annotation | 17s | 25s | 1.5x |
| CRF + clustering | 2s | 8s | 4.0x |
| **Total** | **25s** | **42s** | **1.7x** |

### Output Comparison

| Metric | Rust | Python | Match? |
|--------|-----:|-------:|--------|
| Predicted genes | 5,401 | 5,401 | Identical |
| Gene coordinates | -- | -- | Identical |
| Domain hits | 6,839 | 6,455 | +6% |
| Unique domain types | 1,655 | 1,623 | +2% |
| Clusters found | 9 | 3 | See below |

Gene finding produces identical results between the two implementations (same gene count, same coordinates). Rust finds slightly more domain hits due to minor numerical differences in the HMMER implementation.

Rust detects more clusters because its CRF marginal probabilities (from `crfsuite-compliant-rs`) run 5-15% higher than Python's `sklearn-crfsuite`, pushing additional regions above the default 0.8 threshold. All 3 Python clusters overlap with Rust clusters at matching genomic coordinates:

| Region | Rust | Python |
|--------|------|--------|
| 623 kbp | cluster_2 (avg 0.93) | cluster_1 (avg 0.91, Terpene) |
| 3,953 kbp | cluster_12 (avg 0.97) | cluster_3 (avg 0.97, Saccharide) |
| 4,138 kbp | cluster_14 (avg 0.98) | cluster_4 (avg 0.98, Unknown) |

### Running Benchmarks

```console
# Rust pipeline benchmark (per-stage timing)
$ cargo run --release --bin bench_pipeline

# Rust full pipeline benchmark (end-to-end)
$ cargo run --release --bin bench_full
```

## Dependencies

### Core Algorithm
- [crfsuite-compliant-rs]https://crates.io/crates/crfsuite-compliant-rs -- CRF model loading, training, and marginal inference
- [prodigal]https://github.com/henriksson-lab/prodigal-rs -- Gene prediction (pure Rust port with rayon parallelism)
- [hmmer-pure-rs]https://github.com/henriksson-lab/hmmer-pure-rs -- HMMER3 domain search (SIMD-accelerated)
- [smartcore_proba]https://crates.io/crates/smartcore_proba -- Random Forest type classifier

### Bio I/O
- [bio]https://crates.io/crates/bio -- Bioinformatics utilities
- [noodles-fasta]https://crates.io/crates/noodles-fasta -- FASTA parsing
- [gb-io]https://crates.io/crates/gb-io -- GenBank I/O

## Reference

GECCO can be cited using the following publication:

> **Accurate de novo identification of biosynthetic gene clusters with GECCO**.
> Laura M Carroll, Martin Larralde, Jonas Simon Fleck, Ruby Ponnudurai, Alessio Milanese, Elisa Cappio Barazzone, Georg Zeller.
> bioRxiv 2021.05.03.442509; [doi:10.1101/2021.05.03.442509]https://doi.org/10.1101/2021.05.03.442509

## License

This software is provided under the [GNU General Public License v3.0 or later](https://choosealicense.com/licenses/gpl-3.0/).