ARGenus 0.2.1

ARG detection and genus-level classification using flanking sequence analysis
Documentation

ARGenus

ARG detection and genus-level classification using flanking sequence analysis

Crates.io License

ARGenus is a bioinformatics tool that simultaneously detects antibiotic resistance genes (ARGs) and identifies their source bacterial genera from metagenomic sequencing data. Unlike existing tools that only detect ARGs, ARGenus provides direct ARG-to-genus linkage through flanking sequence analysis.

Features

  • Direct ARG-genus linkage: Identifies the bacterial source of each detected ARG
  • Targeted assembly: Efficient processing through read filtering and localized assembly
  • SNP verification: Filters false positives by confirming resistance-conferring mutations
  • Dual database modes: 1,000 bp (high coverage) and 5,000 bp (high resolution) flanking databases
  • High compression: Custom FDB format with ~22x compression ratio
  • Memory-efficient: Streaming FDB builder for large datasets (works with 8-16 GB RAM)
  • Fast processing: 5-10 minutes per sample with 16 threads

Installation

From crates.io

cargo install argenus

From source

git clone https://github.com/necoli1822/argenus.git
cd argenus
cargo build --release

Dependencies

ARGenus requires the following tools in your PATH:

For building the 5,000 bp flanking database:

  • BLAST+ - blastn and blastdbcmd

Database Setup

ARGenus requires a flanking sequence database for genus classification.

Pre-built Databases

Database Size Coverage Genus Resolution Best For
flanking_1kbp.fdb ~50 MB 97.6% 83.9% High-throughput screening
flanking_5kbp.fdb ~8.7 GB 91.5% 92.8% Epidemiological studies

Building Flanking Database

Short mode (1,000 bp) - from GenBank/PLSDB

argenus -b fdb -m short -o databases/flanking_1kbp.fdb

Long mode (5,000 bp) - from NCBI nt_prok

argenus -b fdb -m long \
    -o databases/flanking_5kbp.fdb \
    --blastn /path/to/blastn \
    --blastdbcmd /path/to/blastdbcmd \
    --nt-prok /path/to/nt_prok \
    --taxdump ./taxonomy

Streaming mode (for large datasets)

For datasets exceeding available RAM, use the streaming mode with external sorting:

# Step 1: External sort
sort -t'\t' -k1,1 -S 8G --parallel=8 flanking.tsv > flanking_sorted.tsv

# Step 2: Streaming FDB build
argenus -b fdb --sorted -i flanking_sorted.tsv -o flanking.fdb

Usage

Basic usage

argenus run \
    --r1 sample_R1.fastq.gz \
    --r2 sample_R2.fastq.gz \
    --db databases/AMR_PanRes.mmi \
    --fdb databases/flanking_1kbp.fdb \
    --output results/sample_argenus.tsv

Options

argenus run [OPTIONS]

Required:
    --r1 <FILE>         Forward reads (FASTQ/FASTQ.gz)
    --r2 <FILE>         Reverse reads (FASTQ/FASTQ.gz)
    --db <FILE>         ARG database index (.mmi)
    --fdb <FILE>        Flanking sequence database (.fdb)
    --output <FILE>     Output TSV file

Optional:
    --threads <N>       Number of threads [default: 16]
    --min-identity <F>  Minimum identity for ARG matching [default: 0.8]
    --min-coverage <F>  Minimum coverage for ARG matching [default: 0.7]
    --flank-identity <F> Minimum identity for genus classification [default: 0.9]
    --include-wildtype  Include wild-type alleles in output

Flanking Database Building Options

argenus -b fdb [OPTIONS]

Options:
    -m, --mode <MODE>      Database mode: short (1000bp) or long (5000bp)
    -o, --output <PATH>    Output FDB path
    --sorted               Input TSV is pre-sorted by gene name (streaming mode)
    --taxdump <PATH>       Path to NCBI taxdump directory
    --threads <N>          Number of threads [default: available CPUs]
    --buffer-mb <MB>       Sort buffer size in MB [default: 1024]

Output Format

ARGenus produces a tab-delimited file with the following columns:

Column Description
sample Sample identifier
contig_id Contig identifier (e.g., contig_1)
gene ARG gene name
drug_class Antimicrobial drug class
genus Assigned source genus
confidence Classification confidence (mean identity)
specificity Gene-genus association strength
identity ARG sequence identity
coverage ARG sequence coverage
contig_length Assembled contig length
upstream_len Upstream flanking sequence length
downstream_len Downstream flanking sequence length
extension_method Extension method used (strict/flexible)
snp_status SNP verification status

Performance

  • Processing speed: 5-10 minutes per sample (10-20M reads, 16 threads)
  • Memory usage: ~700 MB for FDB building (streaming mode)
  • Classification rate: ~73% genus-level assignment
  • False positive rate: <5% (compared to 15% for KMA)

Database Statistics

Flanking Database (v3)

Metric 1,000 bp 5,000 bp
File size ~50 MB ~8.7 GB
Total records 1,069,848 23,184,244
Gene count 11,835 11,092
Gene coverage 97.6% 91.5%
Genus resolution 83.9% 92.8%
Species resolution 74.7% 85.2%

Data Sources

Database Records
NCBI nt_prok 8,722,761 sequences
GenBank prokaryotic 85,269 genomes
PLSDB 14,635 plasmids
PanRes 13,280 ARGs

Citation

If you use ARGenus in your research, please cite:

[Citation information to be added upon publication]

License

MIT License - see LICENSE for details.

Contact

For questions and bug reports, please open an issue on GitHub.