Deacon
A minimizer-based filter for nucleotide sequences in FASTA or FASTQ format, built for efficient host depletion. Default behaviour removes query sequences with two or more minimizers present in the index. Filters at ~50Mbp/s using a single Apple M1 core and indexes the human genome in under 60s. Peak memory usage is ~2.5GB for a human genome with default parameters. Accuracy benchmarks will be published soon.
The sensitivity/specificity/memory tradeoff can be tuned using k-mer length (-k), minimizer window length (-w), and match threshold (-m). Filtering speed may be increased by considering only the first -n bases per query sequence. Uses simd-minimizers for accelerated minimizer computation. This project is currently unstable and under active development.
Install
conda/mamba/pixi 
cargo 
Usage
Indexing
Supports FASTA[.gz] input files and outputs to stdout or file (-o).
Filtering
Supports FASTA or FASTQ input from stdin or file and outputs to stdout or file. Paired sequences are supported as either separate files or interleaved stdin, and are written in interleaved format to either stdout or file. Gzip (.gz) and Zstandard (.zst) compression formats are detected automatically. Piping uncompressed FASTA/Q to pigz is advisable in order to avoid compression bottlenecks when writing gzip output directly.
| | | | | | |
Reports
Use --log results.json to save a filtering summary to a JSON file:
Set operations on indexes
- Use
deacon index union 1.idx 2.idx > 1+2.idxto nonredundantly combine two (or more) deacon minimizer indexes. - Use
deacon index diff 1.idx 2.idx > 1-2.idxto subtract minimizers in 2.idx from 1.idx. Useful for masking.