Deacon
Fast minimizer-based search and depletion of FASTA/FASTQ files and streams. Default parameters balance sensitivity and specificity for microbial (meta)genomic host depletion, for which a validated prebuilt index is available. Classification sensitivity, specificity and memory requirements can be tuned by varying k-mer length (-k), minimizer window size (-w), and match thresholds (-a and -r) per query. Minimizer k and w are chosen at index time, while the match thresholds can be varied at filter time. Sequences must meet both an absolute threshold (-a, default 2 minimizer hits) and a relative threshold (-r, default 0.01 or 1% of minimizers) to be considered a match. Short and/or paired reads are supported: A match in either mate causes both mates in the pair to be retained or discarded. Sequences can optionally be renamed for privacy and smaller file sizes. Deacon reports filtering performance during execution and optionally writes a JSON summary on completion. Gzip, zst and xz compression formats are natively supported and detected by file extension.
Building on simd-minimizers, Deacon is capable of filtering compressed long reads at >500Mbp/s and indexing a human genome in <30s (Apple M1). Filtering at >1Gbp/s is possible with uncompressed input. Peak memory usage during filtering is 5GB for the default panhuman index. Use Zstandard (zst) compression and/or pipe output to an external compressor such as pigz for best performance.
Benchmarks for panhuman host depletion of complex microbial metagenomes are described in a preprint. Among tested approaches, Deacon with the panhuman-1 (k=31, w=15) index exhibited the highest balanced accuracy for both long and short simulated reads. Deacon was however less specific than Hostile for short reads.
[!IMPORTANT] Deacon is still unstable, so please carefully review the CHANGELOG when updating. Version 0.7.0 for instance introduced a new index format (version 2) that is not backwards compatible. Please report any problems you encounter by creating an issue or using the email address in my profile.
Install
conda/mamba/pixi 
cargo 
Usage
Indexing
Build indexes with deacon index build. For human host depletion, the prebuilt validated panhuman index is recommended, available for download below from either Zenodo or fast object storage. Object storage is provided by the ModMedMicro research unit at the University of Oxford.
deacon index build chm13v2.fa > human.k31w15.idx
# Discard very low entropy minimizers
deacon index build -e 0.5 chm13v2.fa > human.k31w15e5.idx
Prebuilt indexes
| Name/URL | Composition | Minimizers | Subtracted minimizers | Size | Date |
|---|---|---|---|---|---|
| panhuman-1 (k=31, w=15) Cloud, Zenodo | (HPRC Year 1 ∪ CHM13v2.0 ∪ GRCh38.p14) - bacteria (FDA-ARGOS) - viruses (RefSeq) | 409,913,780 | 20,781 (0.0051%) | 3.7GB | 2025-07 |
Filtering
The command deacon filter accepts an index path followed by up to two query FASTA/FASTQ file paths, depending on whether query sequences originate from stdin, a single file, or paired input files. Paired queries are supported as either separate files or interleaved stdin, and written interleaved to either stdout or file, or else to separate paired output files. For paired reads, distinct minimizer hits originating from either mate are counted. By default, query sequences must meet both an absolute threshold of 2 minimizer hits (-a 2) and a relative threshold of 1% of minimizers (-r 0.01) to pass the filter. Filtering can be inverted for e.g. host depletion using the --deplete (-d) flag. Gzip, Zstandard, and xz compression formats are detected automatically by file extension. Use Zstandard compression rather than Gzip where possible for best performance.
Examples
# Keep only human sequences
# Host depletion using the panhuman-1 index and default thresholds
# Maximum sensitivity with absolute threshold of 1 and relative threshold of 0
# More specific 10% relative match threshold
# Stdin and stdout
|
# Faster Zstandard compression
# Fast gzip with pigz
|
# Paired reads
|
# Save summary JSON
# Replace read headers with incrementing integers
# Only look for minimizer hits inside the first 1000bp per record
# Debug mode: see sequences with minimizer hits in stderr
Command line reference
Filtering
<INDEX> Path
)
)
; )
)
)
)
)
)
& ; )
Indexing
)
)
)
<INPUT> Path )
)
)
)
)
Building custom indexes
Building custom Deacon indexes is quite fast. Nevertheless, when indexing many large genomes, it may be worthwhile separately indexing and subsequently combining indexes into one succinct index. Combine distinct minimizers from multiple indexes using deacon index union. Similarly, use deacon index diff to subtract the minimizers contained in one index from another. This can be helpful for e.g. eliminating shared minimizers between the target and host genomes when building custom (non-human) indexes for host depletion.
- Use
deacon index union 1.idx 2.idx 3.idx… > 1+2+3.idxto succinctly combine two (or more!) deacon indexes. - Use
deacon index diff 1.idx 2.idx > 1-2.idxto subtract minimizers in fungi.idx from host.idx. Useful for masking out shared minimizer content between e.g. target and host genomes. - In version
0.7.0and above,deacon index diffalso supports subtracting minimizers from an index using a fastx file or stream, e.g.deacon index diff 1.idx 2.fa.gz > 1-2.idxor ``zcat *.fa.gz | deacon index diff 1.idx - > 1-2.idx`.
For best performance, set the --capacity argument of deacon index build to a number of minimizers in millions greater than that you expect your index to contain. Setting this too low will cause delays during indexing for hash table resizing.
Filtering summary statistics
Use -s summary.json to save detailed filtering statistics:
Citation
Bede Constantinides, John Lees, Derrick W Crook. "Deacon: fast sequence filtering and contaminant depletion" bioRxiv 2025.06.09.658732, https://doi.org/10.1101/2025.06.09.658732
Please also consider citing the SimdMinimizers paper:
Ragnar Groot Koerkamp, Igor Martayan. "SimdMinimizers: Computing random minimizers, fast" bioRxiv 2025.01.27.634998, https://doi.org/10.1101/2025.01.27.634998