Deacon
A fast general purpose minimizer-based filter for nucleotide sequences in FASTA or FASTQ format, built for rapid and accurate host depletion. Default parameters have been selected to maximise balanced accuracy for short and long reads. Sensitivity, specificity and memory use may nevertheless be tuned by varying k-mer length (-k), minimizer window length (-w), and the number of required index matches (-m) per query. Minimizer k and w are chosen at index time, while the number of required matches m can be specified at filter time.
Building on simd-minimizers, Deacon is capable of filtering at >200Mbp/s (Apple M1) and indexing a human genome in under 60s. Peak memory usage is ~4.5GB for the default panhuman index. Partial query matching can be used to further increase speed for long queries by considering only the first -n bases per query. Stay tuned for comprehensive validation and benchmarks. This project is currently unstable.
Install
conda/mamba/pixi (recommended) 
cargo 
Usage
Indexing
Custom indexes can be built using deacon index build. For human host depletion, the prebuilt validated panhuman index is recommended, available for download below. Object storage is provided by the ModMedMicro research unit at the University of Oxford.
deacon index build chm13v2.fa > human.k31w15.idx
Prebuilt indexes
| Name/URL | Composition | Minimizers | Masked minimizers | Size | Date |
|---|---|---|---|---|---|
| panhuman-1 | (HPRC Year 1 ∪ CHM13v2.0 ∪ GRCh38.p14) - bacteria (FDA-ARGOS) - viruses (RefSeq) | 409,914,298 (k=31, w=15) | 20,741 (0.0051%) | 3.7GB | 2025-04 |
Filtering
The command deacon filter accepts an index path followed by up to two query FASTA/FASTQ file paths, depending on whether query sequences originate from stdin, a single file, or paired input files. Paired queries are supported as either separate files or interleaved stdin, and written interleaved to either stdout or file, or else to paired output files. For paired reads, distinct minimizer hits originating from either mate are counted. By default, query sequences with fewer than two minimizer hits to the index (-m 2) pass the filter. Filtering can be inverted using the --invert flag. Gzip (.gz) and Zstandard (.zst) compression formats are detected automatically by file extension. Since (de)compression can be rate limiting, consider using Zstandard rather than Gzip for best performance.
Examples
| | | | | |
Reports
Use --report results.json to save a filtering report:
Composing indexes with set operations
- Use
deacon index union 1.idx 2.idx > 1+2.idxto succinctly combine two (or more) deacon minimizer indexes. - Use
deacon index diff 1.idx 2.idx > 1-2.idxto subtract minimizers in 2.idx from 1.idx. Useful for masking.