deacon 0.14.0

Accelerated DNA sequence search and [host] depletion using minimizers
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [0.14.0] - 2026-02-25

### Added

- `deacon filter` accepts `-f`/`--fasta` to force FASTA output regardless of input format.

## [0.13.2] - 2025-11-21

### Added

- `deacon filter` accepts `--rename-random` for anonymising read names using both incrementing and random (64 bit) integers to practically assure uniqueness. This mitigates a reported issue relating to identical read names across separate files causing problems during upload to the Eurpean Nucleotide Archive.

### Changed

- Groundwork for ensuring Bioconda binaries are always built for the most portable `x86-64-v3` (AVX2) target supporting all AMD and Intel CPUs released in the last decade.
- The crate `minreq` now uses the `rustls` backend enabling native compilation on a wider range of Linux systems. Implements a workaround for the previously described `rustls` & `ring` issue on MacOS ARM runners.

## [0.13.1] - 2025-11-14

### Changed

- `deacon index fetch` now uses `minreq` rather than `ureq` to download indexes, removing dependency on `rustls` & `ring`, which caused a curious build error on Bioconda arm64 MacOS runners, meaning that 0.13.0 was released via Cargo only.

## [0.13.0] - 2025-11-11

### Added

- Command `deacon index fetch` for downloading prebuilt indexes by name. If no index name is specified, the `panhuman-1` index is downloaded.
- Parallel gzip compression of output files with automatic ~1:1 thread allocation between filtering and compression tasks if .gz extensions for `--output` (`-o`) and `--output2` (`-O`) arguments are detected.
  - Automatic thread allocation can be overriden using new `--compression-threads` argument.
  - ~3x faster filtering when reading and writing gzip-compressed Illumina FASTQs.

## [0.12.0] - 2025-10-16

### Added

- Command `deacon index intersect` for finding the intersection of two or more minimizer indexes.
-  Command `deacon index dump` for extracting minimizers from an index as plain text (FASTA).
- Command `deacon cite` showing citation info.

### Changed
- Graceful handling of empty compressed input files.
- Fixes bug where `--debug` incorrectly showed the complement of the hitting minimizer.
- Uses paraseq 0.4.3, addressing a bug identified in paraseq 0.4.2 causing FASTQ records without a trailing newline byte to be ignored.
- 2x increase in filtering throughput on arm64 / MacOS systems enabled by a series of optimisations in latest versions of libraries packed-seq, seq-hash and simd-minimizers.


## [0.11.0] - 2025-10-07

### Added

- Local (socket) server mode, enabling successive filter commands to be handled by a persistent server process for low latency filtering.
- Support for longer k-mers of up to length 61, where k+w ≤ 96 (packed-seq 4.1.1). 

### Changed

- Much faster paired read filtering, particularly from separate input files which are now decompressed in parallel.
- Faster indexing and index loading.
- While minimizers containing non-ACGT nucleotides were already discarded, minimizer selection could still be influenced by non-ACGT nucleotides present in the window, occasionally impacting results. Enabled by changes in simd-minimizers ≥ 2.0, entire windows containing non-ACGT nucleotides are discarded. Records containing non-ACGT nucleotides may therefore be classified differently in this release.
- Redesigned index format (v3).
  - Index now stores 'concrete' k-mers using using 2*k rather than 64bit `xxh3` k-mer hashes.
  - Eliminates [tiny] risk of false positive matches caused by xxh3 collisions.
  - Serialised k-mers are byte-aligned, balancing efficient storage and deserialisation speed.
  - Index disk footprint reduced by 10%.
  - Paves way for painless future adoption of faster HashSet implementations.
  - Paves way for future index introspection functionality.
- `RapidHashSet` (`rapidhash::fast`) replaces combined use of `xxHash` (`xxh3`) and `FxHashSet` 
- Fails gracefully given empty input files.
- Bugfix for paired read I/O.
- Feature gating for reduced compile times.

### Removed

- Removed `--capacity` argument, which was easily misused for little performance benefit.

## [0.10.0] - 2025-09-01

### Added

- Support for k-mer length up to 57 (previously 32).

## [0.9.0] - 2025-08-15

### Changed

- Performance optimisations deliver up to 80% faster filtering with unchanged accuracy.

  - \>2Gbp/s with uncompressed long read input.

  - \>500Mbp/s with gzip-compressed long read input.

## [0.8.1] - 2025-08-14

### Changed

- Fixes bug handling multiline FASTA input introduced in 0.8.0.
- Fixes bug handling paired reads introduced in 0.8.0 which could lead to mispaired read output.

## [0.8.0] - 2025-08-11

### Added

- Added new independent absolute (`-a`) and relative (`-r`) match thresholds with respective default values of 2 and 0.01 (1%). The new default relative threshold improves specificity for long sequences over the previous absolute-only default threshold without affecting short read accuracy. These replace the previous dual purpose `-m` parameter which could accept _either_ an absolute (integer) threshold _or_ a relative (float) threshold. 
- `deacon index` now offers the ability to discard minimizers with information content below a specified scaled Shannon `--entropy` (`-e`) threshold. This is disabled by default.
- `deacon filter` now has a `--debug` mode which prints all records with minimizer matches to stderr including the matched minimizer sequence(s).
- The default worst-case hash table capacity preallocation used in `deacon index union` operations can now be overriden with the new `--capacity` (`-c`) argument, in similar fashion to `deacon index build`. 

### Changed

- Filtering performance has improved dramatically on multicore systems due to improved work allocation using the Paraseq library. Filtering at >1Gbp/s is possible with uncompressed long sequences, and >500Mbp/s is achievable on many systems with Gzip-compressed long reads.
- Minimizers containing ambiguous nucleotides are now ignored.

### Removed

- The filtering argument `--matches` (`-m`) has been removed and replaced with `--abs-threshold` (`-a`) and `--rel-threshold` (`-r`).




## [0.7.0] - 2025-07-08

### Added
- `deacon index diff` optionally accepts a fastx file or stream in place of a second index. This enables index masking using massive sequence collections without the need to first index them.

### Changed
- Deacon uses the recently added `simd-minimizers::iter_canonical_minimizer_values()`, increasing filtering speed by up to 50% on Linux/x86_64 systems. Speeds of 1Gbp/s are now possible with uncompressed FASTA input.
  - Index format is now version 2. Existing indexes must be rebuilt for use with this version. A new version of the panhuman-1 index is available from Zenodo and object storage. Attempting to load an incompatible index throws an error.
- Position-dependent IUPAC ambiguous base canonicalisation was replaced with a simpler and faster fixed mapping, meaning that records containing ambiguous IUPAC bases may be classified differently to before.
- `deacon index union` now automatically preallocates the required hash table capacity, eliminating slowdowns when combining indexes.
- Compatible minimizer _k_ and _w_ is now validated (k+w-1 must be odd) prior to indexing.
- Default index capacity is now 400M (Was 500).


## [0.6.0] - 2025-06-25

### Added

- Support for the .xz compression format via liblzma.
- Adjustable filter output compression level with `--compression-level`.
- Report fields `seqs_out_proportion` and `bp_out_proportion`.

### Changed

- Use zlib-rs for much faster gzip decompression.
- Displays number and proportion of _retained_ reads and base pairs during filtering.

## [0.5.0] - 2025-06-11

### Added

- `--deplete` (`-d`) flag to remove index matches.
- Support for relative thresholds (floats between 0.0 and 1.0) for required minimizer hits to `--matches`.
- `-O` short argument name for `--output2`.
- Tests.

### Changed

- Default filtering behaviour now _passes_ index matches. Use `--deplete` (`-d`) to remove matches.
- Renamed `--nucleotides` (`-n`) to `--prefix-length` (`-p`).
- Renamed `--report` to `--summary`.

### Removed

- `--invert` argument has been removed.

## [0.4.0] - 2025-05-23

### Added

- Non-interleaved paired output file support (`--output2`).

### Changed

- Faster indexing.
- Renamed `--log` argument to `--report`.
- Filter stats are now always sent to stderr (a json report can be written wherever one chooses).
- For paired input sequences, identical minimizer hits in both mates of a read pair are now counted only once.

## [0.3.0] - 2025-05-09

### Added

- Parallel filtering.
  - Up to 10x faster from initial testing.
  - Configurable with new `--threads` parameter.
  - Uses available CPU cores by default (`--threads 0`).
- Tests.

### Changed

- Default minimizer parameters changed to k=31 and w=15.

## [0.2.0] - 2025-04-23

### Added

- Paired read support using either third positional argument or interleaved stdin.

### Changed

- Faster indexing.
- More accurate default parameters.
- Optional argument changes.
- Refactored lib.rs.
- Dependency updates.
  - Bincode2.

## [0.1.0] - 2025-03-14

### Added

- Initial experimental release.