FastDedup
FastDedup (FDedup) is a fast and memory-efficient FASTX PCR deduplication tool written in Rust. It utilizes needletail for high-performance sequence parsing, xxh3 for rapid hashing, and fxhash for a low-overhead memory cache.
Paper in preparation, you can check it here.
Features
- Fast & Memory Efficient: Uses zero-allocation sequence parsing and a non-cryptographic high-speed hashing cache, which automatically scales based on the estimated input file size.
- Supports Compressed Formats: Transparently reads and writes both uncompressed and GZIP compressed (
.gz) FASTQ/FASTA files. - Incremental Deduplication & Auto-Recovery: By default, FDedup appends new sequences to an existing output file. It safely pre-loads existing hashes to prevent duplicates. If an uncompressed output file is corrupted due to a previous crash, FDedup automatically truncates it to the last valid sequence and resumes safely.
Requirements
If you want to build it from source, you need to have the following dependencies installed:
Installation
From pre-compiled binaries or container images
You can download the latest pre-compiled binaries from the releases page
From Cargo
You can install FastDedup directly from Cargo:
From bioconda
bioconda aprobation in progress
Usage
-1, --input <INPUT>: Path to the input FASTA/FASTQ/GZ file (R1 or Single-End).-2, --input-r2 <INPUT_R2>: Path to the input R2 file (Optional, enables Paired-End mode).-o, --output <OUTPUT>: Path to the output file (R1 or Single-End). Defaults tooutput_R1.fastq.gz.-p, --output-r2 <OUTPUT_R2>: Path to the output R2 file (Required if-2is provided).-f, --force: Overwrite the output file if it exists (instead of pre-loading hashes and appending).-v, --verbose: Print processing stats, such as execution time, number of sequences, and duplication rates.-s, --dry-run: Calculate duplication rate without creating an output file.-t, --threshold <THRESHOLD>: Threshold for automatic hash size selection$^1$ (default: 0.01).-H, --hash <HASH>: Manually specify hash size (64 or 128 bits).
1: The probability $p$ of collision is calculated as $p= \frac{x2}{2*2{64}}$ where $x$ is the estimated number of hashes. If the probability is higher than the specified threshold, FDedup will automatically switch to 128-bit hashing to nullify the risk of collisions.
Note: you need $\sqrt{2+2{64}*10{-3}} \approx 0.1910^9$ sequences to have a 1‰ chance of collision with 64-bit hashing, and $0.2810^{17}$ sequences to have the same chance with 128-bit hashing.
Run it from Cargo
You can run it directly from Cargo:
Run with Pixi
You can also rely on Pixi to run:
Run with Singularity / Apptainer
You can download the latest release and run the containerized version of FDedup:
Using Apptainer:
Using Singularity:
Note:
--forceis very slow when used in a Singularity container. We recommend just deleting the output file before running the container if you want to start from scratch.
You can build the container yourself using pixitainer:
- Install pixitainer:
- Build the container:
pixi containerize
Recommendations
If you are using FDedup in a pre-processing step, we recommend you to not export your file to a .gz format.
If there is any crash, FDedup cannot restart from a compressed file, and you will lose all the progress.
It is because a corrupted gzipped flux will make the file unreadable, and you will have to start from scratch using --force.
However, if you output to an uncompressed format, FDedup will automatically detect any crash-induced corruption, safely truncate the file to the last valid sequence, and seamlessly resume deduplication.
To-Do List
- Support for Paired-End read deduplication.
- Add Multithreading to parallelize sequence hashing and processing.
- Support tracking sequence abundances (counts) instead of naive discarding.
- Add a possibility for exporting sequences as FASTA.
- Improve error handling.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Author
Raphaël Ribes (coding and design)
Céline Mandier (design)
Acknowledgements
Computations were performed on the ISDM-MESO HPC platform, funded in the framework of State-region planning contracts (Contrat de plan État-région – CPER) by the French Government, the Occitanie/Pyrénées-Méditerranée Region, Montpellier Méditerranée Métropole, and the University of Montpellier.