kractor
kraken extractor
Kractor extracts sequencing reads from fastq [.gz/.bz2] files using taxonomic classifications obtained via Kraken2.
It supports single or paired-end reads, can optionally include taxonomic parents or children, and uses minimal memory (~4.5 MB for a 17 GB FASTQ file).
The end result is a fast[q/a] file containing all reads classified as the specified taxon(s).
Kractor significantly enhances processing speed compared to KrakenTools for both paired and unpaired reads.
Performance vs KrakenTools:
- Paired compressed FASTQ: ~21× faster
- Paired uncompressed FASTQ: ~10× faster
- Unpaired: ~4× faster (compressed or uncompressed)
For additional details, refer to the benchmarks
Motivation
Provides similar functionality to the KrakenTools extract_kraken_reads python script.
However the main motivation was to enhance speed when processing multiple, large FASTQ files - and as a way to learn Rust.
Installation
Binaries:
Precompiled binaries for Linux, MacOS and Windows are attached to the latest release.
Conda:
conda install -c bioconda kractor
Docker:
A docker image is available on quay.io
docker pull quay.io/biocontainers/kractor
Cargo:
Requires cargo
cargo install kractor
Build from source:
Install rust toolchain:
To install please refer to the rust documentation: docs
Clone the repository:
Build and add to path:
All executables will be in the directory Kractor/target/release.
Usage
))
))
)
)
Examples:
# Extract reads classified as E. coli from single end reads
# Extract from paired end reads
# Extract multiple taxids (Bacillaceae and Listeriaceae)
# Extract all children of Enterobacteriaceae family (requires kraken report)
# Extract everything EXCEPT viral reads (using --exclude)
# Output FASTA format instead of FASTQ
Summary statistics
Use --json-report to get summary statistics (output to stdout on completion)
Arguments:
Required:
Input
-i, --input
Specifies one or two input FASTQ files to extract reads from. Files may be uncompressed or compressed (gz, bz2).
Paired end reads can be specified by:
Using --input twice: -i <R1_fastq_file> -i <R2_fastq_file>
Using --input once but passing both files: -i <R1_fastq_file> <R2_fastq_file>
Output
-o, --output
Specifies the output file(s) for extracted reads, matching the order of the input files. Compression type is inferred from the file extension (.gz, .bz2). If not recognised, output will be uncompressed.
Kraken Output
-k, --kraken
Path to the Standard Kraken Output Format file, containing taxonomic classification of read IDs.
Taxid
-t, --taxid
One or more taxonomic IDs to extract.
For example: -t 1 2 10
Each taxid is affected by --exclude, --parents, and --children if those options are used.
Optional:
Output type
--compression-format
Manually set output compression format, overriding what is inferred from file names.
Valid values:
gz– gzip compressionbz2– bzip2 compressionnone– no compression
Compression level
--compression-level
Set compression level (1–9).
- 1 = fastest, largest file
- 9 = slowest, smallest file
Default: 2 (balance of speed and size)
Output fasta
--output-fasta
Output sequences in FASTA format instead of FASTQ.
Kraken Report
-r, --report
Path to the Kraken2 report file. Required if using --parents or --children.
Parents
--parents
Include reads classified between the root and the specified --taxid. Requires --report.
Children
--children
Include reads classified at the given taxid and all its descendant taxa. Requires --report.
Exclude
--exclude
Extract all reads except those matching the given taxids. Can be combined with --parents or --children.
JSON report
--summary
Write a JSON report to stdout after processing.
Citation
Sam Sims. (2025). Sam-Sims/kractor: kractor-1.0.1 (kractor-1.0.1). Zenodo. https://doi.org/10.5281/zenodo.15761838