k2tools
A fast Rust CLI toolkit for working with kraken2 outputs, particularly filtering of inputs (FASTQs) to one or more taxa based on kraken's classification.
Motivation
k2tools is largely aimed at those using the excellent kraken2 toolkit for non-metagenomic purposes! It's frequently used in a non-metagenomic context to separate host reads from other contaminants - e.g. in human saliva samples which can contain both food particles and microbial cells and DNA. In this workflow one tends to follow a process like:
flowchart TD
A[k2 classify]-->B[filter fastqs]
B-->alignment;
alignment-->etc.
However, the step for going from the kraken2 report to a set of filtered fastqs can be a little challenging. It requires parsing the kraken2 report, which has a great structure for humans, but is not the most machine readable. And then you need to read and filter the kraken2 output and fastqs in sync. The KrakenTools repo has a python script to do this, but it's many times slower than running kraken2, and is single-threaded.
Thanks to our earlier work on fqtk, we developed the pooled-writer crate for multi-file parallel compression and writing. Using that it's easy to scale performance with threads, and perform the filtering step in a fraction of the time it takes kraken2 to do the classification!
Available Commands
| Command | Description |
|---|---|
filter |
Extract classified/unclassified reads from FASTQ files based on kraken2 results |
report-to-tsv |
Convert a kraken2 report to a clean, header-bearing TSV with derived columns |
Installation
Installing with Conda (bioconda)
Installing with Cargo
Building from source
Clone the repository and build:
The binary will be at target/release/k2tools.
Usage
For a list of all commands:
For detailed usage of any command:
filter
Extracts reads from FASTQ files that were classified to one or more taxa by kraken2. Requires three inputs from the same kraken2 run: the report file (-r), the per-read classification output (-k), and the FASTQ file(s) (-i). Supports single-end and paired-end reads, gzip/bgzf-compressed inputs, and writes bgzf-compressed output.
Extract all reads classified as E. coli (taxon 562):
Extract an entire genus (taxon 543) including all species and strains beneath it:
Extract unclassified reads from a paired-end run:
Combine taxon extraction with unclassified reads in a single pass:
report-to-tsv
Converts a kraken2 report (standard 6-column or extended 8-column format) into a clean TSV with clearly named columns, derived parent information, taxonomy level, descendant counts, and sequence fraction columns.
# Write to a file
# Write to stdout for piping
|
Output columns:
| Column | Description |
|---|---|
tax_id |
NCBI taxonomy ID |
name |
Scientific name |
rank |
Taxonomic rank code (e.g. S, G, D1) |
level |
Depth in the taxonomy tree (0 for root and unclassified) |
parent_tax_id |
Parent taxon ID (empty for root/unclassified) |
parent_rank |
Parent rank code (empty for root/unclassified) |
clade_count |
Fragments in the clade rooted at this taxon |
direct_count |
Fragments assigned directly to this taxon |
descendant_count |
clade_count minus direct_count |
frac_clade |
clade_count / total_sequences |
frac_direct |
direct_count / total_sequences |
frac_descendant |
descendant_count / total_sequences |
minimizer_count |
Minimizers in clade (empty if not in report) |
distinct_minimizer_count |
Distinct minimizers (empty if not in report) |
Output Format
k2tools produces clean TSV output designed for easy downstream consumption:
- Lowercase
snake_caseheaders (e.g.clade_count,frac_direct) - Tab-separated with no metadata or comment lines
- Fractions use
frac_prefix (e.g.frac_cladenotpct_clade)
Resources
- Releases
- Issues: Report a bug or request a feature
- Pull requests: Submit a patch or new feature
- Contributors guide
- License: Released under the MIT license
Authors
Disclaimer
This software is under active development. While we make a best effort to test this software and to fix issues as they are reported, this software is provided as-is without any warranty (see the license for details). Please submit an issue, and better yet a pull request as well, if you discover a bug or identify a missing feature.