k2tools 0.1.0

Rust CLI toolkit for working with kraken2 outputs
Documentation

Build Version at crates.io Bioconda License

k2tools

A fast Rust CLI toolkit for working with kraken2 outputs, particularly filtering of inputs (FASTQs) to one or more taxa based on kraken's classification.

Motivation

k2tools is largely aimed at those using the excellent kraken2 toolkit for non-metagenomic purposes! It's frequently used in a non-metagenomic context to separate host reads from other contaminants - e.g. in human saliva samples which can contain both food particles and microbial cells and DNA. In this workflow one tends to follow a process like:

flowchart TD
    A[k2 classify]-->B[filter fastqs]
    B-->alignment;
    alignment-->etc.

However, the step for going from the kraken2 report to a set of filtered fastqs can be a little challenging. It requires parsing the kraken2 report, which has a great structure for humans, but is not the most machine readable. And then you need to read and filter the kraken2 output and fastqs in sync. The KrakenTools repo has a python script to do this, but it's many times slower than running kraken2, and is single-threaded.

Thanks to our earlier work on fqtk, we developed the pooled-writer crate for multi-file parallel compression and writing. Using that it's easy to scale performance with threads, and perform the filtering step in a fraction of the time it takes kraken2 to do the classification!

Available Commands

Command Description
filter Extract classified/unclassified reads from FASTQ files based on kraken2 results
report-to-tsv Convert a kraken2 report to a clean, header-bearing TSV with derived columns

Installation

Installing with Conda (bioconda)

conda install -c bioconda k2tools

Installing with Cargo

cargo install k2tools

Building from source

Clone the repository and build:

git clone https://github.com/fulcrumgenomics/k2tools.git
cd k2tools
cargo build --release

The binary will be at target/release/k2tools.

Usage

For a list of all commands:

k2tools --help

For detailed usage of any command:

k2tools <command> --help

filter

Extracts reads from FASTQ files that were classified to one or more taxa by kraken2. Requires three inputs from the same kraken2 run: the report file (-r), the per-read classification output (-k), and the FASTQ file(s) (-i). Supports single-end and paired-end reads, gzip/bgzf-compressed inputs, and writes bgzf-compressed output.

Extract all reads classified as E. coli (taxon 562):

k2tools filter \
    -r report.txt -k kraken_output.txt \
    -i reads.fq.gz -o ecoli.fq.gz \
    -t 562

Extract an entire genus (taxon 543) including all species and strains beneath it:

k2tools filter \
    -r report.txt -k kraken_output.txt \
    -i reads.fq.gz -o entero.fq.gz \
    -t 543 -d

Extract unclassified reads from a paired-end run:

k2tools filter \
    -r report.txt -k kraken_output.txt \
    -i r1.fq.gz r2.fq.gz -o unclass_r1.fq.gz unclass_r2.fq.gz \
    -u

Combine taxon extraction with unclassified reads in a single pass:

k2tools filter \
    -r report.txt -k kraken_output.txt \
    -i reads.fq.gz -o human_and_unclass.fq.gz \
    -t 9606 -d -u

report-to-tsv

Converts a kraken2 report (standard 6-column or extended 8-column format) into a clean TSV with clearly named columns, derived parent information, taxonomy level, descendant counts, and sequence fraction columns.

# Write to a file
k2tools report-to-tsv -r kraken2_report.txt -o report.tsv

# Write to stdout for piping
k2tools report-to-tsv -r kraken2_report.txt | head

Output columns:

Column Description
tax_id NCBI taxonomy ID
name Scientific name
rank Taxonomic rank code (e.g. S, G, D1)
level Depth in the taxonomy tree (0 for root and unclassified)
parent_tax_id Parent taxon ID (empty for root/unclassified)
parent_rank Parent rank code (empty for root/unclassified)
clade_count Fragments in the clade rooted at this taxon
direct_count Fragments assigned directly to this taxon
descendant_count clade_count minus direct_count
frac_clade clade_count / total_sequences
frac_direct direct_count / total_sequences
frac_descendant descendant_count / total_sequences
minimizer_count Minimizers in clade (empty if not in report)
distinct_minimizer_count Distinct minimizers (empty if not in report)

Output Format

k2tools produces clean TSV output designed for easy downstream consumption:

  • Lowercase snake_case headers (e.g. clade_count, frac_direct)
  • Tab-separated with no metadata or comment lines
  • Fractions use frac_ prefix (e.g. frac_clade not pct_clade)

Resources

Authors

Disclaimer

This software is under active development. While we make a best effort to test this software and to fix issues as they are reported, this software is provided as-is without any warranty (see the license for details). Please submit an issue, and better yet a pull request as well, if you discover a bug or identify a missing feature.