kmerust
A fast, parallel k-mer counter for DNA sequences in FASTA files.
Features
- Fast parallel processing using rayon and dashmap
- Canonical k-mers - outputs the lexicographically smaller of each k-mer and its reverse complement
- Flexible k-mer lengths from 1 to 32
- Handles N bases by skipping invalid k-mers
- Jellyfish-compatible output format for easy integration with existing pipelines
- Tested for accuracy against Jellyfish
Installation
From crates.io
From source
Usage
Arguments
<k>- K-mer length (1-32)<path>- Path to a FASTA file (use-or omit for stdin)
Options
-f, --format <FORMAT>- Output format:fasta(default),tsv, orjson-m, --min-count <N>- Minimum count threshold (default: 1)-q, --quiet- Suppress informational output-h, --help- Print help information-V, --version- Print version information
Examples
Count 21-mers in a FASTA file:
Count 5-mers:
Unix Pipeline Integration
kmerust supports reading from stdin, enabling seamless integration with Unix pipelines:
# Pipe from another command
|
# Decompress and count
|
# Sample reads and count
|
# Explicit stdin marker
|
Output Formats
Use --format to choose the output format:
# TSV format (tab-separated)
# JSON format
# FASTA-like format (default)
FASTA Readers
kmerust supports two FASTA readers via feature flags:
rust-bio(default) - Uses the rust-bio libraryneedletail- Uses the needletail library
To use needletail instead:
Production Features
Enable production features for additional capabilities:
Or enable individual features:
gzip- Read gzip-compressed FASTA files (.fa.gz)mmap- Memory-mapped I/O for large filestracing- Structured logging and diagnostics
Gzip Compressed Input
With the gzip feature, kmerust can directly read gzip-compressed files:
Tracing/Logging
With the tracing feature, use the RUST_LOG environment variable for diagnostic output:
RUST_LOG=kmerust=debug
Output Format
Output is written to stdout in FASTA-like format:
>{count}
{canonical_kmer}
Example output:
>114928
ATGCC
>289495
AATCA
Library Usage
kmerust can also be used as a library:
use count_kmers;
use PathBuf;
Progress Reporting
Monitor progress during long-running operations:
use count_kmers_with_progress;
Memory-Mapped I/O
For large files, use memory-mapped I/O (requires mmap feature):
use count_kmers_mmap;
Streaming API
For memory-efficient processing:
use count_kmers_streaming;
Reading from Any Source
Count k-mers from any BufRead source, including stdin or in-memory data:
use count_kmers_from_reader;
use BufReader;
Performance
kmerust uses parallel processing to efficiently count k-mers:
- Sequences are processed in parallel using rayon
- A concurrent hash map (dashmap) allows lock-free updates
- FxHash provides fast hashing for 64-bit packed k-mers
License
MIT License - see LICENSE for details.