Needletail
Needletail is a MIT-licensed, minimal-copying FASTA/FASTQ parser and k-mer processing library for Rust.
The goal is to write a fast and well-tested set of functions that more specialized bioinformatics programs can use. Needletail's goal is to be as fast as the readfq C library at parsing FASTX files and much (i.e. 25 times) faster than equivalent Python implementations at k-mer counting.
For example, a simple Needletail script can count all the bases in a 2.1 gigabyte HiSeq 2500 FASTQ file in 1.1 seconds while a comparable parser with readfq
takes 2.6 seconds and Biopython takes over one minute (see bench
folder; measured with %timeit -r 3 -n 3
, %timeit -r 3 -n 1
for Biopython). These speed improvements hold for large FASTQ files as well.
needletail | readfq | Biopython | |
---|---|---|---|
Mid 2012 MacBook Pro (2GB FASTQ) | 1.83s | 2.48s | 2m43s |
AWS EC2 r3.xlarge (2GB FASTQ) | 1.10s | 2.59s | 1m47s |
AWS EC2 d2.2xlarge (2GB FASTQ) | 0.93s | 2.56s | 1m24s |
AWS EC2 d2.2xlarge (55GB FASTQ) | 34.7s | 1m6s | — |
Note: gcc
with the -O3
flag was used for readfq
(clang -O3
was slower on all tested machines and not used). rustc
1.15.1 was used on all machines.
Example
extern crate needletail;
ues env;
use ;
Installation
Needletail requires rust
and cargo
to be installed.
Please use either your local package manager (homebrew
, apt-get
, pacman
, etc) or install these via rustup.
Once you have Rust set up, you can include needletail in your Cargo.toml
file like:
[dependencies]
needletail = "^0.1.0"
To install needletail itself for development:
git clone https://github.com/bovee/needletail
cargo test # to run tests
Getting Help
Questions are best directed as GitHub issues. We plan to add more documentation soon, but in the meantime "doc" comments are included in the source.
Contributing
Please do! We're happy to discuss possible additions and/or accept pull requests.