Crate fastq [−] [src]
A fast parser for fastq.
The records in a file can be accessed through three different interfaces:
Parser::each
: This function takes a closure that is executed for each fastq record. It is the fastest way to iterate over the records, since no copying of the records is needed and we don't allocate anything during parsing.Parser::record_sets
. This function returns an iterator over record sets. All records in a record set share the same data array, so we only need one allocation per record set.Parser::parallel_each
. This is a convenience function that wrapsParser::record_sets
but passes the record sets to a number of background threads. A closure is executed on each thread with an iterator over record sets. Results from the threads are collected and returned to the caller.
Since fastq file are usually compressed, this crate also includes a function
thread_reader
to offload the decompression to a different core.
The FastQ standard
This library supports Windows and Unix line endings, it does not support
the old MAC line ending \r
. It allows arbitrary data on the separator
line between sequence and quality as long as it starts with a +
(some
fastq files repeat the id on this line). It does not validate that the
sequence or the quality contain only allowed characters. Sequence and
quality must have the same length. They are not allowed to contain
newline characters.
At the moment it does not make any effort to pair reads. This means that pairs that belong together might end up on different cores in a multithreaded setup. (TODO This should change it the future!).
Examples
A minimal program that reads uncompressed fastq from stdin and counts the
number of fastq records. Since we do not need ownership of the records we
can use the fastest Parser::each
.
use std::io::stdin; use fastq::Parser; let mut parser = Parser::new(stdin()); let mut total: usize = 0; parser.each(|_| { total += 1 }).expect("Invalid fastq file"); println!("{}", total);
If the file is already in the system cache, this runs on my laptop about
10% slower than wc -l
(at about 2GB/s), and at disk speed (~500MB/s) otherwise.
If we add lz4 decompression in a background thread:
extern crate lz4; extern crate fastq; use std::io::stdin; use fastq::{Parser, thread_reader}; // lz4 decompression is faster with a large 4MB buffer let BUFSIZE: usize = 1<<22; let QUEUELEN: usize = 2; let reader = lz4::Decoder::new(stdin()).expect("Input is not lz4 compressed"); let total = thread_reader(BUFSIZE, QUEUELEN, reader, |reader| { let mut total: u64 = 0; let mut parser = Parser::new(reader); parser.each(|_| { total += 1; }).expect("Invalid fastq file"); total }).expect("Reader thread paniced"); println!("{}", total);
This gets us up to ~1.6GB/s on a fresh page cache.
If we want to do more than just count the number of records (in this example count how many sequences align to ATTAATCCAT with a score better than 7), we probably want to use more cores:
extern crate parasailors; use fastq::{Parser, thread_reader, Record}; use parasailors as align; const N_THREADS: usize = 2; let results = thread_reader(BUFSIZE, 3, reader, |reader| { let parser = Parser::new(reader); let results: Vec<u64> = parser.parallel_each(N_THREADS, |record_sets| { let matrix = align::Matrix::new(align::MatrixType::Identity); let profile = align::Profile::new(b"ATTAATCCAT", &matrix); let mut thread_total: u64 = 0; for record_set in record_sets { for record in record_set.iter() { let score = align::local_alignment_score(&profile, record.seq(), 2, 1); if score > 7 { thread_total += 1; } } } thread_total }).expect("Invalid fastq file"); results }).expect("Reader thread paniced"); println!("{}", results.iter().sum::<u64>());
On my feeeble 2 core laptop this ends up being bound by the alignment at ~300MB/s, but it should scale well to a larger number of cores.
Structs
OwnedRecord |
A fastq record that ownes its data arrays. |
Parser |
Parser for fastq files. |
RecordSet |
A collection of fastq records used to iterate over records in chunks. |
RecordSetItems | |
RecordSetIter | |
RefRecord |
A fastq record that borrows data from an array. |
Traits
Record |
Trait to be implemented by types that represent fastq records. |
Functions
thread_reader |
Wrap a reader in a background thread. |