Crate fastq [] [src]

A fast parser for fastq.

The records in a file can be accessed through three different interfaces:

  • Parser::each: This function takes a closure that is executed for each fastq record. It is the fastest way to iterate over the records, since no copying of the records is needed and we don't allocate anything during parsing.
  • Parser::record_sets. This function returns an iterator over record sets. All records in a record set share the same data array, so we only need one allocation per record set.
  • Parser::parallel_each. This is a convenience function that wraps Parser::record_sets but passes the record sets to a number of background threads. A closure is executed on each thread with an iterator over record sets. Results from the threads are collected and returned to the caller.

Since fastq file are usually compressed, this crate also includes a function thread_reader to offload the decompression to a different core.

The FastQ standard

This library supports Windows and Unix line endings, it does not support the old MAC line ending \r. It allows arbitrary data on the separator line between sequence and quality as long as it starts with a + (some fastq files repeat the id on this line). It does not validate that the sequence or the quality contain only allowed characters. Sequence and quality must have the same length. They are not allowed to contain newline characters.

At the moment it does not make any effort to pair reads. This means that pairs that belong together might end up on different cores in a multithreaded setup. (TODO This should change it the future!).

Examples

A minimal program that reads uncompressed fastq from stdin and counts the number of fastq records. Since we do not need ownership of the records we can use the fastest Parser::each.

use std::io::stdin;
use fastq::Parser;

let mut parser = Parser::new(stdin());
let mut total: usize = 0;
parser.each(|_| {
    total += 1
}).expect("Invalid fastq file");
println!("{}", total);

If the file is already in the system cache, this runs on my laptop about 10% slower than wc -l (at about 2GB/s), and at disk speed (~500MB/s) otherwise.

If we add lz4 decompression in a background thread:

extern crate lz4;
extern crate fastq;

use std::io::stdin;
use fastq::{Parser, thread_reader};

// lz4 decompression is faster with a large 4MB buffer
let BUFSIZE: usize = 1<<22;
let QUEUELEN: usize = 2;

let reader = lz4::Decoder::new(stdin()).expect("Input is not lz4 compressed");
let total = thread_reader(BUFSIZE, QUEUELEN, reader, |reader| {
    let mut total: u64 = 0;
    let mut parser = Parser::new(reader);
    parser.each(|_| {
        total += 1;
     }).expect("Invalid fastq file");
    total
}).expect("Reader thread paniced");
println!("{}", total);

This gets us up to ~1.6GB/s on a fresh page cache.

If we want to do more than just count the number of records (in this example count how many sequences align to ATTAATCCAT with a score better than 7), we probably want to use more cores:

extern crate parasailors;

use fastq::{Parser, thread_reader, Record};
use parasailors as align;

const N_THREADS: usize = 2;

let results = thread_reader(BUFSIZE, 3, reader, |reader| {
    let parser = Parser::new(reader);
    let results: Vec<u64> = parser.parallel_each(N_THREADS, |record_sets| {
        let matrix = align::Matrix::new(align::MatrixType::Identity);
        let profile = align::Profile::new(b"ATTAATCCAT", &matrix);
        let mut thread_total: u64 = 0;
        for record_set in record_sets {
            for record in record_set.iter() {
                let score = align::local_alignment_score(&profile, record.seq(), 2, 1);
                if score > 7 {
                    thread_total += 1;
                }
            }
        }
        thread_total
    }).expect("Invalid fastq file");
    results
}).expect("Reader thread paniced");
println!("{}", results.iter().sum::<u64>());

On my feeeble 2 core laptop this ends up being bound by the alignment at ~300MB/s, but it should scale well to a larger number of cores.

Structs

OwnedRecord

A fastq record that ownes its data arrays.

Parser

Parser for fastq files.

RecordSet

A collection of fastq records used to iterate over records in chunks.

RecordSetItems
RecordSetIter
RefRecord

A fastq record that borrows data from an array.

Traits

Record

Trait to be implemented by types that represent fastq records.

Functions

thread_reader

Wrap a reader in a background thread.