Crate fastq[][src]

A fast parser for fastq.

The records in a file can be accessed through three different interfaces:

  • Parser::each: This function takes a closure that is executed for each fastq record. It is the fastest way to iterate over the records, since no copying of the records is needed and we don't allocate anything during parsing.
  • Parser::record_sets. This function returns an iterator over record sets. All records in a record set share the same data array, so we only need one allocation per record set.
  • Parser::parallel_each. This is a convenience function that wraps Parser::record_sets but passes the record sets to a number of background threads. A closure is executed on each thread with an iterator over record sets. Results from the threads are collected and returned to the caller.

Since fastq file are usually compressed, this crate also includes a function thread_reader to offload the decompression to a different core, and from_path to guess the compression.

The FastQ standard

This library supports Windows and Unix line endings, it does not support the old MAC line ending \r. It allows arbitrary data on the separator line between sequence and quality as long as it starts with a + (some fastq files repeat the id on this line). It does not validate that the sequence or the quality contain only allowed characters. Sequence and quality must have the same length. They are not allowed to contain newline characters.

At the moment it does not make any effort to pair reads. This means that pairs that belong together might end up on different cores in a multithreaded setup. (TODO This should change it the future!).

Examples

A minimal program that reads uncompressed fastq from stdin and counts the number of fastq records. Since we do not need ownership of the records we can use the fastest Parser::each.

use std::io::stdin;
use fastq::Parser;

let mut parser = Parser::new(stdin());
let mut total: usize = 0;
parser.each(|_| {
    total += 1;
    // return `true` if you want to continue iterating
    true
}).expect("Invalid fastq file");
println!("{}", total);

If we want to do more than just count the number of records (in this example, count how many sequences align to an illumina adapter with a score better than 10), we probably want to use more cores:

use fastq::{parse_path, Record};
use std::env::args;
use parasailors as align;

extern crate fastq;
extern crate parasailors;

fn main() {
    let filename = args().nth(1);
    // Treat "-" as stdin
    let path = match filename.as_ref().map(String::as_ref) {
        None | Some("-") => { None },
        Some(name) => Some(name)
    };

    parse_path(path, |parser| {
        let nthreads = 4;
        let results: Vec<usize> = parser.parallel_each(nthreads, |record_sets| {
            // we can initialize thread local variables here.
            let adapter = b"AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT";
            let matrix = align::Matrix::new(align::MatrixType::Identity);
            let profile = align::Profile::new(adapter, &matrix);
            let mut thread_total = 0;

            for record_set in record_sets {
                for record in record_set.iter() {
                    let score = align::local_alignment_score(&profile, record.seq(), 5, 1);
                    if score > 8 {
                        thread_total += 1;
                    }
                }
            }

            // The values we return (it can be any type implementing `Send`)
            // are collected from the different threads by
            // `parser.parallel_each` and returned. See doc for a description of
            // error handling.
            thread_total
        }).expect("Invalid fastq file");
        println!("{}", results.iter().sum::<usize>());
    }).expect("Invalid compression");
}

On my feeble 2 core laptop this ends up being bound by the alignment at ~300MB/s, but it should scale well to a larger number of cores.

Structs

OwnedRecord

A fastq record that ownes its data arrays.

Parser

Parser for fastq files.

RecordRefIter
RecordSet

A collection of fastq records used to iterate over records in chunks.

RecordSetItems
RefRecord

A fastq record that borrows data from an array.

Traits

Record

Trait to be implemented by types that represent fastq records.

Functions

each_zipped

Step through two fastq files and call a callback for pairs of Records.

parse_path

Create a parser and guess the compression.

thread_reader

Wrap a reader in a background thread.