Expand description
This library provides an(other) attempt at high performance FASTA and FASTQ parsing and writing. The FASTA parser can read and write multi-line files. The FASTQ parser supports only single lines.
By default, the parsers avoid allocations and copying as much as possible.
fasta::RefRecord
and
fastq::RefRecord
borrow from the underlying buffered
reader. In addition, fasta::RefRecord
offers the
seq_lines()
method,
which allows iterating over individual sequence lines in a multi-line FASTA file
without the need to copy the data.
By default, both parsers use a buffer of 64 KiB size. If a record with a longer sequence is encountered, the buffer will automatically grow. How it grows can be configured. See below for more information.
More detailed documentation
Please refer to the module docs for more information on how to use the reading and writing functions, as well as information on the exact parsing behaviour:
Example FASTQ parser:
This code prints the ID string from each FASTQ record.
use seq_io::fastq::{Reader,Record};
let mut reader = Reader::from_path("seqs.fastq").unwrap();
while let Some(record) = reader.next() {
let record = record.expect("Error reading record");
println!("{}", record.id().unwrap());
}
Example FASTA parser calculating mean sequence length:
The FASTA reader works just the same. One challenge with the FASTA
format is that the sequence can be broken into multiple lines.
Therefore, it is not always possible to get a slice to the whole sequence
without copying the data. But it is possible to use seq_lines()
for efficiently iterating over each sequence line:
use seq_io::fasta::{Reader,Record};
let mut reader = Reader::from_path("seqs.fasta").unwrap();
let mut n = 0;
let mut sum = 0;
while let Some(record) = reader.next() {
let record = record.expect("Error reading record");
for s in record.seq_lines() {
sum += s.len();
}
n += 1;
}
println!("mean sequence length of {} records: {:.1} bp", n, sum as f32 / n as f32);
If the whole sequence is required at once, there is the
full_seq
,
which will only allocate the sequence if there are multiple lines.
use seq_io::fasta::{Reader,OwnedRecord};
Large sequences
Due to the design of the parsers, each sequence record must fit into the underlying
buffer as a whole. There are different ways to deal with large sequences:
It is possible configure initial buffer size using Reader::with_capacity()
.
However, the buffer will also automatically double its size if a record doesn’t fit.
How it grows can be configured by applying another policy.
For example, the readers can be configured to return
fasta::Error::BufferLimit
/
fastq::Error::BufferLimit
if buffer size grows too large. This is done using set_policy()
:
use seq_io::fasta::Reader;
use seq_io::policy::DoubleUntilLimited;
// The buffer doubles its size until 128 MiB, then grows by steps
// of 128 MiB. If it reaches 1 GiB, there will be an error.
let policy = DoubleUntilLimited::new(1 << 30, 1 << 32);
let mut reader = Reader::from_path("input.fasta").unwrap()
.set_policy(policy);
// (...)
For information on how to create a custom policy, refer to the
policy
module docs.
Owned records
Both readers also provide iterators similar to Rust-Bio, which return owned data. This is slower, but make sense, e.g. if the records are collected in to a vector:
use seq_io::fasta::Reader;
let mut reader = Reader::from_path("input.fasta").unwrap();
let records: Result<Vec<_>, _> = reader.records().collect();
Parallel processing
Functions for parallel processing can be found in the parallel
module
Modules
- Efficient FASTA reading and writing
- Efficient FASTQ reading and writing
- Experiments with parallel processing
- This module defines the
BufPolicy
trait, which configures how the internal buffer of the parsers should grow upon encountering large sequences that don’t fit into the buffer.