[][src]Module seq_io::fasta

FASTA reading and writing

Flavours

There are two flavours of this parser:

  • fasta::Reader in this module parses standard multi-line FASTA.
  • Reader in fasta::single_line accepts only single-line FASTA. This may not be of general use, the only advantage of this parser is that it runs faster than the multi- line parser.

Multi-line FASTQ is problematic since the quality scores may also contain @.

Example

The following example shows how to use Reader.

use seq_io::prelude::*;  // needed to import necessary traits
use seq_io::fasta::Reader;

let seq = b">id1 some description
SEQUENCE
.ANOTHER.LINE
>id2
SEQUENCE
";

// Construct the reader
let mut reader = Reader::new(&seq[..]);

// We'll write the records back to this vector
let mut output = vec![];

while let Some(result) = reader.next() {
    let rec = result.unwrap();

    // Access the ID and the description parts of the header (separated by a space)
    let id = rec.id().unwrap();
    let desc = rec.desc().transpose().unwrap();
    println!("ID: '{}', description: {:?}", id, desc);

    // Print the sequence
    println!("{}", std::str::from_utf8(rec.seq()).unwrap());

    // Write the record to 'output'
    rec.write(&mut output).unwrap();
}

println!("\nWritten:\n{}", std::str::from_utf8(&output).unwrap());

The output will be:

ID: 'id1', description: Some("some description")
SEQUENCE
.ANOTHER.LINE
ID: 'id2', description: None
SEQUENCE

Written:
>id1 some description
SEQUENCE.ANOTHER.LINE
>id2
SEQUENCE

As the record returned by the next() method borrows its data from the underlying buffer, it is not possible to use a for loop for iterating. Therefore, we use the while let ... construct.

The ID of the record can be accessed using record.id(), which first has to check the validity of the UTF-8 first. record.description() returns everything after the space in the header, if present. The whole header line can be accessed using record.head(). This returns a byte slice, which needs to be manually converted to String.

Sequence record types

  • The type of record obtained by calling Reader::next() is RefRecord, which is a simple wrapper borrowing its data from the reader and its buffer. Essentially, the position of the record in the buffer is stored without any further copying necessary.

  • OwnedRecord owns its data (see more here).

Both record types implement BaseRecord, the common trait for all records in this crate. FASTA specific methods are additionally defined in the fasta::Record trait. It is easiest to simply import these traits (along others) with use seq_io::prelude::*.

Accessing the sequence

There are different possibibilities to access the sequence in RefRecord. The fastest are presented first:

Taking a slice of everything

In the above example, printing the output of rec.seq() results in two sequence lines, like in the input. This is on purpose, not a bug. The record.seq() method simply takes a slice from the buffer without transforming the sequence in any way. This may make sense in some cases, but if multiple lines are expected in the FASTA input, the sequence will be interrupted by line breaks.

Iterating over sequence lines

Since the position of each line is remembered by the record, iterating over sequence lines is also very fast:

This example is not tested
for line in rec.seq_lines() {
    println!("line: {}", std::str::from_utf8(&line)?);
}

This would be the output for the first record from the above example:

line: SEQUENCE
line: .ANOTHER.LINE

Obtaining the full sequence

If it is required to have the whole, contiguous sequence in one slice, there are a two possibilities:

  • full_seq() returns Cow<[u8]>, meaning that if there is only one sequence line, no copying will be done. Multiple lines are instead copied into a newly allocated Vec<u8>.
  • full_seq_given() allows reusing allocations in different calls by letting the user supply a Vec<u8>:
This example is not tested
let mut seq = vec![];
rec.full_seq_given(|| &mut seq);
println!("{}", std::str::from_utf8(&seq)?);

This would print SEQUENCE.ANOTHER.LINE given the first sequence record. The reasoning behind the closure was to also allow it to be used with arena allocators. The closure is only called if needed due to multiple sequence lines.

Owned records

With the method shown above it is not possible to iterate using a for loop since Reader is a streaming iterator, which does not implement Iterator. Records cannot be stored in a vector for later reuse. In this case, it is necessary to create an owned copy, which stores the header and sequence in allocated vectors:

This example is not tested
let records = vec![];
while let Some(result) = reader.next() {
    let rec = result?;
    records.push(rec.to_owned_record());
}

An even easier way is to use the records() or into_records iterators:

This example is not tested
let records: Vec<_> = reader.records().collect()?;

Of course, this slows down everything because allocations take time. However, it is also possible to use clone_into_owned in order to reuse the OwnedRecords:

This example is not tested
let records = vec![];
while let Some(result) = reader.next() {
    let rec = result?;
    // Obtain the record from somewhere or create a new one using
    // OwnedRecord::default()
    let mut owned_record = ...
    // Update it with new data
    rec.clone_into_owned(&mut owned_record);
    records.push(owned_record);
}

Writing FASTA

Records can be written to output using BaseRecord::write(). fasta::Record::write_wrap() Writes FASTA, wrapping lines at a given width. RefRecord additionally has the method write_unchanged, which should be faster, but does not remove line wrap if present or re-wrap lines.

It is also possible to write data not part of a FASTA record directly using a set of different functions listed here.

Details on parsing and writing

Valid FASTA records require the header line to start with > and terminated with a line terminator (UNIX-style \n or Windows-style \r\n). Sequence lines are optional. Therefore, the following is also accepted and interpreted as two records with an empty header and no sequence:

>
>

Accordingly, BaseRecord::num_seq_lines() returns 0 and RefRecord::seq_lines() an empty iterator.

More details:

  • Like all parsers in this crate, fasta::Reader handles UNIX (LF) and Windows (CRLF) line endings, but not old Mac-style (CR) endings. LF and CRLF may be mixed within the same file.
  • FASTA writing currently always uses UNIX line endings.
  • The first non-empty line should start with >, indicating the first header. If not, an error with ErrorKind::InvalidStart is returned.
  • Whitespace at the end of header and sequence lines is never removed.
  • Empty input will result in None being returned immediately by fasta::Reader::next() and in empty iterators for RecordsIter / RecordsIntoIter.
  • Comment lines starting with ; are not recognized. If at the start of a file, there will be an error, since > is expected. Intermediate comments are interpreted as belonging to the sequence.
  • The last record header requires to be terminated by a line ending. If not, an error of ErrorKind::UnexpectedEnd is returned.

Modules

single_line

FASTA reader, which requires the sequence to be on a single line, it should not to be wrapped to multiple lines. This is a rather unusually strict definition of the format and may not be of general use. This parser was mostly created for exploring performance optimizations.

Structs

Error

Parsing error

LineStore
OwnedRecord

A FASTA record that ownes its data (requiring two allocations) It implements the traits BaseRecord and Record.

Reader

FASTA parser

RecordSet

Set of sequence records that owns it's buffer and knows the positions of each record.

RecordSetIter

Iterator over record sets

RecordsIntoIter

Iterator of OwnedRecord that owns the underlying reader

RecordsIter

Borrowed iterator of OwnedRecord

RefRecord

A FASTA record that borrows data from a buffer. It implements the traits BaseRecord and Record.

Enums

ErrorKind

Traits

Record

FASTA record trait implemented by both RefRecord and OwnedRecord, which adds more methods to BaseRecord.

Functions

write

Helper function for writing data (not necessarily stored in a Record instance) to the FASTA format.

write_head

Writes only the sequence line.

write_iter

Helper function for writing data (not necessarily stored in a Record instance) to the FASTA format. In contrast to write, this function accepts a sequence iterator.

write_seq

Writes only the sequence line.

write_seq_iter

Writes the sequence line from an iterator of lines.

write_wrap

Writes data to the FASTA format. Wraps the sequence to produce multi-line FASTA with a maximum width specified by the wrap parameter.

write_wrap_iter

Writes data to the FASTA format. Wraps the sequence to produce multi-line FASTA with a maximum width specified by the wrap parameter. Accepts a sequence iterator.

write_wrap_seq

Writes the sequence line, and wraps the output to a maximum width specified by wrap.

write_wrap_seq_iter

Writes the sequence line from an iterator (such as SeqLines) and wraps the output to a maximum width specified by wrap.

Type Definitions

Result