[][src]Crate seq_io

This library provides parsing and writing of FASTA, FASTQ and FASTX at a high performance.

For a detailed documentation of the components, please refer to the module pages listed here:

  • FASTA: See fasta module for an introduction.
  • FASTQ: See fastq module. A separate parser supporting multi-line FASTQ is found in fastq::multiline.
  • FASTX: There are two approaches:

Special features

The following features are not covered in the module docs for the individual formats:

Parallel processing

All readers allow reading sequence records into chunks called record sets. Effectively, they are just copies of the internal buffer with associated positional information. These record sets can be sent around in channels without the overhead that sending single records would have. The idea for this was borrowed from fastq-rs.

The parallel module offers a few functions for processing sequence records and record sets in a worker pool and then sending them along with the processing results to the main thread. The functions work; the API design may not be optimal yet.

Position tracking and seeking

All readers keep track of the byte offset, line number and record number while parsing. The current position can be stored and used later for seeking back to the same position. See here for an example.

It is not yet possible to restore a record completely given positional information (such as from a .fai file). All that is done currently is to set the position, so next() will return the correct record.

Design notes

Apart from R: io::Read, all readers have two additional generic parameters. It is normally not necessary to change the defaults, but in some cases this may be relevant.

Buffer growth policy

The parsers avoid allocations and copying as much as possible. To achieve this, each sequence record must fit into the underlying buffer as a whole. This may not be possible if dealing with large sequences. Therefore, the internal buffer of the reader will grow automatically to fit the whole sequence record again. The buffer may grow until it reaches 1 GiB; larger records will cause an error.

The behaviour of buffer growth can be further configured by applying a different policy. This is documented in the policy module.

Position stores

At the core of the different parsers is the same code, which is called with different parameters. While searching the buffer for sequence records, the position of the different features is stored. This allows to later access the header, sequence and quality features directly as slices taken from the internal buffer. Which positional information needs to be stored depends on the format. For example, the fasta reader stores the position of every sequence line in order to allow fast iteration lines later. The fastq reader needs to remember the position of the quality scores, but doesn't need to store information about multiple lines, which allows for a simpler data structure. In turn, the fastx reader needs to store FASTA lines and quality scores.

Therefore, all readers have a third generic parameter, which allows assigning a specific "storage backend" implementing the core::PositionStore trait. Usually, it is not necessary to deal with this parameter since each parser has a reasonable default. The only case where it is changed in this crate is with the trait object approach implemented in fastx::dynamic.

Note that not all combinations of readers and PositionStore types have currently been tested, and some combinations are known to be problematic. Others just don't make sense. For example, the API does not prohibit combining fastq::Reader with fasta::LineStore, but this will return everything after the header as sequence, and no quality scores are stored. TODO: document possible combinations

Modules

core

Contains core routines and types. The types defined in this module are subject to change and the API should not be relied on.

fasta

FASTA reading and writing

fastq

FASTQ reading and writing

fastx

FASTX reading and writing

parallel

Functions for parallel processing of record sets and records.

policy

This module defines the BufPolicy trait, which configures how the internal buffer of the parsers should grow upon encountering large sequences that don't fit into the buffer.

prelude

Macros

parallel_record_impl

Allows generating functions equivalent to the read_process_xy_records functions in this crate for your own types. This is rather a workaround because the generic approach (read_process_records_init) does currently not work.

Structs

ErrorOffset

Position of a parsing error within the sequence record

ErrorPosition

Position of a parsing error within the file

LinePositionIter

Iterator over the lines of a text slice, whose positions have been searched before and stored. Iterating is very fast.

LineSearchIter

Iterator over the lines of a text slice. The line endings are searched in the text while iterating, except for the case where it is known that there is only one line.

Position

Holds line number and byte offset of a FASTQ record

Traits

BaseRecord
HeadWriter

Helper trait used to allow supplying either the whole head or separate ID and description parts to the write_...() functions in the fasta and fastq modules.

PositionStore

Trait for objects storing the coordinates of sequence records in the buffer.