Crate fire_fasta

Expand description

§Fire-Fasta

Ultra-fast, lightweight, zero-copy, lazy Multi-FASTA parser.

The parser is intended for high performance applications where the input is expected to be well-formed. Therefore, it sacrifices input validation and deprecated features for parsing performance.

§Sequence Characters

The parser makes no assumptions about the sequence alphabet: It is explicitly intended for custom sequences with characters that do not conform to NCBI specifications. The only characters not allowed in sequences are unix-style newlines (LF), which are ignored, and the greater-than sign (>), which starts a new sequence descriptor in Multi-FASTA files. Note, that the parser does not validate whether a sequence description starts at the beginning of a new line.

The parser expects input data that is compatible with ASCII. Multibyte UTF-8 codepoints are processed as separate ASCII characters.

Windows-style newlines (CRLF) are not supported. Instead, the parser treats the LF as a unix-style newline and preserve the CR as a valid sequence character. Old FASTA comments starting with ; are also not supported, they are treated as part of the sequence.

§Usage and Lazy Parsing

Calling the parser does one pass over the entire input, separating individual fasta sequences from each other. No further processing is done and no data is copied.

let seq = ">example\nMSTIL\nAATIL\n\n";
let fasta = parse_fasta_str(&seq)?;
// or parse_fasta(&data) for &[u8] slices

assert_eq!(fasta.sequences.len(), 1);

// Iterating over a sequence removes newlines from the iterator on the fly:
assert_eq!(
    String::from_utf8(fasta.sequences[0].iter().copied().collect::<Vec<_>>())?,
    "MSTILAATIL"
);

//If you want to iterate over a sequence multiple times, it may be faster to first copy the full sequence into its own buffer:
let copied: Box<[u8]> = fasta.sequences[0].copy_sequential();
assert_eq!(copied.as_ref(), b"MSTILAATIL");

Parsing and copying use the memchr crate, and thus operations use SIMD instructions when available.

Structs§

Fasta: A Multi FASTA file containing zero, one, or more FastaSequences. Access the sequences simply through its sequences field:
FastaSequence: A FASTA sequence with a description from a FASTA file. The sequence is not processed in any way, meaning accessing it performs further parsing when necessary.

Enums§

ParseError: FASTA parsing error thrown during the initial parsing step in parse_fasta

Functions§

parse_fasta: Parse a FASTA or Multi FASTA file. Sequence descriptions are expected to start with ‘>’. The deprecated comment character ‘;’ is not parsed, neither for sequence descriptors nor for additional comment lines. Parsing is done lazily: Sequence descriptions and sequences are identified, but are not further processed.
parse_fasta_str: Parse a FASTA or Multi FASTA file. Sequence descriptions are expected to start with ‘>’. The deprecated comment character ‘;’ is not parsed, neither for sequence descriptors nor for additional comment lines. Parsing is done lazily: Sequence descriptions and sequences are identified, but are not further processed.

Crate fire_fasta

Crate fire_fasta Copy item path

§Fire-Fasta

§Sequence Characters

§Usage and Lazy Parsing

Structs§

Enums§

Functions§

Crate fire_fasta