Expand description
A vectorized library for FASTA/FASTQ parsing and bitpacking.
§Requirements
This library requires AVX2, SSE3, or NEON instruction sets. Enable target-cpu=native when
building:
RUSTFLAGS="-C target-cpu=native" cargo run --releaseIf your CPU has poor support for the
PDEP instruction
(e.g. AMD CPUs prior to 2020), use the no-pdep feature:
RUSTFLAGS="-C target-cpu=native" cargo run --release -F no-pdep§Minimal example
The main entry point is to define a configuration via ParserOptions
and build a FastxParser with this configuration.
use helicase::input::*;
use helicase::*;
// set the options of the parser (at compile-time)
const CONFIG: Config = ParserOptions::default().config();
fn main() {
let path = "...";
// create a parser with the desired options
let mut parser = FastxParser::<CONFIG>::from_file(&path).expect("Cannot open file");
// iterate over records
while let Some(_event) = parser.next() {
// get a reference to the header
let header = parser.get_header();
// get a reference to the sequence (without newlines)
let seq = parser.get_dna_string();
// ...
}
}§Adjusting the configuration
The parser is configured at compile-time via ParserOptions.
For example, to ignore headers and split non-ACTG bases:
use helicase::*;
const CONFIG: Config = ParserOptions::default()
.ignore_headers()
.split_non_actg()
.config();§Bitpacked DNA formats
The parser can output a bitpacked representation of the sequence in two formats:
PackedDNAmaps each base to two bits and packs them (compatible with packed-seq via thepacked-seqfeature).ColumnarDNAseparates the high bit and the low bit of each base into two bitmasks.
Since each base is encoded in two bits, non-ACTG bases must be handled explicitly. Three
options are available via ParserOptions:
split_non_actgsplits the sequence at non-ACTG bases, yielding oneDnaChunkevent per contiguous ACTG run (default for bitpacked formats).skip_non_actgskips non-ACTG bases and merges the remaining chunks, yielding oneRecordevent per record.keep_non_actgkeeps non-ACTG bases and encodes them lossily, yielding oneRecordevent per record (default for string format).
§Events
The parser is an iterator that yields Event values.
An event signals a record boundary or a contiguous DNA chunk,
but the data is always read from the parser itself via get_header, get_dna_string, etc.
There are two kinds of event:
Event::Recordemitted once per record, after all of its DNA chunks. Enabled byreturn_record(on by default).Event::DnaChunkemitted for each contiguous ACTG run. Enabled byreturn_dna_chunk(on by default withdna_packedanddna_columnar).
When both are active you need to match on the event to distinguish them:
use helicase::input::*;
use helicase::parser::Event;
use helicase::*;
// dna_packed enables DnaChunk events; and Record events are also kept by default.
const CONFIG: Config = ParserOptions::default().dna_packed().config();
fn main() {
let path = "...";
let mut parser = FastxParser::<CONFIG>::from_file(&path).expect("Cannot open file");
while let Some(event) = parser.next() {
match event {
Event::Record(_) => {
// all chunks of this record have been processed
}
Event::DnaChunk(_) => {
// one contiguous ACTG run is ready
let seq = parser.get_dna_packed();
}
}
}
}When only one type of event is active, the event value can be safely ignored:
use helicase::input::*;
use helicase::*;
// Default config: only Record events, one per record.
const CONFIG: Config = ParserOptions::default().config();
fn main() {
let path = "...";
let mut parser = FastxParser::<CONFIG>::from_file(&path).expect("Cannot open file");
while let Some(_event) = parser.next() {
let header = parser.get_header();
let seq = parser.get_dna_string();
}
}It is even possible to disable all events to process the entire file in one go, for instance if you simply want to count bases.
§Iterating over chunks of packed DNA
use helicase::input::*;
use helicase::*;
const CONFIG: Config = ParserOptions::default()
// by default, dna_packed splits non-ACTG bases and stops after each chunk
.dna_packed()
// don't stop the iterator at the end of a record
.return_record(false)
.config();
fn main() {
let path = "...";
let mut parser = FastxParser::<CONFIG>::from_file(&path).expect("Cannot open file");
// iterate over each chunk of ACTG bases
while let Some(_event) = parser.next() {
// headers are still accessible between chunks
let header = parser.get_header();
// get a reference to the packed sequence
let seq = parser.get_dna_packed();
// or directly get a PackedSeq (requires the packed-seq feature)
// let packed_seq = parser.get_packed_seq();
}
}§Crate features
| Feature | Default | Description |
|---|---|---|
packed-seq | no | conversion to packed-seq types |
no-pdep | no | disable PDEP instruction (recommended for AMD CPUs prior to 2020) |
gz | yes | gzip decompression |
zstd | yes | zstd decompression |
bz2 | no | bzip2 decompression |
xz | no | xz decompression |
Re-exports§
pub use config::Config;pub use config::ParserOptions;pub use parser::FastaParser;pub use parser::FastqParser;pub use parser::FastxParser;pub use parser::HelicaseParser;
Modules§
- config
- Compile-time configuration of the parser.
- dna_
format - Bitpacked DNA formats.
- input
- Input types and helpers.
- parser
- Parser types and traits.