Helicase
Helicase is a carefully optimized FASTA/FASTQ parser that extensively uses vectorized instructions.
It is designed for three main goals: being highly configurable, handling non-ACTG bases and computing bitpacked representations of DNA.
Requirements
This library requires AVX2, SSE3 or NEON instruction sets, make sure to enable target-cpu=native when using it:
RUSTFLAGS="-C target-cpu=native"
Note: if your CPU has a bad support for the PDEP instruction (e.g. AMD CPUs prior to 2020), it is recommended to use the no-pdep feature:
RUSTFLAGS="-C target-cpu=native"
Usage
Minimal example
use *;
use *;
// set the options of the parser (at compile-time)
const CONFIG: Config = default.config;
Adjusting the configuration
The parser is configured at compile-time via ParserOptions.
For example, to ignore headers and split non-ACTG bases:
const CONFIG: Config = default
.ignore_headers
.split_non_actg
.config;
Bitpacked DNA formats
The parser can output a bitpacked representation of the sequence in two different formats:
PackedDNAwhich maps each base to two bits and packs them (compatible with packed-seq using the corresponding feature).ColumnarDNAwhich separates the high bit and the low bit of each base, and store them in two bitmasks.
Since each base is encoded using two bits, we have to handle non-ACTG bases differently.
Three options are available via ParserOptions:
split_non_actgsplits the sequence at non-ACTG bases, yielding oneDnaChunkevent per contiguous ACTG run (default for bitpacked formats).skip_non_actgskips non-ACTG bases and merges the remaining chunks, yielding oneRecordevent per record.keep_non_actgkeeps the non-ACTG bases and encodes them lossily, yielding oneRecordevent per record (default for string format).
Events
The parser is an iterator that yields Event values.
An event signals a record boundary or a contiguous DNA chunk,
but the data is always read from the parser itself via get_header, get_dna_string, etc.
There are two kinds of event:
Event::Record: emitted once per record, after all of its DNA chunks. Enabled byreturn_record(on by default).Event::DnaChunk: emitted for each contiguous ACTG run. Enabled byreturn_dna_chunk(on by default withsplit_non_actg).
When both are active you need to match on the event to distinguish them:
use *;
use Event;
use *;
// dna_packed enables DnaChunk events; and Record events are also kept by default.
const CONFIG: Config = default.dna_packed.config;
When only one type of event is active, the event value can be safely ignored:
use *;
use *;
// Default config: only Record events, one per record.
const CONFIG: Config = default.config;
It is even possible to disable all events to process the entire file in one go, for instance if you simply want to count bases.
Iterating over chunks of packed DNA
use *;
use *;
const CONFIG: Config = default
// by default, dna_packed splits non-ACTG bases and stops after each chunk
.dna_packed
// don't stop the iterator at the end of a record
.return_record
.config;
Crate features
Packed-seq
The PackedDNA format is compatible with packed-seq and can be converted when the packed-seq feature is enabled (disabled by default).
This can be useful for hashing k-mers or computing minimizers & syncmers.
No PDEP
By default, this library uses PDEP to compute the PackedDNA format.
However, this instruction can be very slow on some CPUs (especially AMD CPUs prior to 2020).
If you want an efficient implementation for these CPUs, we recommend using the no-pdep feature.
Decompression
This library supports transparent file decompression using deko, you can choose the supported formats using the following features:
bz2for bzip2gzfor gzip (default)xzfor xzzstdfor zstd (default)
Benchmarks
Benchmarks against needletail and paraseq are available in the bench directory.
You can run them on any (possibly compressed) FASTA/FASTQ file using:
RUSTFLAGS="-C target-cpu=native"
For instance, you can run it on this human genome, these short reads or these long reads.
Note that the FASTQ files can easily be converted to FASTA using:
RUSTFLAGS="-C target-cpu=native"
More information in the bench README.
Acknowledgements
This project was initially started by Loup Lobet during his internship with Charles Paperman.