seq_io_parallel
A parallel processing extension for the seq_io crate, providing an ergonomic API for parallel FASTA/FASTQ file processing.
Overview
While seq_io includes parallel implementations for both FASTQ and FASTA readers, this library offers an alternative approach with a potentially more ergonomic API that is not reliant on closures.
The implementation follows a Map-Reduce style of parallelism that emphasizes clarity and ease of use.
Key Features
- Single-producer multi-consumer parallel processing pipeline
- Map-Reduce style processing architecture
- Support for both FASTA and FASTQ formats
- Thread-safe stateful processing
- Efficient memory management with reusable record sets
Architecture
The library implements a parallel processing pipeline with the following components:
- Reader Thread: A dedicated thread that continuously fills a limited set of
RecordSetsuntil EOF - Worker Threads: Multiple threads that process ready
RecordSetsin parallel - Record Processing: While
RecordSetsmay be processed out of order, records within each set maintain their sequence
Implementation
The ParallelProcessor Trait
To use parallel processing, implement the ParallelProcessor trait:
Record Access
Both FASTA and FASTQ records are accessed through the MinimalRefRecord trait:
Usage Example
Here's a simple example that performs parallel processing of a FASTQ file:
use Result;
use fastq;
use ;
use ;
Performance Considerations
FASTA/FASTQ processing is typically I/O-bound, so parallel processing benefits may vary:
- Best for computationally expensive operations (e.g., alignment, k-mer counting)
- Performance gains depend on the ratio of I/O to processing time
- Consider using
Arcfor processor state with heavy initialization costs
Implementation Notes
- Each worker thread receives a
Cloneof theParallelProcessor - Thread-local state can be maintained without locks
- Global state should use appropriate synchronization (e.g.,
Arc<AtomicUsize>) - Heavy initialization costs can be mitigated by wrapping in
Arc
Future Work
Currently this library is making use of anyhow for all error handling.
This is not ideal for custom error types in libraries, but for many CLI tools will work just fine.
In the future this may change.