grepq
quickly filter fastq files by matching sequences to a set of regex patterns
Performance
grepq is fast.
On a Mac Studio with 32GB RAM and Apple M1 max chip, grepq processed a 104GB fastq file in 88 seconds, about 1.2GB of fastq data per second. For a 874MB fastq file, it was around 4.8 and 450 times faster than the general-purpose regex tools ripgrep and grep, respectively, on the same hardware. Furthermore, grepq will only match regex patterns to the sequence part of the fastq file, which is the most common use case. This is in contrast to ripgrep and grep, which will match the regex patterns to the entire fastq record, which includes the record ID, sequence, separator, and quality. This can lead to false positives and slow down the filtering process.
Usage
<PATTERNS>
)
<FILE>
)
)
- tips
- order your regex patterns from those that are most likely to match to those that are least likely to match. This will speed up the filtering process.
- ensure you have enought space storage space for the output file.
Requirements
grepqhas been tested on Linux and macOS. It might work on Windows, but it has not been tested.- ensure that rust is installed on your system (https://www.rust-lang.org/tools/install)
Installation
-
from source
- clone the repository and
cdinto thegrepqdirectory - run
cargo build --release - relative to the cloned parent directory, the executable will be located in
./grepq/target/release
- clone the repository and
-
from Cargo.io
cargo install grepq
Checksums to verify grepq is working correctly, using the regex file regex.txt and the small fastq file small.fastq, both located in the test directory:
License
MIT