Quickly filter FASTQ files by matching sequences to a set of regex patterns
Table of Contents
- Feature set
- Features and performance in detail
- Usage
- Cookbook
- Requirements
- Installation
- Examples and tests
- Futher testing
- Citation
- Update changes
- License
Feature set
- very fast and scales to large FASTQ files
- IUPAC ambiguity code support
- gzip support
- JSON support for pattern file input and
tunecommand output, allowing named regex sets and named regex patterns - use predicates to filter on header field (using a regex), minimum sequence length, and minimum average quality score (supports Phred+33 and Phred+64)
- does not match false positives
- output matched sequences to one of three formats
- tune your pattern file with the
tunecommand - supports inverted matching with the
invertedcommand - plays nicely with your unix workflows
- comprehensive help, examples and testing script
Features and performance in detail
1. Very fast and scales to large FASTQ files
| tool | time (s) | × grep speedup | × ripgrep speedup |
|---|---|---|---|
| grepq | 0.20 | 1731 | 18 |
| ripgrep | 3.57 | 97 | NA |
| grep | 345.07 | NA | NA |
2. Reads and writes regular or gzip-compressed FASTQ files
Use the --best option for best compression, or the --fast option for faster compression.
| tool | time (s) | × grep speedup | × ripgrep speedup |
|---|---|---|---|
| grepq | 2.39 | 145 | 1.5 |
| ripgrep | 3.64 | 95 | NA |
| grep | 345.68 | NA | NA |
3. Predicates
Predicates can be used to filter on the header field (using a regex), minimum sequence length, and minimum average quality score (supports Phred+33 and Phred+64).
[!NOTE] A regex supplied to filter on the header field is first passed as a string to the regex engine, and then the regex engine is used to match the header field. If you get an error message, be sure to escape any special characters in the regex pattern.
Predicates are specified in a JSON pattern file. For an example, see 16S-iupac-and-predicates.json in the examples directory.
4. Does not match false positives
grepq will only match regex patterns to the sequence field of a FASTQ record, which is the most common use case. Unlike ripgrep and grep, which will match the regex patterns to the entire FASTQ record, which includes the record ID, sequence, separator, and quality fields. This can lead to false positives and slow down the filtering process.
5. Output matched sequences to one of three formats
- sequences only (default)
- sequences and their corresponding record IDs (
-Ioption) - FASTQ format (
-Roption)
6. Will tune your pattern file with the tune command
Use the tune command to analyze matched substrings and update the number and/or order of regex patterns in your pattern file according to their matched frequency. This can speed up the filtering process.
Specifying the -c option to the tune command will output the matched substrings and their frequencies, ranked from highest to lowest.
When the patterns file is given in JSON format, then specifying the -c, --names and --json-matches options to the tune command will output the matched substrings and their frequencies in JSON format to a file called matches.json, allowing named regex sets and named regex patterns. See examples/16S-iupac.json for an example of a JSON pattern file and examples/matches.json for an example of the output of the tune command in JSON format.
[!NOTE] When the count option (-c) is given with the
tunecommand,grepqwill count the number of FASTQ records containing a sequence that is matched, for each matching regex in the pattern file. If, however, there are multiple occurrences of a given regex within a FASTQ record sequence field,grepqwill count this as one match. When the count option (-c) is not given with thetunecommand,grepqprovides the total number of matching FASTQ records for the set of regex patterns in the pattern file.
7. Supports inverted matching with the inverted command
Use the inverted command to output sequences that do not match any of the regex patterns in your pattern file.
8. Plays nicely with your unix workflows
For example, see tune.sh in the examples directory. This simple script will filter a FASTQ file using grepq, tune the pattern file on a user-specified number of FASTQ records, and then filter the FASTQ file again using the tuned pattern file for a user-specified number of the most frequent regex pattern matches.
Usage
Get instructions and examples using grepq -h, and grepq tune -h and grepq inverted -h for more information on the tune and inverted commands, respectively.
[!NOTE] Pattern files must contain one regex pattern per line or be provided in JSON format, and patterns are case-sensitive. You can supply an empty pattern file to count the total number of records in the FASTQ file. The regex patterns should only include the DNA sequence characters (A, C, G, T), or IUPAC ambiguity codes (N, R, Y, etc.). See
16S-no-iupac.txt,16S-iupac.jsonand16S-iupac-and-predicates.jsonin theexamplesdirectory for examples of valid pattern files.
Requirements
grepqhas been tested on Linux and macOS. It might work on Windows, but it has not been tested.- Ensure that Rust is installed on your system (https://www.rust-lang.org/tools/install)
- If the build fails, make sure you have the latest version of the Rust compiler by running
rustup update - To run the
test.shandcookbook.shscripts in theexamplesdirectory, you will needyq(v4.44.6 or later),gunzipand version 4 or later ofbash. - To run "test-10" in
commands-1.yaml,commands-2.yaml,commands-3.yamlandcommands-4.yaml, you will need to download the file SRX26365298.fastq.gz from the SRA and place it in theexamplesdirectory. You can download the file withfastq-dump --accession SRX26365298. Obtainfastq-dumpfrom the SRA Toolkit, available at NCBI.
Installation
-
From crates.io (easiest method, but will not install the
examplesdirectory)cargo install grepq
-
From source (will install the
examplesdirectory)- Clone the repository and
cdinto thegrepqdirectory - Run
cargo build --release - Relative to the cloned parent directory, the executable will be located in
./target/release - Make sure the executable is in your
PATHor use the full path to the executable
- Clone the repository and
Examples and tests
Get instructions and examples using grepq -h, grepq tune -h and grepq inverted -h for more information on the tune and inverted commands, respectively. See the examples directory for examples of pattern files and FASTQ files.
File sizes of outfiles to verify grepq is working correctly, using the regex file 16S-no-iupac.txt and the small fastq file small.fastq, both located in the examples directory:
For the curious-minded, note that the regex patterns in 16S-no-iupac.txt, 16S-iupac.json, and 16S-iupac-and-predicates.json are from Table 3 of Martinez-Porchas, Marcel, et al. "How conserved are the conserved 16S-rRNA regions?." PeerJ 5 (2017): e3036.
For more examples, see the examples directory and the cookbook, available also as a shell script in the examples directory.
Test script
You may also run the test script (test.sh) in the examples directory to more fully test grepq. From the examples directory, run the following command:
; ; ;
If all tests pass, there will be no orange (warning) text in the output, and no test will report a failure.
Example of failing test output:
SARS-CoV-2 example
Count of the top five most frequently matched patterns found in SRX26602697.fastq using the pattern file SARS-CoV-2.txt (this pattern file contains 64 sequences of length 60 from Table II of this preprint):
|
Obtain SRX26602697.fastq from the SRA using fastq-dump --accession SRX26602697.
Futher testing
grepq can be tested using tools that generate synthetic FASTQ files, such as spikeq (https://github.com/Rbfinch/spikeq)
Citation
If you use grepq in your research, please cite as follows:
Crosbie, N.D. (2024). grepq: A Rust application that quickly filters FASTQ files by matching sequences to a set of regex patterns. 10.5281/zenodo.14031703
Update changes
see CHANGELOG
License
MIT