xsra
A performant and storage-efficient CLI tool to extract sequences from an SRA archive with support for FASTA, FASTQ, and BINSEQ outputs.
Overview
The NCBI Sequence Read Archive (SRA) is a repository of raw sequencing data.
The file format used by the SRA is a complicated binary database format that isn't directly readable by most bioinformatics tools.
This tool makes use of the ncbi_vdb c-library through ncbi-vdb-sys to interact with the SRA archive with safe abstractions.
This tool is designed to be a fast, storage-efficient, and more convenient replacement for the fastq-dump and fasterq-dump tools provided by the NCBI.
However, it is not a complete feature-for-feature replacement, and some functionality may be missing.
Features
- Multi-threaded extraction to FASTA, FASTQ, and BINSEQ records.
- Optional built-in compression of output files (FASTA, FASTQ) - [gzip, bgzip, zstd]
- Choice of BINSEQ output format (
*.bqand*.vbq) - Minimum read length filtering
- Technical / biological read segment selection
- Spot subsetting
- Stream directly from NCBI without intermediate prefetch
- Prefetch SRA records for faster IO
Limitations
- May not support every possible SRA archive layout (let us know if you encounter one that fails)
- Does not support all the options provided by
fastq-dumporfasterq-dump - Will not output sequence identifiers in the same format as
fastq-dumporfasterq-dump - Spot ordering is not guaranteed to be the same as the SRA archive
- Read segments are in order to keep paired-end reads together, but the order of spots is dependent on the order of completion of the threads.
- Installation bundles
ncbi-vdbsource code and builds it as a static library- This may not work on all systems
- The resulting builds will likely be system-specific and the resulting binary may not be portable.
Installation
You will need to install the rust package manager cargo first.
# install using cargo
# validate installation
Usage
xsra can either be run with on-disk accessions or can be streamed from SRA directly.
# Write all records to stdout (defaults to fastq)
# Write all records to stdout (as fasta)
# Write all records to stdout (as fastq)
# Split records into multiple files (will create an output directory and write files there)
# Split records into multiple files and compress them (gzip)
# Split records into multiple files, compress them (zstd), and filter out reads shorter than 11bp
# Write all records to stdout but only use 4 threads and compress the output (bgzip)
# Write only the first 100 spots to stdout
# Write only segments 1 and 2 to stdout
# Describe the SRA file (spot statistics)
# Download an accession to disk
You can also write BINSEQ and VBINSEQ files directly from SRA without an intermediate FASTA or FASTQ file. These operations can be done with multiple threads for faster processing as well (following same arguments as above).
# Write a BINSEQ file to (output.bq) selecting segments 1 and 2 (zero-indexed) as primary and extended.
# Write a BINSEQ file to (output.bq) selecting segment 3 (zero-indexed) as primary.
# Write a VBINSEQ file to (output.vbq) selecting segments 3 and 1 (zero-indexed) as primary and extended.
You can also use alternative data providers such as GCP.
You will need to provide a project ID.
Contributing
Please feel free to open an issue or pull request if you have any suggestions or improvements.