multiseqex
MULTI SEQuence EXtractor — a fast, parallel CLI tool for extracting
sequences from FASTA files using .fai indexing.
Similar to samtools faidx but built for bulk extraction. Supports multiple
input formats (BED, VCF, GFF, CSV/TSV tables, inline regions), sequence
transforms (reverse complement, RNA conversion, translation), masking,
interval arithmetic, and parallel output across multiple CPU cores.
Installation
From crates.io
From GitHub
Build from source
Prerequisites
- Rust 1.87+ (edition 2024)
- samtools (optional, for
pre-building
.faiindexes)
If the FASTA file lacks a .fai index, multiseqex builds one automatically
(unless --no-build-fai is set).
Quick start
# Single region to stdout
# Multiple regions from a BED file
# VCF variant context extraction
# GFF gene extraction
# Region statistics (GC%, length, masking)
# Translate extracted sequences to amino acids
# K-mer tiling with 100bp windows, 50bp step
Options reference
Input formats
| Flag | Description |
|---|---|
<FASTA>... |
One or more reference FASTA files (positional). Supports bgzip/gzip (transparent decompression). Use - for stdin with --no-index. |
--regions |
Comma-separated regions: chr:start-end, chr:pos+flank |
--list |
File with one region per line. Use - for stdin. |
--bed |
BED file (0-based half-open). Supports optional name (col 4) and strand (col 6). |
--table |
CSV/TSV with header. Columns: CHROM, START, END (range) or CHROM, POS (position + --flank). Optional: NAME, STRAND. |
--sv-table |
SV paired-region table. Columns: CHROM_LEFT/RIGHT, START/END_LEFT/RIGHT or POS_LEFT/RIGHT. |
--vcf |
VCF file. Extracts REF span per record. ID used as name; REF/ALT in header. |
--gff |
GFF3/GTF annotation file. Use --gff-feature to filter (default: gene). |
--contigs |
Comma-separated contig names to extract in full. |
--contig-list |
File with one contig name per line to extract in full. |
Region manipulation
| Flag | Description |
|---|---|
--flank |
Symmetric flank size for position-mode regions. |
--flank-left |
Left-side flank (must pair with --flank-right). |
--flank-right |
Right-side flank (must pair with --flank-left). |
--dedup |
Remove duplicate regions (same chr, start, end). |
--sort |
Sort by natural chromosome order then start position. |
--merge |
Merge overlapping/book-ended regions. Implies --sort. |
--merge-distance |
Maximum gap for merging (default: 0). Requires --merge. |
--subtract |
BED file of intervals to subtract from input regions. |
--intersect |
BED file of intervals to intersect with input regions. |
--tile |
Tile each region into windows of this size (bases). |
--step |
Step size for tiling (default: same as --tile). |
Output
| Flag | Description |
|---|---|
-o, --output |
Write all sequences to a single file (default: stdout). |
--output-dir |
Write one file per region (or per SV pair). |
--line-width |
FASTA line width (default: 60). Set to 0 to disable wrapping. |
--no-wrap |
Disable FASTA line wrapping (shorthand for --line-width 0). |
--tab-out |
TSV output: chr, start, end, name, sequence. |
--fastq |
FASTQ output with constant quality character. |
--qual |
Quality character for FASTQ (default: I, phred 40). |
--stats |
Print per-region statistics (TSV) instead of sequences. |
--name-template |
Custom header format. Placeholders: {chr}, {start}, {end}, {name}, {length}, {index}, {strand}. |
--rc |
Reverse complement all extracted sequences. |
Transforms
| Flag | Description |
|---|---|
--to-rna |
Convert T to U (DNA to RNA). |
--translate |
Translate to amino acids (standard genetic code). Stop codons as *. |
--uppercase |
Force all output bases to uppercase. |
--lowercase |
Force all output bases to lowercase. |
Masking
| Flag | Description |
|---|---|
--mask-bed |
BED file defining regions to mask within extracted sequences. |
--hard-mask |
Replace masked bases with N (default when --mask-bed is given). |
--soft-mask |
Lowercase masked bases instead of replacing with N. |
Other
| Flag | Description |
|---|---|
--delimiter |
Override delimiter for --table / --sv-table. Accepts tab, comma, or a single character. |
--threads |
Number of worker threads (default: all available CPUs). |
--no-build-fai |
Error if .fai is missing instead of building one. |
--no-index |
Scan FASTA sequentially without an FAI index. Loads into memory. Required for stdin (-). |
-q, --quiet |
Suppress progress messages, warnings, and the progress bar. |
Feature highlights
- Multiple FASTA files: pass several FASTA files as positional arguments. Contigs are looked up across all files (each contig must appear in exactly one file).
- Bgzip support: bgzipped and gzipped FASTA files are decompressed transparently.
- Progress bar: shown on stderr when writing to a file (unless
--quiet). - Bulk-read optimisation: nearby regions on the same contig are read in a single I/O operation, reducing seek overhead.
- Streaming output: stdout and single-file output buffer results in memory to preserve input order while extracting in parallel.
- Interval arithmetic:
--subtractand--intersectapply set operations against a BED file before extraction. - K-mer tiling:
--tileand--stepbreak regions into fixed-width windows for downstream analysis. - Coordinate systems: inline regions and tables use 1-based inclusive coordinates. BED uses 0-based half-open (converted internally).
Documentation
See the docs/ folder for detailed guides:
- Usage guide — full walkthrough of all input and output modes
- Testing and benchmarking — how to run tests and measure performance
Version
Current release: 0.2.0 MSRV: 1.87 (Rust edition 2024)
Licence
MIT