multiseqex

MULTI SEQuence EXtractor — a fast, parallel CLI tool for extracting sequences from FASTA files using .fai indexing.

Similar to samtools faidx but built for bulk extraction. Supports multiple input formats (BED, VCF, GFF, CSV/TSV tables, inline regions), sequence transforms (reverse complement, RNA conversion, translation), masking, interval arithmetic, and parallel output across multiple CPU cores.

Installation

From crates.io

cargo install multiseqex

From GitHub

cargo install --git https://github.com/trentzz/multiseqex

Build from source

git clone https://github.com/trentzz/multiseqex.git
cd multiseqex
cargo build --release
cp target/release/multiseqex ~/.local/bin/

Prerequisites

Rust 1.87+ (edition 2024)
samtools (optional, for pre-building .fai indexes)

If the FASTA file lacks a .fai index, multiseqex builds one automatically (unless --no-build-fai is set).

Quick start

# Single region to stdout
multiseqex ref.fa --regions chr1:1000-2000

# Multiple regions from a BED file
multiseqex ref.fa --bed regions.bed -o out.fa

# VCF variant context extraction
multiseqex ref.fa --vcf variants.vcf --flank 100 -o out.fa

# GFF gene extraction
multiseqex ref.fa --gff annotations.gff3 --gff-feature gene -o genes.fa

# Region statistics (GC%, length, masking)
multiseqex ref.fa --bed regions.bed --stats

# Translate extracted sequences to amino acids
multiseqex ref.fa --regions chr1:1000-2000 --translate

# K-mer tiling with 100bp windows, 50bp step
multiseqex ref.fa --bed regions.bed --tile 100 --step 50 -o tiles.fa

Options reference

Input formats

Flag	Description
`<FASTA>...`	One or more reference FASTA files (positional). Supports bgzip/gzip (transparent decompression). Use `-` for stdin with `--no-index`.
`--regions`	Comma-separated regions: `chr:start-end`, `chr:pos+flank`
`--list`	File with one region per line. Use `-` for stdin.
`--bed`	BED file (0-based half-open). Supports optional name (col 4) and strand (col 6).
`--table`	CSV/TSV with header. Columns: CHROM, START, END (range) or CHROM, POS (position + `--flank`). Optional: NAME, STRAND.
`--sv-table`	SV paired-region table. Columns: CHROM_LEFT/RIGHT, START/END_LEFT/RIGHT or POS_LEFT/RIGHT.
`--vcf`	VCF file. Extracts REF span per record. ID used as name; REF/ALT in header.
`--gff`	GFF3/GTF annotation file. Use `--gff-feature` to filter (default: `gene`).
`--contigs`	Comma-separated contig names to extract in full.
`--contig-list`	File with one contig name per line to extract in full.

Region manipulation

Flag	Description
`--flank`	Symmetric flank size for position-mode regions.
`--flank-left`	Left-side flank (must pair with `--flank-right`).
`--flank-right`	Right-side flank (must pair with `--flank-left`).
`--dedup`	Remove duplicate regions (same chr, start, end).
`--sort`	Sort by natural chromosome order then start position.
`--merge`	Merge overlapping/book-ended regions. Implies `--sort`.
`--merge-distance`	Maximum gap for merging (default: 0). Requires `--merge`.
`--subtract`	BED file of intervals to subtract from input regions.
`--intersect`	BED file of intervals to intersect with input regions.
`--tile`	Tile each region into windows of this size (bases).
`--step`	Step size for tiling (default: same as `--tile`).

Output

Flag	Description
`-o, --output`	Write all sequences to a single file (default: stdout).
`--output-dir`	Write one file per region (or per SV pair).
`--line-width`	FASTA line width (default: 60). Set to 0 to disable wrapping.
`--no-wrap`	Disable FASTA line wrapping (shorthand for `--line-width 0`).
`--tab-out`	TSV output: chr, start, end, name, sequence.
`--fastq`	FASTQ output with constant quality character.
`--qual`	Quality character for FASTQ (default: `I`, phred 40).
`--stats`	Print per-region statistics (TSV) instead of sequences.
`--name-template`	Custom header format. Placeholders: `{chr}`, `{start}`, `{end}`, `{name}`, `{length}`, `{index}`, `{strand}`.
`--rc`	Reverse complement all extracted sequences.

Transforms

Flag	Description
`--to-rna`	Convert T to U (DNA to RNA).
`--translate`	Translate to amino acids (standard genetic code). Stop codons as `*`.
`--uppercase`	Force all output bases to uppercase.
`--lowercase`	Force all output bases to lowercase.

Masking

Flag	Description
`--mask-bed`	BED file defining regions to mask within extracted sequences.
`--hard-mask`	Replace masked bases with N (default when `--mask-bed` is given).
`--soft-mask`	Lowercase masked bases instead of replacing with N.

Other

Flag	Description
`--delimiter`	Override delimiter for `--table` / `--sv-table`. Accepts `tab`, `comma`, or a single character.
`--threads`	Number of worker threads (default: all available CPUs).
`--no-build-fai`	Error if `.fai` is missing instead of building one.
`--no-index`	Scan FASTA sequentially without an FAI index. Loads into memory. Required for stdin (`-`).
`-q, --quiet`	Suppress progress messages, warnings, and the progress bar.

Feature highlights

Multiple FASTA files: pass several FASTA files as positional arguments. Contigs are looked up across all files (each contig must appear in exactly one file).
Bgzip support: bgzipped and gzipped FASTA files are decompressed transparently.
Progress bar: shown on stderr when writing to a file (unless --quiet).
Bulk-read optimisation: nearby regions on the same contig are read in a single I/O operation, reducing seek overhead.
Streaming output: stdout and single-file output buffer results in memory to preserve input order while extracting in parallel.
Interval arithmetic: --subtract and --intersect apply set operations against a BED file before extraction.
K-mer tiling: --tile and --step break regions into fixed-width windows for downstream analysis.
Coordinate systems: inline regions and tables use 1-based inclusive coordinates. BED uses 0-based half-open (converted internally).

Documentation

See the docs/ folder for detailed guides:

Usage guide — full walkthrough of all input and output modes
Testing and benchmarking — how to run tests and measure performance

Version

Current release: 0.2.0 MSRV: 1.87 (Rust edition 2024)

Licence

MIT

multiseqex 0.2.0