multiseqex 0.2.0

Multi-sequence extractor from FASTA using FAI indexing, with parallelism and flexible region input formats.
Documentation

multiseqex

MULTI SEQuence EXtractor — a fast, parallel CLI tool for extracting sequences from FASTA files using .fai indexing.

Similar to samtools faidx but built for bulk extraction. Supports multiple input formats (BED, VCF, GFF, CSV/TSV tables, inline regions), sequence transforms (reverse complement, RNA conversion, translation), masking, interval arithmetic, and parallel output across multiple CPU cores.

Installation

From crates.io

cargo install multiseqex

From GitHub

cargo install --git https://github.com/trentzz/multiseqex

Build from source

git clone https://github.com/trentzz/multiseqex.git
cd multiseqex
cargo build --release
cp target/release/multiseqex ~/.local/bin/

Prerequisites

  • Rust 1.87+ (edition 2024)
  • samtools (optional, for pre-building .fai indexes)

If the FASTA file lacks a .fai index, multiseqex builds one automatically (unless --no-build-fai is set).

Quick start

# Single region to stdout
multiseqex ref.fa --regions chr1:1000-2000

# Multiple regions from a BED file
multiseqex ref.fa --bed regions.bed -o out.fa

# VCF variant context extraction
multiseqex ref.fa --vcf variants.vcf --flank 100 -o out.fa

# GFF gene extraction
multiseqex ref.fa --gff annotations.gff3 --gff-feature gene -o genes.fa

# Region statistics (GC%, length, masking)
multiseqex ref.fa --bed regions.bed --stats

# Translate extracted sequences to amino acids
multiseqex ref.fa --regions chr1:1000-2000 --translate

# K-mer tiling with 100bp windows, 50bp step
multiseqex ref.fa --bed regions.bed --tile 100 --step 50 -o tiles.fa

Options reference

Input formats

Flag Description
<FASTA>... One or more reference FASTA files (positional). Supports bgzip/gzip (transparent decompression). Use - for stdin with --no-index.
--regions Comma-separated regions: chr:start-end, chr:pos+flank
--list File with one region per line. Use - for stdin.
--bed BED file (0-based half-open). Supports optional name (col 4) and strand (col 6).
--table CSV/TSV with header. Columns: CHROM, START, END (range) or CHROM, POS (position + --flank). Optional: NAME, STRAND.
--sv-table SV paired-region table. Columns: CHROM_LEFT/RIGHT, START/END_LEFT/RIGHT or POS_LEFT/RIGHT.
--vcf VCF file. Extracts REF span per record. ID used as name; REF/ALT in header.
--gff GFF3/GTF annotation file. Use --gff-feature to filter (default: gene).
--contigs Comma-separated contig names to extract in full.
--contig-list File with one contig name per line to extract in full.

Region manipulation

Flag Description
--flank Symmetric flank size for position-mode regions.
--flank-left Left-side flank (must pair with --flank-right).
--flank-right Right-side flank (must pair with --flank-left).
--dedup Remove duplicate regions (same chr, start, end).
--sort Sort by natural chromosome order then start position.
--merge Merge overlapping/book-ended regions. Implies --sort.
--merge-distance Maximum gap for merging (default: 0). Requires --merge.
--subtract BED file of intervals to subtract from input regions.
--intersect BED file of intervals to intersect with input regions.
--tile Tile each region into windows of this size (bases).
--step Step size for tiling (default: same as --tile).

Output

Flag Description
-o, --output Write all sequences to a single file (default: stdout).
--output-dir Write one file per region (or per SV pair).
--line-width FASTA line width (default: 60). Set to 0 to disable wrapping.
--no-wrap Disable FASTA line wrapping (shorthand for --line-width 0).
--tab-out TSV output: chr, start, end, name, sequence.
--fastq FASTQ output with constant quality character.
--qual Quality character for FASTQ (default: I, phred 40).
--stats Print per-region statistics (TSV) instead of sequences.
--name-template Custom header format. Placeholders: {chr}, {start}, {end}, {name}, {length}, {index}, {strand}.
--rc Reverse complement all extracted sequences.

Transforms

Flag Description
--to-rna Convert T to U (DNA to RNA).
--translate Translate to amino acids (standard genetic code). Stop codons as *.
--uppercase Force all output bases to uppercase.
--lowercase Force all output bases to lowercase.

Masking

Flag Description
--mask-bed BED file defining regions to mask within extracted sequences.
--hard-mask Replace masked bases with N (default when --mask-bed is given).
--soft-mask Lowercase masked bases instead of replacing with N.

Other

Flag Description
--delimiter Override delimiter for --table / --sv-table. Accepts tab, comma, or a single character.
--threads Number of worker threads (default: all available CPUs).
--no-build-fai Error if .fai is missing instead of building one.
--no-index Scan FASTA sequentially without an FAI index. Loads into memory. Required for stdin (-).
-q, --quiet Suppress progress messages, warnings, and the progress bar.

Feature highlights

  • Multiple FASTA files: pass several FASTA files as positional arguments. Contigs are looked up across all files (each contig must appear in exactly one file).
  • Bgzip support: bgzipped and gzipped FASTA files are decompressed transparently.
  • Progress bar: shown on stderr when writing to a file (unless --quiet).
  • Bulk-read optimisation: nearby regions on the same contig are read in a single I/O operation, reducing seek overhead.
  • Streaming output: stdout and single-file output buffer results in memory to preserve input order while extracting in parallel.
  • Interval arithmetic: --subtract and --intersect apply set operations against a BED file before extraction.
  • K-mer tiling: --tile and --step break regions into fixed-width windows for downstream analysis.
  • Coordinate systems: inline regions and tables use 1-based inclusive coordinates. BED uses 0-based half-open (converted internally).

Documentation

See the docs/ folder for detailed guides:

Version

Current release: 0.2.0 MSRV: 1.87 (Rust edition 2024)

Licence

MIT