seqx
seqx is an agent-friendly CLI for FASTA/FASTQ sequence processing.
It is designed around streaming I/O, predictable command behavior, and low-memory execution for large files.
Repository Layout
seqx/
├── .github/
│ └── workflows/
│ └── release.yml
├── scripts/
│ ├── bench_packed_io.sh
│ └── gen_random_fasta.py
├── src/
│ ├── main.rs
│ ├── lib.rs
│ ├── cmd/
│ │ ├── mod.rs
│ │ ├── compress.rs
│ │ ├── convert.rs
│ │ ├── dedup.rs
│ │ ├── extract.rs
│ │ ├── filter.rs
│ │ ├── merge.rs
│ │ ├── modify.rs
│ │ ├── sample.rs
│ │ ├── search.rs
│ │ ├── sort.rs
│ │ ├── split.rs
│ │ └── stats.rs
│ └── common/
│ ├── mod.rs
│ ├── parser.rs
│ ├── packed_seq_io.rs
│ ├── record.rs
│ ├── writer.rs
│ └── README.md
├── Cargo.toml
├── Cargo.lock
├── README.md
├── QUICKREF.md
├── DEVELOPMENT.md
├── SKILL.md
├── rustfmt.toml
└── target/ # build artifacts (generated)
Build
Binary path:
Quick Start
# Show help
# Basic stats
# Convert FASTA -> FASTQ
# Filter short sequences
Commands
stats
convert
filter
extract
search
modify
sample
sort
dedup
merge
split
compress
# Compress using pigz if available, otherwise built-in
# Decompress
# Use stdin/stdout
|
|
# Force built-in implementation
Behavior Notes
- Input defaults to
stdinwhere supported. - Output defaults to
stdoutwhere supported. - Format detection is extension-based (
.fa/.fasta/.fq/.fastq, optional.gz). - FASTA/FASTQ parsing uses
noodles. extractcurrently supports FASTA extraction only.
Nucleotide vs Protein Behavior
- Protein FASTA records are supported by all commands.
- Nucleotide-only operations are explicitly guarded:
filter --gc-min/--gc-maxmodify --reverse-complement- reverse-complement matching in
search(enabled only when both record and pattern are nucleotide)
Performance Model
sort: external chunk sort + mmap merge, configurable with--max-memoryand--threads.dedup: disk bucket partitioning + per-bucket dedup + stable merge, configurable with--bucketsand--threads.split --parts: two-pass streaming split (stdin may be materialized to a temp file).- Temp binary record paths use
packed_seq_io(2-bit packing for A/C/G/T when applicable).
Bench Script
# Custom workload
N_RECORDS=1000000 SEQ_LEN=200 DUP_RATE=40
Developer Docs
License
MIT