seqtable-0.1.1 is not a library.

seqtable

🧬 High-performance parallel FASTA/FASTQ sequence counter with multiple output formats

Features

✨ Fast: Parallel processing with Rayon (5-10x speedup on multi-core systems)
💾 Memory Efficient: Streaming I/O with constant memory usage
📊 Multiple Formats: Parquet, CSV, TSV output
📈 RPM Calculation: Optional Reads Per Million normalization
🗜️ Compression: Native support for .gz files
🎯 Simple: Single binary with no dependencies

Installation

Using Nix (Recommended)

# Install from this repository
nix profile install github:mulatta/seqtable

# Or run directly
nix run github:mulatta/seqtable -- --help

From Source

git clone https://github.com/mulatta/seqtable
-cd seqtable
cd seqtable
cargo build --release
./target/release/seqtable --help

Quick Start

Basic Usage

# Count sequences in a FASTQ file
seqtable input.fastq.gz

# Specify output directory
seqtable input.fastq.gz -o results/

# Use CSV format with RPM
seqtable input.fastq.gz -f csv --rpm

Multiple Files

Use GNU parallel for processing multiple files:

# Process all FASTQ files in parallel (4 jobs)
parallel -j 4 seqtable {} -o results/ ::: *.fastq.gz

# Memory-aware processing
parallel --memfree 4G seqtable {} -o results/ ::: *.fq.gz

Usage

seqtable [OPTIONS] <INPUT>...

Arguments:
  <INPUT>...  Input file path(s) - FASTA/FASTQ/FASTQ.gz

Options:
  -o, --output-dir <DIR>        Output directory [default: .]
  -s, --suffix <SUFFIX>         Output filename suffix [default: _counts]
  -f, --format <FORMAT>         Output format [default: parquet]
                                [possible values: parquet, csv, tsv]
  -c, --chunk-size <SIZE>       Chunk size for parallel processing [default: 50000]
  -t, --threads <N>             Number of threads (0 = auto) [default: 0]
  -q, --quiet                   Disable progress bar
  --compression <TYPE>          Parquet compression [default: snappy]
                                [possible values: none, snappy, gzip, brotli, zstd]
  --rpm                         Calculate RPM (Reads Per Million)
  -h, --help                    Print help
  -V, --version                 Print version

Examples

Output Formats

# Parquet (default, best for data analysis)
seqtable input.fq.gz

# CSV (spreadsheet-friendly)
seqtable input.fq.gz -f csv

# TSV (tab-separated)
seqtable input.fq.gz -f tsv

With RPM Calculation

# Add RPM column for normalization
seqtable input.fq.gz --rpm -f csv

# Output includes:
# sequence,count,rpm
# ATCGATCG,1000000,50000.00
# GCTAGCTA,500000,25000.00

Custom Output

# Custom output name and location
seqtable sample.fq.gz -o results/ -s .counts -f parquet

# Output: results/sample.counts.parquet

Performance Tuning

# Use 8 threads
seqtable input.fq.gz -t 8

# Larger chunks for big files (reduces overhead)
seqtable huge_file.fq.gz -c 100000

# Smaller chunks for memory-constrained systems
seqtable input.fq.gz -c 10000

Output Format

Parquet (default)

Columnar format optimized for analytics:

Efficient compression
Fast queries with tools like DuckDB, Polars
Schema preservation

# Read in Python
import polars as pl
df = pl.read_parquet("output_counts.parquet")
print(df.head())

CSV/TSV

Human-readable text formats:

sequence,count,rpm
ATCGATCGATCG,1500000,75000.00
GCTAGCTAGCTA,1000000,50000.00
TTAATTAATTAA,500000,25000.00

Performance

Typical performance on a 16-core system:

File Size	Reads	Time	Memory
1 GB	10M	~15s	~500MB
10 GB	100M	~60s	~2GB
100 GB	1B	~600s	~2GB

Key Features:

Linear scaling with CPU cores
Constant memory usage regardless of file size
Efficient handling of gzip-compressed files

File Format Support

Format	Extension	Compression	Streaming
FASTA	`.fa`, `.fasta`	❌	✅
FASTQ	`.fq`, `.fastq`	❌	✅
FASTA.gz	`.fa.gz`	✅	✅
FASTQ.gz	`.fq.gz`	✅	✅

Architecture

Processing Pipeline

Input File(s)
    ↓
Streaming Reader (needletail)
    ↓
Chunking (50K sequences)
    ↓
Parallel Counting (Rayon + AHashMap)
    ↓
Parallel Merge
    ↓
Optional RPM Calculation
    ↓
Output (Parquet/CSV/TSV)

Memory Usage

Base: ~100MB (program overhead)
Chunks: chunk_size × threads × ~80 bytes
HashMap: unique_sequences × ~100 bytes
Total: Typically 1-3GB for large files

Key Optimizations

Streaming I/O: Files processed incrementally
Parallel Hashing: Multi-threaded counting with AHash
Zero-Copy: Minimal data duplication
Adaptive Chunking: Optimal chunk size selection

Development

Building

# Debug build
nix develop
cargo build

# Release build with optimizations
cargo build --release

# With mold linker (faster)
mold -run cargo build --release

Testing

# Run tests
cargo test

# Generate test data
head -n 4000 input.fastq > test_small.fastq
seqtable test_small.fastq --rpm -f csv

Benchmarking

# Time comparison
time seqtable large.fq.gz -t 1    # Single thread
time seqtable large.fq.gz -t 16   # 16 threads

# Memory profiling
/usr/bin/time -v seqtable input.fq.gz

Troubleshooting

Out of Memory

# Reduce chunk size
seqtable input.fq.gz -c 10000

# Use fewer threads
seqtable input.fq.gz -t 4

Slow Performance

# Increase threads
seqtable input.fq.gz -t $(nproc)

# Larger chunks (for large files)
seqtable input.fq.gz -c 100000

# Check I/O bottleneck
iostat -x 1

File Format Issues

# Verify file format
head -n 4 input.fq.gz | gunzip

# Test with small sample
head -n 40000 input.fq.gz | gunzip > test.fq
seqtable test.fq

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Citation

If you use this tool in your research, please cite:

Acknowledgments

needletail - Fast FASTA/FASTQ parsing
rayon - Data parallelism
arrow-rs - Parquet support

seqtable 0.1.1