seqtable-0.1.1 is not a library.
seqtable
🧬 High-performance parallel FASTA/FASTQ sequence counter with multiple output formats
Features
- ✨ Fast: Parallel processing with Rayon (5-10x speedup on multi-core systems)
- 💾 Memory Efficient: Streaming I/O with constant memory usage
- 📊 Multiple Formats: Parquet, CSV, TSV output
- 📈 RPM Calculation: Optional Reads Per Million normalization
- 🗜️ Compression: Native support for
.gzfiles - 🎯 Simple: Single binary with no dependencies
Installation
Using Nix (Recommended)
# Install from this repository
# Or run directly
From Source
Quick Start
Basic Usage
# Count sequences in a FASTQ file
# Specify output directory
# Use CSV format with RPM
Multiple Files
Use GNU parallel for processing multiple files:
# Process all FASTQ files in parallel (4 jobs)
# Memory-aware processing
Usage
seqtable [OPTIONS] <INPUT>...
Arguments:
<INPUT>... Input file path(s) - FASTA/FASTQ/FASTQ.gz
Options:
-o, --output-dir <DIR> Output directory [default: .]
-s, --suffix <SUFFIX> Output filename suffix [default: _counts]
-f, --format <FORMAT> Output format [default: parquet]
[possible values: parquet, csv, tsv]
-c, --chunk-size <SIZE> Chunk size for parallel processing [default: 50000]
-t, --threads <N> Number of threads (0 = auto) [default: 0]
-q, --quiet Disable progress bar
--compression <TYPE> Parquet compression [default: snappy]
[possible values: none, snappy, gzip, brotli, zstd]
--rpm Calculate RPM (Reads Per Million)
-h, --help Print help
-V, --version Print version
Examples
Output Formats
# Parquet (default, best for data analysis)
# CSV (spreadsheet-friendly)
# TSV (tab-separated)
With RPM Calculation
# Add RPM column for normalization
# Output includes:
# sequence,count,rpm
# ATCGATCG,1000000,50000.00
# GCTAGCTA,500000,25000.00
Custom Output
# Custom output name and location
# Output: results/sample.counts.parquet
Performance Tuning
# Use 8 threads
# Larger chunks for big files (reduces overhead)
# Smaller chunks for memory-constrained systems
Output Format
Parquet (default)
Columnar format optimized for analytics:
- Efficient compression
- Fast queries with tools like DuckDB, Polars
- Schema preservation
# Read in Python
=
CSV/TSV
Human-readable text formats:
sequence,count,rpm
ATCGATCGATCG,1500000,75000.00
GCTAGCTAGCTA,1000000,50000.00
TTAATTAATTAA,500000,25000.00
Performance
Typical performance on a 16-core system:
| File Size | Reads | Time | Memory |
|---|---|---|---|
| 1 GB | 10M | ~15s | ~500MB |
| 10 GB | 100M | ~60s | ~2GB |
| 100 GB | 1B | ~600s | ~2GB |
Key Features:
- Linear scaling with CPU cores
- Constant memory usage regardless of file size
- Efficient handling of gzip-compressed files
File Format Support
| Format | Extension | Compression | Streaming |
|---|---|---|---|
| FASTA | .fa, .fasta |
❌ | ✅ |
| FASTQ | .fq, .fastq |
❌ | ✅ |
| FASTA.gz | .fa.gz |
✅ | ✅ |
| FASTQ.gz | .fq.gz |
✅ | ✅ |
Architecture
Processing Pipeline
Input File(s)
↓
Streaming Reader (needletail)
↓
Chunking (50K sequences)
↓
Parallel Counting (Rayon + AHashMap)
↓
Parallel Merge
↓
Optional RPM Calculation
↓
Output (Parquet/CSV/TSV)
Memory Usage
- Base: ~100MB (program overhead)
- Chunks:
chunk_size × threads × ~80 bytes - HashMap:
unique_sequences × ~100 bytes - Total: Typically 1-3GB for large files
Key Optimizations
- Streaming I/O: Files processed incrementally
- Parallel Hashing: Multi-threaded counting with AHash
- Zero-Copy: Minimal data duplication
- Adaptive Chunking: Optimal chunk size selection
Development
Building
# Debug build
# Release build with optimizations
# With mold linker (faster)
Testing
# Run tests
# Generate test data
Benchmarking
# Time comparison
# Memory profiling
Troubleshooting
Out of Memory
# Reduce chunk size
# Use fewer threads
Slow Performance
# Increase threads
# Larger chunks (for large files)
# Check I/O bottleneck
File Format Issues
# Verify file format
|
# Test with small sample
|
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details.
Citation
If you use this tool in your research, please cite:
Acknowledgments
- needletail - Fast FASTA/FASTQ parsing
- rayon - Data parallelism
- arrow-rs - Parquet support