seqtable
High-performance FASTQ sequence counter.
Counts unique sequences in FASTQ files and outputs sorted results in Parquet, CSV, or TSV format.
Installation
# Run directly (no install)
# Temporary shell with seqtable available
# Persistent install
# From source
Usage
seqtable [OPTIONS] <INPUT>...
Arguments:
<INPUT>... Input FASTQ file path(s), or "-" for stdin
Options:
-o, --output-dir <DIR> Output directory [default: .]
-f, --format <FORMAT> Output format [default: parquet]
[possible values: parquet, csv, tsv]
-t, --threads <THREADS> Number of threads (0 = auto) [default: 0]
-q, --quiet Suppress all status output
--compression <COMPRESSION> Parquet compression [default: zstd]
[possible values: none, snappy, gzip, brotli, zstd]
--rpm Include RPM (Reads Per Million) column
-h, --help Print help
-V, --version Print version
Basic
Multiple files
Files are processed in parallel automatically:
Stdin
Accept - to read from stdin. Useful for piping from other tools without intermediate files:
# Pipe from fastp (no intermediate file)
|
# Decompress with external tools
|
# Sample first 1M reads
|
Output
Results are sorted by count (descending). Output filename is derived from input (e.g., input.csv, stdin.parquet).
sequence,count,rpm
ACGTACGTACGTACGT,150000,75000.00
GCTAGCTAGCTAGCTA,100000,50000.00
TTAATTAATTAATTAA,50000,25000.00
Reading Parquet
=
-- DuckDB
SELECT * FROM 'input.parquet' LIMIT 10;
Performance
Benchmarked with hyperfine (warmup=3, runs=5) against awk 'NR%4==2{a[$0]++}END{...}' | sort -rn:
| Data | Reads | seqtable | awk + sort | Speedup |
|---|---|---|---|---|
| 22bp, plain FASTQ | 1M | 0.17s | 1.04s | 6.2x |
| 22bp, gzip | 20M | 4.7s | 25.0s | 5.3x |
| 100-300bp, gzip | 20M | 18.8s | 37.2s | 2.0x |
| 151bp, gzip (real) | 13.6M | 12.9s | 55.8s | 4.3x |
Memory
Memory usage is proportional to the number of unique sequences, not file size:
RSS = ~35 MB (base) + unique_count * bytes_per_unique
bytes_per_unique:
ACGT-only, <=160bp: ~140 bytes (2-bit packed)
other (N, >160bp): ~150 + seq_len bytes
| Scenario | Unique sequences | RSS |
|---|---|---|
| Illumina 151bp, 13M unique | 13M | 1.9 GB |
| Amplicon 200bp, 1M unique | 1M | 0.3 GB |
| Short 22bp, 181K unique | 181K | 43 MB |
How it works
flowchart TD
A[".fq / .fq.gz / stdin"] --> B["needletail parser\n(gzip via zlib-ng)"]
B --> C{ACGT-only\nand <=160bp?}
C -- yes --> D["PackedDna\n[u64;5], 48 bytes"]
C -- no --> E["Vec <u8> fallback"]
D --> F["DualSeqCounts\nAHashMap counting"]
E --> F
F --> G["sort by count desc"]
G --> H["lazy unpack to\nParquet / CSV / TSV"]
Key optimizations:
- 2-bit DNA encoding: ACGT bases packed 2 bits each into
[u64;5]. Covers up to 160bp with 48-byte fixed-size keys (vs ~190 bytes forVec<u8>). Eliminates heap allocation for 99.9% of Illumina reads. - Lazy unpack: Packed sequences stay as
PackedDnathrough sorting and are only decoded to ASCII during output, avoiding a fullVec<u8>copy per unique sequence. - get_mut probe pattern: Check existing key first, insert only on miss. Faster than
entry()API for this workload (benchmarked). - zlib-ng backend: flate2 compiled with zlib-ng for faster gzip decompression.
Supported formats
| Format | Extensions | Compression |
|---|---|---|
| FASTQ | .fastq, .fq |
- |
| FASTQ gzip | .fastq.gz, .fq.gz |
gzip |
| stdin | - |
auto-detected |
FASTA files are not supported.
Development
Generate test fixtures
Benchmark
License
MIT
Acknowledgments
- needletail - FASTQ parsing
- ahash - Fast hashing
- arrow-rs - Parquet support