seqtable 0.2.0

High-performance FASTQ sequence counter
Documentation

seqtable

High-performance FASTQ sequence counter.

Counts unique sequences in FASTQ files and outputs sorted results in Parquet, CSV, or TSV format.

License: MIT

Installation

# Run directly (no install)
nix run github:mulatta/seqtable -- input.fq.gz -o results/

# Temporary shell with seqtable available
nix shell github:mulatta/seqtable
seqtable input.fq.gz

# Persistent install
nix profile install github:mulatta/seqtable

# From source
git clone https://github.com/mulatta/seqtable
cd seqtable
cargo build --release

Usage

seqtable [OPTIONS] <INPUT>...

Arguments:
  <INPUT>...  Input FASTQ file path(s), or "-" for stdin

Options:
  -o, --output-dir <DIR>          Output directory [default: .]
  -f, --format <FORMAT>           Output format [default: parquet]
                                  [possible values: parquet, csv, tsv]
  -t, --threads <THREADS>         Number of threads (0 = auto) [default: 0]
  -q, --quiet                     Suppress all status output
      --compression <COMPRESSION> Parquet compression [default: zstd]
                                  [possible values: none, snappy, gzip, brotli, zstd]
      --rpm                       Include RPM (Reads Per Million) column
  -h, --help                      Print help
  -V, --version                   Print version

Basic

seqtable input.fq.gz
seqtable input.fq.gz -o results/ -f csv --rpm
seqtable input.fq.gz -f tsv -t 4

Multiple files

Files are processed in parallel automatically:

seqtable sample1.fq.gz sample2.fq.gz sample3.fq.gz -o results/

Stdin

Accept - to read from stdin. Useful for piping from other tools without intermediate files:

# Pipe from fastp (no intermediate file)
fastp -i raw.fq.gz -o /dev/stdout | seqtable - -o results/ -f csv

# Decompress with external tools
zcat input.fq.gz | seqtable - -o results/

# Sample first 1M reads
head -n 4000000 input.fq | seqtable - -o results/

Output

Results are sorted by count (descending). Output filename is derived from input (e.g., input.csv, stdin.parquet).

sequence,count,rpm
ACGTACGTACGTACGT,150000,75000.00
GCTAGCTAGCTAGCTA,100000,50000.00
TTAATTAATTAATTAA,50000,25000.00

Reading Parquet

import polars as pl
df = pl.read_parquet("input.parquet")
-- DuckDB
SELECT * FROM 'input.parquet' LIMIT 10;

Performance

Benchmarked with hyperfine (warmup=3, runs=5) against awk 'NR%4==2{a[$0]++}END{...}' | sort -rn:

Data Reads seqtable awk + sort Speedup
22bp, plain FASTQ 1M 0.17s 1.04s 6.2x
22bp, gzip 20M 4.7s 25.0s 5.3x
100-300bp, gzip 20M 18.8s 37.2s 2.0x
151bp, gzip (real) 13.6M 12.9s 55.8s 4.3x

Memory

Memory usage is proportional to the number of unique sequences, not file size:

RSS = ~35 MB (base) + unique_count * bytes_per_unique

bytes_per_unique:
  ACGT-only, <=160bp:  ~140 bytes  (2-bit packed)
  other (N, >160bp):   ~150 + seq_len bytes
Scenario Unique sequences RSS
Illumina 151bp, 13M unique 13M 1.9 GB
Amplicon 200bp, 1M unique 1M 0.3 GB
Short 22bp, 181K unique 181K 43 MB

How it works

flowchart TD
    A[".fq / .fq.gz / stdin"] --> B["needletail parser\n(gzip via zlib-ng)"]
    B --> C{ACGT-only\nand <=160bp?}
    C -- yes --> D["PackedDna\n[u64;5], 48 bytes"]
    C -- no --> E["Vec &lt;u8&gt; fallback"]
    D --> F["DualSeqCounts\nAHashMap counting"]
    E --> F
    F --> G["sort by count desc"]
    G --> H["lazy unpack to\nParquet / CSV / TSV"]

Key optimizations:

  • 2-bit DNA encoding: ACGT bases packed 2 bits each into [u64;5]. Covers up to 160bp with 48-byte fixed-size keys (vs ~190 bytes for Vec<u8>). Eliminates heap allocation for 99.9% of Illumina reads.
  • Lazy unpack: Packed sequences stay as PackedDna through sorting and are only decoded to ASCII during output, avoiding a full Vec<u8> copy per unique sequence.
  • get_mut probe pattern: Check existing key first, insert only on miss. Faster than entry() API for this workload (benchmarked).
  • zlib-ng backend: flate2 compiled with zlib-ng for faster gzip decompression.

Supported formats

Format Extensions Compression
FASTQ .fastq, .fq -
FASTQ gzip .fastq.gz, .fq.gz gzip
stdin - auto-detected

FASTA files are not supported.

Development

nix develop
cargo test
cargo bench

Generate test fixtures

cargo run --example generate_fixtures --release -- --size small

Benchmark

nix run .#benchmark -- small

License

MIT

Acknowledgments