cf1-rs

A fast, parallel Rust implementation of the Cuttlefish algorithm for constructing compacted reference de Bruijn graphs (cdBGs).

Overview

cf1-rs builds a compacted de Bruijn graph from reference sequences (genomes, transcriptomes) using a 5-phase pipeline:

Minimizer counting — SIMD-accelerated canonical minimizer extraction
Super k-mer routing — Parallel partitioning of sequences into minimizer-based bins using 2-bit packed encoding
MPHF construction — Global minimal perfect hash function via fingerprint-based hashing (BBHash), with parallel multi-threaded construction from disk
DFA classification — Lock-free vertex classification using atomic compare-and-swap on a compact state vector
Unitig extraction — Parallel traversal and output in GFA-reduced format

cf1-rs supports k-mer lengths from 1 to 63 (odd values) using const-generic k-mer types with u64 storage (k <= 32) or u128 storage (k <= 63).

Installation

cargo install cf1-rs

Or build from source:

git clone https://github.com/COMBINE-lab/cf1-rs.git
cd cf1-rs
cargo build --release

Usage

# Build a cdBG from a single FASTA file
cf1-rs build -s genome.fa.gz -k 31 -t 8 -o output/k31_dbg

# Build from a transcriptome with short sequence tracking
cf1-rs build -s transcripts.fa.gz -k 31 -t 4 -o output/k31_dbg \
    --track-short-seqs --poly-N-stretch

# Specify working directory and memory budget
cf1-rs build -s genome.fa.gz -k 31 -t 8 -o output/k31_dbg \
    -w /tmp/cf1_work --memory-budget 8.0

Options

Flag	Description	Default
`-s`	Input FASTA/FASTQ file (optionally gzipped)	—
`-l`	File listing input paths (one per line)	—
`-d`	Directory containing input sequence files	—
`-k`	K-mer length (odd, 1–63)	31
`-t`	Number of threads	1
`-o`	Output file prefix	—
`-f`	Output format (0=FASTA, 1=GFA1, 2=GFA2, 3=GFA-reduced)	3
`-w`	Working directory for temporary files	output dir
`--memory-budget`	Memory budget in GB for MPHF construction	4.0
`--track-short-seqs`	Report sequences shorter than k	off
`--poly-N-stretch`	Handle poly-N gaps in tiling output	off
`--collate-output-in-mem`	Buffer output per-thread for ordered writes	off
`--num-bins`	Number of minimizer partition bins	128

Output

cf1-rs produces three output files:

<prefix>.cf_seg — Unitig segments (GFA S-lines)
<prefix>.cf_seq — Per-sequence unitig tilings (GFA P-lines)
<prefix>.json — Summary statistics (vertex count, unitig count, length distribution)

Performance

cf1-rs is designed for high performance on both small transcriptomes and large genomes. Benchmarks below were run on an Apple M3 Max with 4 threads and k=31.

Dataset	Vertices	Unitigs	Wall time	Peak RSS
Transcriptome (GENCODE v49 PC)	96,334,486	1,077,924	~35 s	~2.4 GB
Genome (GRCh38 primary)	2,501,077,719	35,776,107	~8 m	~7.0 GB

Phase 5 (unitig extraction) uses within-sequence parallelism — large chromosomes are split into per-thread k-mer chunks processed concurrently — giving a ~3.4× speedup for that phase on 4 threads.

License

BSD-3-Clause

cf1-rs 0.3.1