cf1-rs 0.3.1

Fast construction of compacted reference de Bruijn graphs
Documentation

cf1-rs

A fast, parallel Rust implementation of the Cuttlefish algorithm for constructing compacted reference de Bruijn graphs (cdBGs).

Overview

cf1-rs builds a compacted de Bruijn graph from reference sequences (genomes, transcriptomes) using a 5-phase pipeline:

  1. Minimizer counting — SIMD-accelerated canonical minimizer extraction
  2. Super k-mer routing — Parallel partitioning of sequences into minimizer-based bins using 2-bit packed encoding
  3. MPHF construction — Global minimal perfect hash function via fingerprint-based hashing (BBHash), with parallel multi-threaded construction from disk
  4. DFA classification — Lock-free vertex classification using atomic compare-and-swap on a compact state vector
  5. Unitig extraction — Parallel traversal and output in GFA-reduced format

cf1-rs supports k-mer lengths from 1 to 63 (odd values) using const-generic k-mer types with u64 storage (k <= 32) or u128 storage (k <= 63).

Installation

cargo install cf1-rs

Or build from source:

git clone https://github.com/COMBINE-lab/cf1-rs.git
cd cf1-rs
cargo build --release

Usage

# Build a cdBG from a single FASTA file
cf1-rs build -s genome.fa.gz -k 31 -t 8 -o output/k31_dbg

# Build from a transcriptome with short sequence tracking
cf1-rs build -s transcripts.fa.gz -k 31 -t 4 -o output/k31_dbg \
    --track-short-seqs --poly-N-stretch

# Specify working directory and memory budget
cf1-rs build -s genome.fa.gz -k 31 -t 8 -o output/k31_dbg \
    -w /tmp/cf1_work --memory-budget 8.0

Options

Flag Description Default
-s Input FASTA/FASTQ file (optionally gzipped)
-l File listing input paths (one per line)
-d Directory containing input sequence files
-k K-mer length (odd, 1–63) 31
-t Number of threads 1
-o Output file prefix
-f Output format (0=FASTA, 1=GFA1, 2=GFA2, 3=GFA-reduced) 3
-w Working directory for temporary files output dir
--memory-budget Memory budget in GB for MPHF construction 4.0
--track-short-seqs Report sequences shorter than k off
--poly-N-stretch Handle poly-N gaps in tiling output off
--collate-output-in-mem Buffer output per-thread for ordered writes off
--num-bins Number of minimizer partition bins 128

Output

cf1-rs produces three output files:

  • <prefix>.cf_seg — Unitig segments (GFA S-lines)
  • <prefix>.cf_seq — Per-sequence unitig tilings (GFA P-lines)
  • <prefix>.json — Summary statistics (vertex count, unitig count, length distribution)

Performance

cf1-rs is designed for high performance on both small transcriptomes and large genomes. Benchmarks below were run on an Apple M3 Max with 4 threads and k=31.

Dataset Vertices Unitigs Wall time Peak RSS
Transcriptome (GENCODE v49 PC) 96,334,486 1,077,924 ~35 s ~2.4 GB
Genome (GRCh38 primary) 2,501,077,719 35,776,107 ~8 m ~7.0 GB

Phase 5 (unitig extraction) uses within-sequence parallelism — large chromosomes are split into per-thread k-mer chunks processed concurrently — giving a ~3.4× speedup for that phase on 4 threads.

License

BSD-3-Clause