cf1-rs
A fast, parallel Rust implementation of the Cuttlefish algorithm for constructing compacted reference de Bruijn graphs (cdBGs).
Overview
cf1-rs builds a compacted de Bruijn graph from reference sequences (genomes, transcriptomes) using a 5-phase pipeline:
- Minimizer counting — SIMD-accelerated canonical minimizer extraction
- Super k-mer routing — Parallel partitioning of sequences into minimizer-based bins using 2-bit packed encoding
- MPHF construction — Global minimal perfect hash function via fingerprint-based hashing (BBHash), with parallel multi-threaded construction from disk
- DFA classification — Lock-free vertex classification using atomic compare-and-swap on a compact state vector
- Unitig extraction — Parallel traversal and output in GFA-reduced format
cf1-rs supports k-mer lengths from 1 to 63 (odd values) using const-generic k-mer types with u64 storage (k <= 32) or u128 storage (k <= 63).
Installation
Or build from source:
Usage
# Build a cdBG from a single FASTA file
# Build from a transcriptome with short sequence tracking
# Specify working directory and memory budget
Options
| Flag | Description | Default |
|---|---|---|
-s |
Input FASTA/FASTQ file (optionally gzipped) | — |
-l |
File listing input paths (one per line) | — |
-d |
Directory containing input sequence files | — |
-k |
K-mer length (odd, 1–63) | 31 |
-t |
Number of threads | 1 |
-o |
Output file prefix | — |
-f |
Output format (0=FASTA, 1=GFA1, 2=GFA2, 3=GFA-reduced) | 3 |
-w |
Working directory for temporary files | output dir |
--memory-budget |
Memory budget in GB for MPHF construction | 4.0 |
--track-short-seqs |
Report sequences shorter than k | off |
--poly-N-stretch |
Handle poly-N gaps in tiling output | off |
--collate-output-in-mem |
Buffer output per-thread for ordered writes | off |
--num-bins |
Number of minimizer partition bins | 128 |
Output
cf1-rs produces three output files:
<prefix>.cf_seg— Unitig segments (GFA S-lines)<prefix>.cf_seq— Per-sequence unitig tilings (GFA P-lines)<prefix>.json— Summary statistics (vertex count, unitig count, length distribution)
Performance
cf1-rs is designed for high performance on both small transcriptomes and large genomes. Benchmarks below were run on an Apple M3 Max with 4 threads and k=31.
| Dataset | Vertices | Unitigs | Wall time | Peak RSS |
|---|---|---|---|---|
| Transcriptome (GENCODE v49 PC) | 96,334,486 | 1,077,924 | ~35 s | ~2.4 GB |
| Genome (GRCh38 primary) | 2,501,077,719 | 35,776,107 | ~8 m | ~7.0 GB |
Phase 5 (unitig extraction) uses within-sequence parallelism — large chromosomes are split into per-thread k-mer chunks processed concurrently — giving a ~3.4× speedup for that phase on 4 threads.
License
BSD-3-Clause