fgumi
⚠️ ALPHA SOFTWARE - USE AT YOUR OWN RISK
This software is currently in ALPHA. While we have extensively tested these tools across a wide variety of vendor-provided data, no guarantees are made regarding correctness or stability.
We are targeting June 1, 2026 to recommend fgumi over fgbio for production use.
Fulcrum Genomics Unique Molecular Indexing (UMI) Tools - a suite of high-performance tools for working with UMI-tagged sequencing data.
Overview
fgumi provides comprehensive functionality for:
- UMI extraction from FASTQ files
- Read grouping by UMI with multiple assignment strategies
- UMI-aware deduplication for marking/removing PCR duplicates
- Consensus calling (simplex, duplex, and CODEC)
- Quality filtering of consensus reads
- Read clipping for overlapping pairs
- Metrics collection for QC and analysis
Pipeline Overview
The diagram shows the workflow from FASTQ files to filtered consensus reads:
- Red: Simplex (single-strand) consensus
- Blue: Duplex (double-strand) consensus
- Green: CODEC consensus
- Orange: Optional UMI correction for fixed UMI sets
Resources
- Documentation
- Best Practice Pipeline: Recommended workflow from FASTQ to consensus
- Performance Tuning Guide: Threading, memory, and compression optimization
- Snakemake Pipeline: Reference implementation
- Metrics: Output metrics documentation
- Developing: Developer guide
- Compare CLI: Compare command documentation (feature-gated)
- Simulate CLI: Simulate command documentation (feature-gated)
- Releases
- Issues: Report a bug or request a feature
- Pull requests: Submit a patch or new feature
- Discussions: Ask a question
- Contributors guide
- License: Released under the MIT license
Installation
Downloading a pre-built binary
Pre-built binaries for the most common operating systems and CPU architectures are attached to each release for this project.
Installing with Cargo
cargo install fgumi
Building from source
Clone the repository:
git clone https://github.com/fulcrumgenomics/fgumi
Build the release version:
cd fgumi
cargo build --release
Optional Features
| Feature | Description |
|---|---|
compare |
Developer tools for comparing BAMs and metrics |
simulate |
Commands for generating synthetic test data |
profile-adjacency |
Enable profiling output for adjacency UMI assigner |
Enable with: cargo build --release --features <feature> |
Available Tools
| Command | Description | Equivalent Tool(s) |
|---|---|---|
extract |
Extract UMIs from FASTQ files | fgbio ExtractUmisFromBam |
correct |
Correct UMIs based on sequence similarity | fgbio CorrectUmis |
zipper |
Restore original FASTQ from unaligned BAM | fgbio ZipperBams, picard MergeBamAlignment |
fastq |
Convert BAM to FASTQ format | samtools fastq |
sort |
Sort BAM by coordinate/queryname/template | — |
group |
Group reads by UMI | fgbio GroupReadsByUmi |
dedup |
Mark/remove UMI-aware duplicates | gatk UmiAwareMarkDuplicatesWithMateCigar, umi-tools dedup |
simplex |
Call single-strand consensus reads | fgbio CallMolecularConsensusReads |
duplex |
Call duplex consensus reads | fgbio CallDuplexConsensusReads |
codec |
Call CODEC consensus | fgbio CallCodecConsensusReads |
filter |
Filter consensus reads | fgbio FilterConsensusReads |
clip |
Clip overlapping read pairs | fgbio ClipBam |
duplex-metrics |
Collect duplex metrics | fgbio CollectDuplexSeqMetrics |
review |
Review consensus variants | fgbio ReviewConsensusVariants |
downsample |
Downsample BAM by UMI family | N/A |
compare <cmd> |
Compare files (feature-gated) | N/A |
simulate <cmd> |
Generate test data (feature-gated) | N/A |
Usage
For detailed usage of each command, run:
Basic Workflow
- Extract UMIs from FASTQ:
- (Optional) Correct UMIs for fixed UMI sets:
- Align and sort reads using fgumi fastq + zipper + sort pipeline:
| | |
- Group reads by UMI:
# or --strategy adjacency for simplex/codec workflows
- Call consensus reads:
# Simplex consensus
# Or duplex consensus
# Or codec consensus
- (Optional) Collect duplex metrics:
- Filter consensus reads:
Performance Options
fgumi supports multi-threading and memory management for optimal performance:
- Threading:
--threads Nfor parallel processing - Memory:
--queue-memory 768(plain numbers are MB; supports human-readable formats like2GB) - Compression:
--compression-level 1-12for speed vs size trade-offs
Memory Model — Important for fgbio users: Unlike fgbio's JVM
-Xmxwhich sets a hard ceiling on total process memory, fgumi's--queue-memorycontrols pipeline queue backpressure only. Two things to be aware of:
- Per-thread scaling (default):
--queue-memory 768 --threads 8allocates 768 MB per thread = ~6 GB total queue memory. Use--queue-memory-per-thread falsefor a fixed total budget.- Queue memory < total process memory: Actual RSS will be higher due to UMI data structures, decompressors, thread stacks, and working buffers.
See the Performance Tuning Guide for detailed guidance, including scenario-based configurations and troubleshooting.
Performance
fgumi is written in Rust for maximum performance.
Command-Level Optimizations
| Command | Key Optimizations |
|---|---|
extract |
Work-stealing thread pool, streaming I/O |
correct |
N-gram indexing with pigeonhole principle, BK-tree for k>1 |
group |
2-bit UMI encoding, N-gram/BK-tree indexing, directed adjacency graph |
simplex |
Fast-path for unanimous consensus, parallel processing |
duplex |
Parallel duplex calling, efficient strand matching |
codec |
Parallel CODEC consensus |
filter |
Streaming filter with parallel processing |
clip |
Parallel overlap detection and clipping |
sort |
External merge sort, configurable memory limit |
General Optimizations
- 2-bit DNA encoding: 4 bases in 1 byte, 32 bases in u64
- CPU intrinsics: XOR + popcount for Hamming distance
- Work-stealing scheduler: Unified pipeline with dynamic load balancing
- libdeflate: Fast BGZF compression
Acknowledgements
fgumi's UMI grouping algorithms are inspired by:
- UMI-tools (Smith et al. 2017) - The directed adjacency method for UMI deduplication with count gradient constraints.
- UMICollapse (Liu 2019) - N-gram and BK-tree indexing strategies for efficient similarity search in UMI deduplication.
Authors
Sponsors
Development of fgumi is supported by Fulcrum Genomics.
Disclaimer
This software is under active development. While we make a best effort to test this software and to fix issues as they are reported, this software is provided as-is without any warranty (see the license for details). Please submit an issue, and better yet a pull request as well, if you discover a bug or identify a missing feature. Please contact Fulcrum Genomics if you are considering using this software or are interested in sponsoring its development.