Expand description
§fgumi - Fulcrum Genomics UMI Tools Library
This library provides core functionality for working with Unique Molecular Identifiers (UMIs) in sequencing data, including grouping, consensus calling, and quality filtering.
§Overview
The fgumi library is organized into several key modules:
§Core Functionality
umi- UMI assignment strategies (identity, edit-distance, adjacency, paired)consensus- Consensus calling algorithms (simplex, duplex, vanilla)sam- SAM/BAM file utilities and alignment tag manipulation
§Utilities
bam_io- BAM file I/O helpers for reading and writingvalidation- Input validation utilities for parameters and filesprogress- Progress tracking and logginglogging- Enhanced logging utilities with formattingmetrics- Structured metrics types and file writing utilitiesrejection- Rejection reason tracking and statistics
§Specialized Modules
clipper- Read clipping for overlapping pairstemplate- Template-based read groupingreference- Reference genome handling
§Quick Start
§Reading and Writing BAM Files
use fgumi_lib::bam_io::{create_bam_reader, create_bam_writer};
// Open input BAM and get header (path, threads)
let (mut reader, header) = create_bam_reader("input.bam", 1)?;
// Create output BAM writer (path, header, threads, compression_level)
let mut writer = create_bam_writer("output.bam", &header, 1, 6)?;§Validating Input Files
use fgumi_lib::validation::validate_file_exists;
// Validate input files exist with clear error messages
validate_file_exists("input.bam", "Input BAM")?;
validate_file_exists("reference.fa", "Reference FASTA")?;§Progress Tracking
use fgumi_lib::progress::ProgressTracker;
let tracker = ProgressTracker::new("Processing records")
.with_interval(100);
for _i in 0..1000 {
// Process one record...
tracker.log_if_needed(1); // Track incremental progress
}
tracker.log_final(); // Log final count if not exactly on interval§UMI Assignment
use fgumi_lib::umi::{IdentityUmiAssigner, UmiAssigner};
let assigner = IdentityUmiAssigner::default();
let umis = vec!["ACGTACGT".to_string(), "ACGTACGT".to_string(), "TGCATGCA".to_string()];
let assignments = assigner.assign(&umis);
// With identity assignment, each unique UMI gets its own molecule ID
// So we have 2 unique molecule IDs (ACGTACGT and TGCATGCA)
assert_eq!(assignments.iter().collect::<std::collections::HashSet<_>>().len(), 2);§Feature Highlights
- Type-safe BAM I/O - Headers always paired with readers
- Consistent validation - Standardized error messages
- Progress tracking - Uniform logging across tools
- Module organization - Related functionality grouped logically
- Comprehensive testing - Extensive test suite ensuring correctness
§Architecture
The library follows these design principles:
- Separation of concerns - Modules have clear, focused responsibilities
- Backward compatibility - Re-exports maintain existing APIs
- Testability - Comprehensive unit and integration tests
- Documentation - All public items documented with examples
§Contributing
When adding new functionality:
- Add to appropriate module group (sam, umi, consensus, etc.)
- Include comprehensive documentation and examples
- Add unit tests covering edge cases
- Maintain backward compatibility via re-exports
§See Also
Modules§
- alignment_
tags - Alignment tag regeneration (NM, UQ, MD) after base masking.
- assigner
- UMI Assignment Strategies
- bam_io
- BAM file I/O utilities.
- batched_
sam_ reader - Adaptive buffered SAM reader that grows based on observed batch sizes.
- bgzf_
reader - Raw BGZF block reading and decompression.
- bgzf_
writer - BGZF compression utilities for BAM output.
- bitenc
- A 2-bit DNA encoding for fast UMI comparison.
- clipper
- Read clipping utilities for BAM/SAM records.
- consensus
- Consensus calling and filtering for UMI-based molecular consensus reads.
- consensus_
caller - Consensus Calling Infrastructure
- consensus_
filter - Consensus read filtering logic.
- consensus_
tags - Consensus-related SAM tags for reads generated by consensus calling tools.
- dna
- DNA sequence utilities.
- duplex_
consensus_ caller - Duplex Consensus Calling
- errors
- Custom error types for fgumi operations.
- fastq
- FASTQ file parsing and read structure handling.
- grouper
- Grouper implementations for the 9-step pipeline.
- header
- Utilities for adding @PG (program) records to SAM headers.
- logging
- Enhanced logging utilities for formatted output.
- metrics
- Metrics collection and reporting for fgumi operations.
- mi_
group - Molecular Identifier (MI) group utilities for streaming BAM processing.
- overlapping_
consensus - Overlapping bases consensus caller for paired-end reads.
- phred
- Phred score utilities and probability calculations.
- progress
- Progress tracking utilities
- read_
info - Data structures for tracking read position information.
- reference
- Reference genome FASTA reading with all sequences loaded into memory.
- rejection
- Rejection reason tracking for reads and templates.
- reorder_
buffer - Reordering buffer for out-of-order batch completion.
- sam
- SAM/BAM file utilities and header manipulation.
- simple_
umi_ consensus - Simple UMI consensus calling for metrics collection.
- sort
- High-performance BAM sorting module.
- tag_
reversal - Per-base tag reversal for negative-strand reads.
- template
- Template data structure for grouping reads by query name.
- umi
- UMI (Unique Molecular Identifier) utilities
- unified_
pipeline - Unified thread pool pipeline for
--threads Nmode. - validation
- Input validation utilities
- vanilla_
consensus_ caller - Vanilla UMI consensus calling implementation.
- variant_
review - Support for reviewing consensus variants
Enums§
- Rejection
Reason - Reasons why a read or template was rejected during processing.