cgDist π§¬
Ultra-fast SNP/indel-level distance calculator for core genome MLST analysis
cgDist is a high-performance Rust implementation for calculating genetic distances in bacterial genomics, specifically designed for epidemiological outbreak investigations and phylogenetic analysis.
π Features
- β‘ Ultra-fast: Parallel processing with optimized algorithms
- π― Precision: SNP/indel-level distance calculation
- π§ Flexible: Multiple hashing algorithms (CRC32, MD5, SHA256)
- π Comprehensive: Built-in comparison tools and statistical analysis
- 𧬠Recombination-candidate flagging: Per-locus mutation-density screen to flag loci as recombination candidates for downstream phylogenetic confirmation
- πΎ Efficient: LZ4 compression for fast caching
- π Scalable: Memory-efficient processing of large datasets
π Table of Contents
- Features
- Installation
- Quick Start
- Usage
- Recombination-Candidate Flagging
- Cache Inspector
- Custom Hashers Plugin System
- API Documentation
- Citation
- Support
- License
π§ Installation
Prerequisites
- Rust 1.70 or later (the minimum supported Rust version, MSRV, is also declared in
Cargo.toml). The easiest way to install or update Rust is via rustup.rs:# Install rustup (skip if already installed) | # If rustup is already installed but Rust is older than 1.70, update with: - Python 3.8+ (only for the validation scripts in
validation_test/) - System build dependencies for
parasail-rs: the alignment backend is built from C source via CMake, which requires a C compiler andzlibdevelopment headers. Install once per machine:
Windows users are encouraged to use the Docker image or WSL2, which provide a ready-to-build Linux environment. Native Windows builds additionally require zlib via vcpkg (# Debian / Ubuntu / WSL # RHEL / AlmaLinux / Rocky / CentOS / Fedora # macOS (Homebrew; Xcode Command Line Tools provide compiler + zlib)vcpkg install zlib) or MSYS2.
From Source
# Clone the repository
# Build release version
# The binary will be available at ./target/release/cgdist
Install from crates.io (recommended)
This fetches the latest published release from
crates.io, builds it locally with
your stable Rust toolchain, and installs the cgdist, inspector,
and recombination_candidate_analyzer binaries to ~/.cargo/bin/
(which should already be on your PATH after a default rustup
install). The deprecated recombination_analyzer binary is also
installed and forwards every argument to
recombination_candidate_analyzer with a deprecation notice β existing
scripts continue to work.
To pin a specific published version:
Install from GitHub (specific tag or unreleased commits)
To install directly from the GitHub repository β useful for installing an unreleased commit or for fully self-contained reproducibility when citing the manuscript:
# Specific release tag
# Latest state on the default branch
cgdist is a binary crate, so its Cargo.lock is committed to the
repository to guarantee reproducible builds β this is the convention
recommended in the
official Cargo FAQ
for binary crates.
Docker
A multi-arch (linux/amd64 + linux/arm64) image is published to GitHub Container Registry on every release:
# Pull the public image (no authentication required)
# or pin to the minor / major series:
# docker pull ghcr.io/genpat-it/cgdist:0.1
# docker pull ghcr.io/genpat-it/cgdist:latest # tracks master HEAD
# Run with the image (mount your working directory at /data).
# The image's ENTRYPOINT is `cgdist`, so flags are passed directly:
To build the image locally instead of pulling (useful for development):
π Quick Start
Basic Distance Calculation
# Calculate SNP distances from cgMLST profiles
# Use different distance mode
# Use different hashing algorithm
# Enable cache for faster recomputation
# Specify number of threads
Validation / Smoke Test
A self-contained validation suite with a small embedded test dataset
(3 loci, 10 samples, ~3 KB) is provided in
validation_test/. It verifies algorithmic
correctness across all four distance modes (Hamming, SNPs,
SNPs+InDel-events, SNPs+InDel-bases), checks the mathematical invariant
cgDist β₯ Hamming, and confirms Parasail alignment integration.
# After building cgDist (see Installation)
# Verify expected results
Expected output: π ALL VALIDATION TESTS PASSED! See
validation_test/README.md for details on
the test design, expected distances, and how to regenerate the fixture
from scratch.
The validation suite also runs automatically in CI on every push and
pull request (see .github/workflows/ci-and-docker.yml).
Configuration File
A configuration file is optional: every parameter accepted by cgdist also has
a CLI flag. The configuration file simply lets you persist commonly-used
settings without retyping them. CLI flags always override TOML values
when both are provided.
A canonical example is shipped at
examples/cgdist-config.toml; a
Hamming-mode variant is at
examples/hamming-config.toml. Both
files use the same flat key structure (no [sections]), and the same
key names as the corresponding CLI flags (the only normalization is
that CLI flag dashes become underscores in TOML β e.g. --hasher-type
becomes hasher_type).
You can also generate a fresh annotated sample with:
A minimal example (alignment-based mode):
= "profiles.tsv"
= "schema/"
= "distances.tsv"
= "crc32"
= "snps" # legacy alias snps-indel-events == snps-indel-contiguous (deprecated)
= "tsv"
= "-"
= 1 # default; set to 0 for auto-detect
= false # opt-in (see Hamming Fallback section below)
# Use a configuration file
# CLI overrides example: config says threads=1, but the CLI wins β 16 threads used
CLI vs TOML precedence
When both the configuration file and the command line specify the same
parameter, the command-line value wins. Internally this is
implemented by loading the TOML first, then overlaying any CLI flag
that the user explicitly set. The same rule applies to switches: e.g.
if the TOML says hamming_fallback = false but you pass
--hamming-fallback on the command line, the fallback will be enabled
for that run.
π Usage
Command Line Options
)
()
)
)
)
)
)
)
)
)
)
)
)
)
()
Supported Input Formats
Schema (FASTA directory):
- Individual FASTA files per locus
- Each file contains allele sequences
- File names correspond to locus names
Profiles (allelic profiles):
- TSV: Tab-separated values
- CSV: Comma-separated values
- Format: Sample name | Locus1 | Locus2 | ... | LocusN
- Missing data represented by configurable character (default:
-)
Cache files:
- LZ4: Compressed cache files (.lz4 or .bin extension)
- Automatic compression/decompression
Output Formats
- TSV: Tab-separated distance matrix (default)
- CSV: Comma-separated distance matrix
- PHYLIP: Phylogenetic analysis format
- NEXUS: Nexus format for phylogenetic tools
𧬠Recombination-Candidate Flagging
cgDist includes a companion screen that flags candidate recombinant loci based on per-locus mutation density. This is not a recombination detector: confirmation of recombination requires downstream phylogeny-aware tools (e.g. Gubbins, ClonalFrameML, fastGEAR). The flagging output identifies which loci warrant that follow-up.
Features
- Mutation Density Analysis: Flags loci with high SNP/indel density per alignment as recombination candidates
- Hamming Distance Filtering: Focuses analysis on genetically related sample pairs
- Pairwise Flagging Summary: Per sample-pair count of flagged loci
- EFSA Loci Support: Compatible with standardized loci sets for food safety applications
- Distance Matrix Correction: Recomputes distances excluding flagged loci
Tool 1: Built-in Flagging
The main cgdist binary can flag candidate recombinant loci during distance calculation:
# Flag candidate loci with default threshold (20 SNPs+InDel bases)
# Custom threshold (e.g., 30 SNPs+InDel bases)
Note β
--recombination-logand--recombination-thresholdare kept as deprecated aliases of the canonical--candidate-recombination-*flags (a deprecation warning is printed when used). Existing scripts continue to work.
Output: CSV log with locus, sample pairs, divergence percentages, and sequence lengths
Tool 2: Recombination-Candidate Analyzer (Post-processing)
For advanced flagging with Hamming filtering and EFSA loci support:
# Build the candidate analyzer
# Step 1: Create enriched cache with sequence lengths
# Step 2: Run the analyzer
# Custom threshold (5% mutation density)
Note β the binary
recombination_analyzerand the flag--recombination-logare kept as deprecated aliases for backward compatibility (a deprecation notice is printed when invoked). Existing scripts continue to work.
Input Requirements
cgDist consumes standard cgMLST outputs. Profiles and schemas can be generated, for example, with ChewBBACA (Silva et al. 2018) or downloaded from Chewie-NS (Mamede et al. 2020).
For Tool 1 (Built-in Flagging):
- Schema: FASTA directory with allele sequences (e.g. ChewBBACA schema directory)
- Profiles: TSV/CSV file with sample-locus-allele matrix (e.g. ChewBBACA
results_alleles.tsv)
For Tool 2 (Recombination-Candidate Analyzer):
- Enriched Cache:
.binfile generated with--enrich-lengthsoption - Allelic Profiles: TSV file with sample-locus-allele matrix
- Distance Matrix: Original distance matrix from cgdist
- EFSA Loci (optional): TSV file listing loci of interest
Output Files
Tool 1 Output: recombination_events.csv
- Locus name
- Sample pairs
- Divergence percentage
- Sequence lengths
- SNPs and InDel counts
Tool 2 Outputs:
- Corrected Distance Matrix: Distance matrix with flagged candidate loci excluded
- Flagging Log: Detailed list of flagged candidate loci with:
- Sample pairs
- Locus information
- Mutation statistics (SNPs, InDels)
- Density percentages
- Sequence lengths
Parameters
Tool 1 (Built-in):
--candidate-recombination-threshold: SNPs + InDel bases threshold (default: 20). Legacy alias--recombination-thresholdis also accepted.--candidate-recombination-log: output flagging log path. Legacy alias--recombination-logis also accepted.
Tool 2 (Recombination-Candidate Analyzer):
--threshold: Mutation density percentage (default: 3.0%)--candidate-recombination-log: output flagging log path. Legacy alias--recombination-logis also accepted.
Complete Workflow Example
# Option A: Quick flagging during distance calculation
# Option B: Advanced flagging with corrected distances
# Step 1: Create enriched cache
# Step 2: Flag and correct
Interpretation Guidelines
- High SNP Density: > 3% flags a locus as a recombination candidate (confirm with phylogeny-aware tools)
- High Indel Events: May indicate mobile genetic elements; warrants downstream inspection
- Pairwise Patterns: Multiple flagged loci between the same sample pair suggests related strains
- Hamming Filtering: Ensures focus on epidemiologically relevant comparisons
Performance Considerations
- Memory Usage: ~4-8GB for typical bacterial datasets (1000+ samples)
- Processing Time: 2-5 minutes for 21M cache entries on modern hardware
- Scalability: Linear with cache size, efficient for large epidemiological studies
Scientific Applications
- Outbreak Investigation: Flag candidate recombination loci in transmission chains for downstream confirmation
- Evolutionary Analysis: Identify candidate horizontal gene transfer events
- Food Safety: Screen for recombination signatures in foodborne pathogens
- Antimicrobial Resistance: Flag candidate resistance gene transfer events
- Population Genomics: Identify loci that may bias clonal-frame distance estimates
π Cache Inspector
The inspector tool provides detailed analysis of cgDist cache files, including validation, statistics, and compatibility checks.
Building the Inspector
Basic Usage
# Show cache summary
# Detailed information including all loci
# Show entries for specific locus
# Validate cache integrity
# Export cache summary to TSV
# Check top N loci by entry count
Advanced Features
# Detect alignment mode from parameters
# Check compatibility with specific alignment parameters
# Quiet mode for scripting
Use Cases
- Cache Validation: Verify cache file integrity before reuse
- Troubleshooting: Diagnose cache compatibility issues
- Statistics: Understand cache size and loci distribution
- Auditing: Track which alignment parameters were used
- Quality Control: Ensure cache matches expected schema
π Custom Hashers Plugin System
cgDist provides a powerful plugin architecture for implementing custom hashing algorithms. This is particularly useful for specialized applications or compatibility with other tools.
Implementing a Custom Hasher
Create a new hasher by implementing the AlleleHasher trait:
use ;
/// Example: Simple nucleotide composition hasher
;
Registering Your Custom Hasher
use HasherRegistry;
Advanced Custom Hasher Examples
1. K-mer Based Hasher
2. Custom Numeric Hasher
;
Integration with cgdist CLI
To use custom hashers with the cgdist command-line tool, you can:
- Fork and modify: Add your hasher to the registry in
src/main.rs - Configuration file: Load hashers from a configuration file
- Dynamic loading: Use Rust's plugin system (advanced)
Example integration in main.rs:
Use Cases for Custom Hashers
- Legacy Compatibility: Match existing tool formats
- Domain-Specific: Specialized algorithms for specific organisms
- Research: Experimental hashing strategies
- Performance: Optimized for specific hardware or datasets
- Compliance: Meet specific regulatory or institutional requirements
Best Practices
- Deterministic: Ensure same sequence always produces same hash
- Collision-Resistant: Minimize hash collisions for your use case
- Performance: Consider computational overhead
- Validation: Implement robust input validation
- Documentation: Provide clear usage examples and limitations
The plugin architecture makes cgDist highly extensible while maintaining backward compatibility with existing workflows.
Running the Custom Hasher Example
See the complete working example:
# Run the custom hasher demonstration
# Output shows different hashers applied to test sequences:
# π cgDist Custom Hasher Examples
# ===================================
#
# π Available Hashers:
# β’ crc32: Fast CRC32 checksum (chewBBACA compatible)
# β’ composition: Nucleotide composition-based hasher (A/T/G/C counts)
# β’ kmer3: K-mer composition hasher (sorted k-mers)
# β’ polynomial: Polynomial rolling hash for sequences
#
# 𧬠Testing hasher: composition
# Description: Nucleotide composition-based hasher (A/T/G/C counts)
# ATCGATCGATCG β A3T3G3C3
# AAATTTGGGCCC β A3T3G3C3
# ATGCATGCATGC β A3T3G3C3
This example demonstrates practical implementation patterns for:
- Composition-based hashing: Count nucleotide frequencies
- K-mer analysis: Extract and sort sequence k-mers
- Polynomial hashing: Mathematical sequence encoding
- Error handling: Validation and missing data management
π API Documentation
Rust API
use ;
// Create calculator with custom config
let config = new
.hasher
.threads
.cache_enabled;
let calculator = new;
// Calculate distances
let distances = calculator.calculate_from_file?;
Python Integration
# Run cgdist from Python
=
# Check for errors
# Load results
=
π Citation
If you use cgDist in your research, please cite our preprint:
de Ruvo, A.; Castelli, P.; Bucciacchio, A.; Mangone, I.; Mixao, V.; Borges, V.; Radomski, N.; Di Pasquale, A. (2025). cgDist: An Enhanced Algorithm for Efficient Calculation of pairwise SNP and InDel differences from Core Genome Multilocus Sequence Typing. bioRxiv. DOI: 10.1101/2025.10.16.682749
π Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: a.deruvo@izs.it
π License
This project is licensed under the MIT License - see the LICENSE file for details.
Made with β€οΈ for the bioinformatics community