cgdist 0.1.1

Ultra-fast SNP/indel-level distance calculator for core genome MLST analysis
Documentation
# cgDist Validation Test Suite

This directory contains a comprehensive validation test suite for cgDist that demonstrates the algorithm's correctness in calculating SNPs, InDel events, and InDel bases from controlled sequences.

## Purpose

This validation suite serves to:
1. **Verify algorithmic correctness**: Ensure cgDist accurately counts mutations at the nucleotide level
2. **Validate Parasail integration**: Confirm sequence alignment produces expected results
3. **Check mathematical invariants**: Ensure cgDist ≥ Hamming distance relationship is maintained
4. **Provide scientific confidence**: Give the research community controlled test cases with known answers

## Test Design

### Controlled Schema
The test uses 3 loci with carefully designed allelic variants:

- **Locus 1**: Tests various SNP patterns and InDel types
- **Locus 2**: Tests deletions and insertions of different lengths  
- **Locus 3**: Tests complex patterns and large InDels

### Sample Profiles
10 test samples with specific mutation patterns:
- `Sample_Ref`: All reference alleles (baseline)
- `Sample_Identical`: Identical to reference (distance = 0)
- `Sample_SNPs_Only`: Only SNPs (4 total: 1+1+2 across loci)
- `Sample_Dels_Only`: Only deletions (4 total bases deleted)
- `Sample_Ins_Only`: Only insertions (7 total bases inserted)
- `Sample_Mixed*`: Complex combinations
- `Sample_Large_*`: Large InDels for stress testing

### Distance Modes Tested
- **Hamming**: Traditional allelic differences
- **SNPs**: Single nucleotide polymorphisms only
- **SNPs + InDel events**: SNPs plus number of InDel events
- **SNPs + InDel bases**: SNPs plus total InDel bases

## Key Validation Results ✅

### Test Cases
1. **Identical sequences**: Distance = 0 in all modes
2. **SNPs only**: Hamming=3, SNPs=4, Events=4, Bases=4
3. **Deletions only**: Hamming=3, SNPs=3 (fallback), Events=3, Bases=4
4. **Insertions only**: Hamming=3, SNPs=3 (fallback), Events=3, Bases=7
5. **Mathematical invariant**: cgDist ≥ Hamming for all pairs

### Key Insights
- **Hamming fallback**: When InDels exist but SNPs=0, algorithm correctly applies +1 per locus
- **Event vs Base counting**: Algorithm correctly distinguishes between number of InDel events and total bases affected
- **Parasail alignment**: Global alignment produces expected mutation counts
- **Cache efficiency**: Unified cache stores all statistics for rapid mode switching

## Files in this Directory

### Core Test Files
- `setup_validation_test.py`: Generates CRC32-based schema and profiles
- `run_validation.py`: Main validation script with correct expected values
- `validate_cache.py`: Validates cache integrity and consistency
- `test_new_features.py`: Tests cache-only mode and recombination detection
- `schema_crc32/`: FASTA files with controlled sequences
- `profiles/test_profiles_crc32.tsv`: Sample-to-allele mappings
- `results/`: Output distance matrices from cgDist

### Documentation
- `VALIDATION_SUMMARY.md`: Summary of validation results
- `README.md`: This documentation

## Running the Validation

There are two ways to run the validation suite from a fresh clone:

- **Quick smoke test (recommended for first-time users)**: the input fixture
  `profiles/test_profiles_crc32.tsv` and the schema FASTA files are
  committed to the repository, so you can skip the setup step and go
  straight to running `cgdist`. Use this path if you just want to verify
  that the tool installs and runs correctly.
- **Regenerate from scratch**: run `setup_validation_test.py` to
  regenerate the input fixture, the FASTA schema, and the
  `EXPECTED_RESULTS_CRC32.md` documentation. Use this path if you want
  to verify that the test generator itself is reproducible, or if you
  modify the test scenarios. The regenerated fixture is byte-identical
  to the committed one (CRC32 hashes are deterministic).

### Step 1: Build cgDist
```bash
cd ..
RUSTFLAGS="-C target-cpu=native" cargo build --release
```

### Step 2: Generate Test Data (optional — fixture already committed)
```bash
cd validation_test
python3 setup_validation_test.py
```

### Step 3: Run cgDist Tests
```bash
# Test all distance modes
../target/release/cgdist --schema schema_crc32 --profiles profiles/test_profiles_crc32.tsv --output results/crc32_hamming.tsv --mode hamming --hasher-type crc32

../target/release/cgdist --schema schema_crc32 --profiles profiles/test_profiles_crc32.tsv --output results/crc32_snps.tsv --mode snps --hasher-type crc32 --hamming-fallback

../target/release/cgdist --schema schema_crc32 --profiles profiles/test_profiles_crc32.tsv --output results/crc32_snps_indel_contiguous.tsv --mode snps-indel-contiguous --hasher-type crc32

../target/release/cgdist --schema schema_crc32 --profiles profiles/test_profiles_crc32.tsv --output results/crc32_snps_indel_bases.tsv --mode snps-indel-bases --hasher-type crc32

# Test alignment saving with gaps
../target/release/cgdist --schema schema_crc32 --profiles profiles/test_profiles_crc32.tsv --output results/test_alignments.tsv --mode snps-indel-bases --hasher-type crc32 --save-alignments results/alignments_with_gaps.tsv --force-recompute
```

### Step 4: Validate Results
```bash
python3 run_validation.py
```

### Step 5: Validate Cache Integrity
```bash
# Generate cache file
../target/release/cgdist --schema schema_crc32 --profiles profiles/test_profiles_crc32.tsv --output results/cache_test.tsv --mode snps-indel-bases --hasher-type crc32 --cache-file results/validation_cache.lz4 --cache-note "Validation test cache" --force-recompute

# Validate cache consistency
python3 validate_cache.py
```

### Step 6: Test New Features (Cache-Only & Recombination Detection)
```bash
# Test cache-only mode and recombination detection
python3 test_new_features.py
```

Expected output:
```
🎉 ALL VALIDATION TESTS PASSED!
✅ cgDist is correctly calculating SNPs, InDel events, and InDel bases
✅ Mathematical invariants are maintained
✅ Parasail alignment integration is working correctly

🎉 ALL CACHE VALIDATION TESTS PASSED!
✅ Cache is consistent across all distance modes
✅ Cache metadata is correct and complete
✅ Cache provides expected performance benefits

🎉 ALL NEW FEATURE TESTS PASSED!
✅ Cache-only mode working correctly
✅ Recombination detection functional
✅ CSV output format validated
```

## New Features (2025-09-02)

### Cache-Only Mode
- **Purpose**: Pre-compute alignments without generating distance matrix
- **Usage**: `--cache-only --cache-file cache.lz4` 
- **Benefit**: Separate alignment computation from distance matrix generation for large datasets

### Recombination Detection
- **Purpose**: Identify potential recombination events between alleles
- **Usage**: `--recombination-log events.csv --recombination-threshold 20`
- **Default threshold**: 20 SNPs+InDel bases (based on 3-4% divergence for typical MLST loci)
- **Output**: CSV log with locus, sample pairs, divergence percentages, and sequence lengths

## Scientific Significance

This validation demonstrates that:

1. **cgDist accurately counts mutations**: The algorithm correctly distinguishes between SNPs, InDel events, and InDel bases
2. **Parasail integration works correctly**: Global sequence alignment produces biologically meaningful results
3. **Mathematical properties are preserved**: The fundamental ordering relationship (cgDist ≥ Hamming) is maintained
4. **Cache architecture is sound**: Unified cache enables rapid switching between distance modes without recomputation
5. **Recombination detection**: Scientists can identify potentially recombined alleles that may skew phylogenetic analysis

## For the Scientific Community

This validation suite provides:
- **Reproducible test cases** with known ground truth
- **Transparent methodology** for verifying algorithmic correctness
- **Confidence in results** through controlled, testable scenarios
- **Foundation for further testing** with organism-specific data

The successful validation confirms that cgDist can be trusted for epidemiological analysis, outbreak investigation, and phylogenetic studies requiring nucleotide-level resolution.

---

**Note**: This validation uses synthetic sequences for controlled testing. For organism-specific validation, consider testing with known outbreak datasets where epidemiological relationships are well-established.