determine_splitters

Function determine_splitters 

Source
pub fn determine_splitters(
    contigs: &[Contig],
    k: usize,
    segment_size: usize,
) -> (AHashSet<u64>, AHashSet<u64>, AHashSet<u64>)
Expand description

Build a splitter set from reference contigs

This implements the C++ AGC three-pass algorithm:

  1. Find all singleton k-mers in reference (candidates)
  2. Scan reference to find which candidates are ACTUALLY used as splitters
  3. Return only the actually-used splitters

This ensures all genomes split at the SAME positions!

§Arguments

  • contigs - Vector of reference contigs
  • k - K-mer length
  • segment_size - Minimum segment size

§Returns

Tuple of (splitters, singletons, duplicates) HashSets

  • splitters: Actually-used splitter k-mers (for segmentation)
  • singletons: All singleton k-mers from reference (for adaptive mode exclusion)
  • duplicates: All duplicate k-mers from reference (for adaptive mode exclusion)