determine_splitters_streaming

Function determine_splitters_streaming 

Source
pub fn determine_splitters_streaming(
    fasta_path: &Path,
    k: usize,
    segment_size: usize,
) -> Result<(AHashSet<u64>, AHashSet<u64>, AHashSet<u64>)>
Expand description

Build a splitter set by streaming through a FASTA file (memory-efficient!)

This matches C++ AGC’s approach but streams the file twice instead of loading all contigs into memory. For yeast (12MB genome):

  • Max memory: ~100MB (Vec of 12M k-mers)
  • vs loading all contigs: ~2.8GB

§Arguments

  • fasta_path - Path to reference FASTA file (can be gzipped)
  • k - K-mer length
  • segment_size - Minimum segment size

§Returns

Tuple of (splitters, singletons, duplicates) HashSets