ragc-core 0.1.1

Core compression and decompression algorithms for the AGC genome compression format
Documentation

Core compression and decompression algorithms for the AGC genome compression format.

This crate implements the complete AGC compression pipeline with full C++ AGC format compatibility. Archives created by this library can be read by the C++ implementation and vice versa.

Features

  • Compression - Create AGC archives from FASTA files
  • Decompression - Extract genomes from AGC archives
  • C++ Compatibility - Bidirectional format interoperability
  • Multi-sample support - Handle multiple genomes in one archive
  • LZ differential encoding - Efficient encoding against reference sequences
  • ZSTD compression - High-ratio compression of segments

Examples

Compressing genomes

use ragc_core::{Compressor, CompressorConfig};
use std::path::Path;

# fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create a compressor
let config = CompressorConfig::default();
let mut compressor = Compressor::new("output.agc", config)?;

// Add FASTA files
compressor.add_fasta_file("sample1", Path::new("genome1.fasta"))?;
compressor.add_fasta_file("sample2", Path::new("genome2.fasta"))?;

// Finalize the archive
compressor.finalize()?;
# Ok(())
# }

Decompressing genomes

use ragc_core::{Decompressor, DecompressorConfig};

# fn main() -> Result<(), Box<dyn std::error::Error>> {
// Open an archive
let config = DecompressorConfig::default();
let mut decompressor = Decompressor::open("archive.agc", config)?;

// List available samples
let samples = decompressor.list_samples();
println!("Found {} samples", samples.len());

// Extract a sample
let contigs = decompressor.get_sample("sample1")?;
for (name, sequence) in contigs {
    println!(">{}",  name);
    // sequence is Vec<u8> with numeric encoding (A=0, C=1, G=2, T=3)
}
# Ok(())
# }

Working with k-mers

use ragc_core::{Kmer, KmerMode};

// Create a canonical k-mer
let mut kmer = Kmer::new(21, KmerMode::Canonical);

// Insert bases (0=A, 1=C, 2=G, 3=T)
kmer.insert(0); // A
kmer.insert(1); // C
kmer.insert(2); // G

if kmer.is_full() {
    let value = kmer.data();
    println!("K-mer value: {}", value);
}

Custom compression settings

use ragc_core::CompressorConfig;

let config = CompressorConfig {
    kmer_length: 25,        // Use 25-mers instead of default 21
    segment_size: 2000,     // Larger segments
    min_match_len: 20,      // Minimum LZ match length
    verbosity: 2,           // More verbose output
};

Archive Format

The AGC format organizes data into streams:

  • file_type_info - Version and producer metadata
  • params - Compression parameters (k-mer length, segment size)
  • splitters - Singleton k-mers used for segmentation (future)
  • seg-NN or seg_dNN - Compressed genome segments
  • collection - Sample and contig metadata

Compatibility

This implementation is tested for compatibility with C++ AGC:

  • Archives created by ragc can be read by C++ AGC
  • Archives created by C++ AGC can be read by ragc
  • Format version 3.0 support
  • SHA256-verified roundtrip testing