Core compression and decompression algorithms for the AGC genome compression format.
This crate implements the complete AGC compression pipeline with full C++ AGC format compatibility. Archives created by this library can be read by the C++ implementation and vice versa.
Features
- Compression - Create AGC archives from FASTA files
- Decompression - Extract genomes from AGC archives
- C++ Compatibility - Bidirectional format interoperability
- Multi-sample support - Handle multiple genomes in one archive
- LZ differential encoding - Efficient encoding against reference sequences
- ZSTD compression - High-ratio compression of segments
Examples
Compressing genomes
use ragc_core::{Compressor, CompressorConfig};
use std::path::Path;
# fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create a compressor
let config = CompressorConfig::default();
let mut compressor = Compressor::new("output.agc", config)?;
// Add FASTA files
compressor.add_fasta_file("sample1", Path::new("genome1.fasta"))?;
compressor.add_fasta_file("sample2", Path::new("genome2.fasta"))?;
// Finalize the archive
compressor.finalize()?;
# Ok(())
# }
Decompressing genomes
use ragc_core::{Decompressor, DecompressorConfig};
# fn main() -> Result<(), Box<dyn std::error::Error>> {
// Open an archive
let config = DecompressorConfig::default();
let mut decompressor = Decompressor::open("archive.agc", config)?;
// List available samples
let samples = decompressor.list_samples();
println!("Found {} samples", samples.len());
// Extract a sample
let contigs = decompressor.get_sample("sample1")?;
for (name, sequence) in contigs {
println!(">{}", name);
// sequence is Vec<u8> with numeric encoding (A=0, C=1, G=2, T=3)
}
# Ok(())
# }
Working with k-mers
use ;
// Create a canonical k-mer
let mut kmer = new;
// Insert bases (0=A, 1=C, 2=G, 3=T)
kmer.insert; // A
kmer.insert; // C
kmer.insert; // G
if kmer.is_full
Custom compression settings
use ragc_core::CompressorConfig;
let config = CompressorConfig {
kmer_length: 25, // Use 25-mers instead of default 21
segment_size: 2000, // Larger segments
min_match_len: 20, // Minimum LZ match length
verbosity: 2, // More verbose output
};
Archive Format
The AGC format organizes data into streams:
- file_type_info - Version and producer metadata
- params - Compression parameters (k-mer length, segment size)
- splitters - Singleton k-mers used for segmentation (future)
- seg-NN or seg_dNN - Compressed genome segments
- collection - Sample and contig metadata
Compatibility
This implementation is tested for compatibility with C++ AGC:
- Archives created by ragc can be read by C++ AGC
- Archives created by C++ AGC can be read by ragc
- Format version 3.0 support
- SHA256-verified roundtrip testing