Expand description
Core compression and decompression algorithms for the AGC genome compression format.
This crate implements the complete AGC compression pipeline with full C++ AGC format compatibility. Archives created by this library can be read by the C++ implementation and vice versa.
§Features
- Compression - Create AGC archives from FASTA files
- Decompression - Extract genomes from AGC archives
- C++ Compatibility - Bidirectional format interoperability
- Multi-sample support - Handle multiple genomes in one archive
- LZ differential encoding - Efficient encoding against reference sequences
- ZSTD compression - High-ratio compression of segments
§Examples
§Compressing genomes
ⓘ
use ragc_core::{Compressor, CompressorConfig};
use std::path::Path;
// Create a compressor
let config = CompressorConfig::default();
let mut compressor = Compressor::new("output.agc", config)?;
// Add FASTA files
compressor.add_fasta_file("sample1", Path::new("genome1.fasta"))?;
compressor.add_fasta_file("sample2", Path::new("genome2.fasta"))?;
// Finalize the archive
compressor.finalize()?;§Decompressing genomes
use ragc_core::{Decompressor, DecompressorConfig};
// Open an archive
let config = DecompressorConfig::default();
let mut decompressor = Decompressor::open("archive.agc", config)?;
// List available samples
let samples = decompressor.list_samples();
println!("Found {} samples", samples.len());
// Extract a sample
let contigs = decompressor.get_sample("sample1")?;
for (name, sequence) in contigs {
println!(">{}", name);
// sequence is Vec<u8> with numeric encoding (A=0, C=1, G=2, T=3)
}§Working with k-mers
use ragc_core::{Kmer, KmerMode};
// Create a canonical k-mer
let mut kmer = Kmer::new(21, KmerMode::Canonical);
// Insert bases (0=A, 1=C, 2=G, 3=T)
kmer.insert(0); // A
kmer.insert(1); // C
kmer.insert(2); // G
if kmer.is_full() {
let value = kmer.data();
println!("K-mer value: {}", value);
}§Custom compression settings
ⓘ
use ragc_core::CompressorConfig;
let config = CompressorConfig {
kmer_length: 25, // Use 25-mers instead of default 21
segment_size: 2000, // Larger segments
min_match_len: 20, // Minimum LZ match length
verbosity: 2, // More verbose output
};§Archive Format
The AGC format organizes data into streams:
- file_type_info - Version and producer metadata
- params - Compression parameters (k-mer length, segment size)
- splitters - Singleton k-mers used for segmentation (future)
- seg-NN or seg_dNN - Compressed genome segments
- collection - Sample and contig metadata
§Compatibility
This implementation is tested for compatibility with C++ AGC:
- Archives created by ragc can be read by C++ AGC
- Archives created by C++ AGC can be read by ragc
- Format version 3.0 support
- SHA256-verified roundtrip testing
Re-exports§
pub use agc_compressor::QueueStats;pub use agc_compressor::StreamingQueueCompressor;pub use agc_compressor::StreamingQueueConfig;pub use contig_iterator::MultiFileIterator;pub use contig_iterator::PansnFileIterator;pub use decompressor::Decompressor;pub use decompressor::DecompressorConfig;pub use genome_io::GenomeIO;pub use genome_io::GenomeWriter;pub use kmer::canonical_kmer;pub use kmer::decode_base;pub use kmer::encode_base;pub use kmer::reverse_complement;pub use kmer::reverse_complement_kmer;pub use kmer::Kmer;pub use kmer::KmerMode;pub use kmer_extract::enumerate_kmers;pub use kmer_extract::find_candidate_kmers;pub use kmer_extract::remove_non_singletons;pub use lz_diff::LZDiff;pub use memory_bounded_queue::MemoryBoundedQueue;pub use segment::split_at_splitters;pub use segment::split_at_splitters_with_size;pub use segment::Segment;pub use segment_compression::compress_reference_segment;pub use segment_compression::compress_segment;pub use segment_compression::compress_segment_configured;pub use segment_compression::decompress_segment;pub use segment_compression::decompress_segment_with_marker;pub use splitters::determine_splitters;pub use splitters::determine_splitters_streaming;pub use splitters::determine_splitters_streaming_first_sample;pub use splitters::find_candidate_kmers_multi;pub use splitters::is_hard_contig;pub use splitters::is_splitter;pub use splitters::two_pass_splitter_discovery;pub use worker::create_agc_archive;
Modules§
- agc_
compress_ ffi - agc_
compressor - agc_
index_ ffi - base_
validation_ ffi - bloom_
filter - contig_
compression - Contig compression with inline segmentation
- contig_
iterator - decompressor
- env_
cache - Cached environment variable lookups for debug flags These are checked once at startup and cached, avoiding ~30% CPU overhead from getenv calls
- find_
splitters_ in_ contig_ ffi - genome_
io - kmer
- kmer_
extract - kmer_
helpers_ ffi - kmer_
pair_ ffi - lz_diff
- lz_
matcher - memory_
bounded_ queue - preprocessing
- preprocessing_
ffi - priority_
queue - reverse_
complement_ ffi - segment
- segment_
boundary_ ffi - segment_
buffer - segment_
compression - segment_
helpers_ ffi - segment_
split_ ffi - splitter_
check_ ffi - splitters
- splitters_
ffi - task
- tuple_
packing - worker
- zstd_
pool