Crate ragc_core

Crate ragc_core 

Source
Expand description

Core compression and decompression algorithms for the AGC genome compression format.

This crate implements the complete AGC compression pipeline with full C++ AGC format compatibility. Archives created by this library can be read by the C++ implementation and vice versa.

§Features

  • Compression - Create AGC archives from FASTA files
  • Decompression - Extract genomes from AGC archives
  • C++ Compatibility - Bidirectional format interoperability
  • Multi-sample support - Handle multiple genomes in one archive
  • LZ differential encoding - Efficient encoding against reference sequences
  • ZSTD compression - High-ratio compression of segments

§Examples

§Compressing genomes

use ragc_core::{Compressor, CompressorConfig};
use std::path::Path;

// Create a compressor
let config = CompressorConfig::default();
let mut compressor = Compressor::new("output.agc", config)?;

// Add FASTA files
compressor.add_fasta_file("sample1", Path::new("genome1.fasta"))?;
compressor.add_fasta_file("sample2", Path::new("genome2.fasta"))?;

// Finalize the archive
compressor.finalize()?;

§Decompressing genomes

use ragc_core::{Decompressor, DecompressorConfig};

// Open an archive
let config = DecompressorConfig::default();
let mut decompressor = Decompressor::open("archive.agc", config)?;

// List available samples
let samples = decompressor.list_samples();
println!("Found {} samples", samples.len());

// Extract a sample
let contigs = decompressor.get_sample("sample1")?;
for (name, sequence) in contigs {
    println!(">{}",  name);
    // sequence is Vec<u8> with numeric encoding (A=0, C=1, G=2, T=3)
}

§Working with k-mers

use ragc_core::{Kmer, KmerMode};

// Create a canonical k-mer
let mut kmer = Kmer::new(21, KmerMode::Canonical);

// Insert bases (0=A, 1=C, 2=G, 3=T)
kmer.insert(0); // A
kmer.insert(1); // C
kmer.insert(2); // G

if kmer.is_full() {
    let value = kmer.data();
    println!("K-mer value: {}", value);
}

§Custom compression settings

use ragc_core::CompressorConfig;

let config = CompressorConfig {
    kmer_length: 25,        // Use 25-mers instead of default 21
    segment_size: 2000,     // Larger segments
    min_match_len: 20,      // Minimum LZ match length
    verbosity: 2,           // More verbose output
};

§Archive Format

The AGC format organizes data into streams:

  • file_type_info - Version and producer metadata
  • params - Compression parameters (k-mer length, segment size)
  • splitters - Singleton k-mers used for segmentation (future)
  • seg-NN or seg_dNN - Compressed genome segments
  • collection - Sample and contig metadata

§Compatibility

This implementation is tested for compatibility with C++ AGC:

  • Archives created by ragc can be read by C++ AGC
  • Archives created by C++ AGC can be read by ragc
  • Format version 3.0 support
  • SHA256-verified roundtrip testing

Re-exports§

pub use agc_compressor::QueueStats;
pub use agc_compressor::StreamingQueueCompressor;
pub use agc_compressor::StreamingQueueConfig;
pub use contig_iterator::MultiFileIterator;
pub use contig_iterator::PansnFileIterator;
pub use decompressor::Decompressor;
pub use decompressor::DecompressorConfig;
pub use genome_io::GenomeIO;
pub use genome_io::GenomeWriter;
pub use kmer::canonical_kmer;
pub use kmer::decode_base;
pub use kmer::encode_base;
pub use kmer::reverse_complement;
pub use kmer::reverse_complement_kmer;
pub use kmer::Kmer;
pub use kmer::KmerMode;
pub use kmer_extract::enumerate_kmers;
pub use kmer_extract::find_candidate_kmers;
pub use kmer_extract::remove_non_singletons;
pub use lz_diff::LZDiff;
pub use memory_bounded_queue::MemoryBoundedQueue;
pub use segment::split_at_splitters;
pub use segment::split_at_splitters_with_size;
pub use segment::Segment;
pub use segment_compression::compress_reference_segment;
pub use segment_compression::compress_segment;
pub use segment_compression::compress_segment_configured;
pub use segment_compression::decompress_segment;
pub use segment_compression::decompress_segment_with_marker;
pub use splitters::determine_splitters;
pub use splitters::determine_splitters_streaming;
pub use splitters::determine_splitters_streaming_first_sample;
pub use splitters::find_candidate_kmers_multi;
pub use splitters::is_hard_contig;
pub use splitters::is_splitter;
pub use splitters::two_pass_splitter_discovery;
pub use worker::create_agc_archive;

Modules§

agc_compress_ffi
agc_compressor
agc_index_ffi
base_validation_ffi
bloom_filter
contig_compression
Contig compression with inline segmentation
contig_iterator
decompressor
env_cache
Cached environment variable lookups for debug flags These are checked once at startup and cached, avoiding ~30% CPU overhead from getenv calls
find_splitters_in_contig_ffi
genome_io
kmer
kmer_extract
kmer_helpers_ffi
kmer_pair_ffi
lz_diff
lz_matcher
memory_bounded_queue
preprocessing
preprocessing_ffi
priority_queue
reverse_complement_ffi
segment
segment_boundary_ffi
segment_buffer
segment_compression
segment_helpers_ffi
segment_split_ffi
splitter_check_ffi
splitters
splitters_ffi
task
tuple_packing
worker
zstd_pool