Skip to main content

Crate rosalind

Crate rosalind 

Source
Expand description

§Rosalind — a deterministic, low-memory genomics engine

Call variants across a whole genome on a laptop, with memory you can predict and verify, and results that are byte-for-byte reproducible. Rosalind treats memory as a contract: you declare a RAM budget, rosalind plan tells you up front whether the job fits, the run honors it (fits-or-refuses cleanly — never a silent OOM-kill), and rosalind verify re-checks a BLAKE3 receipt proving the realized peak landed inside your budget.

The kernel is a streaming, CIGAR-aware pileup column stream bounded by local coverage, not input size — a substrate you can compute arbitrary per-locus analytics on. Variant calling is the first consumer, not the whole product.

use std::sync::Arc;
use rosalind::{PileupEngine, PileupParams, SliceSource};
use rosalind::core::{AlignedRead, CigarOp, CigarOpKind, Position, SamFlags};

// One 4bp read "ACGT" aligned at chr0:0 over the reference "ACGT".
let read = AlignedRead {
    contig: 0,
    pos: Position(0),
    mapq: 60,
    flags: SamFlags(0),
    cigar: vec![CigarOp::new(CigarOpKind::Match, 4)],
    seq: Arc::from(b"ACGT".to_vec().into_boxed_slice()),
    qual: Arc::from(vec![40u8; 4].into_boxed_slice()),
};
let reference: Arc<[u8]> = Arc::from(b"ACGT".to_vec().into_boxed_slice());

// The bounded pileup substrate: one PileupColumn per covered position.
let mut engine =
    PileupEngine::new(SliceSource::new(vec![read]), reference, 0, 0..4, PileupParams::default());
let first = engine.next().unwrap().unwrap();
assert_eq!(first.depth(), 1);

§Research direction (Phase D)

Rosalind is also a research vehicle for space-bounded genomics — sublinear-space index construction along a ~√t space/time curve, extending the memory contract to the index build itself (today’s build is O(reference)). That is a direction, not yet shipped; it is tracked in docs/OPEN_PROBLEMS.md.

Re-exports§

pub use io::bam::StreamingBamSource;
pub use pileup::Obs;
pub use pileup::PileupColumn;
pub use pileup::PileupEngine;
pub use pileup::PileupParams;
pub use pileup::ReadSource;
pub use pileup::SliceSource;
pub use call::call_germline_region_streaming;
pub use call::call_germline_whole_genome;
pub use call::GermlineCall;
pub use call::GermlineParams;
pub use call::run_bounded_whole_genome;
pub use call::ColumnAnalyzer;
pub use call::FeatureAnalyzer;
pub use call::estimate_variants_working_set;
pub use call::first_fit_decreasing;
pub use call::predicted_peak_rss_bytes;
pub use call::PackJob;
pub use call::PackOutcome;
pub use core::MemoryBudget;
pub use core::WorkingSet;
pub use genomics::GenomeIndex;
pub use genomics::IndexReader;
pub use genomics::ReferenceView;
pub use rosalind_receipt as provenance;

Modules§

call
The calling layer: probabilistically-grounded, abstention-aware variant calls from pileup columns. The calling layer: turn the PileupColumn stream into probabilistically-grounded, abstention-aware variant calls. Built on crate::core + crate::pileup only; no VCF writing or CLI wiring (those are later phases).
core
Core types: the lingua franca shared by every layer (io, index, align, pileup, call). Core types — the lingua franca shared by every Rosalind layer.
genomics
Genomics primitives: the FM-index, persisted memory-mapped index, alignment, sort, eval. Genomics-specific utilities and data structures built on top of the O(√t) engine.
io
IO layer: spec-valid VCF writer + streaming FASTA/FASTQ/BAM readers. IO layer: standards-compliant readers and writers. Phase A4 lands the spec-valid VCF writer; readers (FASTA/FASTQ/BAM) migrate here in later phases.
pileup
The streaming pileup kernel: one CIGAR-aware, filtered, bounded-memory engine. The streaming pileup kernel.
reproduce
Third-party byte re-derivation from a receipt (the reproduce verb). reproduce — re-derive a recorded result and compare it byte-for-byte. Standalone: filesystem + blake3 + subprocess only (no htslib). The verdict is over OUTPUT bytes; code / inputs / resource are diagnostic context. v1 supports the deterministic text outputs Rosalind emits (VCF, TSV); BAM/bgzf is reported INCONCLUSIVE, never a false DIVERGED.
util
Helper utilities: read-only mmap + peak-RSS measurement. Utility functions

Structs§

CommandCapture
Accumulates an invocation, then writes it into a RunManifest (claim) and/or reconstructs an argv for re-execution.
RunManifest
A reproducibility receipt for a single run.
VerifyOpts
Options for verify_receipt beyond the receipt text itself.
VerifyReport
The outcome of verify_receipt: failures (empty == ok), informational notes, and the parsed manifest (when parsing succeeded).

Functions§

verify_receipt
Check a receipt’s internal integrity — self-hashes, cross-field consistency, optional budget + expected-code — and, when opts.rehash_files, re-hash recorded files. The single source of truth shared by the verify CLI and reproduce (and a future WASM verifier) so they cannot drift.