Expand description
§Rosalind — a deterministic, low-memory genomics engine
Call variants across a whole genome on a laptop, with memory you can predict
and verify, and results that are byte-for-byte reproducible. Rosalind
treats memory as a contract: you declare a RAM budget, rosalind plan tells
you up front whether the job fits, the run honors it (fits-or-refuses cleanly —
never a silent OOM-kill), and rosalind verify re-checks a BLAKE3 receipt
proving the realized peak landed inside your budget.
The kernel is a streaming, CIGAR-aware pileup column stream bounded by local coverage, not input size — a substrate you can compute arbitrary per-locus analytics on. Variant calling is the first consumer, not the whole product.
use std::sync::Arc;
use rosalind::{PileupEngine, PileupParams, SliceSource};
use rosalind::core::{AlignedRead, CigarOp, CigarOpKind, Position, SamFlags};
// One 4bp read "ACGT" aligned at chr0:0 over the reference "ACGT".
let read = AlignedRead {
contig: 0,
pos: Position(0),
mapq: 60,
flags: SamFlags(0),
cigar: vec![CigarOp::new(CigarOpKind::Match, 4)],
seq: Arc::from(b"ACGT".to_vec().into_boxed_slice()),
qual: Arc::from(vec![40u8; 4].into_boxed_slice()),
};
let reference: Arc<[u8]> = Arc::from(b"ACGT".to_vec().into_boxed_slice());
// The bounded pileup substrate: one PileupColumn per covered position.
let mut engine =
PileupEngine::new(SliceSource::new(vec![read]), reference, 0, 0..4, PileupParams::default());
let first = engine.next().unwrap().unwrap();
assert_eq!(first.depth(), 1);§Research direction (Phase D)
Rosalind is also a research vehicle for space-bounded genomics — sublinear-space
index construction along a ~√t space/time curve, extending the memory contract to the
index build itself (today’s build is O(reference)). That is a direction, not yet shipped;
it is tracked in docs/OPEN_PROBLEMS.md.
Re-exports§
pub use io::bam::StreamingBamSource;pub use pileup::Obs;pub use pileup::PileupColumn;pub use pileup::PileupEngine;pub use pileup::PileupParams;pub use pileup::ReadSource;pub use pileup::SliceSource;pub use call::call_germline_region_streaming;pub use call::call_germline_whole_genome;pub use call::GermlineCall;pub use call::GermlineParams;pub use call::run_bounded_whole_genome;pub use call::ColumnAnalyzer;pub use call::FeatureAnalyzer;pub use call::estimate_variants_working_set;pub use call::first_fit_decreasing;pub use call::predicted_peak_rss_bytes;pub use call::PackJob;pub use call::PackOutcome;pub use core::MemoryBudget;pub use core::WorkingSet;pub use genomics::GenomeIndex;pub use genomics::IndexReader;pub use genomics::ReferenceView;pub use rosalind_receipt as provenance;
Modules§
- call
- The calling layer: probabilistically-grounded, abstention-aware variant calls from pileup columns.
The calling layer: turn the
PileupColumnstream into probabilistically-grounded, abstention-aware variant calls. Built oncrate::core+crate::pileuponly; no VCF writing or CLI wiring (those are later phases). - core
- Core types: the lingua franca shared by every layer (io, index, align, pileup, call). Core types — the lingua franca shared by every Rosalind layer.
- genomics
- Genomics primitives: the FM-index, persisted memory-mapped index, alignment, sort, eval. Genomics-specific utilities and data structures built on top of the O(√t) engine.
- io
- IO layer: spec-valid VCF writer + streaming FASTA/FASTQ/BAM readers. IO layer: standards-compliant readers and writers. Phase A4 lands the spec-valid VCF writer; readers (FASTA/FASTQ/BAM) migrate here in later phases.
- pileup
- The streaming pileup kernel: one CIGAR-aware, filtered, bounded-memory engine. The streaming pileup kernel.
- reproduce
- Third-party byte re-derivation from a receipt (the
reproduceverb).reproduce— re-derive a recorded result and compare it byte-for-byte. Standalone: filesystem + blake3 + subprocess only (no htslib). The verdict is over OUTPUT bytes; code / inputs / resource are diagnostic context. v1 supports the deterministic text outputs Rosalind emits (VCF, TSV); BAM/bgzf is reported INCONCLUSIVE, never a false DIVERGED. - util
- Helper utilities: read-only mmap + peak-RSS measurement. Utility functions
Structs§
- Command
Capture - Accumulates an invocation, then writes it into a
RunManifest(claim) and/or reconstructs an argv for re-execution. - RunManifest
- A reproducibility receipt for a single run.
- Verify
Opts - Options for
verify_receiptbeyond the receipt text itself. - Verify
Report - The outcome of
verify_receipt: failures (empty == ok), informational notes, and the parsed manifest (when parsing succeeded).
Functions§
- verify_
receipt - Check a receipt’s internal integrity — self-hashes, cross-field consistency, optional
budget + expected-code — and, when
opts.rehash_files, re-hash recorded files. The single source of truth shared by theverifyCLI andreproduce(and a future WASM verifier) so they cannot drift.