genomicframe-core → GenomicFrame

High-performance, memory-efficient genomics I/O and interoperability layer written entirely in Rust.

Overview

genomicframe-core is the foundation for GenomicFrame — a high-performance genomics analytics library written in Rust and compiled to Python via PyO3. The end goal is a first-class Python library for use in Jupyter notebooks and Nextflow pipelines, with Rust-level performance and no CPython bottlenecks. This project addresses the limitations of existing Python-based bioinformatics libraries by providing:

Blazing-fast I/O for VCF, BAM, SAM, FASTA, FASTQ, BED, and GFF/GTF formats
Memory efficiency through zero-copy streaming and O(1) memory regardless of file size
Composable pipelines with a LogicalPlan query API and lazy evaluation
Rust-level performance with safe, concurrent execution via Rayon

Performance

These numbers are measured — not extrapolated. See BENCHMARKS.md for full methodology and reproduction steps.

VCF Processing (Criterion.rs benchmarks)

Task	genomicframe-core	PyVCF	cyvcf2	scikit-allel
Parse 10K variants	4.5 ms	850 ms	65 ms	120 ms
Compute full statistics	6.1 ms	1,200 ms	N/A	200 ms
Peak memory	~100 KB	~50 MB	~25 MB	~80 MB

Throughput: ~1.6 million variants/second with full statistics (Ts/Tv, variant classification, quality, filter distribution) computed in a single streaming pass — no reopen, no second loop.

Scaling to Production Files

File Size	Variants	rustbio (streaming)	PyVCF	scikit-allel
10 MB	10,000	10 KB	50 MB	80 MB
1 GB	1,000,000	10 KB	5 GB	8 GB
10 GB	10,000,000	10 KB	OOM	OOM

Memory usage is flat. Python scales linearly and crashes on large files. Rust processes a 10 GB file with the same footprint as a 1 MB file.

Key advantages:

10–200x faster than Python equivalents
250–800x less memory via streaming architecture
O(1) memory regardless of file size
Process 1M+ variants in under 1 second

Format Support

Format	Extension	Reader	Stats	Intervals	Gzip	Status
VCF	`.vcf`, `.vcf.gz`	✅	✅	✅	✅	Production
BAM	`.bam`	✅	✅	✅	✅ (BGZF)	Production
SAM	`.sam`	✅	—	✅	—	Production
BED	`.bed`, `.bed.gz`	✅	✅	✅	✅	Production
FASTQ	`.fastq`, `.fq`, `.gz`	✅	✅	—	✅	Production
FASTA	`.fasta`, `.fa`, `.gz`	✅	✅	—	✅	Production
GFF3/GTF	`.gff`, `.gff3`, `.gtf`, `.gz`	✅	—	✅	✅	Production

All formats share the same GenomicRecordIterator trait and convert to a unified GenomicInterval coordinate system.

Core API

Streaming (simple, direct)

Every format follows the same pattern with O(1) memory:

use genomicframe_core::core::GenomicRecordIterator;
use genomicframe_core::formats::vcf::VcfReader;

let mut reader = VcfReader::from_path("variants.vcf.gz")?;
while let Some(record) = reader.next_record()? {
    if record.is_pass() && record.qual.unwrap_or(0.0) > 30.0 {
        println!("{}\t{}\t{} -> {:?}", record.chrom, record.pos, record.reference, record.alt);
    }
}

LogicalPlan (composable, ergonomic)

The LogicalPlan API is the central interface for filtered queries. It compiles filter expressions into format-native operations — no intermediate allocations.

use genomicframe_core::plan::LogicalPlan;
use genomicframe_core::expression::{col, lit};
use genomicframe_core::schema::FileFormat;
use genomicframe_core::execution::execute;

// Build a lazy query — nothing runs yet
let plan = LogicalPlan::scan("variants.vcf.gz", FileFormat::Vcf)
    .max_scan(100_000)
    .filter(
        col("qual").gte(lit(30.0))
            .and(col("chrom").eq(lit("chr22")))
    );

// Execute — streaming, filtered at read time
let result = execute(plan)?;

Statistics

use genomicframe_core::formats::vcf::{VcfReader, VcfStats};

let mut reader = VcfReader::from_path("variants.vcf.gz")?;

// All computed in one streaming pass
let stats = VcfStats::compute(&mut reader)?;
stats.print_summary();

println!("Ts/Tv ratio:  {:.3}", stats.ts_tv_ratio().unwrap());
println!("SNP count:    {}", stats.snp_count());
println!("Mean quality: {:.2}", stats.mean_quality().unwrap());

For FASTQ, parallel computation is available:

use genomicframe_core::formats::fastq::FastqStats;

// Multi-threaded, O(threads) memory
let stats = FastqStats::par_compute("reads.fastq.gz", None)?;
println!("Q30+ reads: {} ({:.1}%)", stats.high_quality_reads_q30,
    (stats.high_quality_reads_q30 as f64 / stats.total_reads as f64) * 100.0);

Real-World Pipeline Example

The following is drawn from internal_examples/complete_ecosystem.rs — a full four-stage genomics pipeline demonstrating how the API composes across formats.

Stage 1: Pre-Alignment QC

use genomicframe_core::formats::fastq::FastqStats;
use genomicframe_core::formats::fasta::reader::FastaReader;
use genomicframe_core::core::GenomicRecordIterator;

// FASTQ quality control (parallel, O(threads) memory)
let stats = FastqStats::par_compute("reads.fastq.gz", None)?;
println!("Mean quality: {:.1}", stats.mean_quality().unwrap_or(0.0));
println!("Q30+ reads:   {:.1}%",
    (stats.high_quality_reads_q30 as f64 / stats.total_reads as f64) * 100.0);

// Reference FASTA statistics (streaming)
let mut reader = FastaReader::from_path("reference.fa")?;
while let Some(record) = reader.next_record()? {
    // Compute GC%, N-content per sequence
}

Stage 2: Build Filtered Annotation Databases

LogicalPlan acts as the primary interface for loading annotations. Three lines from query to indexed annotation database:

use genomicframe_core::plan::LogicalPlan;
use genomicframe_core::expression::{col, lit};
use genomicframe_core::schema::FileFormat;
use genomicframe_core::execution::execute;

// Large genes only (>= 1000 bp) — filtered during load
let genes = execute(
    LogicalPlan::scan("genes.bed", FileFormat::Bed)
        .filter(col("length").gte(lit(1000)))
)?.to_annotation_index(|r| r.name.clone().unwrap_or_else(|| "unknown".to_string()))?;

// Chr22 exons only — compound filter with .and()
let exons = execute(
    LogicalPlan::scan("exons.bed", FileFormat::Bed)
        .filter(col("chrom").eq(lit("chr22")).and(col("length").gte(lit(50))))
)?.to_annotation_index(|r| format!("exon_{}", r.name.as_deref().unwrap_or("unknown")))?;

println!("Loaded {} genes, {} exons", genes.len(), exons.len());

No manual iteration. No intermediate Vec. The filter executes at read time.

Stage 3: BAM Filtering + Annotation

use genomicframe_core::plan::LogicalPlan;
use genomicframe_core::execution::execute;
use genomicframe_core::schema::FileFormat;
use genomicframe_core::expression::{col, lit};

// Filter high-quality reads and annotate with gene/exon overlaps
let result = execute(
    LogicalPlan::scan("alignments.bam", FileFormat::Bam)
        .filter(
            col("mapq").gte(lit(30))
                .and(col("refid").eq(lit("22")))
        )
)?.annotate_with_genes(&genes, &exons)?;

result.print_summary();

Stage 4: VCF Filtering + Annotation

// Scan first 50K records, filter high-quality chr22 variants, annotate
let result = execute(
    LogicalPlan::scan("variants.vcf.gz", FileFormat::Vcf)
        .max_scan(50_000)
        .filter(
            col("qual").gte(lit(30.0))
                .and(col("chrom").eq(lit("22")))
        )
)?.annotate_with_genes(&genes, &exons)?;

result.print_summary();
// → genic variants, exonic variants, intergenic counts

The same annotate_with_genes pattern works across BAM and VCF — GenomicInterval unifies the coordinate systems.

Unified Interval System

All formats convert to GenomicInterval for cross-format operations. Coordinate system conversions (1-based ↔ 0-based half-open) happen at the conversion boundary, not in application code:

// Each conversion handles its format's coordinate convention
let vcf_interval: GenomicInterval = (&vcf_record).into(); // 1-based → 0-based
let bam_interval: GenomicInterval = (&bam_record).into(); // 0-based + CIGAR length
let bed_interval: GenomicInterval = (&bed_record).into(); // 0-based, direct
let gff_interval: GenomicInterval = (&gff_record).into(); // 1-based inclusive → 0-based half-open

// Cross-format overlap queries
if vcf_interval.overlaps(&bed_interval) {
    println!("Variant falls in annotated region");
}

AnnotationIndex uses an interval tree for O(log n) overlap queries, built from any BED or GFF source.

Path to GenomicFrame

genomicframe-core is intentionally structured as two separable tiers:

genomicframe-core (this crate)        genomicframe (future crate)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━       ━━━━━━━━━━━━━━━━━━━━━━━━━━━
Format readers (VCF, BAM, ...)   →   GenomicFrame::scan_vcf(...)
LogicalPlan + expression AST     →   .filter(col("qual") > 30)
ExecutionResult + annotation     →   .annotate_with_genes(...)
AnnotationIndex (interval tree)  →   .join_by_overlap(exons)
Statistics (VcfStats, BamStats)  →   .vcf_stats() / .bam_stats()
                                      ↓
                                      .collect_polars() / .collect_arrow()

The LogicalPlan and expression modules in this crate are the foundation. genomicframe will add:

Query optimization — predicate pushdown, filter reordering
Parallel execution — chunk-based Rayon dispatch
Index-aware scans — tabix/BAI random access when available
Arrow/Polars output — for downstream data science tooling
Auto-detection — GenomicFrame::scan("file.vcf.gz") detects format from path

When a LogicalPlan is handed to genomicframe, it can inspect the expression tree, detect indexed files, push filters to the read layer, and dispatch chunks across threads — without changing the query authoring API.

Target Environments

genomicframe is primarily a Python library compiled from Rust via PyO3. The intended deployment targets are:

Jupyter notebooks — interactive genomics analysis with Polars/Pandas interop for visualisation and ML
Nextflow pipelines — drop-in replacement for Python bioinformatics tools in DSL2 process blocks, with no GIL, no OOM risk, and predictable latency

# Future genomicframe Python API
import genomicframe as gf

# Nextflow process: annotate variants
variants = (
    gf.scan("variants.vcf.gz")          # auto-detects VCF
      .filter("qual >= 30 AND chrom == '22'")
      .annotate(gf.scan("exons.bed"))   # interval overlap join
      .to_polars()                      # hand off to Polars for reporting
)

The Rust core handles all I/O and compute. Python only touches results.

Installation

[dependencies]
genomicframe-core = "0.1"

Optional Features

[dependencies]
genomicframe-core = { version = "0.1", features = ["async", "cloud"] }

async — Enable async I/O with Tokio
cloud — S3/GCS/Azure cloud storage via object_store

Project Structure

genomicframe-core/
├── src/
│   ├── lib.rs              # Crate entry point
│   ├── core.rs             # Core traits (GenomicRecordIterator, etc.)
│   ├── error.rs            # Unified error handling
│   ├── expression.rs       # Expression AST (col, lit, eq, gte, and, ...)
│   ├── plan.rs             # LogicalPlan builder
│   ├── execution.rs        # Plan executor → ExecutionResult
│   ├── schema.rs           # FileFormat, schema metadata
│   ├── filters.rs          # Filter traits and combinators
│   ├── filtered_reader.rs  # Filtered iterator wrapper
│   ├── interval/           # GenomicInterval + AnnotationIndex
│   ├── parallel/           # Rayon-based parallel processing
│   ├── stats.rs            # Shared statistics types
│   └── formats/
│       ├── vcf/            # VCF reader, stats, filters, expressions
│       ├── bam/            # BAM reader, stats, filters, expressions
│       ├── fasta/          # FASTA reader
│       ├── fastq/          # FASTQ reader, stats (parallel)
│       ├── bed/            # BED reader, stats
│       └── gff/            # GFF3/GTF reader
├── internal_examples/
│   └── complete_ecosystem.rs  # Full four-stage pipeline demo
├── docs/
│   └── bench/
│       └── BENCHMARKS.md
└── Cargo.toml

Roadmap

Phase	Focus	Status
1a	Format I/O core (VCF, BAM, FASTQ, FASTA, BED, GFF)	✅ Complete
1b	LogicalPlan + expression + execution engine	✅ In progress
1c	Writers, validation, indexed readers (tabix/BAI)	🚧 Planned
2	`genomicframe` — lazy engine, optimizer, Arrow/Polars output	📋 Planned
3	Python bindings via PyO3 (notebook + Nextflow targets)	📋 Planned
4	Cloud integration (S3/GCS/Azure)	📋 Planned
5	Query optimization (predicate pushdown, parallel joins)	📋 Planned

Design Principles

Performance First — Everything measurable should be optimized
Predictable Memory — No unexpected growth or hidden buffering
Composable Design — Pipelines compose easily across file types
Transparency — Execution plans and memory profiles are inspectable
Open Science — Code clarity > cleverness. Benchmarks are reproducible

Contributing

Contributions welcome. This is an early-stage research project. Please open an issue to discuss major changes before submitting a PR.

License

Dual-licensed under MIT OR Apache-2.0.

Author

Ryan Duffy — 2025 Performance-driven genomics for the modern bioinformatics era.

genomicframe-core 0.2.0