genomicframe-core → GenomicFrame
High-performance, memory-efficient genomics I/O and interoperability layer written entirely in Rust.
Overview
genomicframe-core is the foundation for GenomicFrame — a high-performance genomics analytics library written in Rust and compiled to Python via PyO3. The end goal is a first-class Python library for use in Jupyter notebooks and Nextflow pipelines, with Rust-level performance and no CPython bottlenecks. This project addresses the limitations of existing Python-based bioinformatics libraries by providing:
- Blazing-fast I/O for VCF, BAM, SAM, FASTA, FASTQ, BED, and GFF/GTF formats
- Memory efficiency through zero-copy streaming and O(1) memory regardless of file size
- Composable pipelines with a
LogicalPlanquery API and lazy evaluation - Rust-level performance with safe, concurrent execution via Rayon
Performance
These numbers are measured — not extrapolated. See BENCHMARKS.md for full methodology and reproduction steps.
VCF Processing (Criterion.rs benchmarks)
| Task | genomicframe-core | PyVCF | cyvcf2 | scikit-allel |
|---|---|---|---|---|
| Parse 10K variants | 4.5 ms | 850 ms | 65 ms | 120 ms |
| Compute full statistics | 6.1 ms | 1,200 ms | N/A | 200 ms |
| Peak memory | ~100 KB | ~50 MB | ~25 MB | ~80 MB |
Throughput: ~1.6 million variants/second with full statistics (Ts/Tv, variant classification, quality, filter distribution) computed in a single streaming pass — no reopen, no second loop.
Scaling to Production Files
| File Size | Variants | rustbio (streaming) | PyVCF | scikit-allel |
|---|---|---|---|---|
| 10 MB | 10,000 | 10 KB | 50 MB | 80 MB |
| 1 GB | 1,000,000 | 10 KB | 5 GB | 8 GB |
| 10 GB | 10,000,000 | 10 KB | OOM | OOM |
Memory usage is flat. Python scales linearly and crashes on large files. Rust processes a 10 GB file with the same footprint as a 1 MB file.
Key advantages:
- 10–200x faster than Python equivalents
- 250–800x less memory via streaming architecture
- O(1) memory regardless of file size
- Process 1M+ variants in under 1 second
Format Support
| Format | Extension | Reader | Stats | Intervals | Gzip | Status |
|---|---|---|---|---|---|---|
| VCF | .vcf, .vcf.gz |
✅ | ✅ | ✅ | ✅ | Production |
| BAM | .bam |
✅ | ✅ | ✅ | ✅ (BGZF) | Production |
| SAM | .sam |
✅ | — | ✅ | — | Production |
| BED | .bed, .bed.gz |
✅ | ✅ | ✅ | ✅ | Production |
| FASTQ | .fastq, .fq, .gz |
✅ | ✅ | — | ✅ | Production |
| FASTA | .fasta, .fa, .gz |
✅ | ✅ | — | ✅ | Production |
| GFF3/GTF | .gff, .gff3, .gtf, .gz |
✅ | — | ✅ | ✅ | Production |
All formats share the same GenomicRecordIterator trait and convert to a unified GenomicInterval coordinate system.
Core API
Streaming (simple, direct)
Every format follows the same pattern with O(1) memory:
use GenomicRecordIterator;
use VcfReader;
let mut reader = from_path?;
while let Some = reader.next_record?
LogicalPlan (composable, ergonomic)
The LogicalPlan API is the central interface for filtered queries. It compiles filter expressions into format-native operations — no intermediate allocations.
use LogicalPlan;
use ;
use FileFormat;
use execute;
// Build a lazy query — nothing runs yet
let plan = scan
.max_scan
.filter;
// Execute — streaming, filtered at read time
let result = execute?;
Statistics
use ;
let mut reader = from_path?;
// All computed in one streaming pass
let stats = compute?;
stats.print_summary;
println!;
println!;
println!;
For FASTQ, parallel computation is available:
use FastqStats;
// Multi-threaded, O(threads) memory
let stats = par_compute?;
println!;
Real-World Pipeline Example
The following is drawn from internal_examples/complete_ecosystem.rs — a full four-stage genomics pipeline demonstrating how the API composes across formats.
Stage 1: Pre-Alignment QC
use FastqStats;
use FastaReader;
use GenomicRecordIterator;
// FASTQ quality control (parallel, O(threads) memory)
let stats = par_compute?;
println!;
println!;
// Reference FASTA statistics (streaming)
let mut reader = from_path?;
while let Some = reader.next_record?
Stage 2: Build Filtered Annotation Databases
LogicalPlan acts as the primary interface for loading annotations. Three lines from query to indexed annotation database:
use LogicalPlan;
use ;
use FileFormat;
use execute;
// Large genes only (>= 1000 bp) — filtered during load
let genes = execute?.to_annotation_index?;
// Chr22 exons only — compound filter with .and()
let exons = execute?.to_annotation_index?;
println!;
No manual iteration. No intermediate Vec. The filter executes at read time.
Stage 3: BAM Filtering + Annotation
use LogicalPlan;
use execute;
use FileFormat;
use ;
// Filter high-quality reads and annotate with gene/exon overlaps
let result = execute?.annotate_with_genes?;
result.print_summary;
Stage 4: VCF Filtering + Annotation
// Scan first 50K records, filter high-quality chr22 variants, annotate
let result = execute?.annotate_with_genes?;
result.print_summary;
// → genic variants, exonic variants, intergenic counts
The same annotate_with_genes pattern works across BAM and VCF — GenomicInterval unifies the coordinate systems.
Unified Interval System
All formats convert to GenomicInterval for cross-format operations. Coordinate system conversions (1-based ↔ 0-based half-open) happen at the conversion boundary, not in application code:
// Each conversion handles its format's coordinate convention
let vcf_interval: GenomicInterval = .into; // 1-based → 0-based
let bam_interval: GenomicInterval = .into; // 0-based + CIGAR length
let bed_interval: GenomicInterval = .into; // 0-based, direct
let gff_interval: GenomicInterval = .into; // 1-based inclusive → 0-based half-open
// Cross-format overlap queries
if vcf_interval.overlaps
AnnotationIndex uses an interval tree for O(log n) overlap queries, built from any BED or GFF source.
Path to GenomicFrame
genomicframe-core is intentionally structured as two separable tiers:
genomicframe-core (this crate) genomicframe (future crate)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━━━━━━━━
Format readers (VCF, BAM, ...) → GenomicFrame::scan_vcf(...)
LogicalPlan + expression AST → .filter(col("qual") > 30)
ExecutionResult + annotation → .annotate_with_genes(...)
AnnotationIndex (interval tree) → .join_by_overlap(exons)
Statistics (VcfStats, BamStats) → .vcf_stats() / .bam_stats()
↓
.collect_polars() / .collect_arrow()
The LogicalPlan and expression modules in this crate are the foundation. genomicframe will add:
- Query optimization — predicate pushdown, filter reordering
- Parallel execution — chunk-based Rayon dispatch
- Index-aware scans — tabix/BAI random access when available
- Arrow/Polars output — for downstream data science tooling
- Auto-detection —
GenomicFrame::scan("file.vcf.gz")detects format from path
When a LogicalPlan is handed to genomicframe, it can inspect the expression tree, detect indexed files, push filters to the read layer, and dispatch chunks across threads — without changing the query authoring API.
Target Environments
genomicframe is primarily a Python library compiled from Rust via PyO3. The intended deployment targets are:
- Jupyter notebooks — interactive genomics analysis with Polars/Pandas interop for visualisation and ML
- Nextflow pipelines — drop-in replacement for Python bioinformatics tools in DSL2 process blocks, with no GIL, no OOM risk, and predictable latency
# Future genomicframe Python API
# Nextflow process: annotate variants
=
The Rust core handles all I/O and compute. Python only touches results.
Installation
[]
= "0.1"
Optional Features
[]
= { = "0.1", = ["async", "cloud"] }
async— Enable async I/O with Tokiocloud— S3/GCS/Azure cloud storage viaobject_store
Project Structure
genomicframe-core/
├── src/
│ ├── lib.rs # Crate entry point
│ ├── core.rs # Core traits (GenomicRecordIterator, etc.)
│ ├── error.rs # Unified error handling
│ ├── expression.rs # Expression AST (col, lit, eq, gte, and, ...)
│ ├── plan.rs # LogicalPlan builder
│ ├── execution.rs # Plan executor → ExecutionResult
│ ├── schema.rs # FileFormat, schema metadata
│ ├── filters.rs # Filter traits and combinators
│ ├── filtered_reader.rs # Filtered iterator wrapper
│ ├── interval/ # GenomicInterval + AnnotationIndex
│ ├── parallel/ # Rayon-based parallel processing
│ ├── stats.rs # Shared statistics types
│ └── formats/
│ ├── vcf/ # VCF reader, stats, filters, expressions
│ ├── bam/ # BAM reader, stats, filters, expressions
│ ├── fasta/ # FASTA reader
│ ├── fastq/ # FASTQ reader, stats (parallel)
│ ├── bed/ # BED reader, stats
│ └── gff/ # GFF3/GTF reader
├── internal_examples/
│ └── complete_ecosystem.rs # Full four-stage pipeline demo
├── docs/
│ └── bench/
│ └── BENCHMARKS.md
└── Cargo.toml
Roadmap
| Phase | Focus | Status |
|---|---|---|
| 1a | Format I/O core (VCF, BAM, FASTQ, FASTA, BED, GFF) | ✅ Complete |
| 1b | LogicalPlan + expression + execution engine | ✅ In progress |
| 1c | Writers, validation, indexed readers (tabix/BAI) | 🚧 Planned |
| 2 | genomicframe — lazy engine, optimizer, Arrow/Polars output |
📋 Planned |
| 3 | Python bindings via PyO3 (notebook + Nextflow targets) | 📋 Planned |
| 4 | Cloud integration (S3/GCS/Azure) | 📋 Planned |
| 5 | Query optimization (predicate pushdown, parallel joins) | 📋 Planned |
Design Principles
- Performance First — Everything measurable should be optimized
- Predictable Memory — No unexpected growth or hidden buffering
- Composable Design — Pipelines compose easily across file types
- Transparency — Execution plans and memory profiles are inspectable
- Open Science — Code clarity > cleverness. Benchmarks are reproducible
Contributing
Contributions welcome. This is an early-stage research project. Please open an issue to discuss major changes before submitting a PR.
License
Dual-licensed under MIT OR Apache-2.0.
Author
Ryan Duffy — 2025 Performance-driven genomics for the modern bioinformatics era.