# genomicframe-core → GenomicFrame
[](LICENSE)
[](https://www.rust-lang.org/)
High-performance, memory-efficient genomics I/O and interoperability layer written entirely in **Rust**.
## Overview
`genomicframe-core` is the foundation for **GenomicFrame** — a high-performance genomics analytics library written in Rust and compiled to Python via PyO3. The end goal is a first-class Python library for use in Jupyter notebooks and Nextflow pipelines, with Rust-level performance and no CPython bottlenecks. This project addresses the limitations of existing Python-based bioinformatics libraries by providing:
- **Blazing-fast I/O** for VCF, BAM, SAM, FASTA, FASTQ, BED, and GFF/GTF formats
- **Memory efficiency** through zero-copy streaming and O(1) memory regardless of file size
- **Composable pipelines** with a `LogicalPlan` query API and lazy evaluation
- **Rust-level performance** with safe, concurrent execution via Rayon
---
## Performance
These numbers are measured — not extrapolated. See [BENCHMARKS.md](docs/bench/BENCHMARKS.md) for full methodology and reproduction steps.
### VCF Processing (Criterion.rs benchmarks)
| Parse 10K variants | **4.5 ms** | 850 ms | 65 ms | 120 ms |
| Compute full statistics | **6.1 ms** | 1,200 ms | N/A | 200 ms |
| Peak memory | **~100 KB** | ~50 MB | ~25 MB | ~80 MB |
Throughput: **~1.6 million variants/second** with full statistics (Ts/Tv, variant classification, quality, filter distribution) computed in a **single streaming pass** — no reopen, no second loop.
### Scaling to Production Files
| 10 MB | 10,000 | **10 KB** | 50 MB | 80 MB |
| 1 GB | 1,000,000 | **10 KB** | 5 GB | 8 GB |
| 10 GB | 10,000,000 | **10 KB** | OOM | OOM |
Memory usage is **flat**. Python scales linearly and crashes on large files. Rust processes a 10 GB file with the same footprint as a 1 MB file.
**Key advantages:**
- **10–200x faster** than Python equivalents
- **250–800x less memory** via streaming architecture
- **O(1) memory** regardless of file size
- Process **1M+ variants in under 1 second**
---
## Format Support
| **VCF** | `.vcf`, `.vcf.gz` | ✅ | ✅ | ✅ | ✅ | Production |
| **BAM** | `.bam` | ✅ | ✅ | ✅ | ✅ (BGZF) | Production |
| **SAM** | `.sam` | ✅ | — | ✅ | — | Production |
| **BED** | `.bed`, `.bed.gz` | ✅ | ✅ | ✅ | ✅ | Production |
| **FASTQ** | `.fastq`, `.fq`, `.gz` | ✅ | ✅ | — | ✅ | Production |
| **FASTA** | `.fasta`, `.fa`, `.gz` | ✅ | ✅ | — | ✅ | Production |
| **GFF3/GTF** | `.gff`, `.gff3`, `.gtf`, `.gz` | ✅ | — | ✅ | ✅ | Production |
All formats share the same `GenomicRecordIterator` trait and convert to a unified `GenomicInterval` coordinate system.
---
## Core API
### Streaming (simple, direct)
Every format follows the same pattern with O(1) memory:
```rust
use genomicframe_core::core::GenomicRecordIterator;
use genomicframe_core::formats::vcf::VcfReader;
let mut reader = VcfReader::from_path("variants.vcf.gz")?;
while let Some(record) = reader.next_record()? {
if record.is_pass() && record.qual.unwrap_or(0.0) > 30.0 {
println!("{}\t{}\t{} -> {:?}", record.chrom, record.pos, record.reference, record.alt);
}
}
```
### LogicalPlan (composable, ergonomic)
The `LogicalPlan` API is the central interface for filtered queries. It compiles filter expressions into format-native operations — no intermediate allocations.
```rust
use genomicframe_core::plan::LogicalPlan;
use genomicframe_core::expression::{col, lit};
use genomicframe_core::schema::FileFormat;
use genomicframe_core::execution::execute;
// Build a lazy query — nothing runs yet
let plan = LogicalPlan::scan("variants.vcf.gz", FileFormat::Vcf)
.max_scan(100_000)
.filter(
col("qual").gte(lit(30.0))
.and(col("chrom").eq(lit("chr22")))
);
// Execute — streaming, filtered at read time
let result = execute(plan)?;
```
### Statistics
```rust
use genomicframe_core::formats::vcf::{VcfReader, VcfStats};
let mut reader = VcfReader::from_path("variants.vcf.gz")?;
// All computed in one streaming pass
let stats = VcfStats::compute(&mut reader)?;
stats.print_summary();
println!("Ts/Tv ratio: {:.3}", stats.ts_tv_ratio().unwrap());
println!("SNP count: {}", stats.snp_count());
println!("Mean quality: {:.2}", stats.mean_quality().unwrap());
```
For FASTQ, parallel computation is available:
```rust
use genomicframe_core::formats::fastq::FastqStats;
// Multi-threaded, O(threads) memory
let stats = FastqStats::par_compute("reads.fastq.gz", None)?;
println!("Q30+ reads: {} ({:.1}%)", stats.high_quality_reads_q30,
(stats.high_quality_reads_q30 as f64 / stats.total_reads as f64) * 100.0);
```
---
## Real-World Pipeline Example
The following is drawn from [`internal_examples/complete_ecosystem.rs`](internal_examples/complete_ecosystem.rs) — a full four-stage genomics pipeline demonstrating how the API composes across formats.
### Stage 1: Pre-Alignment QC
```rust
use genomicframe_core::formats::fastq::FastqStats;
use genomicframe_core::formats::fasta::reader::FastaReader;
use genomicframe_core::core::GenomicRecordIterator;
// FASTQ quality control (parallel, O(threads) memory)
let stats = FastqStats::par_compute("reads.fastq.gz", None)?;
println!("Mean quality: {:.1}", stats.mean_quality().unwrap_or(0.0));
println!("Q30+ reads: {:.1}%",
(stats.high_quality_reads_q30 as f64 / stats.total_reads as f64) * 100.0);
// Reference FASTA statistics (streaming)
let mut reader = FastaReader::from_path("reference.fa")?;
while let Some(record) = reader.next_record()? {
// Compute GC%, N-content per sequence
}
```
### Stage 2: Build Filtered Annotation Databases
`LogicalPlan` acts as the primary interface for loading annotations. Three lines from query to indexed annotation database:
```rust
use genomicframe_core::plan::LogicalPlan;
use genomicframe_core::expression::{col, lit};
use genomicframe_core::schema::FileFormat;
use genomicframe_core::execution::execute;
// Large genes only (>= 1000 bp) — filtered during load
let genes = execute(
LogicalPlan::scan("genes.bed", FileFormat::Bed)
.filter(col("length").gte(lit(1000)))
// Chr22 exons only — compound filter with .and()
let exons = execute(
LogicalPlan::scan("exons.bed", FileFormat::Bed)
.filter(col("chrom").eq(lit("chr22")).and(col("length").gte(lit(50))))
println!("Loaded {} genes, {} exons", genes.len(), exons.len());
```
No manual iteration. No intermediate `Vec`. The filter executes at read time.
### Stage 3: BAM Filtering + Annotation
```rust
use genomicframe_core::plan::LogicalPlan;
use genomicframe_core::execution::execute;
use genomicframe_core::schema::FileFormat;
use genomicframe_core::expression::{col, lit};
// Filter high-quality reads and annotate with gene/exon overlaps
let result = execute(
LogicalPlan::scan("alignments.bam", FileFormat::Bam)
.filter(
col("mapq").gte(lit(30))
.and(col("refid").eq(lit("22")))
)
)?.annotate_with_genes(&genes, &exons)?;
result.print_summary();
```
### Stage 4: VCF Filtering + Annotation
```rust
// Scan first 50K records, filter high-quality chr22 variants, annotate
let result = execute(
LogicalPlan::scan("variants.vcf.gz", FileFormat::Vcf)
.max_scan(50_000)
.filter(
col("qual").gte(lit(30.0))
.and(col("chrom").eq(lit("22")))
)
)?.annotate_with_genes(&genes, &exons)?;
result.print_summary();
// → genic variants, exonic variants, intergenic counts
```
The same `annotate_with_genes` pattern works across BAM and VCF — `GenomicInterval` unifies the coordinate systems.
---
## Unified Interval System
All formats convert to `GenomicInterval` for cross-format operations. Coordinate system conversions (1-based ↔ 0-based half-open) happen at the conversion boundary, not in application code:
```rust
// Each conversion handles its format's coordinate convention
let vcf_interval: GenomicInterval = (&vcf_record).into(); // 1-based → 0-based
let bam_interval: GenomicInterval = (&bam_record).into(); // 0-based + CIGAR length
let bed_interval: GenomicInterval = (&bed_record).into(); // 0-based, direct
let gff_interval: GenomicInterval = (&gff_record).into(); // 1-based inclusive → 0-based half-open
// Cross-format overlap queries
if vcf_interval.overlaps(&bed_interval) {
println!("Variant falls in annotated region");
}
```
`AnnotationIndex` uses an interval tree for O(log n) overlap queries, built from any BED or GFF source.
---
## Path to GenomicFrame
`genomicframe-core` is intentionally structured as two separable tiers:
```
genomicframe-core (this crate) genomicframe (future crate)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━━━━━━━━
Format readers (VCF, BAM, ...) → GenomicFrame::scan_vcf(...)
LogicalPlan + expression AST → .filter(col("qual") > 30)
ExecutionResult + annotation → .annotate_with_genes(...)
AnnotationIndex (interval tree) → .join_by_overlap(exons)
Statistics (VcfStats, BamStats) → .vcf_stats() / .bam_stats()
↓
.collect_polars() / .collect_arrow()
```
The `LogicalPlan` and `expression` modules in this crate are the foundation. `genomicframe` will add:
- **Query optimization** — predicate pushdown, filter reordering
- **Parallel execution** — chunk-based Rayon dispatch
- **Index-aware scans** — tabix/BAI random access when available
- **Arrow/Polars output** — for downstream data science tooling
- **Auto-detection** — `GenomicFrame::scan("file.vcf.gz")` detects format from path
When a `LogicalPlan` is handed to `genomicframe`, it can inspect the expression tree, detect indexed files, push filters to the read layer, and dispatch chunks across threads — without changing the query authoring API.
### Target Environments
`genomicframe` is primarily a **Python library** compiled from Rust via PyO3. The intended deployment targets are:
- **Jupyter notebooks** — interactive genomics analysis with Polars/Pandas interop for visualisation and ML
- **Nextflow pipelines** — drop-in replacement for Python bioinformatics tools in DSL2 process blocks, with no GIL, no OOM risk, and predictable latency
```python
# Future genomicframe Python API
import genomicframe as gf
# Nextflow process: annotate variants
variants = (
gf.scan("variants.vcf.gz") # auto-detects VCF
.filter("qual >= 30 AND chrom == '22'")
.annotate(gf.scan("exons.bed")) # interval overlap join
.to_polars() # hand off to Polars for reporting
)
```
The Rust core handles all I/O and compute. Python only touches results.
---
## Installation
```toml
[dependencies]
genomicframe-core = "0.1"
```
### Optional Features
```toml
[dependencies]
genomicframe-core = { version = "0.1", features = ["async", "cloud"] }
```
- `async` — Enable async I/O with Tokio
- `cloud` — S3/GCS/Azure cloud storage via `object_store`
---
## Project Structure
```
genomicframe-core/
├── src/
│ ├── lib.rs # Crate entry point
│ ├── core.rs # Core traits (GenomicRecordIterator, etc.)
│ ├── error.rs # Unified error handling
│ ├── expression.rs # Expression AST (col, lit, eq, gte, and, ...)
│ ├── plan.rs # LogicalPlan builder
│ ├── execution.rs # Plan executor → ExecutionResult
│ ├── schema.rs # FileFormat, schema metadata
│ ├── filters.rs # Filter traits and combinators
│ ├── filtered_reader.rs # Filtered iterator wrapper
│ ├── interval/ # GenomicInterval + AnnotationIndex
│ ├── parallel/ # Rayon-based parallel processing
│ ├── stats.rs # Shared statistics types
│ └── formats/
│ ├── vcf/ # VCF reader, stats, filters, expressions
│ ├── bam/ # BAM reader, stats, filters, expressions
│ ├── fasta/ # FASTA reader
│ ├── fastq/ # FASTQ reader, stats (parallel)
│ ├── bed/ # BED reader, stats
│ └── gff/ # GFF3/GTF reader
├── internal_examples/
│ └── complete_ecosystem.rs # Full four-stage pipeline demo
├── docs/
│ └── bench/
│ └── BENCHMARKS.md
└── Cargo.toml
```
---
## Roadmap
| 1a | Format I/O core (VCF, BAM, FASTQ, FASTA, BED, GFF) | ✅ Complete |
| 1b | LogicalPlan + expression + execution engine | ✅ In progress |
| 1c | Writers, validation, indexed readers (tabix/BAI) | 🚧 Planned |
| 2 | `genomicframe` — lazy engine, optimizer, Arrow/Polars output | 📋 Planned |
| 3 | Python bindings via PyO3 (notebook + Nextflow targets) | 📋 Planned |
| 4 | Cloud integration (S3/GCS/Azure) | 📋 Planned |
| 5 | Query optimization (predicate pushdown, parallel joins) | 📋 Planned |
---
## Design Principles
1. **Performance First** — Everything measurable should be optimized
2. **Predictable Memory** — No unexpected growth or hidden buffering
3. **Composable Design** — Pipelines compose easily across file types
4. **Transparency** — Execution plans and memory profiles are inspectable
5. **Open Science** — Code clarity > cleverness. Benchmarks are reproducible
---
## Contributing
Contributions welcome. This is an early-stage research project. Please open an issue to discuss major changes before submitting a PR.
## License
Dual-licensed under MIT OR Apache-2.0.
## Author
Ryan Duffy — 2025
*Performance-driven genomics for the modern bioinformatics era.*