# GenomicFrame Development Roadmap
## Vision
Build a two-tier architecture for high-performance genomics:
1. **`genomicframe-core`** - Low-level I/O and domain logic (this repository)
2. **`genomicframe`** - High-level query engine and ergonomic API (future repository)
This separation ensures:
- ✅ `genomicframe-core` stays lean, focused on I/O performance
- ✅ `genomicframe` provides convenience without bloating core library
- ✅ Users can choose: fast I/O only OR full query engine
- ✅ Clear upgrade path: start with interrop, add frame when needed
---
## Phase 1: Complete `genomicframe-core` Foundation
**Goal:** Production-ready I/O library with comprehensive format support and domain-aware optimizations.
### 1.1 VCF Completion ✅→🚧
**Current Status:**
- ✅ Streaming reader with O(1) memory
- ✅ Header parsing and metadata extraction
- ✅ Comprehensive statistics (Ts/Tv, variant types, quality, etc.)
- ✅ Multi-sample support
- ✅ Gzip compression support
**Remaining Work:**
#### A. Validation (`src/formats/vcf/validation.rs`)
```rust
pub struct VcfValidator {
/// Validation rules to apply
rules: Vec<ValidationRule>,
/// Whether to fail fast or collect all errors
strict: bool,
}
pub enum ValidationRule {
/// Check chromosome names match contig headers
ValidateChromosomes,
/// Ensure positions are sorted within chromosomes
CheckSorted,
/// Validate REF/ALT alleles are valid DNA sequences
ValidateAlleles,
/// Check quality scores are in valid range
ValidateQuality,
/// Ensure sample genotypes match FORMAT field
ValidateGenotypes,
/// Check INFO field values match their declared types
ValidateInfoTypes,
}
impl VcfValidator {
pub fn validate_record(&self, record: &VcfRecord) -> ValidationResult;
pub fn validate_stream(&self, reader: &mut VcfReader) -> Vec<ValidationError>;
}
```
**Use Cases:**
- Pre-publication data QC
- Pipeline debugging (catch malformed files early)
- Format compliance checking
#### B. Filtering (`src/formats/vcf/filters.rs`)
```rust
/// Composable filter predicates
pub trait VcfFilter: Send + Sync {
fn test(&self, record: &VcfRecord) -> bool;
}
// Quality-based filters
pub struct QualityFilter { min_qual: f64 }
pub struct PassFilter; // Only PASS variants
pub struct DepthFilter { min_dp: u32 }
// Genomic region filters
pub struct RegionFilter { intervals: Vec<GenomicInterval> }
pub struct ChromosomeFilter { chroms: HashSet<String> }
// Variant type filters
pub struct SnpOnlyFilter;
pub struct IndelOnlyFilter;
pub struct BiAllelicFilter; // Exclude multi-allelic
// Allele frequency filters
pub struct MinorAlleleFrequency { min_maf: f64, max_maf: f64 }
// Combinator filters
pub struct AndFilter { filters: Vec<Box<dyn VcfFilter>> }
pub struct OrFilter { filters: Vec<Box<dyn VcfFilter>> }
pub struct NotFilter { inner: Box<dyn VcfFilter> }
/// Ergonomic filter builder
impl VcfReader {
pub fn with_filter<F: VcfFilter + 'static>(self, filter: F) -> FilteredReader<Self>;
}
```
**Example Usage:**
```rust
let filtered = VcfReader::from_path("variants.vcf.gz")?
.with_filter(PassFilter)
.with_filter(QualityFilter { min_qual: 30.0 })
.with_filter(RegionFilter::from_bed("exons.bed")?);
// Filters applied during iteration - no extra memory
for record in filtered {
// Only see records matching all filters
}
```
**Implementation Note:**
- Filters should be **zero-cost abstractions** via trait objects
- Apply during parsing when possible (skip before allocation)
- Enable **predicate pushdown** for Phase 2
#### C. Writer Enhancement (`src/formats/vcf/writer.rs`)
```rust
pub struct VcfWriter {
writer: Box<dyn Write>,
header: VcfHeader,
compression: Option<CompressionLevel>,
}
impl VcfWriter {
pub fn from_path<P: AsRef<Path>>(path: P) -> Result<Self>;
pub fn with_compression(self, level: CompressionLevel) -> Self;
pub fn write_header(&mut self) -> Result<()>;
pub fn write_record(&mut self, record: &VcfRecord) -> Result<()>;
pub fn write_records(&mut self, records: &[VcfRecord]) -> Result<()>;
}
```
**Key Features:**
- Automatic bgzip compression for `.vcf.bgz`
- Header validation before writing
- Efficient batch writing
---
### 1.2 BAM/SAM/CRAM Support
**Priority:** High (critical for alignment data)
#### Reader (`src/formats/bam/reader.rs`)
```rust
pub struct BamRecord {
pub qname: String, // Query name
pub flag: u16, // SAM flags
pub rname: String, // Reference name
pub pos: u64, // 1-based position
pub mapq: u8, // Mapping quality
pub cigar: Vec<CigarOp>, // CIGAR string
pub sequence: Vec<u8>, // DNA sequence
pub qualities: Vec<u8>, // Phred quality scores
pub tags: HashMap<String, TagValue>, // Optional fields
}
pub struct BamReader {
reader: Box<dyn BufRead>,
header: BamHeader,
index: Option<BaiIndex>, // .bai file for random access
}
impl BamReader {
pub fn from_path<P: AsRef<Path>>(path: P) -> Result<Self>;
pub fn with_index<P: AsRef<Path>>(self, index: P) -> Result<Self>;
// Random access (requires index)
pub fn fetch(&mut self, interval: &GenomicInterval) -> Result<BamIterator>;
}
```
#### Statistics (`src/formats/bam/stats.rs`)
```rust
pub struct BamStats {
pub total_reads: usize,
pub mapped_reads: usize,
pub unmapped_reads: usize,
pub properly_paired: usize,
pub duplicates: usize,
// Quality metrics
pub mean_mapq: f64,
pub mapq_distribution: CategoryCounter<u8>,
// Coverage statistics
pub mean_coverage: f64,
pub coverage_histogram: Vec<(u32, usize)>, // (depth, count)
// Per-chromosome stats
pub reads_per_chrom: CategoryCounter<String>,
}
```
**Domain-Specific Filters:**
- Mapping quality threshold
- Proper pair filtering
- Duplicate removal
- Primary alignment only
- Read length filters
---
### 1.3 FASTQ Support
**Priority:** Medium (sequence data preprocessing)
#### Reader (`src/formats/fastq/reader.rs`)
```rust
pub struct FastqRecord {
pub id: String, // Sequence identifier
pub description: Option<String>,
pub sequence: Vec<u8>, // DNA sequence
pub qualities: Vec<u8>, // Phred quality scores
}
pub struct FastqReader {
reader: Box<dyn BufRead>,
compression: Compression,
}
// Streaming by default
impl GenomicRecordIterator for FastqReader {
type Record = FastqRecord;
fn next_record(&mut self) -> Result<Option<Self::Record>>;
}
```
#### Statistics (`src/formats/fastq/stats.rs`)
```rust
pub struct FastqStats {
pub total_reads: usize,
pub total_bases: usize,
pub mean_length: f64,
pub length_distribution: Vec<(usize, usize)>, // (length, count)
// Quality metrics
pub mean_quality: f64,
pub quality_per_position: Vec<RunningStats>, // Per-base quality
// Sequence composition
pub gc_content: f64,
pub n_content: f64,
pub base_distribution: CategoryCounter<u8>,
}
```
**Domain-Specific Operations:**
- Quality trimming
- Adapter removal
- Length filtering
- N-base filtering
- Quality score recalibration
---
### 1.4 GFF/GTF Support
**Priority:** Medium (gene annotations)
#### Reader (`src/formats/gff/reader.rs`)
```rust
pub struct GffRecord {
pub seqid: String, // Chromosome
pub source: String, // Annotation source
pub feature: String, // Feature type (gene, exon, CDS, etc.)
pub start: u64, // 1-based start
pub end: u64, // 1-based end (inclusive)
pub score: Option<f64>,
pub strand: Strand, // +, -, .
pub phase: Option<u8>, // CDS frame
pub attributes: HashMap<String, String>, // Key-value pairs
}
impl GffRecord {
pub fn interval(&self) -> GenomicInterval;
pub fn get_attribute(&self, key: &str) -> Option<&str>;
}
```
#### Operations
- Filter by feature type (gene, exon, etc.)
- Extract transcript/gene relationships
- Interval overlap queries
- Convert to BED format
---
### 1.5 Cross-Format Optimizations
**Shared Infrastructure:**
#### A. Parallel Iteration with Rayon
```rust
// Add to all readers
impl GenomicRecordIterator {
pub fn par_chunks(self, chunk_size: usize) -> ParallelChunkedIterator<Self>
where
Self: Sized + Send,
Self::Record: Send,
{
ParallelChunkedIterator {
inner: self,
chunk_size,
pool: rayon::ThreadPoolBuilder::new().build().unwrap(),
}
}
}
// Usage: Process VCF in parallel
let stats: Vec<VcfStats> = reader
.par_chunks(10_000)
.map(|chunk| {
let mut stats = VcfStats::new();
for record in chunk {
stats.update(&record);
}
stats
})
.collect();
// Merge thread-local stats
let final_stats = merge_stats(stats);
```
**Key Insight:** This enables **data parallelism** without changing API!
#### B. Predicate Pushdown Foundation
```rust
/// Trait for filters that can be applied during parsing
pub trait PushdownFilter<R>: VcfFilter {
/// Can this filter be evaluated during parsing?
fn is_pushdown_eligible(&self) -> bool;
/// Apply filter to raw line before full parse
fn test_raw(&self, line: &str) -> bool;
}
// Example: Quality filter can check QUAL column without full parse
impl PushdownFilter<VcfRecord> for QualityFilter {
fn is_pushdown_eligible(&self) -> bool { true }
fn test_raw(&self, line: &str) -> bool {
let parts: Vec<&str> = line.split('\t').collect();
if let Some(qual_str) = parts.get(5) {
if let Ok(qual) = qual_str.parse::<f64>() {
return qual >= self.min_qual;
}
}
false
}
}
```
**Impact:** Skip parsing records that will be filtered anyway → **2-3x speedup** on filtered queries
#### C. Index Support Infrastructure
```rust
// Shared index trait
pub trait GenomicIndex {
fn fetch_offsets(&self, interval: &GenomicInterval) -> Result<Vec<u64>>;
}
// Tabix for VCF/GFF/BED
pub struct TabixIndex {
path: PathBuf,
index_data: HashMap<String, IntervalTree>,
}
// BAI for BAM
pub struct BaiIndex {
path: PathBuf,
bins: HashMap<u32, Vec<Chunk>>,
}
// Add to readers
impl VcfReader {
pub fn with_index<P: AsRef<Path>>(self, index: P) -> Result<IndexedVcfReader>;
}
impl IndexedVcfReader {
pub fn fetch(&mut self, interval: &GenomicInterval) -> Result<VcfIterator>;
}
```
**Use Case:**
```rust
let mut reader = VcfReader::from_path("variants.vcf.bgz")?
.with_index("variants.vcf.bgz.tbi")?;
// Seek directly to chr1:1000000-2000000
for record in reader.fetch(&GenomicInterval::new("chr1", 1_000_000, 2_000_000)?)? {
// Only reads ~1 MB instead of entire 10 GB file
}
```
---
### 1.6 Cloud Storage Integration
**Using `object_store` crate for unified cloud I/O**
```rust
// src/io.rs
pub enum StorageLocation {
Local(PathBuf),
S3 { bucket: String, key: String, region: Option<String> },
Gcs { bucket: String, key: String },
Azure { container: String, blob: String },
Http(String),
}
pub struct CloudReader {
store: Arc<dyn ObjectStore>,
path: Path,
buffer: Vec<u8>,
}
impl CloudReader {
pub async fn from_location(location: &StorageLocation) -> Result<Self>;
}
// All readers support cloud paths
impl VcfReader {
pub async fn from_s3(bucket: &str, key: &str) -> Result<Self> {
let location = StorageLocation::S3 {
bucket: bucket.to_string(),
key: key.to_string(),
region: None,
};
let cloud_reader = CloudReader::from_location(&location).await?;
Self::from_reader(cloud_reader)
}
}
```
**Example Usage:**
```rust
use genomicframe_core::formats::vcf::VcfReader;
#[tokio::main]
async fn main() -> Result<()> {
// Read directly from S3
let mut reader = VcfReader::from_s3(
"1000genomes",
"release/20130502/ALL.chr1.phase3.vcf.gz"
).await?;
let stats = VcfStats::compute(&mut reader)?;
stats.print_summary();
Ok(())
}
```
**Implementation Strategy:**
- Use `object_store` crate (same as Polars/DataFusion)
- Support credentials from environment variables
- Implement range requests for indexed access
- Cache frequently accessed chunks
---
## Phase 2: Launch `genomicframe` Package
**Goal:** High-level query engine with Polars-style ergonomics
### 2.1 Package Structure
```
genomicframe/
├── Cargo.toml
├── src/
│ ├── lib.rs # Public API
│ ├── frame.rs # GenomicFrame struct
│ ├── plan/
│ │ ├── logical.rs # LogicalPlan AST
│ │ ├── physical.rs # PhysicalPlan (execution)
│ │ └── optimizer.rs # Query optimization
│ ├── expr.rs # Expression types (filters, projections)
│ ├── executor.rs # Parallel execution engine
│ ├── arrow.rs # Arrow conversion
│ └── polars.rs # Polars interop (optional feature)
└── examples/
└── complex_query.rs
```
### 2.2 Core Dependencies
```toml
[dependencies]
genomicframe-core = "0.1" # I/O layer
arrow = "53" # Columnar format
rayon = "1.10" # Parallelism
# Optional
polars = { version = "0.43", optional = true }
duckdb = { version = "1.0", optional = true }
[features]
default = ["arrow"]
polars = ["dep:polars"]
sql = ["duckdb"]
```
**Key Design Decision:** Arrow is **required**, Polars is **optional**
### 2.3 GenomicFrame API
**Design Philosophy:** GenomicFrame operates in **two modes**:
1. **Genomic Mode** - Stay in genomic domain, delegate to `genomicframe-core` for domain operations
2. **Analytics Mode** - Convert to Arrow/Polars for general-purpose data science
```rust
pub struct GenomicFrame {
plan: LogicalPlan, // Lazy execution plan
schema: GenomicSchema, // Column metadata
}
impl GenomicFrame {
// ================================================
// CONSTRUCTORS (Lazy - just build plan)
// ================================================
pub fn scan_vcf<P: AsRef<Path>>(path: P) -> Result<Self>;
pub fn scan_bam<P: AsRef<Path>>(path: P) -> Result<Self>;
pub fn scan_gff<P: AsRef<Path>>(path: P) -> Result<Self>;
pub fn scan_fastq<P: AsRef<Path>>(path: P) -> Result<Self>;
// ================================================
// TRANSFORMATIONS (Lazy - modify plan)
// ================================================
// General filters
pub fn filter(self, predicate: Expr) -> Self;
pub fn select(self, columns: &[&str]) -> Self;
pub fn with_column(self, name: &str, expr: Expr) -> Self;
// Genomic-specific filters
pub fn filter_by_region(self, interval: GenomicInterval) -> Self;
pub fn filter_by_regions(self, intervals: &[GenomicInterval]) -> Self;
pub fn filter_by_quality(self, min_qual: f64) -> Self;
pub fn filter_snps_only(self) -> Self;
pub fn filter_transitions_only(self) -> Self;
// Joins
pub fn join(self, other: Self, on: &[&str]) -> Self;
pub fn join_by_overlap(self, other: Self) -> Self; // Interval-based join
// ================================================
// GENOMIC OPERATIONS (Eager - use genomicframe-core)
// ================================================
// These stay in genomic domain, never convert to Arrow!
// Statistics (delegates to genomicframe-core::VcfStats, BamStats, etc.)
pub fn statistics(self) -> Result<Statistics>;
pub fn vcf_stats(self) -> Result<VcfStats>; // VCF-specific
pub fn bam_stats(self) -> Result<BamStats>; // BAM-specific
// Genomic-specific counts/aggregations
pub fn count(self) -> Result<usize>;
pub fn count_transitions(self) -> Result<usize>;
pub fn count_transversions(self) -> Result<usize>;
pub fn ts_tv_ratio(self) -> Result<f64>;
// Interval operations
pub fn to_intervals(self) -> Result<Vec<GenomicInterval>>;
pub fn interval_coverage(self) -> Result<Vec<(GenomicInterval, usize)>>;
// Group by genomic properties
pub fn group_by_chromosome(self) -> Result<HashMap<String, usize>>;
pub fn group_by_variant_type(self) -> Result<HashMap<VariantType, usize>>;
// Extraction operations
pub fn unique_alleles(self) -> Result<HashSet<String>>;
pub fn extract_genotypes(self, sample: &str) -> Result<Vec<String>>;
// ================================================
// ARROW/POLARS CONVERSION (When leaving genomic domain)
// ================================================
// Use these when you need general-purpose analytics
pub fn collect_arrow(self) -> Result<arrow::RecordBatch>;
pub fn collect_polars(self) -> Result<polars::DataFrame>; // Feature-gated
pub fn to_arrow_stream(self) -> Result<ArrowStreamReader>;
pub fn head(self, n: usize) -> Result<arrow::RecordBatch>;
// ================================================
// OPTIMIZATION & DEBUGGING
// ================================================
pub fn explain(self) -> String; // Show query plan
pub fn explain_optimized(self) -> String; // Show optimized plan
pub fn optimize(self) -> Result<Self>; // Manual optimization
}
// Statistics enum that wraps format-specific stats
pub enum Statistics {
Vcf(VcfStats), // From genomicframe-core::formats::vcf::VcfStats
Bam(BamStats), // From genomicframe-core::formats::bam::BamStats
Fastq(FastqStats), // From genomicframe-core::formats::fastq::FastqStats
}
```
### 2.4 Expression System
```rust
pub enum Expr {
// Columns and literals
Column(String),
Literal(ScalarValue),
// Comparisons
Eq(Box<Expr>, Box<Expr>),
Neq(Box<Expr>, Box<Expr>),
Gt(Box<Expr>, Box<Expr>),
Lt(Box<Expr>, Box<Expr>),
// Boolean logic
And(Vec<Expr>),
Or(Vec<Expr>),
Not(Box<Expr>),
// Genomic predicates
IsTransition,
IsTransversion,
IsSnp,
IsIndel,
InRegion(GenomicInterval),
// Aggregations
Count,
Mean(Box<Expr>),
Sum(Box<Expr>),
Min(Box<Expr>),
Max(Box<Expr>),
// Genomic aggregations
TsTvRatio,
AlleleFrequency,
}
// Ergonomic helpers
pub fn col(name: &str) -> Expr {
Expr::Column(name.to_string())
}
pub fn lit<T: Into<ScalarValue>>(value: T) -> Expr {
Expr::Literal(value.into())
}
```
### 2.5 Example: Complete Workflow
```rust
use genomicframe::prelude::*;
// Complex population genetics query
let result = GenomicFrame::scan_vcf("s3://1000genomes/ALL.chr1.vcf.bgz")
// Filter to exonic regions (using overlap join)
.join_by_overlap(
GenomicFrame::scan_gff("gencode.v45.gff3.gz")
.filter(col("feature").eq(lit("exon")))
)
// Quality filters
.filter(col("QUAL").gt(lit(30.0)))
.filter(col("FILTER").eq(lit("PASS")))
.filter(col("is_snp"))
// Select relevant columns
.select(&["CHROM", "POS", "REF", "ALT", "samples"])
// Compute per-population allele frequencies
.with_column(
"AF_AFR",
col("samples").filter_by_population("AFR").allele_frequency()
)
.with_column(
"AF_EUR",
col("samples").filter_by_population("EUR").allele_frequency()
)
// Find variants with large frequency differences
.filter((col("AF_AFR") - col("AF_EUR")).abs().gt(lit(0.3)))
// Execute and convert to Polars
.collect_polars()?;
// Now use Polars for visualization, ML, etc.
result.write_csv("population_differences.csv")?;
```
### 2.6 Query Optimization Pipeline
```rust
impl Optimizer {
pub fn optimize(plan: LogicalPlan) -> Result<LogicalPlan> {
let plan = Self::push_down_predicates(plan)?;
let plan = Self::push_down_projections(plan)?;
let plan = Self::detect_indexed_scans(plan)?;
let plan = Self::parallel_execution_planning(plan)?;
Ok(plan)
}
fn push_down_predicates(plan: LogicalPlan) -> Result<LogicalPlan> {
// Move filters as close to scan as possible
// Use PushdownFilter trait from genomicframe-core
}
fn push_down_projections(plan: LogicalPlan) -> Result<LogicalPlan> {
// Only read columns that are actually used
// VCF: Skip INFO parsing if not selected
// BAM: Skip tag parsing if not needed
}
fn detect_indexed_scans(plan: LogicalPlan) -> Result<LogicalPlan> {
// Look for region filters
// Check if .tbi/.bai index exists
// Use indexed reader instead of full scan
}
}
```
---
## Phase 3: Advanced Features
### 3.1 Window Functions
```rust
// Sliding window over genome
let coverage_windows = GenomicFrame::scan_bam("alignments.bam")
.coverage()
.window(window_size: 1_000, step: 100)
.agg(vec![
mean(col("depth")).alias("mean_coverage"),
std(col("depth")).alias("coverage_std"),
])
.collect_arrow()?;
```
### 3.3 Python Bindings
```rust
// genomicframe-py (separate repo)
use pyo3::prelude::*;
#[pyclass]
struct PyGenomicFrame {
inner: GenomicFrame,
}
#[pymethods]
impl PyGenomicFrame {
#[staticmethod]
fn scan_vcf(path: &str) -> PyResult<Self> {
Ok(Self {
inner: GenomicFrame::scan_vcf(path)?,
})
}
fn filter(&self, predicate: &str) -> PyResult<Self> {
// Parse string to Expr
// Apply filter
}
fn to_polars(&self) -> PyResult<PyObject> {
// Return Polars DataFrame to Python
}
}
```
**Python Usage:**
```python
import genomicframe as gf
df = (gf.scan_vcf("variants.vcf.gz")
.filter("QUAL > 30")
.filter("CHROM == 'chr1'")
.to_polars())
# Now it's a Polars DataFrame - use normal Polars API
df.write_csv("filtered.csv")
```
---
## Timeline & Milestones
### 1: Complete `genomicframe-core` v0.1
- ✅ VCF reader + stats (DONE)
- ✅ VCF validation + filtering
- ✅ VCF writer
- ✅ BAM/SAM reader + stats
- ✅ FASTQ reader + stats
- ✅ GFF/GTF reader
- ⬜ Cloud storage (S3/GCS/Azure)
- ⬜ Documentation + tutorials
**Deliverable:** Stable I/O library, 10-200x faster than Python
### 2. Launch `genomicframe` v0.1 - Key part to nail for Unique Product in Bio
- ⬜ Core query engine (logical/physical plans)
- ⬜ Arrow integration
- ⬜ Basic optimizer (predicate pushdown)
- ⬜ Parallel execution
- ⬜ Polars conversion (optional feature)
- ⬜ Python bindings (genomicframe-py)
- ⬜ Comprehensive benchmarks
- ⬜ Examples + tutorials
**Deliverable:** Working query engine with lazy evaluation
### 3: Advanced Features
- ⬜ Index support (tabix/BAI)
- ⬜ Window functions
- ⬜ Complex joins (overlap-based)
- ⬜ Performance tuning
**Deliverable:** Production-ready analytics platform
### 4: Ecosystem & Adoption
- ⬜ WebAssembly builds
- ⬜ Cloud-native optimizations
- ⬜ Integration guides (Nextflow, Snakemake)
- ⬜ Academic publication
**Deliverable:** Ecosystem adoption, benchmark paper
---
## Success Metrics
### Performance Targets
| Parse 1M VCF variants | <1 sec | 120 sec (PyVCF) | 120x |
| Compute VCF stats | <1 sec | 200 sec (scikit-allel) | 200x |
| Filter + aggregate | <2 sec | 300 sec (Pandas) | 150x |
| Memory usage (10M variants) | <100 MB | 8 GB (scikit-allel) | 80x |
| Query latency (indexed) | <100 ms | N/A | New capability |
## Technical Risks & Mitigations
### Risk 1: Arrow Overhead
**Risk:** Arrow conversion adds latency
**Mitigation:** Benchmark carefully; provide streaming API that bypasses Arrow for simple cases
### Risk 2: Query Optimization Complexity
**Risk:** Optimizer bugs lead to incorrect results
**Mitigation:** Extensive testing; compare against Python/pandas results; fuzzing
### Risk 3: Index Fragmentation
**Risk:** Too many index formats (tabix, BAI, custom)
**Mitigation:** Start with tabix/BAI only; unified trait abstracts differences
### Risk 4: Ecosystem Lock-in
**Risk:** Too tightly coupled to Polars/Arrow specifics
**Mitigation:** Keep core format-agnostic; Polars/Arrow as optional features
---
## Why This Plan Works
### Clear Separation of Concerns
- **`genomicframe-core`**: Pure I/O, no query engine complexity
- **`genomicframe`**: Pure query logic, no I/O complexity
- Users choose their level: fast I/O OR full analytics
### Incremental Value
- Phase 1 delivers immediate value (10-200x speedups)
- Phase 2 adds convenience without breaking Phase 1 users
- Phase 3 extends ecosystem without bloating core
### Proven Patterns
- Polars: Lazy query planning works
- DataFusion: Arrow-based execution works
- DuckDB: SQL on custom formats works
- We're applying these patterns to genomics
### Rust Advantages
- Zero-cost abstractions → fast + ergonomic
- Trait system → pluggable formats
- Rayon → trivial parallelism
- Type safety → catch bugs at compile time
---
## Next Steps (Immediate)
1. **Complete VCF validation** (1 week)
2. **Add VCF filtering traits** (1 week)
3. **Implement BAM reader** (2 weeks)
4. **Add Rayon parallel iterators** (1 week)
5. **Cloud storage integration** (2 weeks)
**Total:** ~7 weeks to complete Phase 1
Then: Start `genomicframe` package with basic query engine.
---
## Conclusion
This roadmap balances:
- ✅ **Pragmatism**: Deliver value quickly (Phase 1)
- ✅ **Ambition**: Build transformative tool (Phase 2-3)
- ✅ **Focus**: Keep packages lean and purposeful
- ✅ **Openness**: Integrate with existing ecosystems
**The end result:** A genomics analytics platform that's 100-1000x faster than Python while remaining ergonomic and interoperable.
Let's build the future of bioinformatics! 🧬🚀
---
*Last Updated: 2025-11-05*
*Author: Ryan Duffy*
*Status: Living Document*