genomicframe-core 0.2.0

# GenomicFrame Development Roadmap

## Vision

Build a two-tier architecture for high-performance genomics:

1. **`genomicframe-core`** - Low-level I/O and domain logic (this repository)
2. **`genomicframe`** - High-level query engine and ergonomic API (future repository)

This separation ensures:
- ✅ `genomicframe-core` stays lean, focused on I/O performance
- ✅ `genomicframe` provides convenience without bloating core library
- ✅ Users can choose: fast I/O only OR full query engine
- ✅ Clear upgrade path: start with interrop, add frame when needed

---

## Phase 1: Complete `genomicframe-core` Foundation

**Goal:** Production-ready I/O library with comprehensive format support and domain-aware optimizations.

### 1.1 VCF Completion ✅→🚧

**Current Status:**
- ✅ Streaming reader with O(1) memory
- ✅ Header parsing and metadata extraction
- ✅ Comprehensive statistics (Ts/Tv, variant types, quality, etc.)
- ✅ Multi-sample support
- ✅ Gzip compression support

**Remaining Work:**

#### A. Validation (`src/formats/vcf/validation.rs`)
```rust
pub struct VcfValidator {
    /// Validation rules to apply
    rules: Vec<ValidationRule>,
    /// Whether to fail fast or collect all errors
    strict: bool,
}

pub enum ValidationRule {
    /// Check chromosome names match contig headers
    ValidateChromosomes,
    /// Ensure positions are sorted within chromosomes
    CheckSorted,
    /// Validate REF/ALT alleles are valid DNA sequences
    ValidateAlleles,
    /// Check quality scores are in valid range
    ValidateQuality,
    /// Ensure sample genotypes match FORMAT field
    ValidateGenotypes,
    /// Check INFO field values match their declared types
    ValidateInfoTypes,
}

impl VcfValidator {
    pub fn validate_record(&self, record: &VcfRecord) -> ValidationResult;
    pub fn validate_stream(&self, reader: &mut VcfReader) -> Vec<ValidationError>;
}
```

**Use Cases:**
- Pre-publication data QC
- Pipeline debugging (catch malformed files early)
- Format compliance checking

#### B. Filtering (`src/formats/vcf/filters.rs`)
```rust
/// Composable filter predicates
pub trait VcfFilter: Send + Sync {
    fn test(&self, record: &VcfRecord) -> bool;
}

// Quality-based filters
pub struct QualityFilter { min_qual: f64 }
pub struct PassFilter;  // Only PASS variants
pub struct DepthFilter { min_dp: u32 }

// Genomic region filters
pub struct RegionFilter { intervals: Vec<GenomicInterval> }
pub struct ChromosomeFilter { chroms: HashSet<String> }

// Variant type filters
pub struct SnpOnlyFilter;
pub struct IndelOnlyFilter;
pub struct BiAllelicFilter;  // Exclude multi-allelic

// Allele frequency filters
pub struct MinorAlleleFrequency { min_maf: f64, max_maf: f64 }

// Combinator filters
pub struct AndFilter { filters: Vec<Box<dyn VcfFilter>> }
pub struct OrFilter { filters: Vec<Box<dyn VcfFilter>> }
pub struct NotFilter { inner: Box<dyn VcfFilter> }

/// Ergonomic filter builder
impl VcfReader {
    pub fn with_filter<F: VcfFilter + 'static>(self, filter: F) -> FilteredReader<Self>;
}
```

**Example Usage:**
```rust
let filtered = VcfReader::from_path("variants.vcf.gz")?
    .with_filter(PassFilter)
    .with_filter(QualityFilter { min_qual: 30.0 })
    .with_filter(RegionFilter::from_bed("exons.bed")?);

// Filters applied during iteration - no extra memory
for record in filtered {
    // Only see records matching all filters
}
```

**Implementation Note:**
- Filters should be **zero-cost abstractions** via trait objects
- Apply during parsing when possible (skip before allocation)
- Enable **predicate pushdown** for Phase 2

#### C. Writer Enhancement (`src/formats/vcf/writer.rs`)
```rust
pub struct VcfWriter {
    writer: Box<dyn Write>,
    header: VcfHeader,
    compression: Option<CompressionLevel>,
}

impl VcfWriter {
    pub fn from_path<P: AsRef<Path>>(path: P) -> Result<Self>;
    pub fn with_compression(self, level: CompressionLevel) -> Self;

    pub fn write_header(&mut self) -> Result<()>;
    pub fn write_record(&mut self, record: &VcfRecord) -> Result<()>;
    pub fn write_records(&mut self, records: &[VcfRecord]) -> Result<()>;
}
```

**Key Features:**
- Automatic bgzip compression for `.vcf.bgz`
- Header validation before writing
- Efficient batch writing

---

### 1.2 BAM/SAM/CRAM Support

**Priority:** High (critical for alignment data)

#### Reader (`src/formats/bam/reader.rs`)
```rust
pub struct BamRecord {
    pub qname: String,          // Query name
    pub flag: u16,              // SAM flags
    pub rname: String,          // Reference name
    pub pos: u64,               // 1-based position
    pub mapq: u8,               // Mapping quality
    pub cigar: Vec<CigarOp>,    // CIGAR string
    pub sequence: Vec<u8>,      // DNA sequence
    pub qualities: Vec<u8>,     // Phred quality scores
    pub tags: HashMap<String, TagValue>,  // Optional fields
}

pub struct BamReader {
    reader: Box<dyn BufRead>,
    header: BamHeader,
    index: Option<BaiIndex>,    // .bai file for random access
}

impl BamReader {
    pub fn from_path<P: AsRef<Path>>(path: P) -> Result<Self>;
    pub fn with_index<P: AsRef<Path>>(self, index: P) -> Result<Self>;

    // Random access (requires index)
    pub fn fetch(&mut self, interval: &GenomicInterval) -> Result<BamIterator>;
}
```

#### Statistics (`src/formats/bam/stats.rs`)
```rust
pub struct BamStats {
    pub total_reads: usize,
    pub mapped_reads: usize,
    pub unmapped_reads: usize,
    pub properly_paired: usize,
    pub duplicates: usize,

    // Quality metrics
    pub mean_mapq: f64,
    pub mapq_distribution: CategoryCounter<u8>,

    // Coverage statistics
    pub mean_coverage: f64,
    pub coverage_histogram: Vec<(u32, usize)>,  // (depth, count)

    // Per-chromosome stats
    pub reads_per_chrom: CategoryCounter<String>,
}
```

**Domain-Specific Filters:**
- Mapping quality threshold
- Proper pair filtering
- Duplicate removal
- Primary alignment only
- Read length filters

---

### 1.3 FASTQ Support

**Priority:** Medium (sequence data preprocessing)

#### Reader (`src/formats/fastq/reader.rs`)
```rust
pub struct FastqRecord {
    pub id: String,          // Sequence identifier
    pub description: Option<String>,
    pub sequence: Vec<u8>,   // DNA sequence
    pub qualities: Vec<u8>,  // Phred quality scores
}

pub struct FastqReader {
    reader: Box<dyn BufRead>,
    compression: Compression,
}

// Streaming by default
impl GenomicRecordIterator for FastqReader {
    type Record = FastqRecord;
    fn next_record(&mut self) -> Result<Option<Self::Record>>;
}
```

#### Statistics (`src/formats/fastq/stats.rs`)
```rust
pub struct FastqStats {
    pub total_reads: usize,
    pub total_bases: usize,
    pub mean_length: f64,
    pub length_distribution: Vec<(usize, usize)>,  // (length, count)

    // Quality metrics
    pub mean_quality: f64,
    pub quality_per_position: Vec<RunningStats>,  // Per-base quality

    // Sequence composition
    pub gc_content: f64,
    pub n_content: f64,
    pub base_distribution: CategoryCounter<u8>,
}
```

**Domain-Specific Operations:**
- Quality trimming
- Adapter removal
- Length filtering
- N-base filtering
- Quality score recalibration

---

### 1.4 GFF/GTF Support

**Priority:** Medium (gene annotations)

#### Reader (`src/formats/gff/reader.rs`)
```rust
pub struct GffRecord {
    pub seqid: String,       // Chromosome
    pub source: String,      // Annotation source
    pub feature: String,     // Feature type (gene, exon, CDS, etc.)
    pub start: u64,          // 1-based start
    pub end: u64,            // 1-based end (inclusive)
    pub score: Option<f64>,
    pub strand: Strand,      // +, -, .
    pub phase: Option<u8>,   // CDS frame
    pub attributes: HashMap<String, String>,  // Key-value pairs
}

impl GffRecord {
    pub fn interval(&self) -> GenomicInterval;
    pub fn get_attribute(&self, key: &str) -> Option<&str>;
}
```

#### Operations
- Filter by feature type (gene, exon, etc.)
- Extract transcript/gene relationships
- Interval overlap queries
- Convert to BED format

---

### 1.5 Cross-Format Optimizations

**Shared Infrastructure:**

#### A. Parallel Iteration with Rayon
```rust
// Add to all readers
impl GenomicRecordIterator {
    pub fn par_chunks(self, chunk_size: usize) -> ParallelChunkedIterator<Self>
    where
        Self: Sized + Send,
        Self::Record: Send,
    {
        ParallelChunkedIterator {
            inner: self,
            chunk_size,
            pool: rayon::ThreadPoolBuilder::new().build().unwrap(),
        }
    }
}

// Usage: Process VCF in parallel
let stats: Vec<VcfStats> = reader
    .par_chunks(10_000)
    .map(|chunk| {
        let mut stats = VcfStats::new();
        for record in chunk {
            stats.update(&record);
        }
        stats
    })
    .collect();

// Merge thread-local stats
let final_stats = merge_stats(stats);
```

**Key Insight:** This enables **data parallelism** without changing API!

#### B. Predicate Pushdown Foundation
```rust
/// Trait for filters that can be applied during parsing
pub trait PushdownFilter<R>: VcfFilter {
    /// Can this filter be evaluated during parsing?
    fn is_pushdown_eligible(&self) -> bool;

    /// Apply filter to raw line before full parse
    fn test_raw(&self, line: &str) -> bool;
}

// Example: Quality filter can check QUAL column without full parse
impl PushdownFilter<VcfRecord> for QualityFilter {
    fn is_pushdown_eligible(&self) -> bool { true }

    fn test_raw(&self, line: &str) -> bool {
        let parts: Vec<&str> = line.split('\t').collect();
        if let Some(qual_str) = parts.get(5) {
            if let Ok(qual) = qual_str.parse::<f64>() {
                return qual >= self.min_qual;
            }
        }
        false
    }
}
```

**Impact:** Skip parsing records that will be filtered anyway → **2-3x speedup** on filtered queries

#### C. Index Support Infrastructure
```rust
// Shared index trait
pub trait GenomicIndex {
    fn fetch_offsets(&self, interval: &GenomicInterval) -> Result<Vec<u64>>;
}

// Tabix for VCF/GFF/BED
pub struct TabixIndex {
    path: PathBuf,
    index_data: HashMap<String, IntervalTree>,
}

// BAI for BAM
pub struct BaiIndex {
    path: PathBuf,
    bins: HashMap<u32, Vec<Chunk>>,
}

// Add to readers
impl VcfReader {
    pub fn with_index<P: AsRef<Path>>(self, index: P) -> Result<IndexedVcfReader>;
}

impl IndexedVcfReader {
    pub fn fetch(&mut self, interval: &GenomicInterval) -> Result<VcfIterator>;
}
```

**Use Case:**
```rust
let mut reader = VcfReader::from_path("variants.vcf.bgz")?
    .with_index("variants.vcf.bgz.tbi")?;

// Seek directly to chr1:1000000-2000000
for record in reader.fetch(&GenomicInterval::new("chr1", 1_000_000, 2_000_000)?)? {
    // Only reads ~1 MB instead of entire 10 GB file
}
```

---

### 1.6 Cloud Storage Integration

**Using `object_store` crate for unified cloud I/O**

```rust
// src/io.rs
pub enum StorageLocation {
    Local(PathBuf),
    S3 { bucket: String, key: String, region: Option<String> },
    Gcs { bucket: String, key: String },
    Azure { container: String, blob: String },
    Http(String),
}

pub struct CloudReader {
    store: Arc<dyn ObjectStore>,
    path: Path,
    buffer: Vec<u8>,
}

impl CloudReader {
    pub async fn from_location(location: &StorageLocation) -> Result<Self>;
}

// All readers support cloud paths
impl VcfReader {
    pub async fn from_s3(bucket: &str, key: &str) -> Result<Self> {
        let location = StorageLocation::S3 {
            bucket: bucket.to_string(),
            key: key.to_string(),
            region: None,
        };
        let cloud_reader = CloudReader::from_location(&location).await?;
        Self::from_reader(cloud_reader)
    }
}
```

**Example Usage:**
```rust
use genomicframe_core::formats::vcf::VcfReader;

#[tokio::main]
async fn main() -> Result<()> {
    // Read directly from S3
    let mut reader = VcfReader::from_s3(
        "1000genomes",
        "release/20130502/ALL.chr1.phase3.vcf.gz"
    ).await?;

    let stats = VcfStats::compute(&mut reader)?;
    stats.print_summary();

    Ok(())
}
```

**Implementation Strategy:**
- Use `object_store` crate (same as Polars/DataFusion)
- Support credentials from environment variables
- Implement range requests for indexed access
- Cache frequently accessed chunks

---

## Phase 2: Launch `genomicframe` Package

**Goal:** High-level query engine with Polars-style ergonomics

### 2.1 Package Structure

```
genomicframe/
├── Cargo.toml
├── src/
│   ├── lib.rs           # Public API
│   ├── frame.rs         # GenomicFrame struct
│   ├── plan/
│   │   ├── logical.rs   # LogicalPlan AST
│   │   ├── physical.rs  # PhysicalPlan (execution)
│   │   └── optimizer.rs # Query optimization
│   ├── expr.rs          # Expression types (filters, projections)
│   ├── executor.rs      # Parallel execution engine
│   ├── arrow.rs         # Arrow conversion
│   └── polars.rs        # Polars interop (optional feature)
└── examples/
    └── complex_query.rs
```

### 2.2 Core Dependencies

```toml
[dependencies]
genomicframe-core = "0.1"   # I/O layer
arrow = "53"               # Columnar format
rayon = "1.10"             # Parallelism

# Optional
polars = { version = "0.43", optional = true }
duckdb = { version = "1.0", optional = true }

[features]
default = ["arrow"]
polars = ["dep:polars"]
sql = ["duckdb"]
```

**Key Design Decision:** Arrow is **required**, Polars is **optional**

### 2.3 GenomicFrame API

**Design Philosophy:** GenomicFrame operates in **two modes**:

1. **Genomic Mode** - Stay in genomic domain, delegate to `genomicframe-core` for domain operations
2. **Analytics Mode** - Convert to Arrow/Polars for general-purpose data science

```rust
pub struct GenomicFrame {
    plan: LogicalPlan,     // Lazy execution plan
    schema: GenomicSchema, // Column metadata
}

impl GenomicFrame {
    // ================================================
    // CONSTRUCTORS (Lazy - just build plan)
    // ================================================
    pub fn scan_vcf<P: AsRef<Path>>(path: P) -> Result<Self>;
    pub fn scan_bam<P: AsRef<Path>>(path: P) -> Result<Self>;
    pub fn scan_gff<P: AsRef<Path>>(path: P) -> Result<Self>;
    pub fn scan_fastq<P: AsRef<Path>>(path: P) -> Result<Self>;

    // ================================================
    // TRANSFORMATIONS (Lazy - modify plan)
    // ================================================

    // General filters
    pub fn filter(self, predicate: Expr) -> Self;
    pub fn select(self, columns: &[&str]) -> Self;
    pub fn with_column(self, name: &str, expr: Expr) -> Self;

    // Genomic-specific filters
    pub fn filter_by_region(self, interval: GenomicInterval) -> Self;
    pub fn filter_by_regions(self, intervals: &[GenomicInterval]) -> Self;
    pub fn filter_by_quality(self, min_qual: f64) -> Self;
    pub fn filter_snps_only(self) -> Self;
    pub fn filter_transitions_only(self) -> Self;

    // Joins
    pub fn join(self, other: Self, on: &[&str]) -> Self;
    pub fn join_by_overlap(self, other: Self) -> Self;  // Interval-based join

    // ================================================
    // GENOMIC OPERATIONS (Eager - use genomicframe-core)
    // ================================================
    // These stay in genomic domain, never convert to Arrow!

    // Statistics (delegates to genomicframe-core::VcfStats, BamStats, etc.)
    pub fn statistics(self) -> Result<Statistics>;
    pub fn vcf_stats(self) -> Result<VcfStats>;  // VCF-specific
    pub fn bam_stats(self) -> Result<BamStats>;  // BAM-specific

    // Genomic-specific counts/aggregations
    pub fn count(self) -> Result<usize>;
    pub fn count_transitions(self) -> Result<usize>;
    pub fn count_transversions(self) -> Result<usize>;
    pub fn ts_tv_ratio(self) -> Result<f64>;

    // Interval operations
    pub fn to_intervals(self) -> Result<Vec<GenomicInterval>>;
    pub fn interval_coverage(self) -> Result<Vec<(GenomicInterval, usize)>>;

    // Group by genomic properties
    pub fn group_by_chromosome(self) -> Result<HashMap<String, usize>>;
    pub fn group_by_variant_type(self) -> Result<HashMap<VariantType, usize>>;

    // Extraction operations
    pub fn unique_alleles(self) -> Result<HashSet<String>>;
    pub fn extract_genotypes(self, sample: &str) -> Result<Vec<String>>;

    // ================================================
    // ARROW/POLARS CONVERSION (When leaving genomic domain)
    // ================================================
    // Use these when you need general-purpose analytics

    pub fn collect_arrow(self) -> Result<arrow::RecordBatch>;
    pub fn collect_polars(self) -> Result<polars::DataFrame>;  // Feature-gated
    pub fn to_arrow_stream(self) -> Result<ArrowStreamReader>;
    pub fn head(self, n: usize) -> Result<arrow::RecordBatch>;

    // ================================================
    // OPTIMIZATION & DEBUGGING
    // ================================================
    pub fn explain(self) -> String;           // Show query plan
    pub fn explain_optimized(self) -> String; // Show optimized plan
    pub fn optimize(self) -> Result<Self>;    // Manual optimization
}

// Statistics enum that wraps format-specific stats
pub enum Statistics {
    Vcf(VcfStats),        // From genomicframe-core::formats::vcf::VcfStats
    Bam(BamStats),        // From genomicframe-core::formats::bam::BamStats
    Fastq(FastqStats),    // From genomicframe-core::formats::fastq::FastqStats
}
```

### 2.4 Expression System

```rust
pub enum Expr {
    // Columns and literals
    Column(String),
    Literal(ScalarValue),

    // Comparisons
    Eq(Box<Expr>, Box<Expr>),
    Neq(Box<Expr>, Box<Expr>),
    Gt(Box<Expr>, Box<Expr>),
    Lt(Box<Expr>, Box<Expr>),

    // Boolean logic
    And(Vec<Expr>),
    Or(Vec<Expr>),
    Not(Box<Expr>),

    // Genomic predicates
    IsTransition,
    IsTransversion,
    IsSnp,
    IsIndel,
    InRegion(GenomicInterval),

    // Aggregations
    Count,
    Mean(Box<Expr>),
    Sum(Box<Expr>),
    Min(Box<Expr>),
    Max(Box<Expr>),

    // Genomic aggregations
    TsTvRatio,
    AlleleFrequency,
}

// Ergonomic helpers
pub fn col(name: &str) -> Expr {
    Expr::Column(name.to_string())
}

pub fn lit<T: Into<ScalarValue>>(value: T) -> Expr {
    Expr::Literal(value.into())
}
```

### 2.5 Example: Complete Workflow

```rust
use genomicframe::prelude::*;

// Complex population genetics query
let result = GenomicFrame::scan_vcf("s3://1000genomes/ALL.chr1.vcf.bgz")
    // Filter to exonic regions (using overlap join)
    .join_by_overlap(
        GenomicFrame::scan_gff("gencode.v45.gff3.gz")
            .filter(col("feature").eq(lit("exon")))
    )
    // Quality filters
    .filter(col("QUAL").gt(lit(30.0)))
    .filter(col("FILTER").eq(lit("PASS")))
    .filter(col("is_snp"))
    // Select relevant columns
    .select(&["CHROM", "POS", "REF", "ALT", "samples"])
    // Compute per-population allele frequencies
    .with_column(
        "AF_AFR",
        col("samples").filter_by_population("AFR").allele_frequency()
    )
    .with_column(
        "AF_EUR",
        col("samples").filter_by_population("EUR").allele_frequency()
    )
    // Find variants with large frequency differences
    .filter((col("AF_AFR") - col("AF_EUR")).abs().gt(lit(0.3)))
    // Execute and convert to Polars
    .collect_polars()?;

// Now use Polars for visualization, ML, etc.
result.write_csv("population_differences.csv")?;
```

### 2.6 Query Optimization Pipeline

```rust
impl Optimizer {
    pub fn optimize(plan: LogicalPlan) -> Result<LogicalPlan> {
        let plan = Self::push_down_predicates(plan)?;
        let plan = Self::push_down_projections(plan)?;
        let plan = Self::detect_indexed_scans(plan)?;
        let plan = Self::parallel_execution_planning(plan)?;
        Ok(plan)
    }

    fn push_down_predicates(plan: LogicalPlan) -> Result<LogicalPlan> {
        // Move filters as close to scan as possible
        // Use PushdownFilter trait from genomicframe-core
    }

    fn push_down_projections(plan: LogicalPlan) -> Result<LogicalPlan> {
        // Only read columns that are actually used
        // VCF: Skip INFO parsing if not selected
        // BAM: Skip tag parsing if not needed
    }

    fn detect_indexed_scans(plan: LogicalPlan) -> Result<LogicalPlan> {
        // Look for region filters
        // Check if .tbi/.bai index exists
        // Use indexed reader instead of full scan
    }
}
```

---

## Phase 3: Advanced Features

### 3.1 Window Functions

```rust
// Sliding window over genome
let coverage_windows = GenomicFrame::scan_bam("alignments.bam")
    .coverage()
    .window(window_size: 1_000, step: 100)
    .agg(vec![
        mean(col("depth")).alias("mean_coverage"),
        std(col("depth")).alias("coverage_std"),
    ])
    .collect_arrow()?;
```

### 3.3 Python Bindings

```rust
// genomicframe-py (separate repo)
use pyo3::prelude::*;

#[pyclass]
struct PyGenomicFrame {
    inner: GenomicFrame,
}

#[pymethods]
impl PyGenomicFrame {
    #[staticmethod]
    fn scan_vcf(path: &str) -> PyResult<Self> {
        Ok(Self {
            inner: GenomicFrame::scan_vcf(path)?,
        })
    }

    fn filter(&self, predicate: &str) -> PyResult<Self> {
        // Parse string to Expr
        // Apply filter
    }

    fn to_polars(&self) -> PyResult<PyObject> {
        // Return Polars DataFrame to Python
    }
}
```

**Python Usage:**
```python
import genomicframe as gf

df = (gf.scan_vcf("variants.vcf.gz")
    .filter("QUAL > 30")
    .filter("CHROM == 'chr1'")
    .to_polars())

# Now it's a Polars DataFrame - use normal Polars API
df.write_csv("filtered.csv")
```

---

## Timeline & Milestones

### 1: Complete `genomicframe-core` v0.1
- ✅ VCF reader + stats (DONE)
- ✅ VCF validation + filtering
- ✅ VCF writer
- ✅ BAM/SAM reader + stats
- ✅ FASTQ reader + stats
- ✅ GFF/GTF reader
- ⬜ Cloud storage (S3/GCS/Azure)
- ⬜ Documentation + tutorials

**Deliverable:** Stable I/O library, 10-200x faster than Python

### 2. Launch `genomicframe` v0.1 - Key part to nail for Unique Product in Bio
- ⬜ Core query engine (logical/physical plans)
- ⬜ Arrow integration
- ⬜ Basic optimizer (predicate pushdown)
- ⬜ Parallel execution
- ⬜ Polars conversion (optional feature)
- ⬜ Python bindings (genomicframe-py)
- ⬜ Comprehensive benchmarks
- ⬜ Examples + tutorials

**Deliverable:** Working query engine with lazy evaluation

### 3: Advanced Features
- ⬜ Index support (tabix/BAI)
- ⬜ Window functions
- ⬜ Complex joins (overlap-based)
- ⬜ Performance tuning

**Deliverable:** Production-ready analytics platform

### 4: Ecosystem & Adoption
- ⬜ WebAssembly builds
- ⬜ Cloud-native optimizations
- ⬜ Integration guides (Nextflow, Snakemake)
- ⬜ Academic publication

**Deliverable:** Ecosystem adoption, benchmark paper

---

## Success Metrics

### Performance Targets

| Metric | Target | Baseline (Python) | Improvement |
|--------|--------|-------------------|-------------|
| Parse 1M VCF variants | <1 sec | 120 sec (PyVCF) | 120x |
| Compute VCF stats | <1 sec | 200 sec (scikit-allel) | 200x |
| Filter + aggregate | <2 sec | 300 sec (Pandas) | 150x |
| Memory usage (10M variants) | <100 MB | 8 GB (scikit-allel) | 80x |
| Query latency (indexed) | <100 ms | N/A | New capability |


## Technical Risks & Mitigations

### Risk 1: Arrow Overhead
**Risk:** Arrow conversion adds latency
**Mitigation:** Benchmark carefully; provide streaming API that bypasses Arrow for simple cases

### Risk 2: Query Optimization Complexity
**Risk:** Optimizer bugs lead to incorrect results
**Mitigation:** Extensive testing; compare against Python/pandas results; fuzzing

### Risk 3: Index Fragmentation
**Risk:** Too many index formats (tabix, BAI, custom)
**Mitigation:** Start with tabix/BAI only; unified trait abstracts differences

### Risk 4: Ecosystem Lock-in
**Risk:** Too tightly coupled to Polars/Arrow specifics
**Mitigation:** Keep core format-agnostic; Polars/Arrow as optional features

---

## Why This Plan Works

### Clear Separation of Concerns
- **`genomicframe-core`**: Pure I/O, no query engine complexity
- **`genomicframe`**: Pure query logic, no I/O complexity
- Users choose their level: fast I/O OR full analytics

### Incremental Value
- Phase 1 delivers immediate value (10-200x speedups)
- Phase 2 adds convenience without breaking Phase 1 users
- Phase 3 extends ecosystem without bloating core

### Proven Patterns
- Polars: Lazy query planning works
- DataFusion: Arrow-based execution works
- DuckDB: SQL on custom formats works
- We're applying these patterns to genomics

### Rust Advantages
- Zero-cost abstractions → fast + ergonomic
- Trait system → pluggable formats
- Rayon → trivial parallelism
- Type safety → catch bugs at compile time

---

## Next Steps (Immediate)

1. **Complete VCF validation** (1 week)
2. **Add VCF filtering traits** (1 week)
3. **Implement BAM reader** (2 weeks)
4. **Add Rayon parallel iterators** (1 week)
5. **Cloud storage integration** (2 weeks)

**Total:** ~7 weeks to complete Phase 1

Then: Start `genomicframe` package with basic query engine.

---

## Conclusion

This roadmap balances:
- ✅ **Pragmatism**: Deliver value quickly (Phase 1)
- ✅ **Ambition**: Build transformative tool (Phase 2-3)
- ✅ **Focus**: Keep packages lean and purposeful
- ✅ **Openness**: Integrate with existing ecosystems

**The end result:** A genomics analytics platform that's 100-1000x faster than Python while remaining ergonomic and interoperable.

Let's build the future of bioinformatics! 🧬🚀

---

*Last Updated: 2025-11-05*
*Author: Ryan Duffy*
*Status: Living Document*