cyanea-io 0.1.0

File format parsing for the Cyanea bioinformatics ecosystem
Documentation
# cyanea-io Architecture

## Module Map

```
cyanea-io
 +-- csv.rs              [csv] CSV metadata, preview
 +-- vcf.rs              [vcf] VCF variant parsing, stats
 +-- vcf_header.rs       [vcf] Structured VCF 4.3 header construction/parsing
 +-- vcf_ops.rs          [vcf] Normalization, multi-allelic split/join, filtering, set ops
 +-- bed.rs              [bed] BED3-BED6 parsing, interval conversion
 +-- bedpe.rs            [bed] BEDPE paired-end intervals
 +-- bedgraph.rs         [bed] bedGraph/Wiggle signal tracks
 +-- gff.rs              [gff] GFF3 hierarchical gene parsing
 +-- gtf.rs              [gtf] GTF (GFF2) gene parsing
 +-- sam.rs              [sam] SAM record parsing, flag helpers, paired-end stats
 +-- pileup.rs           [sam] Pileup generation from SAM, mpileup output
 +-- bgzf.rs             [bam] BGZF block decompression, virtual offsets
 +-- bam.rs              [bam] BAM binary alignment parsing
 +-- bam_ops.rs          [bam] Sort, merge, mark duplicates, flagstat, depth
 +-- indexed_bam.rs      [bam] Random-access BAM via BAI/CSI index (noodles)
 +-- indexed_vcf.rs      [vcf] Random-access VCF via tabix (noodles)
 +-- cram.rs             [cram] CRAM format via noodles
 +-- bcf.rs              [bcf] BCF2 binary VCF reader
 +-- bcf_write.rs        [bcf] BCF2 binary VCF writer
 +-- variant_call.rs     [variant-calling] Bayesian genotype caller
 +-- blast.rs            [blast] BLAST tabular output (-outfmt 6/7)
 +-- blast_xml.rs        [blast] BLAST XML output (-outfmt 5)
 +-- maf.rs              [maf] Multiple Alignment Format
 +-- genbank.rs          [genbank] GenBank flat file
 +-- embl.rs             [genbank] EMBL/ENA format
 +-- stockholm.rs        [genbank] Stockholm MSA format
 +-- clustal.rs          [genbank] ClustalW/Omega format
 +-- phylip.rs           [genbank] PHYLIP interleaved/sequential
 +-- pir.rs              [genbank] PIR/NBRF protein format
 +-- abi.rs              [genbank] ABI Sanger chromatogram binary
 +-- gfa.rs              [genbank] GFA v1 sequence graph
 +-- bigwig.rs           [bigwig] bigWig/bigBed Kent binary format
 +-- parquet.rs          [parquet] Apache Parquet via arrow/parquet crates
 +-- fetch.rs            [fetch] URL builders for NCBI/UniProt/KEGG/htsget/refget
```

## Design Decisions

### Feature Flag Architecture

Every I/O module is behind a Cargo feature flag. This is essential because:

1. **Dependency minimization**: Each format parser may pull in large dependencies (noodles for BAM/CRAM, arrow/parquet for Parquet). Users only pay for formats they use.
2. **Compile time**: A full build with all features is significantly slower than the default (CSV-only) build.
3. **WASM compatibility**: Some formats depend on filesystem or system libraries that are unavailable in WASM.

Feature implications enforce logical groupings:
- `bam` implies `sam` (BAM records are returned as `SamRecord`)
- `bcf` implies `vcf` (BCF records are returned as `Variant`)
- `cram` implies `sam`
- `parquet` implies `vcf` + `bed` (uses `Variant` and `GenomicInterval` types)
- `variant-calling` implies `sam` + `vcf` (reads pileup data, outputs variants)

### BGZF: Block Gzip with Virtual Offsets

BGZF (Blocked GNU Zip Format) is the compression layer for BAM, BCF, and tabix-indexed VCF. Key properties:

- Series of concatenated gzip blocks, each at most 64 KiB uncompressed.
- **Virtual offset** = `(block_offset << 16) | within_block_offset`, enabling random access by seeking to the compressed block and then to the byte within the decompressed block.
- BAI/TBI/CSI indices store virtual offsets for genomic intervals.

The `bgzf.rs` module provides shared decompression used by both `bam.rs` and `bcf.rs`.

### BAI/TBI/CSI Index Reading for Random Access

Indexed readers (`indexed_bam.rs`, `indexed_vcf.rs`) use noodles for index parsing:

- **BAI** (BAM index): bins of virtual offset ranges, linear index of minimum virtual offsets per 16kb window.
- **TBI** (tabix): generalized index for tab-delimited, coordinate-sorted, bgzipped files (VCF, BED, GFF).
- **CSI** (Coordinate-Sorted Index): supports arbitrary bin sizes for very large genomes.

The workflow: parse the index file, look up bins overlapping the query region, seek to each bin's virtual offset in the BGZF file, decompress blocks, and filter records by coordinate.

### Pileup Generation from SAM Records

`pileup.rs` walks SAM records in coordinate order, expanding CIGAR strings to track which reads cover each reference position. For each position, it maintains:

- Base counts (A, C, G, T, N, deletions, insertions)
- Quality score sums per base
- Depth of coverage
- Forward/reverse strand counts

The pileup can be formatted as samtools mpileup output or fed to the variant caller.

### Variant Calling: Bayesian Genotype Likelihoods

`variant_call.rs` implements a diploid Bayesian genotype caller:

1. For each pileup position with sufficient depth, enumerate candidate alleles.
2. Compute genotype likelihoods P(data | genotype) using base quality error probabilities.
3. Apply diploid genotype priors (homozygous ref, heterozygous, homozygous alt).
4. Compute posterior probabilities and report the MAP genotype.
5. Apply quality filters: minimum depth, minimum allele count, strand bias (Fisher exact test).

### Parquet: Arrow Record Batches with Predicate Pushdown

The Parquet integration uses the Apache Arrow ecosystem:

- Variants and intervals are converted to Arrow record batches with typed columns.
- Region queries use row-group statistics (min/max chromosome, min/max position) for predicate pushdown, skipping entire row groups that cannot contain matches.
- Expression matrices use a columnar layout with one column per sample, enabling efficient per-sample access.

### noodles Integration

noodles provides Rust-native BAM, CRAM, BCF, and tabix parsers. cyanea-io uses noodles for:

- **BAM indexed reading**: `noodles-bam::io::IndexedReader` for BAI/CSI-backed random access.
- **CRAM**: `noodles-cram` for reference-based compression format.
- **BCF**: Header parsing and typed-value decoding (though the core BCF reader is custom).
- **Indexed VCF**: `noodles-vcf::io::IndexedReader` for tabix-backed random access.

All noodles records are converted to cyanea-io's own types (`SamRecord`, `Variant`) at the boundary, keeping the public API independent of noodles version.