cyanea-io 0.1.0

File format parsing for the Cyanea bioinformatics ecosystem
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
# cyanea-io API Reference

Unified file format parsing for bioinformatics and tabular data. Each parser is behind a feature flag to keep the dependency tree minimal.

## Public API

### CSV (`csv.rs`, `csv` feature)

| Type/Function | Description |
|---------------|-------------|
| `CsvInfo` | `delimiter`, `rows`, `columns`, `sample_preview` |
| `parse_csv_info(path) -> Result<CsvInfo>` | Extract CSV metadata |
| `csv_preview(path, limit) -> Result<String>` | First N rows as a string |

### VCF (`vcf.rs`, `vcf` feature)

| Type/Function | Description |
|---------------|-------------|
| `parse_vcf(path) -> Result<Vec<Variant>>` | Parse VCF data lines into `Variant` records |
| `VcfStats` | `variant_count`, `snv_count`, `indel_count`, `pass_count`, `chromosomes` |
| `vcf_stats(path) -> Result<VcfStats>` | Streaming VCF summary statistics |

### VCF header (`vcf_header.rs`, `vcf` feature)

| Type | Description |
|------|-------------|
| `VcfHeader` | Structured VCF 4.3 header with contigs, INFO, FORMAT, FILTER definitions |
| `ContigLine` | Contig definition: `id`, `length` |
| `FieldDef` | INFO/FORMAT field: `id`, `number`, `field_type`, `description` |
| `FilterDef` | FILTER definition: `id`, `description` |

| Method | Description |
|--------|-------------|
| `VcfHeader::new() -> Self` | Create empty header |
| `VcfHeader::add_contig(id, length) -> &mut Self` | Add contig definition |
| `VcfHeader::add_info(field) -> &mut Self` | Add INFO field |
| `VcfHeader::add_format(field) -> &mut Self` | Add FORMAT field |
| `VcfHeader::add_filter(filter) -> &mut Self` | Add FILTER definition |
| `VcfHeader::to_string() -> String` | Serialize to VCF header text |
| `VcfHeader::parse(text) -> Result<Self>` | Parse VCF header text |

### VCF operations (`vcf_ops.rs`, `vcf` feature)

| Function | Description |
|----------|-------------|
| `split_multiallelic(variant) -> Vec<Variant>` | Split multi-allelic into biallelic records |
| `join_biallelic(variants) -> Option<Variant>` | Join biallelic variants at same position |
| `left_align(variant, ref_seq) -> Variant` | Left-align indels against reference |
| `normalize_variant(variant, ref_seq) -> Variant` | Full normalization (left-align + trim) |
| `filter_variants(variants, expression) -> Result<Vec<Variant>>` | Filter with expression parsing |
| `intersect_variants(a, b) -> Vec<Variant>` | Set intersection (matching chrom/pos/ref/alt) |
| `subtract_variants(a, b) -> Vec<Variant>` | Set subtraction |
| `concordance(a, b) -> VariantConcordance` | Concordance statistics (TP, FP, FN) |
| `detailed_vcf_stats(variants) -> DetailedVcfStats` | Ti/Tv, het/hom ratio, per-chrom counts |

### BED (`bed.rs`, `bed` feature)

| Type/Function | Description |
|---------------|-------------|
| `BedRecord` | BED3-BED6 fields: `chrom`, `start`, `end`, `name`, `score`, `strand` |
| `parse_bed(path) -> Result<Vec<BedRecord>>` | Parse BED file |
| `parse_bed_intervals(path) -> Result<Vec<GenomicInterval>>` | Parse into `GenomicInterval` |
| `BedStats` | `record_count`, `total_bases`, `chromosomes` |
| `bed_stats(path) -> Result<BedStats>` | Summary statistics |

### BEDPE (`bedpe.rs`, `bed` feature)

| Type/Function | Description |
|---------------|-------------|
| `BedpeRecord` | Paired-end record: `interval1`, `interval2`, `name`, `score` |
| `parse_bedpe(path) -> Result<Vec<BedpeRecord>>` | Parse BEDPE file |
| `BedpeStats` | `record_count`, `total_span`, `inter_chromosomal`, `intra_chromosomal` |
| `bedpe_stats(path) -> Result<BedpeStats>` | Summary statistics |

### GFF3 (`gff.rs`, `gff` feature)

| Type/Function | Description |
|---------------|-------------|
| `parse_gff3(path) -> Result<Vec<Gene>>` | Hierarchical assembly: Gene -> Transcript -> Exon |
| `GffStats` | `gene_count`, `transcript_count`, `exon_count`, `protein_coding_count` |
| `gff3_stats(path) -> Result<GffStats>` | Summary statistics |

### GTF (`gtf.rs`, `gtf` feature)

| Type/Function | Description |
|---------------|-------------|
| `parse_gtf(path) -> Result<Vec<Gene>>` | Hierarchical assembly from GTF (GFF2) format |
| `GtfStats` | `gene_count`, `transcript_count`, `exon_count`, `cds_count` |
| `gtf_stats(path) -> Result<GtfStats>` | Summary statistics |

### SAM (`sam.rs`, `sam` feature)

| Type/Function | Description |
|---------------|-------------|
| `SamRecord` | Full SAM record: `qname`, `flag`, `rname`, `pos`, `mapq`, `cigar`, etc. |
| `parse_sam(path) -> Result<Vec<SamRecord>>` | Parse SAM text format |
| `SamStats` | `total_reads`, `mapped`, `unmapped`, `avg_mapq`, `mapq_distribution` |
| `sam_stats(records) -> SamStats` | Compute stats from records |
| `sam_stats_from_path(path) -> Result<SamStats>` | Streaming SAM statistics |
| `SamPair` | Paired reads: `r1`, `r2` with `insert_size()` |
| `PairedSamStats` | `base`, `paired_count`, `proper_pair_count`, `singletons`, `avg_insert_size` |
| `pair_sam_records(records) -> Vec<SamPair>` | Group records into mate pairs |
| `filter_proper_pairs(records) -> Vec<&SamRecord>` | Filter proper pairs (FLAG 0x2) |

### Pileup (`pileup.rs`, `sam` feature)

| Type/Function | Description |
|---------------|-------------|
| `PileupColumn` | Per-position column: `chrom`, `pos`, `ref_base`, `depth`, base counts, qualities |
| `Pileup` | Collection of pileup columns across a region |
| `generate_pileup(records, ref_name) -> Result<Pileup>` | Generate pileup from SAM records |
| `pileup_to_mpileup(pileup) -> String` | Format as samtools mpileup output |

### BGZF (`bgzf.rs`, `bam` feature)

| Type/Function | Description |
|---------------|-------------|
| `VirtualOffset` | BGZF virtual offset: block offset (48 bits) + within-block offset (16 bits) |
| `read_bgzf_block(reader) -> Result<Option<Vec<u8>>>` | Read and decompress next block |
| `decompress_bgzf(data) -> Result<Vec<u8>>` | Decompress full BGZF-compressed data |

### BAM (`bam.rs`, `bam` feature)

| Type/Function | Description |
|---------------|-------------|
| `parse_bam(path) -> Result<Vec<SamRecord>>` | Parse BAM (BGZF-compressed) format |
| `bam_stats(path) -> Result<SamStats>` | Streaming BAM statistics |
| `BamReference` | Reference sequence metadata from BAM header |

### BAM operations (`bam_ops.rs`, `bam` feature)

| Type/Function | Description |
|---------------|-------------|
| `SortOrder` | Enum: `Coordinate`, `Queryname`, `Unsorted` |
| `coordinate_sort(records, ref_order)` | Sort by reference + position |
| `queryname_sort(records)` | Sort by query name |
| `merge_sorted(a, b, ref_order) -> Vec<SamRecord>` | Merge two sorted record lists |
| `mark_duplicates(records) -> DuplicateReport` | Mark PCR/optical duplicates |
| `DuplicateReport` | Duplicate counts: `total`, `duplicates`, `optical`, `rate` |
| `depth_of_coverage(records, ref_name) -> Vec<u32>` | Per-position depth |
| `flagstat(records) -> FlagstatReport` | samtools flagstat-equivalent |

### Indexed BAM (`indexed_bam.rs`, `bam` feature)

| Type | Description |
|------|-------------|
| `IndexedBamReader` | Random-access BAM reader using BAI/CSI index |

| Method | Description |
|--------|-------------|
| `open(bam_path) -> Result<Self>` | Open BAM with auto-detected `.bai` index |
| `open_with_index(bam_path, bai_path) -> Result<Self>` | Open with explicit index path |
| `query(region) -> Result<Vec<SamRecord>>` | Fetch records overlapping a genomic region |
| `references() -> &[BamReference]` | Reference sequence metadata |

### Indexed VCF (`indexed_vcf.rs`, `vcf` feature)

| Type | Description |
|------|-------------|
| `IndexedVcfReader` | Random-access VCF reader using tabix (.tbi) index |

| Method | Description |
|--------|-------------|
| `open(vcf_gz_path) -> Result<Self>` | Open bgzipped VCF with auto-detected `.tbi` index |
| `open_with_index(vcf_gz_path, index_path) -> Result<Self>` | Open with explicit index |
| `query(region) -> Result<Vec<Variant>>` | Fetch variants overlapping a region |

### CRAM (`cram.rs`, `cram` feature)

| Type/Function | Description |
|---------------|-------------|
| `CramConfig` | Configuration: `reference_path: Option<PathBuf>` |
| `parse_cram(path, config) -> Result<Vec<SamRecord>>` | Parse CRAM format |
| `parse_cram_default(path) -> Result<Vec<SamRecord>>` | Parse without external reference |
| `cram_stats(path, config) -> Result<SamStats>` | CRAM summary statistics |

### BCF (`bcf.rs`, `bcf` feature)

| Type/Function | Description |
|---------------|-------------|
| `parse_bcf(path) -> Result<Vec<Variant>>` | Parse BCF2 binary VCF |
| `bcf_stats(path) -> Result<VcfStats>` | BCF summary statistics |

### BCF writer (`bcf_write.rs`, `bcf` feature)

| Function | Description |
|----------|-------------|
| `write_bcf(header, variants, path) -> Result<()>` | Write variants in BCF2 format to file |
| `write_bcf_bytes(header, variants) -> Result<Vec<u8>>` | Write to byte vector (BGZF-compressed) |

### Variant calling (`variant_call.rs`, `variant-calling` feature)

| Type | Description |
|------|-------------|
| `VariantCallConfig` | Configuration: min_depth, min_base_quality, min_mapq, priors, strand bias threshold |
| `VariantCallResult` | Called variant with genotype, quality, depth, allele counts |

| Function | Description |
|----------|-------------|
| `call_variants(pileup, config) -> Vec<VariantCallResult>` | Bayesian genotype calling from pileup |
| `pileup_to_vcf(results, header) -> String` | Format calls as VCF text |

### BLAST tabular (`blast.rs`, `blast` feature)

| Type/Function | Description |
|---------------|-------------|
| `BlastRecord` | 12-column tabular: `query_id`, `subject_id`, `pct_identity`, `evalue`, etc. |
| `parse_blast(path) -> Result<Vec<BlastRecord>>` | Parse `-outfmt 6`/`7` output |
| `BlastStats` | `hit_count`, `unique_queries`, `unique_subjects`, `avg_identity`, `avg_evalue` |
| `blast_stats(path) -> Result<BlastStats>` | Summary statistics |

### BLAST XML (`blast_xml.rs`, `blast` feature)

| Type | Description |
|------|-------------|
| `BlastXmlResult` | Top-level result: program, version, db, query, iterations |
| `BlastXmlIteration` | Search iteration with hits |
| `BlastXmlHit` | Database hit with HSPs |
| `BlastXmlHsp` | High-scoring segment pair: scores, positions, sequences |

| Function | Description |
|----------|-------------|
| `parse_blast_xml(input) -> Result<BlastXmlResult>` | Parse BLAST `-outfmt 5` XML output |

### MAF (`maf.rs`, `maf` feature)

| Type/Function | Description |
|---------------|-------------|
| `MafBlock` | `score`, `sequences: Vec<MafSequence>` |
| `MafSequence` | `src`, `start`, `size`, `strand`, `src_size`, `text` |
| `parse_maf(path) -> Result<Vec<MafBlock>>` | Parse MAF alignment blocks |
| `MafStats` | `block_count`, `total_aligned_bases`, `species` |
| `maf_stats(path) -> Result<MafStats>` | Summary statistics |

### GenBank (`genbank.rs`, `genbank` feature)

| Type/Function | Description |
|---------------|-------------|
| `GenbankRecord` | `locus`, `definition`, `accession`, `version`, `organism`, `sequence`, `features` |
| `GenbankFeature` | `feature_type`, `location`, `qualifiers` |
| `parse_genbank(path) -> Result<Vec<GenbankRecord>>` | Parse multi-record GenBank flat file |
| `GenbankStats` | `record_count`, `total_bases`, `feature_counts` |
| `genbank_stats(path) -> Result<GenbankStats>` | Summary statistics |

### EMBL (`embl.rs`, `genbank` feature)

| Type | Description |
|------|-------------|
| `EmblRecord` | `id`, `accession`, `description`, `sequence`, `features` |

| Function | Description |
|----------|-------------|
| `parse_embl(input) -> Result<Vec<EmblRecord>>` | Parse EMBL/ENA format records |
| `write_embl(records) -> String` | Write EMBL format |

### bigWig/bigBed (`bigwig.rs`, `bigwig` feature)

| Type/Function | Description |
|---------------|-------------|
| `BigWigHeader` | `version`, `zoom_levels`, `chrom_count`, `total_summary` |
| `BigWigSummary` | `bases_covered`, `min_val`, `max_val`, `sum`, `sum_squares` |
| `BigWigInterval` | `chrom`, `start`, `end`, `value` |
| `BigBedRecord` | `chrom`, `start`, `end`, `rest` |
| `read_bigwig_header(path) -> Result<BigWigHeader>` | Read bigWig header + summary |
| `read_bigwig_intervals(path, chrom, start, end) -> Result<Vec<BigWigInterval>>` | Query intervals |
| `read_bigbed_header(path) -> Result<BigWigHeader>` | Read bigBed header |
| `read_bigbed_records(path, chrom, start, end) -> Result<Vec<BigBedRecord>>` | Query records |

### Parquet (`parquet.rs`, `parquet` feature)

| Type/Function | Description |
|---------------|-------------|
| `ParquetInfo` | `num_rows`, `num_columns`, `column_names`, `num_row_groups`, `created_by` |
| `parquet_info(path) -> Result<ParquetInfo>` | Extract Parquet metadata |
| `write_variants_parquet(variants, path) -> Result<()>` | Write variants to Parquet |
| `read_variants_parquet(path) -> Result<Vec<Variant>>` | Read variants from Parquet |
| `read_variants_parquet_region(path, chrom, start, end) -> Result<Vec<Variant>>` | Region query with pushdown |
| `write_intervals_parquet(intervals, path) -> Result<()>` | Write intervals to Parquet |
| `read_intervals_parquet(path) -> Result<Vec<GenomicInterval>>` | Read intervals |
| `ExpressionMatrix` | `gene_ids`, `sample_ids`, `values` |
| `write_expression_parquet(matrix, path) -> Result<()>` | Write expression matrix |
| `read_expression_parquet(path) -> Result<ExpressionMatrix>` | Read expression matrix |

### Stockholm (`stockholm.rs`, `genbank` feature)

| Type | Description |
|------|-------------|
| `StockholmAlignment` | Aligned sequences, GC/GS/GR annotations |

| Function | Description |
|----------|-------------|
| `parse_stockholm(input) -> Result<Vec<StockholmAlignment>>` | Parse Stockholm 1.0 format |
| `write_stockholm(alignments) -> String` | Write Stockholm format |

### Clustal (`clustal.rs`, `genbank` feature)

| Type | Description |
|------|-------------|
| `ClustalAlignment` | Aligned sequences with conservation string |

| Function | Description |
|----------|-------------|
| `parse_clustal(input) -> Result<ClustalAlignment>` | Parse ClustalW/Omega format |
| `write_clustal(alignment) -> String` | Write Clustal format |

### Phylip (`phylip.rs`, `genbank` feature)

| Type | Description |
|------|-------------|
| `PhylipAlignment` | Aligned sequences with `n_taxa`, `n_sites` |

| Function | Description |
|----------|-------------|
| `parse_phylip(input) -> Result<PhylipAlignment>` | Parse PHYLIP interleaved format |
| `parse_phylip_sequential(input) -> Result<PhylipAlignment>` | Parse PHYLIP sequential format |
| `write_phylip(alignment) -> String` | Write PHYLIP interleaved format |

### PIR/NBRF (`pir.rs`, `genbank` feature)

| Type | Description |
|------|-------------|
| `PirRecord` | `entry_type`, `name`, `description`, `sequence` |

| Function | Description |
|----------|-------------|
| `parse_pir(input) -> Result<Vec<PirRecord>>` | Parse PIR/NBRF format |
| `write_pir(records) -> String` | Write PIR format |

### ABI chromatogram (`abi.rs`, `genbank` feature)

| Type | Description |
|------|-------------|
| `AbiRecord` | `sequence`, `quality`, `traces`, `sample_name`, `peak_positions` |
| `AbiTraces` | Raw fluorescence: `a`, `c`, `g`, `t` channels |

| Function | Description |
|----------|-------------|
| `parse_abi(data) -> Result<AbiRecord>` | Parse ABI (.ab1) binary trace file |

### bedGraph/Wiggle (`bedgraph.rs`, `bed` feature)

| Type | Description |
|------|-------------|
| `BedGraphRecord` | `chrom`, `start`, `end`, `value` |

| Function | Description |
|----------|-------------|
| `parse_bedgraph_str(input) -> Result<Vec<BedGraphRecord>>` | Parse bedGraph format from string |
| `parse_bedgraph(path) -> Result<Vec<BedGraphRecord>>` | Parse bedGraph file |
| `parse_wiggle_str(input) -> Result<Vec<BedGraphRecord>>` | Parse WIG format (variableStep/fixedStep) |
| `write_bedgraph(records) -> String` | Write bedGraph format |

### GFA (`gfa.rs`, `genbank` feature)

| Type | Description |
|------|-------------|
| `GfaSegment` | Segment: `name`, `sequence`, `length` |
| `GfaLink` | Link: `from_segment`, `from_orient`, `to_segment`, `to_orient`, `overlap` |
| `GfaPath` | Path: `name`, `segment_names`, `overlaps` |
| `GfaGraph` | Complete GFA graph: segments, links, paths |

| Function | Description |
|----------|-------------|
| `parse_gfa(input) -> Result<GfaGraph>` | Parse GFA v1 format |
| `write_gfa(graph) -> String` | Write GFA v1 format |

### Fetch helpers (`fetch.rs`, `fetch` feature)

URL builders and response parsers for bioinformatics APIs. No HTTP client -- pure URL construction and text parsing.

| Type | Description |
|------|-------------|
| `EntrezUrl` | NCBI Entrez E-utilities URL builder (esearch, efetch, esummary) |
| `UniProtUrl` | UniProt REST API URL builder |
| `KeggUrl` | KEGG API URL builder |
| `HtsgetUrl` | htsget protocol URL builder |
| `RefgetUrl` | refget protocol URL builder |

## Feature Flags

| Flag | Default | Description |
|------|---------|-------------|
| `csv` | Yes | CSV parsing (`csv`, `serde`, `serde_json` deps) |
| `vcf` | No | VCF variant parsing (requires `cyanea-omics`) |
| `bed` | No | BED/BEDPE/bedGraph interval parsing (requires `cyanea-omics`) |
| `gff` | No | GFF3 gene structure parsing (requires `cyanea-omics`) |
| `gtf` | No | GTF (GFF2) parsing (requires `cyanea-omics`) |
| `sam` | No | SAM text alignment parsing, pileup generation |
| `bam` | No | BAM binary parsing, BAM ops, indexed BAM (implies `sam`, requires `flate2`) |
| `cram` | No | CRAM format (implies `sam`, requires noodles) |
| `bcf` | No | BCF binary VCF parsing and writing (implies `vcf`, requires `flate2`) |
| `blast` | No | BLAST tabular and XML output parsing |
| `maf` | No | MAF (Multiple Alignment Format) parsing |
| `genbank` | No | GenBank, EMBL, Stockholm, Clustal, Phylip, PIR, ABI, GFA |
| `bigwig` | No | bigWig/bigBed binary format (requires `flate2`) |
| `parquet` | No | Apache Parquet columnar format (implies `vcf` + `bed`, arrow/parquet deps) |
| `variant-calling` | No | Bayesian variant caller (implies `sam` + `vcf`) |
| `fetch` | No | URL builders for NCBI/UniProt/KEGG/htsget/refget |
| `parallel` | No | Rayon parallelism |
| `wasm` | No | WASM target |

## Dependencies

- `cyanea-core` -- error types
- `cyanea-omics` -- genomic types (`Variant`, `GenomicInterval`, `Gene`), optional
- `csv`, `serde`, `serde_json` -- CSV parsing, optional
- `flate2` -- BAM/BCF/bigWig BGZF/zlib decompression, optional
- `noodles-*` -- CRAM, indexed BAM/VCF parsing, optional
- `parquet`, `arrow` -- Apache Parquet/Arrow, optional
- `cyanea-stats` -- Fisher exact test for variant calling, optional

## Tests

357 unit + 7 doc tests with all features enabled. Breakdown by module: CSV (3), VCF (25), VCF ops (32), BED (9), BEDPE (10), GFF3 (9), GTF (6), SAM (32), pileup (36), BAM (35), indexed BAM/VCF (8), BGZF (4), CRAM (7), Parquet (14), BCF (12), BCF write (included in BCF), variant calling (21), BLAST tabular (5), BLAST XML (5), MAF (5), GenBank (6), bigWig (6), Stockholm (6), Clustal (5), Phylip (5), EMBL (5), PIR (6), ABI (5), bedGraph (5), GFA (5), fetch (12).