# Import bioinformatic formats
- Protein FASTA
- List (string, AA_Seq)
- DNA FASTA
- List (string, DNA_Seq)
- FASTQ
- List (string, record, quality)
- GB
- (string, )
- BED or BED like: BED, BED graph, BigBed
- List (Coordinates, [ other info ]
- Coords: (chrom, start, stop)
- VCF probably falls into this category
- not sure how bedgraph fits in there
- GFF3 probably falls into the above category
- how to think about SAM?
General principles:
- ~~Design a grammar for each spec~~
- There seem to be two cases here
1) The file format has a spec (BAM, vcf)
2) The file format does not have a spec
- A grammar for #1 is thus unnecessary, and the a grammar for #2 violates Postel's Law (seemingly)
- The other issue i'm concerned about is whether or not the error recovery will be good enough
- A grammar that fails in some obtuse way is not useful
- Finally, these formats need to be streamed in most cases
- If something goes off-spec, try to flag it but proceed
- Postel's Law, Postel's Law
- Stream it if you can
- Read the first 1000 lines (or X number of bytes, whichever comes first) or so, and then go from there
- Trait-based polymorphism
- Reference
- Features (iterable)
- Records
- Reads? (How is this different than a sequence?)
FileData is a &str for now, but it can be something different in the future (e.g. it can be iterable with the first N lines or bytes cached)
FileData -> peek -> FileData, Format -> Box<dyn BioFile>
Box<dyn BioFile> is there a way to try and convert this at run time?
fn <T: reference, U: sequence>align()
PEEK functions are used for _inference_ purposes.
For instance, just because peek_gb(file) is true, doesn't mean it's completely syntactically valid (it only checks the LOCUS line).
Similarly, peek_bed(file) only returns true if the files are indisputably in a BED file. (blank files and comment-only files trivially return True for BED files)