bioform 0.1.0 - Docs.rs

# Import bioinformatic formats


- Protein FASTA
    - List (string, AA_Seq)
- DNA FASTA
    - List (string, DNA_Seq)
- FASTQ
    - List (string, record, quality)
- GB
    - (string, )
- BED or BED like: BED, BED graph, BigBed
    - List (Coordinates, [ other info ]
    - Coords: (chrom, start, stop)
    - VCF probably falls into this category
    - not sure how bedgraph fits in there
- GFF3 probably falls into the above category

- how to think about SAM?



General principles:
- ~~Design a grammar for each spec~~
    - There seem to be two cases here
        1) The file format has a spec (BAM, vcf)
        2) The file format does not have a spec
    - A grammar for #1 is thus unnecessary, and the a grammar for #2 violates Postel's Law (seemingly)
    - The other issue i'm concerned about is whether or not the error recovery will be good enough
    - A grammar that fails in some obtuse way is not useful
    - Finally, these formats need to be streamed in most cases
- If something goes off-spec, try to flag it but proceed
    - Postel's Law, Postel's Law
- Stream it if you can
    - Read the first 1000 lines (or X number of bytes, whichever comes first) or so, and then go from there


- Trait-based polymorphism
    - Reference
    - Features (iterable)
    - Records
    - Reads? (How is this different than a sequence?)

FileData is a &str for now, but it can be something different in the future (e.g. it can be iterable with the first N lines or bytes cached)
FileData -> peek -> FileData, Format -> Box<dyn BioFile>

Box<dyn BioFile> is there a way to try and convert this at run time?

fn <T: reference, U: sequence>align() 



PEEK functions are used for _inference_ purposes. 

For instance, just because peek_gb(file) is true, doesn't mean it's completely syntactically valid (it only checks the LOCUS line).
Similarly, peek_bed(file) only returns true if the files are indisputably in a BED file. (blank files and comment-only files trivially return True for BED files)