sbol-fasta
Pure-Rust FASTA → SBOL 3 importer for the sbol-rs ecosystem.
FASTA is the lowest-common-denominator sequence exchange format —
NCBI BLAST, UniProt downloads, every genome project, and most
bioinformatics tools either emit or accept it. This crate lets
sbol-rs ingest that data with zero new transitive dependencies.
Each >header record becomes one sbol::Component paired with one
sbol::Sequence. The component's biological type (DNA / RNA /
protein) and the sequence's EDAM encoding are auto-detected from the
alphabet of the sequence itself; the detection can be overridden
with FastaImporter::with_alphabet when the data is ambiguous.
FASTA carries no feature annotations — what you get back is a
Component with no SequenceFeatures. For annotated data, use
sbol-genbank instead.
Quickstart
use FastaImporter;
let =
new?.read_path?;
println!;
document.check?;
# Ok::
CLI
# Override alphabet detection for ambiguous sequences:
Accepted extensions: .fasta, .fa, .fna, .faa.
Alphabet detection
| Heuristic | Result |
|---|---|
Sequence contains U or u |
RNA |
Sequence contains protein-only letters (E, F, I, L, P, Q, Z) |
Protein |
| Anything else | DNA |
This handles the ambiguous case of FASTA files whose sequence is
pure A/C/G/T — these are also valid protein letters, but in
practice overwhelmingly mean DNA. Override with --alphabet protein
on the CLI or .with_alphabet(Alphabet::Protein) in the SDK when
the data is genuinely a peptide.
Dependencies
sbol-fasta does not depend on a third-party FASTA parser; the
~100-line parser lives in
src/parser.rs. Compared to pulling in
noodles-fasta, this saves four transitive dependencies (bstr,
memchr, noodles-bgzf, noodles-core) for what amounts to "split
on >, concatenate continuation lines."