Expand description
§A Genbank to GFF parser
You are able to parse genbank and save as a GFF (gff3) format as well as extracting DNA sequences, gene DNA sequences (ffn) and protein fasta sequences (faa)
You can also create new records and save as a genbank (gbk) format
§Detailed Explanation
The Genbank parser contains:
Records - a top level structure which consists of either one record (single genbank) or multiple instances of record (multi-genbank).
Each Record contains:
- A source,
SourceAttributes, construct(enum) of counter (source name), start, stop [of source or contig], organism, mol_type, strain, type_material, db_xref - Features,
FeatureAttributes, construct(enum) of counter (locus tag), gene (if present), product, codon start, strand, start, stop [of cds/gene] - Sequence features,
SequenceAttributes, construct(enum) of counter (locus tag), sequence_ffn (DNA gene sequence) sequence_faa (protein translation), strand, codon start, start, stop [cds/gene] - The DNA sequence of the whole record (or contig)
Example to extract and print all the protein sequence fasta, example using getters or get_ functionality
use clap::Parser;
use std::{
fs::File,
io,
};
use microBioRust::gbk::Reader;
#[derive(Parser, Debug)]
#[clap(author, version, about)]
struct Arguments {
#[clap(short, long)]
filename: String,
}
pub fn genbank_to_faa() -> Result<(), anyhow::Error> {
let args = Arguments::parse();
let file_gbk = File::open(args.filename)?;
let mut reader = Reader::new(file_gbk);
let mut records = reader.records();
loop {
//collect from each record advancing on a next record basis, count cds records
match records.next() {
Some(Ok(mut record)) => {
for (k, v) in &record.cds.attributes {
match record.seq_features.get_sequence_faa(&k) {
Some(value) => { let seq_faa = value.to_string();
println!(">{}|{}\n{}", &record.id, &k, seq_faa);
},
_ => (),
};
}
},
Some(Err(e)) => { println!("Error encountered - an err {:?}", e); },
None => break,
}
}
return Ok(());
}Example to extract the protein sequences with simplified genbank! macro use
use clap::Parser;
use std::{
fs::File,
io,
};
use microBioRust::{
gbk::Reader,
genbank,
};
#[derive(Parser, Debug)]
#[clap(author, version, about)]
struct Arguments {
#[clap(short, long)]
filename: String,
}
pub fn genbank_to_faa() -> Result<(), anyhow::Error> {
let args = Arguments::parse();
let records = genbank!(&args.filename);
for record in records {
for (k, v) in &record.cds.attributes {
if let Some(seq) = record.seq_features.get_sequence_faa(k) {
println!(">{}|{}\n{}", &record.id, &k, seq);
}
}
}
return Ok(());
}
Example to save a provided multi- or single genbank file as a GFF file (by joining any multi-genbank)
use microBioRust::gbk::{gff_write, Reader, Record};
use std::collections::BTreeMap;
use std::{
fs::File,
io,
};
use clap::Parser;
#[derive(Parser, Debug)]
#[clap(author, version, about)]
struct Arguments {
#[clap(short, long)]
filename: String,
}
pub fn genbank_to_gff() -> io::Result<()> {
let args = Arguments::parse();
let file_gbk = File::open(&args.filename)?;
let prev_start: u32 = 0;
let mut prev_end: u32 = 0;
let mut reader = Reader::new(file_gbk);
let mut records = reader.records();
let mut read_counter: u32 = 0;
let mut seq_region: BTreeMap<String, (u32,u32)> = BTreeMap::new();
let mut record_vec: Vec<Record> = Vec::new();
loop {
match records.next() {
Some(Ok(mut record)) => {
println!("next record");
println!("Record id: {:?}", record.id);
let source = record.source_map.source_name.clone().expect("issue collecting source name");
let beginning = match record.source_map.get_start(&source) {
Some(value) => value.get_value(),
_ => 0,
};
let ending = match record.source_map.get_stop(&source) {
Some(value) => value.get_value(),
_ => 0,
};
if ending + prev_end < beginning + prev_end {
println!("debug: end value smaller is than the start {:?}", beginning);
}
seq_region.insert(source, (beginning + prev_end, ending + prev_end));
record_vec.push(record);
// Add additional fields to print if needed
read_counter+=1;
prev_end+=ending; // create the joined record if there are multiple
},
Some(Err(e)) => { println!("theres an err {:?}", e); },
None => {
println!("finished iteration");
break; },
}
}
let output_file = format!("{}.gff", &args.filename);
if std::path::Path::new(&output_file).exists() {
println!("Deleting existing file: {}", &output_file);
std::fs::remove_file(&output_file).expect("NOOO");
}
gff_write(seq_region.clone(), record_vec, &output_file, true);
println!("Total records processed: {}", read_counter);
return Ok(());
}Example to create a completely new record, use of setters or set_ functionality
To write into GFF format requires gff_write(seq_region, record_vec, filename, true or false)
The seq_region is the region of interest to save with name and DNA coordinates such as seqregion.entry("source_1".to_string(), (1,897))
This makes it possible to save the whole file or to subset it
record_vec is a list of the records. If there is only one record, include this as a vec using vec![record]
The boolean true/false describes whether the DNA sequence should be included in the GFF3 file
To write into genbank format requires gbk_write(seq_region, record_vec, filename), no true or false since genbank format will include the DNA sequence
use microBioRust::gbk::{gff_write, RangeValue, Record};
use std::fs::File;
use std::collections::BTreeMap;
pub fn create_new_record() -> Result<(), anyhow::Error> {
let filename = format!("new_record.gff");
if std::path::Path::new(&filename).exists() {
std::fs::remove_file(&filename)?;
}
let mut record = Record::new();
let mut seq_region: BTreeMap<String, (u32,u32)> = BTreeMap::new();
//example from E.coli K12
seq_region.insert("source_1".to_string(), (1,897));
//Add the source into SourceAttributes
record.source_map
.set_counter("source_1".to_string())
.set_start(RangeValue::Exact(1))
.set_stop(RangeValue::Exact(897))
.set_organism("Escherichia coli".to_string())
.set_mol_type("DNA".to_string())
.set_strain("K-12 substr. MG1655".to_string())
.set_type_material("type strain of Escherichia coli K12".to_string())
.set_db_xref("PRJNA57779".to_string());
//Add the features into FeatureAttributes, here we are setting two features, i.e. coding sequences or genes
record.cds
.set_counter("b3304".to_string())
.set_start(RangeValue::Exact(1))
.set_stop(RangeValue::Exact(354))
.set_gene("rplR".to_string())
.set_product("50S ribosomal subunit protein L18".to_string())
.set_codon_start(1)
.set_strand(-1);
record.cds
.set_counter("b3305".to_string())
.set_start(RangeValue::Exact(364))
.set_stop(RangeValue::Exact(897))
.set_gene("rplF".to_string())
.set_product("50S ribosomal subunit protein L6".to_string())
.set_codon_start(1)
.set_strand(-1);
//Add the sequences for the coding sequence (CDS) into SequenceAttributes
record.seq_features
.set_counter("b3304".to_string())
.set_start(RangeValue::Exact(1))
.set_stop(RangeValue::Exact(354))
.set_sequence_ffn("ATGGATAAGAAATCTGCTCGTATCCGTCGTGCGACCCGCGCACGCCGCAAGCTCCAGGAG
CTGGGCGCAACTCGCCTGGTGGTACATCGTACCCCGCGTCACATTTACGCACAGGTAATT
GCACCGAACGGTTCTGAAGTTCTGGTAGCTGCTTCTACTGTAGAAAAAGCTATCGCTGAA
CAACTGAAGTACACCGGTAACAAAGACGCGGCTGCAGCTGTGGGTAAAGCTGTCGCTGAA
CGCGCTCTGGAAAAAGGCATCAAAGATGTATCCTTTGACCGTTCCGGGTTCCAATATCAT
GGTCGTGTCCAGGCACTGGCAGATGCTGCCCGTGAAGCTGGCCTTCAGTTCTAA".to_string())
.set_sequence_faa("MDKKSARIRRATRARRKLQELGATRLVVHRTPRHIYAQVIAPNGSEVLVAASTVEKAIAE
QLKYTGNKDAAAAVGKAVAERALEKGIKDVSFDRSGFQYHGRVQALADAAREAGLQF".to_string())
.set_codon_start(1)
.set_strand(-1);
record.seq_features
.set_counter("bb3305".to_string())
.set_start(RangeValue::Exact(364))
.set_stop(RangeValue::Exact(897))
.set_sequence_ffn("ATGTCTCGTGTTGCTAAAGCACCGGTCGTTGTTCCTGCCGGCGTTGACGTAAAAATCAAC
GGTCAGGTTATTACGATCAAAGGTAAAAACGGCGAGCTGACTCGTACTCTCAACGATGCT
GTTGAAGTTAAACATGCAGATAATACCCTGACCTTCGGTCCGCGTGATGGTTACGCAGAC
GGTTGGGCACAGGCTGGTACCGCGCGTGCCCTGCTGAACTCAATGGTTATCGGTGTTACC
GAAGGCTTCACTAAGAAGCTGCAGCTGGTTGGTGTAGGTTACCGTGCAGCGGTTAAAGGC
AATGTGATTAACCTGTCTCTGGGTTTCTCTCATCCTGTTGACCATCAGCTGCCTGCGGGT
ATCACTGCTGAATGTCCGACTCAGACTGAAATCGTGCTGAAAGGCGCTGATAAGCAGGTG
ATCGGCCAGGTTGCAGCGGATCTGCGCGCCTACCGTCGTCCTGAGCCTTATAAAGGCAAG
GGTGTTCGTTACGCCGACGAAGTCGTGCGTACCAAAGAGGCTAAGAAGAAGTAA".to_string())
.set_sequence_faa("MSRVAKAPVVVPAGVDVKINGQVITIKGKNGELTRTLNDAVEVKHADNTLTFGPRDGYAD
GWAQAGTARALLNSMVIGVTEGFTKKLQLVGVGYRAAVKGNVINLSLGFSHPVDHQLPAG
ITAECPTQTEIVLKGADKQVIGQVAADLRAYRRPEPYKGKGVRYADEVVRTKEAKKK".to_string())
.set_codon_start(1)
.set_strand(-1);
//Add the full sequence of the entire record into the record.sequence
record.sequence = "TTAGAACTGAAGGCCAGCTTCACGGGCAGCATCTGCCAGTGCCTGGACACGACCATGATA
TTGGAACCCGGAACGGTCAAAGGATACATCTTTGATGCCTTTTTCCAGAGCGCGTTCAGC
GACAGCTTTACCCACAGCTGCAGCCGCGTCTTTGTTACCGGTGTACTTCAGTTGTTCAGC
GATAGCTTTTTCTACAGTAGAAGCAGCTACCAGAACTTCAGAACCGTTCGGTGCAATTAC
CTGTGCGTAAATGTGACGCGGGGTACGATGTACCACCAGGCGAGTTGCGCCCAGCTCCTG
GAGCTTGCGGCGTGCGCGGGTCGCACGACGGATACGAGCAGATTTCTTATCCATAGTGTT
ACCTTACTTCTTCTTAGCCTCTTTGGTACGCACGACTTCGTCGGCGTAACGAACACCCTT
GCCTTTATAAGGCTCAGGACGACGGTAGGCGCGCAGATCCGCTGCAACCTGGCCGATCAC
CTGCTTATCAGCGCCTTTCAGCACGATTTCAGTCTGAGTCGGACATTCAGCAGTGATACC
CGCAGGCAGCTGATGGTCAACAGGATGAGAGAAACCCAGAGACAGGTTAATCACATTGCC
TTTAACCGCTGCACGGTAACCTACACCAACCAGCTGCAGCTTCTTAGTGAAGCCTTCGGT
AACACCGATAACCATTGAGTTCAGCAGGGCACGCGCGGTACCAGCCTGTGCCCAACCGTC
TGCGTAACCATCACGCGGACCGAAGGTCAGGGTATTATCTGCATGTTTAACTTCAACAGC
ATCGTTGAGAGTACGAGTCAGCTCGCCGTTTTTACCTTTGATCGTAATAACCTGACCGTT
GATTTTTACGTCAACGCCGGCAGGAACAACGACCGGTGCTTTAGCAACACGAGACAT".to_string();
gff_write(seq_region, vec![record], &filename, true);
return Ok(());
}Re-exports§
pub use crate::record::RangeValue;
Structs§
- Config
- Feature
Attribute Builder - builder for the feature information on a per coding sequence (CDS) basis
- GFFInner
- GFF3 field9 construct
- GFFOuter
- The main GFF3 construct
- Reader
- per line reader for the file
- Record
- internal record containing data from a single source or contig. Has multiple features.
- Records
- A Gbk reader.
- Sequence
Attribute Builder - builder for the sequence information on a per coding sequence (CDS) basis
- Source
Attribute Builder - builder for the source information on a per record basis
Enums§
- Feature
Attributes - attributes for each feature, cds or gene
- Sequence
Attributes - stores the sequences of the coding sequences (genes) and proteins. Also stores start, stop, codon_start and strand information
- Source
Attributes
Traits§
Functions§
- format_
translation - formats the translation string which can be multiple lines, for gbk
- gbk_
write - saves the parsed data in genbank format
- gff_
write - saves the parsed data in gff3 format
- orig_
gff_ write - saves the parsed data in gff3 format
- substitute_
odd_ punctuation - product lines can contain difficult to parse punctuation such as biochemical symbols like unclosed single quotes, superscripts, single and double brackets etc. here we substitute these for an underscore
- write_
gbk_ format_ sequence - writes the DNA sequence in gbk format with numbering