Skip to main content

Module gbk

Module gbk 

Source
Expand description

§A Genbank to GFF parser

You are able to parse genbank and save as a GFF (gff3) format as well as extracting DNA sequences, gene DNA sequences (ffn) and protein fasta sequences (faa)

You can also create new records and save as a genbank (gbk) format

§Detailed Explanation

The Genbank parser contains:

Records - a top level structure which consists of either one record (single genbank) or multiple instances of record (multi-genbank).

Each Record contains:

  1. A source, SourceAttributes, construct(enum) of counter (source name), start, stop [of source or contig], organism, mol_type, strain, type_material, db_xref
  2. Features, FeatureAttributes, construct(enum) of counter (locus tag), gene (if present), product, codon start, strand, start, stop [of cds/gene]
  3. Sequence features, SequenceAttributes, construct(enum) of counter (locus tag), sequence_ffn (DNA gene sequence) sequence_faa (protein translation), strand, codon start, start, stop [cds/gene]
  4. The DNA sequence of the whole record (or contig)

Example to extract and print all the protein sequence fasta, example using getters or get_ functionality

 use clap::Parser;
 use std::{
     fs::File,
     io,
 };
 use microBioRust::gbk::Reader;

 #[derive(Parser, Debug)]
 #[clap(author, version, about)]
 struct Arguments {
 #[clap(short, long)]
 filename: String,
 }

 pub fn genbank_to_faa() -> Result<(), anyhow::Error> {
            let args = Arguments::parse();
            let file_gbk = File::open(args.filename)?;
            let mut reader = Reader::new(file_gbk);
            let mut records = reader.records();
            loop {
                //collect from each record advancing on a next record basis, count cds records
                match records.next() {
                    Some(Ok(mut record)) => {
		                     for (k, v) in &record.cds.attributes {
		                         match record.seq_features.get_sequence_faa(&k) {
		                             Some(value) => { let seq_faa = value.to_string();
				                              println!(">{}|{}\n{}", &record.id, &k, seq_faa);
						              },
				             _ => (),
				             };
		                         }
                                      },
	               Some(Err(e)) => { println!("Error encountered - an err {:?}", e); },
	               None => break,
	               }
                 }
            return Ok(());
  }

Example to extract the protein sequences with simplified genbank! macro use

 use clap::Parser;
 use std::{
     fs::File,
     io,
 };
 use microBioRust::{
     gbk::Reader,
     genbank,
 };

 #[derive(Parser, Debug)]
 #[clap(author, version, about)]
 struct Arguments {
 #[clap(short, long)]
 filename: String,
 }

 pub fn genbank_to_faa() -> Result<(), anyhow::Error> {
            let args = Arguments::parse();
            let records = genbank!(&args.filename);
            for record in records {
	          for (k, v) in &record.cds.attributes {
                  if let Some(seq) = record.seq_features.get_sequence_faa(k) {
		        println!(">{}|{}\n{}", &record.id, &k, seq);
                     }
                  }
            }
            return Ok(());
  }

Example to save a provided multi- or single genbank file as a GFF file (by joining any multi-genbank)

   use microBioRust::gbk::{gff_write, Reader, Record};
   use std::collections::BTreeMap;
   use std::{
       fs::File,
       io,
   };
   use clap::Parser;

  #[derive(Parser, Debug)]
  #[clap(author, version, about)]
  struct Arguments {
  #[clap(short, long)]
  filename: String,
  }

   pub fn genbank_to_gff() -> io::Result<()> {
       let args = Arguments::parse();
       let file_gbk = File::open(&args.filename)?;
       let prev_start: u32 = 0;
       let mut prev_end: u32 = 0;
       let mut reader = Reader::new(file_gbk);
       let mut records = reader.records();
       let mut read_counter: u32 = 0;
       let mut seq_region: BTreeMap<String, (u32,u32)> = BTreeMap::new();
       let mut record_vec: Vec<Record> = Vec::new();
       loop {  
           match records.next() {
               Some(Ok(mut record)) => {
                  println!("next record");
                   println!("Record id: {:?}", record.id);
   	       let source = record.source_map.source_name.clone().expect("issue collecting source name");
   	       let beginning = match record.source_map.get_start(&source) {
   	                        Some(value) => value.get_value(),
   			        _ => 0,
   				};
   	       let ending = match record.source_map.get_stop(&source) {
   	                        Some(value) => value.get_value(),
   				_ => 0,
   				};
   	       if ending + prev_end < beginning + prev_end {
   	          println!("debug: end value smaller is than the start {:?}", beginning);
   	          }
   	       seq_region.insert(source, (beginning + prev_end, ending + prev_end));
   	       record_vec.push(record);
                   // Add additional fields to print if needed
   	       read_counter+=1;
   	       prev_end+=ending; // create the joined record if there are multiple
                   },
               Some(Err(e)) => { println!("theres an err {:?}", e); },
               None => {
                  println!("finished iteration");
                        break; },
               }
           }
       let output_file = format!("{}.gff", &args.filename);
       if std::path::Path::new(&output_file).exists() {
          println!("Deleting existing file: {}", &output_file);
          std::fs::remove_file(&output_file).expect("NOOO");
          }
       gff_write(seq_region.clone(), record_vec, &output_file, true);
       println!("Total records processed: {}", read_counter);
       return Ok(());
   }

Example to create a completely new record, use of setters or set_ functionality

To write into GFF format requires gff_write(seq_region, record_vec, filename, true or false)

The seq_region is the region of interest to save with name and DNA coordinates such as seqregion.entry("source_1".to_string(), (1,897)) This makes it possible to save the whole file or to subset it

record_vec is a list of the records. If there is only one record, include this as a vec using vec![record]

The boolean true/false describes whether the DNA sequence should be included in the GFF3 file

To write into genbank format requires gbk_write(seq_region, record_vec, filename), no true or false since genbank format will include the DNA sequence

   use microBioRust::gbk::{gff_write, RangeValue, Record};
   use std::fs::File;
   use std::collections::BTreeMap;

    pub fn create_new_record() -> Result<(), anyhow::Error> {
        let filename = format!("new_record.gff");
        if std::path::Path::new(&filename).exists() {
          std::fs::remove_file(&filename)?;
          }
       let mut record = Record::new();
       let mut seq_region: BTreeMap<String, (u32,u32)> = BTreeMap::new();
        //example from E.coli K12
       seq_region.insert("source_1".to_string(), (1,897));
        //Add the source into SourceAttributes
        record.source_map
            .set_counter("source_1".to_string())
            .set_start(RangeValue::Exact(1))
            .set_stop(RangeValue::Exact(897))
            .set_organism("Escherichia coli".to_string())
            .set_mol_type("DNA".to_string())
            .set_strain("K-12 substr. MG1655".to_string())
   	 .set_type_material("type strain of Escherichia coli K12".to_string())
            .set_db_xref("PRJNA57779".to_string());
        //Add the features into FeatureAttributes, here we are setting two features, i.e. coding sequences or genes
       record.cds
                 .set_counter("b3304".to_string())
                 .set_start(RangeValue::Exact(1))
                 .set_stop(RangeValue::Exact(354))
                 .set_gene("rplR".to_string())
                 .set_product("50S ribosomal subunit protein L18".to_string())
                 .set_codon_start(1)
                 .set_strand(-1);
       record.cds
                 .set_counter("b3305".to_string())
                 .set_start(RangeValue::Exact(364))
                 .set_stop(RangeValue::Exact(897))
                 .set_gene("rplF".to_string())
                 .set_product("50S ribosomal subunit protein L6".to_string())
                 .set_codon_start(1)
                 .set_strand(-1);
        //Add the sequences for the coding sequence (CDS) into SequenceAttributes
       record.seq_features
            .set_counter("b3304".to_string())
   	 .set_start(RangeValue::Exact(1))
                .set_stop(RangeValue::Exact(354))
                .set_sequence_ffn("ATGGATAAGAAATCTGCTCGTATCCGTCGTGCGACCCGCGCACGCCGCAAGCTCCAGGAG
CTGGGCGCAACTCGCCTGGTGGTACATCGTACCCCGCGTCACATTTACGCACAGGTAATT
GCACCGAACGGTTCTGAAGTTCTGGTAGCTGCTTCTACTGTAGAAAAAGCTATCGCTGAA
CAACTGAAGTACACCGGTAACAAAGACGCGGCTGCAGCTGTGGGTAAAGCTGTCGCTGAA
CGCGCTCTGGAAAAAGGCATCAAAGATGTATCCTTTGACCGTTCCGGGTTCCAATATCAT
GGTCGTGTCCAGGCACTGGCAGATGCTGCCCGTGAAGCTGGCCTTCAGTTCTAA".to_string())
                .set_sequence_faa("MDKKSARIRRATRARRKLQELGATRLVVHRTPRHIYAQVIAPNGSEVLVAASTVEKAIAE
QLKYTGNKDAAAAVGKAVAERALEKGIKDVSFDRSGFQYHGRVQALADAAREAGLQF".to_string())
                .set_codon_start(1)
                .set_strand(-1);
       record.seq_features
            .set_counter("bb3305".to_string())
   	 .set_start(RangeValue::Exact(364))
                .set_stop(RangeValue::Exact(897))
                .set_sequence_ffn("ATGTCTCGTGTTGCTAAAGCACCGGTCGTTGTTCCTGCCGGCGTTGACGTAAAAATCAAC
GGTCAGGTTATTACGATCAAAGGTAAAAACGGCGAGCTGACTCGTACTCTCAACGATGCT
GTTGAAGTTAAACATGCAGATAATACCCTGACCTTCGGTCCGCGTGATGGTTACGCAGAC
GGTTGGGCACAGGCTGGTACCGCGCGTGCCCTGCTGAACTCAATGGTTATCGGTGTTACC
GAAGGCTTCACTAAGAAGCTGCAGCTGGTTGGTGTAGGTTACCGTGCAGCGGTTAAAGGC
AATGTGATTAACCTGTCTCTGGGTTTCTCTCATCCTGTTGACCATCAGCTGCCTGCGGGT
ATCACTGCTGAATGTCCGACTCAGACTGAAATCGTGCTGAAAGGCGCTGATAAGCAGGTG
ATCGGCCAGGTTGCAGCGGATCTGCGCGCCTACCGTCGTCCTGAGCCTTATAAAGGCAAG
GGTGTTCGTTACGCCGACGAAGTCGTGCGTACCAAAGAGGCTAAGAAGAAGTAA".to_string())
                .set_sequence_faa("MSRVAKAPVVVPAGVDVKINGQVITIKGKNGELTRTLNDAVEVKHADNTLTFGPRDGYAD
GWAQAGTARALLNSMVIGVTEGFTKKLQLVGVGYRAAVKGNVINLSLGFSHPVDHQLPAG
ITAECPTQTEIVLKGADKQVIGQVAADLRAYRRPEPYKGKGVRYADEVVRTKEAKKK".to_string())
                .set_codon_start(1)
                .set_strand(-1);
        //Add the full sequence of the entire record into the record.sequence
       record.sequence = "TTAGAACTGAAGGCCAGCTTCACGGGCAGCATCTGCCAGTGCCTGGACACGACCATGATA
TTGGAACCCGGAACGGTCAAAGGATACATCTTTGATGCCTTTTTCCAGAGCGCGTTCAGC
GACAGCTTTACCCACAGCTGCAGCCGCGTCTTTGTTACCGGTGTACTTCAGTTGTTCAGC
GATAGCTTTTTCTACAGTAGAAGCAGCTACCAGAACTTCAGAACCGTTCGGTGCAATTAC
CTGTGCGTAAATGTGACGCGGGGTACGATGTACCACCAGGCGAGTTGCGCCCAGCTCCTG
GAGCTTGCGGCGTGCGCGGGTCGCACGACGGATACGAGCAGATTTCTTATCCATAGTGTT
ACCTTACTTCTTCTTAGCCTCTTTGGTACGCACGACTTCGTCGGCGTAACGAACACCCTT
GCCTTTATAAGGCTCAGGACGACGGTAGGCGCGCAGATCCGCTGCAACCTGGCCGATCAC
CTGCTTATCAGCGCCTTTCAGCACGATTTCAGTCTGAGTCGGACATTCAGCAGTGATACC
CGCAGGCAGCTGATGGTCAACAGGATGAGAGAAACCCAGAGACAGGTTAATCACATTGCC
TTTAACCGCTGCACGGTAACCTACACCAACCAGCTGCAGCTTCTTAGTGAAGCCTTCGGT
AACACCGATAACCATTGAGTTCAGCAGGGCACGCGCGGTACCAGCCTGTGCCCAACCGTC
TGCGTAACCATCACGCGGACCGAAGGTCAGGGTATTATCTGCATGTTTAACTTCAACAGC
ATCGTTGAGAGTACGAGTCAGCTCGCCGTTTTTACCTTTGATCGTAATAACCTGACCGTT
GATTTTTACGTCAACGCCGGCAGGAACAACGACCGGTGCTTTAGCAACACGAGACAT".to_string();
          gff_write(seq_region, vec![record], &filename, true);
      return Ok(());
     }

Re-exports§

pub use crate::record::RangeValue;

Structs§

Config
FeatureAttributeBuilder
builder for the feature information on a per coding sequence (CDS) basis
GFFInner
GFF3 field9 construct
GFFOuter
The main GFF3 construct
Reader
per line reader for the file
Record
internal record containing data from a single source or contig. Has multiple features.
Records
A Gbk reader.
SequenceAttributeBuilder
builder for the sequence information on a per coding sequence (CDS) basis
SourceAttributeBuilder
builder for the source information on a per record basis

Enums§

FeatureAttributes
attributes for each feature, cds or gene
SequenceAttributes
stores the sequences of the coding sequences (genes) and proteins. Also stores start, stop, codon_start and strand information
SourceAttributes

Traits§

GbkRead

Functions§

format_translation
formats the translation string which can be multiple lines, for gbk
gbk_write
saves the parsed data in genbank format
gff_write
saves the parsed data in gff3 format
orig_gff_write
saves the parsed data in gff3 format
substitute_odd_punctuation
product lines can contain difficult to parse punctuation such as biochemical symbols like unclosed single quotes, superscripts, single and double brackets etc. here we substitute these for an underscore
write_gbk_format_sequence
writes the DNA sequence in gbk format with numbering

Type Aliases§

GenericRecordGbk