bed2gff 0.1.0

A Rust BED-to-GFF3 translator
Documentation
![Crates.io](https://img.shields.io/crates/v/bed2gff?color=green)
![GitHub](https://img.shields.io/github/license/alejandrogzi/bed2gff?color=blue)

# **bed2gff**

A Rust BED-to-GFF3 translator.


translates
```
chr7 56766360 56805692 ENST00000581852.25 1000 + 56766360 56805692 0,0,200 3 3,135,81, 0,496,39251,
```
into
```
chr7 bed2gff gene 56399404 56805692 . + . ID=ENSG00000166960;gene_id=ENSG00000166960

chr7 bed2gff transcript 56766361 56805692 . + . ID=ENST00000581852.25;Parent=ENSG00000166960;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25

chr7 bed2gff exon 56766361 56766363 . + . ID=exon:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1

chr7 bed2gff CDS 56766361 56766363 . + 0 ID=CDS:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1

...

chr7 bed2gff start_codon 56766361 56766363 . + 0 ID=start_codon:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1

chr7 bed2gff stop_codon 56805690 56805692 . + 0 ID=stop_codon:ENST00000581852.25.3;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=3

...
```

in a few seconds.

## Usage
``` rust
Usage: bed2gff[EXE] --bed <BED> --isoforms <ISOFORMS> --output <OUTPUT>

Arguments:
    --bed <BED>: a .bed file
    --isoforms <ISOFORMS>: a tab-delimited file
    --output <OUTPUT>: path to output file

Options:
    --help: print help
    --version: print version
```

>**Warning** 
>
>All the transcripts in .bed file should appear in the isoforms file.
#### crate: [https://crates.io/crates/bed2gff](https://crates.io/crates/bed2gff)

<details>
<summary>click for detailed formats</summary>
<p>
bed2gff just needs two files:

1. a .bed file

    tab-delimited files with 3 required and 9 optional fields:

    ```
    chrom   chromStart  chromEnd      name    ...
      |         |           |           |
    chr20   50222035    50222038    ENST00000595977    ...
    ```

    see [BED format](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) for more information

2. a tab-delimited .txt/.tsv/.csv/... file with genes/isoforms (all the transcripts in .bed file should appear in the isoforms file):

    ```
    > cat isoforms.txt

    ENSG00000198888 ENST00000361390
    ENSG00000198763 ENST00000361453
    ENSG00000198804 ENST00000361624
    ENSG00000188868 ENST00000595977
    ```

    you can build a custom file for your preferred species using [Ensembl BioMart](https://www.ensembl.org/biomart/martview). 

</p>
</details>

## Installation
to install bed2gff on your system follow this steps:
1. get rust: `curl https://sh.rustup.rs -sSf | sh` on unix, or go [here](https://www.rust-lang.org/tools/install) for other options
2. run `cargo install bed2gff` (make sure `~/.cargo/bin` is in your `$PATH` before running it)
4. use `bed2gff` with the required arguments
5. enjoy!


## Library
to include bed2gff as a library and use it within your project follow these steps:
1. include `bed2gff = 0.1.0` under `[dependencies]` in the `Cargo.toml` file
2. the library name is `bed2gff`, to use it just write:

    ``` rust
    use bed2gff::bed2gff; 
    ```
    or 
    ``` rust
    use bed2gff::*;
    ```
3. invoke
    ``` rust
    let gff = bed2gff(bed: &String, isoforms: &String, output: &String)
    ```

## Build
to build bed2gff from this repo, do:

1. get rust (as described above)
2. run `git clone https://github.com/alejandrogzi/bed2gff.git && cd bed2gff`
3. run `cargo run --release <BED> <ISOFORMS> <OUTPUT>`(arguments are positional, so you do not need to specify --bed/--isoforms)


## Output

bed2gff will send the output directly to the same .bed file path if you specify so

```
bed2gff annotation.bed isoforms.txt output.gff

.
├── ...
├── isoforms.txt
├── annotation.bed
└── output.gff3
```
where `output.gff3` is the result.

## FAQ
### Why?

Converting formats is a daily practice in bioinformatics. This is way more common while working with gene annotations as tools differ in input/output layouts. GTF/GFF/BED are the most used structures to store gene-related annotations and the conversion needs are not well covered by available software. 

A considerable portion of genomic tools reduce the software space by accepting GTF/GFF3 files only, directing BED users to translate their files into different formats. While some of this issues have already been covered (e.g. [bed2gtf](https://github.com/alejandrogzi/bed2gtf)) with GTF files, the GFF3 layout lacks stable converting tools (1, 2).

bed2gff is presented as a straightforward option to convert BED files into ready-to-use GFF3 files, closing that gap.  


### How?
bed2gff, takes the base code of [bed2gtf](https://github.com/alejandrogzi/bed2gtf), that basically is the reimplementation of UCSC's C binaries merged in 1 step (bedToGenePred + genePredToGtf). Before any conversion, this tool sorts the .bed file internally using a similar algorithmic approach seen in [gtfsort](https://github.com/alejandrogzi/gtfsort). This step allows bed2gff to directly present the output file sorted in a natural and convenient way. Then, evaluates the position of exons and other features (CDS, stop/start, UTRs), preserving reading frames and adjusting the indexing count.

Following the rationale of [bed2gtf](https://github.com/alejandrogzi/bed2gtf), bed2gff is able to produce a ready-to-use gff3 file by using an isoforms file, that works as the refTable in C binaries to map each transcript to their respective gene. 

### To Do's

- [ ] Allow users to input compressed files (e.g. .gz, .bgzip)
- [x] Test GFF3 with different types of aligners
- [ ] Improve the error module
- [ ] Add test modules for most of the scripts
- [ ] Allow users to specify their parent/child relationships (?)


## References

1. https://bioinformatics.stackexchange.com/questions/2242/how-to-convert-bed-to-gff3
2. https://www.biostars.org/p/2/