atg 0.8.7

Convert transcripts between different file formats
# ATG

Convert your genomic reference data between formats with a single tool. _ATG_ handles the conversion from and to GTF, GenePred(ext) and Refgene. You can generate bed files, fasta sequences or custom feature sequences. A single tool for all your conversion.

| File format | Can be used as source | Can be created |
| ----------- | ------------- | -------- |
| GTF | Yes | Yes |
| GenePred (extended) | Yes | Yes |
| RefGene | Yes | Yes |
| GenePred (simple) | No | Yes |
| Bed | No | Yes |
| Fasta | No | Yes (multiple options) |
| SpliceAI gene annotation | No | Yes |
| Quality Checks | No | Yes |


**Reasons to use _ATG_**
* No need to maintain multiple tools for one-way conversions (`gtfToGenePred`, `genePredToGtf`, etc). _ATG_ handles many formats and can convert in both directions.
* Speed: _ATG_ is really fast - almost twice as fast as `gtfToGenePred`.
* Robust parser: It handles GTF, GenePred with all extras according to spec.
* Low memory footprint: It also runs on machines with little RAM.
* Extra features, such as quality control and correctness checks.
* Open for contributions: Every help is welcome improve ATG or to add more functionality.
* You can also use _ATG_ as a library for your own Rust projects.


## ATG command line tool

### Install
There are currently 3 different options how to install _ATG_:

##### cargo
The easiest way to install _ATG_ is to use `cargo` (if you have `cargo` and `rust` installed)
```bash
cargo install atg
```

##### Pre-built binaries
You can download pre-built binaries for Linux and Mac from [Github](https://github.com/anergictcell/atg/releases).

##### From source
You can also build _ATG_ from source (if you have the rust toolchains installed):

```bash
git clone https://github.com/anergictcell/atg.git
cd atg
cargo build --release
```


### Usage
The main CLI arguments are 
- `-f`, `--from`: Specify the file format of the source (e.g. `gtf`, `genepredext`, `refgene`)
- `-t`, `--to`: Specify the target file format (e.g. `gtf`, `genepred`, `bed`, `fasta` etc)
- `-i`, `--input`: Path to source file. (Use `/dev/stdin` if you are using _atg_ in a pipe)
- `-o`, `--output`: Path to target file. Existing files will be overwritten. (Use `/dev/stdout` if you are using _atg_ in a pipe)
- `-v`, `-vv`, `-vvv`: Verbosity (info, debug, trace)
- `-h`, `--help`: Print the help dialog with detailed usage instructions.

Additional, optional arguments:
- `-g`, `--gtf-source`: Specify the source for GTF output files. Defaults to `atg`
- `-r`, `--reference`: Path of a reference genome fasta file. Required for fasta output
- `-c`, `--genetic-code`: Specify which genetic code to use for translating the transcripts. Genetic codes can be specified per chromosome by specifying the chromsome and the code, separated by `:` (e.g. `-c chrM:vertebrate mitochondrial`). They can also be specified for all chromsomes by omitting the chromosome (e.g. `-c vertebrate mitochondrial`). The argument can be specified multiple times (e.g: `-c "standard" -c "chrM:vertebrate mitochondrial" -c "chrAYN:alternative yeast nuclear"`). The code names are based on the `name` field from the [NCBI specs](https://www.ncbi.nlm.nih.gov/IEB/ToolBox/C_DOC/lxr/source/data/gc.prt) but all lowercase characters. Alternatively, you can also specify the amino acid lookup table directly: `-c "chrM:FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG"`. Defaults to `standard`.
- `-q`, `--qc-check`: Specify QC-checks for removing transcripts from the output

#### Examples:
```bash
## Convert a GTF file to a RefGene file
atg --from gtf --to refgene --input /path/to/input.gtf --output /path/to/output.refgene

## Convert a GTF file to a GenePred file
atg --from gtf --to genepred --input /path/to/input.gtf --output /path/to/output.genepred

## Convert a GTF file to a GenePredExt file
atg --from gtf --to genepredext --input /path/to/input.gtf --output /path/to/output.genepredext

## Convert RefGene to GTF
atg --from refgene --to gtf --input /path/to/input.refgene --output /path/to/output.gtf

## Convert RefGene to bed
atg --from refgene --to bed --input /path/to/input.refgene --output /path/to/output.bed


## Convert a GTF file to a RefGene file, remove all transcript without proper start and stop codons
atg --from gtf --to refgene --input /path/to/input.gtf --output /path/to/output.refgene --qc-check start --qc-check stop --reference /path/to/fasta.fa
```

### Supported `--output` formats

#### gtf
Output in [GTF](http://genome.ucsc.edu/FAQ/FAQformat.html#format4) format.

```text
chr9    ncbiRefSeq.2021-05-17   transcript  74526555    74600974    .   +   .   gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9    ncbiRefSeq.2021-05-17   exon        74526555    74526752    .   +   .   gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9    ncbiRefSeq.2021-05-17   5UTR        74526555    74526650    .   +   .   gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9    ncbiRefSeq.2021-05-17   CDS         74526651    74526752    .   +   0   gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9    ncbiRefSeq.2021-05-17   exon        74561922    74562028    .   +   .   gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9    ncbiRefSeq.2021-05-17   CDS         74561922    74562026    .   +   0   gene_id "C9orf85"; transcript_id "NM_001365057.2";
...
```

You can specify the value of the `source` column manually using the `--gtf-source`/`-g` option. Defaults to `atg`

#### refgene
Output in the [refGene](http://rohsdb.cmb.usc.edu/GBshape/cgi-bin/hgTables?hgsid=583_AkEae6dMkhjf5kd9BxNksFo9ySiK&hgta_doSchemaDb=mm10&hgta_doSchemaTable=refGene) format, as used by some UCSC and NCBI RefSeq services 

```text
0   NM_001101.5     chr7    -   5566778     5570232     5567378     5569288    6   5566778,5567634,5567911,5568791,5569165,5570154,    5567522,5567816,5568350,5569031,5569294,5570232,    0   ACTB    cmpl    cmpl    0,1,0,0,0,-1,
0   NM_001203247.2  chr7    -   148504474   148581383   148504737   148544390  20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543561,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,    0   EZH2    cmpl    cmpl    2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,
0   NM_001203248.2  chr7    -   148504474   148581383   148504737   148544390  20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543588,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,    0   EZH2    cmpl    cmpl    2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,
0   NM_001354750.2  chr11   +   113930432   114127487   113934022   114121277  7   113930432,113933932,114027058,114057673,114112888,114117919,114121047,  113930864,113935290,114027156,114057760,114113059,114118087,114127487,  0   ZBTB16  cmpl    cmpl    -1,0,2,1,1,1,1,
```

#### genepred(ext)
Output in the [GenePred(Ext)](http://genome.ucsc.edu/FAQ/FAQformat#format9) format, as used by some UCSC and NCBI RefSeq services 

**GenePred:**
```text
NM_001101.5     chr7    -   5566778     5570232     5567378     5569288     6   5566778,5567634,5567911,5568791,5569165,5570154,    5567522,5567816,5568350,5569031,5569294,5570232,
NM_001203247.2  chr7    -   148504474   148581383   148504737   148544390   20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543561,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,
NM_001203248.2  chr7    -   148504474   148581383   148504737   148544390   20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543588,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,
NM_001354750.2  chr11   +   113930432   114127487   113934022   114121277   7   113930432,113933932,114027058,114057673,114112888,114117919,114121047,  113930864,113935290,114027156,114057760,114113059,114118087,114127487,
```

**GenePredExt**
```text
NM_001101.5     chr7    -   5566778     5570232     5567378     5569288     6   5566778,5567634,5567911,5568791,5569165,5570154,    5567522,5567816,5568350,5569031,5569294,5570232,    0   ACTB    cmpl    cmpl    0,1,0,0,0,-1,
NM_001203247.2  chr7    -   148504474   148581383   148504737   148544390   20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543561,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,    0   EZH2    cmpl    cmpl    2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,
NM_001203248.2  chr7    -   148504474   148581383   148504737   148544390   20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543588,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,    0   EZH2    cmpl    cmpl    2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,
NM_001354750.2  chr11   +   113930432   114127487   113934022   114121277   7   113930432,113933932,114027058,114057673,114112888,114117919,114121047,  113930864,113935290,114027156,114057760,114113059,114118087,114127487,  0   ZBTB16  cmpl    cmpl    -1,0,2,1,1,1,1,
```

#### bed
Output in [bed](http://genome.ucsc.edu/FAQ/FAQformat#format1) format.

```text
chr7    5566778     5570232     ACTB:NM_001101.5       -   5567378    5569288    212,16,48   6   744,182,439,240,129,78  0,856,1133,2013,2387,3376
chr11   113930432   114127487   ZBTB16:NM_001354750.2  +   113934022  114121277  212,16,48   7   432,1358,98,87,171,168,6440 0,3500,96626,127241,182456,187487,190615
chr17   40852292    40897058    EZH1:NM_001321082.2    -   40854549   40880959   212,16,48   20  2318,85,81,82,96,179,126,41,92,197,181,92,164,103,177,121,129,128,91,30 0,2602,3465,4327,4813,5732,7683,8601,9571,12014,12934,17701,18179,18830,19998,22520,27360,28550,30553,44736
```

#### fasta
Writes the cDNA sequence of all transcripts into one file. Please note that the sequence is stranded.

This target format requires a reference genome fasta file that must be specified using `--reference`/`-r`.

*This output allows different `--fasta-format` options:*
- `transcript`: The full transcript sequence (from the genomic start to end position, including introns)
- `exons`: The cDNA sequence of the processed transcript, i.e. the sequence of all exons, including non-coding exons.
- `cds` (default): The CDS of the transcript

```text
>NM_007298.3 BRCA1
ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGC
TATGCAGAAAATCTTAGAGTGTCCCATCTGTCTGGAGTTGATCAAGGAAC
CTGTCTCCACAAAGTGTGACCACATATTTTGCAAATTTTGCATGCTGAAA
CTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGA
TATAACCAAAAGGAGCCTACAAGAAAGTACGAGATTTAGTCAACTTGTTG
...
>NM_001365057.2 C9orf85
ATGAGCTCCCAGAAAGGCAACGTGGCTCGTTCCAGACCTCAGAAGCACCA
GAATACGTTTAGCTTCAAAAATGACAAGTTCGATAAAAGTGTGCAGACCA
AGAAAATTAATGCAAAACTTCATGATGGAGTATGTCAGCGCTGTAAAGAA
GTTCTTGAGTGGCGTGTAAAATACAGCAAATACAAACCATTATCAAAACC
TAAAAAGTGA
...
```

#### fasta-split
Like `fasta` above, but one file for each transcript. Instead of an output file, you must specify an output directory, _ATG_ will save each transcript as `<Transcript_name>.fasta`, e.g.: `NM_001365057.2.fasta`.

This target format requires a reference genome fasta file that must be specified using `--reference`/`-r`.

*This output allows different `--fasta-format` options:*
- `transcript`: The full transcript sequence (from the genomic start to end position, including introns)
- `exons`: The cDNA sequence of the processed transcript, i.e. the sequence of all exons, including non-coding exons.
- `cds` (default): The CDS of the transcript

#### feature-sequence
cDNA sequence of each feature (5' UTR, CDS, 3'UTR), each in a separate row.

This target format requires a reference genome fasta file that must be specified using `--reference`/`-r`.

```text
BRCA1   NM_007298.3     chr17   41196311    41197694    -   3UTR    CTGCAGCCAGCCAC...
BRCA1   NM_007298.3     chr17   41197694    41197819    -   CDS     CAATTGGGCAGATGTGTG...
BRCA1   NM_007298.3     chr17   41199659    41199720    -   CDS     GGTGTCCACCCAATTGTG...
BRCA1   NM_007298.3     chr17   41201137    41201211    -   CDS     ATCAACTGGAATGGATGG...
BRCA1   NM_007298.3     chr17   41203079    41203134    -   CDS     ATCTTCAGGGGGCTAGAA...
BRCA1   NM_007298.3     chr17   41209068    41209152    -   CDS     CATGATTTTGAAGTCAGA...
BRCA1   NM_007298.3     chr17   41215349    41215390    -   CDS     GGGTGACCCAGTCTATTA...
BRCA1   NM_007298.3     chr17   41215890    41215968    -   CDS     ATGCTGAGTTTGTGTGTG...
BRCA1   NM_007298.3     chr17   41219624    41219712    -   CDS     ATGCTCGTGTACAAGTTT...
BRCA1   NM_007298.3     chr17   41222944    41223255    -   CDS     AGGGAACCCCTTACCTGG...
C9orf85 NM_001365057.2  chr9    74526555    74526650    +   5UTR    ATTGACAGAA...
C9orf85 NM_001365057.2  chr9    74526651    74526752    +   CDS     ATGAGCTCCCAGAA...
C9orf85 NM_001365057.2  chr9    74561922    74562028    +   CDS     AAAATTAATGCAAA...
C9orf85 NM_001365057.2  chr9    74597573    74597573    +   CDS     A
C9orf85 NM_001365057.2  chr9    74597574    74600974    +   3UTR    TGGAGTCTCC...
```

#### spliceai
This is a custom format useful for [SpliceAI](https://github.com/Illumina/SpliceAI)
splice predictions. The repo lists [example files](https://github.com/Illumina/SpliceAI/tree/master/spliceai/annotations).
The output has one gene per row, each gene record contains a consensus transcript, created by merging overlapping exons.

```text
#NAME       CHROM   STRAND  TX_START    TX_END  EXON_START      EXON_END
OR4F5       1       +       69090       70008   69090,          70008,
AL627309.1  1       -       134900      139379  134900,137620,  135802,139379,
```

#### qc
Runs some basic consistency checks on the transcripts:

| QC check | Explanation | Non-Coding vs Coding | requires Fasta File |
| --- | --- | --- | --- |
| Exon | Contains at least one exon | all | no |
| Correct CDS Length | The length of the CDS is divisible by 3 | Coding | no |
| Correct Start Codon | The CDS starts with `ATG` | Coding | yes |
| Correct Stop Codon | The CDS ends with a Stop codon `TAG`, `TAA`, or `TGA` | Coding | yes |
| No upstream Start Codon | The 5'UTR does not contain another start codon `ATG` (This test do not make sense biologically. It is totally fine for a transcript to have upstream `ATG` start cordons that are not utilized but the ribosome.) | Coding | yes |
| No upstream Stop Codon| The CDS does not contain another in-frame stop-codon | Coding | yes |
| No Start codon | The full exon sequence does not contain a start codon `ATG` (Biologically speaking, a non-coding transcript could have `ATG` start codons that are not utilized) | Non-Coding | yes |
| Correct Coordinates | The transcript is within the coordinates of the reference genome | all | yes |

**Test results:**
- `NA` Test could not be performed (e.g. CDS-length for non-coding transcripts), so no conclusion could be drawn
- `OK` The test succeeded with an OK results
- `NOK` The test failed and gave a NOT OK result

```text
Gene     transcript     Exon  CDS Length  Correct Start Codon  Correct Stop Codon  No upstream Start Codon  No upstream Stop Codon  Correct Coordinates
FAM239A  NR_146581.1    OK    N/A         N/A                  N/A                 OK                       N/A                     OK
OR5H2    NM_001005482.1 OK    OK          OK                   OK                  OK                       OK                      OK
SNX20    NM_001144972.2 OK    OK          OK                   OK                  NOK                      OK                      OK
```

#### raw
This is mainly useful for debugging, as it gives a quick glimpse into the Exons and CDS coordinates of the transcripts.

#### bin
Save Transcripts in _ATG_ binary format for faster re-reading.


## ATG as library
_ATG_ uses the _atglib_ library, which is documented inline and available on [docs.rs](https://docs.rs/atglib)


## Known issues
### GTF parsing
- [ ] NM_001371720.1 has two book-ended exons (155160639-155161619 || 155161620-155162101). During input parsing, book-ended features are merged into one exon