prot_translate
Translate nucleotide sequence (dna or rna) to protein.
Usage
Add this to your Cargo.toml
:
[]
= "0.1.0"
Example
use *;
Benchmarks
The current algorithm is inspired by seqan's implementation which uses array indexing. Here is how it performs vs other methods (tested on 2012 macbook pro).
Method | 10 bp* | 100 bp | 1,000 bp | 10,000 bp | 100,000 bp | 1 million bp |
---|---|---|---|---|---|---|
prot_translate | 91 ns | 0.29 μs | 2.28 μs | 23 μs | 215 μs | 2.25 ms |
fnv hashmap | 111 ns | 0.37 μs | 3.58 μs | 37 μs | 366 us | 3.86 ms |
std hashmap | 160 ns | 1.03 μs | 9.65 μs | 100 μs | 943 μs | 9.40 ms |
phf_map | 177 ns | 1.04 μs | 9.47 μs | 100 μs | 936 μs | 9.91 |
match statement | 259 ns | 1.77 μs | 17.9 μs | 163 μs | 1941 μs | 19.1 ms |
prot_translate (unchecked) | 90 ns | 0.26 μs | 2.02 μs | 20 μs | 197 μs | 1.92 ms |
*bp = "base pairs"
To benchmark yourself (have to use nightly because of phf_map macro).
cargo +nightly bench
Thoughts
- FNV seems to be a great option, but I have chosen to use the current implementation due to being slightly faster and not required any dependencies.
- There was originally a function called
translate_unchecked
that did not validate each byte for valid ASCII, but since the performance gain was negligible, it was removed.
Todo
- Add other Codon tables (e.g. Vertebrate Mitochondrial, Yeast Mitochondrial, Mold Mitochondrial, etc.)
- Add support for ambiguous nucleotides (right now, only supports A, U, T, C, G)
Tests
To test
cargo test
To can also generate new test data (requires python3 and biopython).
# Generate 500 random sequences and their peptides