[](https://github.com/rderelle/kamino/actions/workflows/ci.yml)
[](https://github.com/rderelle/kamino/actions/workflows/clippy.yml)
[](https://codecov.io/github/rderelle/kamino)
[](https://bioconda.github.io/recipes/kamino/README.html)
<br><br>
<p align="center">
<img src="logo_kamino.svg" alt="kamino logo" width="400">
</p>
<br><br>
From the Spanish word for *path* (not a Star Wars planet!).
Builds an amino-acid alignment in a reference-free, alignment-free manner from a set of proteomes.
Not ‘better’ than traditional marker-based pipelines, but simpler and faster to run.
Typical usages range from between-species to within-family (bacteria and archaea) or within-phylum (eukaryotes) phylogenetic analyses.
<br>
---
## under the hood
kamino performs the following successive steps:
- lists proteome files from the input directory (-i or -I)
- recodes proteins with a 6-letters recoding scheme (-r)
- simplifies proteomes by pre-filtering proteins and discarding out-branching k-mers
- builds a global assembly graph and identifies variant groups as described <a href="https://academic.oup.com/mbe/article/42/4/msaf077/8103706">here</a> (-d)
- converts variant group paths back to amino acids using a sliding window
- mask long polymorphism runs within variant groups (-m)
- filters variant groups by missing data and middle-length thresholds (-f and -l)
- extracts middle positions and incorporate 'constant' positions (-c)
- outputs the final amino acid alignment (-o)
---
## installation
You can either compile the code locally using rustc, or install a precompiled binary from Bioconda:
```bash
conda install bioconda::kamino
```
---
## running kamino
Input consists of proteome files in FASTA format (gzipped or not), with one file per sample. Files can be placed in a single directory (specified with the `-i` argument), or their paths can be provided in a tab-delimited file using `-I`.
A basic run using four threads can be performed with either of the following commands:
```bash
kamino -i <input_dir> -t 4
kamino -I <tabular_file> -t 4
```
In the directory mode, files are recognized by their extension (.fas, .fasta, .faa, .fa, .fna; gzipped ot not).
For **bacterial** isolates, the phylogenomic alignment can also be generated directly from genome assemblies by selecting the option `--genomes` (using either `-i` or `-I`). In this case, an ultra-fast but approximate protein prediction is performed, with predicted proteomes stored in a temporary directory.
```bash
kamino -i <input_dir> -t 4 --genomes
```
Finally, a Neighbour-joinging tree can be generated in addition to other output files by selecting the option `--NJ`.
```bash
kamino -i <input_dir> -t 4 --NJ
```
---
## examples
All analyses were performed on a MacBook *M4 Pro* using v0.6.1 and 4 threads (other parameters set to default):
| 50 *Mycobacterium* | within-genera | 0.1 | 0.7 | 24,678 |
| 400 *Mycobacterium* | within-genera | 0.5 | 2.5 | 21,011 |
| 50 Polyporales (fungi) | within-order | 0.3 | 3.8 | 29,483 |
| 165 Arthropoda | within-phylum | 1.1 | 8.5 | 13,002 |
| 55 Mammalia | within-class | 1.6 | 8.1 | 334,108 |
---
## FAQ
- **When not to use kamino?**
* low diversity datasets (ie, within-species), for which genome-based approaches will be more powerful
* highly divergent datasets (eg, animal kingdom)
* distant outgroup composed of a few isolates: these might have disproportionately more missing data
* isolates with a low number of proteins (viruses and prokaryotes with fewer than 1,000 proteins)
- **Is the output reproducible?**
<p>Yes, kamino is fully deterministic so will produce the exact same alignment for a given version, set of parameters and input proteomes.</p>
- **How to get more phylogenetic positions?**
<p>Increase the maximum depth of the graph traversal (`-d`) or lower the minimum proportion of isolates with amino acid per position (`-f`) if that is acceptable for downstream analyses.</p>
---
This codebase is provided under the MIT License. Some parts of the code were drafted using AI assistance.