From the Spanish word for path (not a Star Wars planet!).
Builds an amino-acid alignment in a reference-free, alignment-free manner from a set of proteomes.
Not ‘better’ than traditional marker-based pipelines, but simpler and faster to run.
Typical usages range from between-species to within-family (bacteria and archaea) or within-phylum (eukaryotes) phylogenetic analyses.
under the hood
kamino performs the following successive steps:
- lists proteome files from the input directory (-i or -I)
- recodes proteins with a 6-letters recoding scheme (-r)
- simplifies proteomes by pre-filtering proteins and discarding out-branching k-mers
- builds a global assembly graph and identifies variant groups as described here (-d)
- converts variant group paths back to amino acids using a sliding window
- mask long polymorphism runs within variant groups (-m)
- filters variant groups by missing data and middle-length thresholds (-f and -l)
- extracts middle positions and incorporate 'constant' positions (-c)
- outputs the final amino acid alignment (-o)
installation
You can either compile the code locally using rustc, or install a precompiled binary from Bioconda:
running kamino
Input consists of proteome files in FASTA format (gzipped or not), with one file per sample. Files can be placed in a single directory (specified with the -i argument), or their paths can be provided in a tab-delimited file using -I.
A basic run using four threads can be performed with either of the following commands:
In the directory mode, files are recognized by their extension (.fas, .fasta, .faa, .fa, .fna; gzipped ot not).
For bacterial isolates, the phylogenomic alignment can also be generated directly from genome assemblies by selecting the option --genomes (using either -i or -I). In this case, an ultra-fast but approximate protein prediction is performed, with predicted proteomes stored in a temporary directory.
Finally, a Neighbour-joinging tree can be generated in addition to other output files by selecting the option --NJ.
examples
All analyses were performed on a MacBook M4 Pro using v0.6.1 and 4 threads (other parameters set to default):
| dataset | taxonomic diversity | runtime (min) | memory (GB) | alignment size (aa) |
|---|---|---|---|---|
| 50 Mycobacterium | within-genera | 0.1 | 0.7 | 24,678 |
| 400 Mycobacterium | within-genera | 0.5 | 2.5 | 21,011 |
| 50 Polyporales (fungi) | within-order | 0.3 | 3.8 | 29,483 |
| 165 Arthropoda | within-phylum | 1.1 | 8.5 | 13,002 |
| 55 Mammalia | within-class | 1.6 | 8.1 | 334,108 |
FAQ
-
When not to use kamino?
- low diversity datasets (ie, within-species), for which genome-based approaches will be more powerful
- highly divergent datasets (eg, animal kingdom)
- distant outgroup composed of a few isolates: these might have disproportionately more missing data
- isolates with a low number of proteins (viruses and prokaryotes with fewer than 1,000 proteins)
-
Is the output reproducible?
- How to get more phylogenetic positions?
This codebase is provided under the MIT License. Some parts of the code were drafted using AI assistance.