From the Spanish word for path.
Builds an amino-acid alignment in a reference-free, alignment-free manner from a set of proteomes.
Not ‘better’ than traditional marker-based pipelines, but simpler and faster to run.
Typical usages range from between-species to within-phylum phylogenetic analyses (bacteria, archaea and eukaryotes).
under the hood
kamino performs the following successive steps:
- lists proteome files from the input directory (-i or -I)
- recodes proteins with a 6-letters recoding scheme (-r)
- simplifies proteomes by discarding out-branching k-mers
- builds a global assembly graph and identifies variant groups as described here (-d)
- converts variant group paths back to amino acids using a sliding window
- mask long polymorphism runs within variant groups (-m)
- filters variant groups by missing data and middle-length thresholds (-m and -l)
- extracts middle positions and incorporate 'constant' positions (-c)
- outputs the final amino acid alignment (-o)
installation
You can either compile the code locally using rustc, or install a precompiled binary from Bioconda:
running kamino
Input consists of proteome files in FASTA format (gzipped or not), with one file per sample. Files can be placed in a single directory (specified with the -i argument), or their paths can be provided in a tab-delimited file using -I.
A basic run using four threads can be performed with either of the following commands:
examples
All analyses were performed on a MacBook M4 pro using 4 threads (other parameters set to default):
| dataset | taxonomic diversity | runtime (min) | memory (GB) | alignment size (aa) |
|---|---|---|---|---|
| 50 Mycobacterium | within-genera | 0.1 | 2 | 16.088 |
| 400 Mycobacterium | within-genera | 0.9 | 8 | 11.745 |
| 50 Polyporales (fungi) | within-order | 0.5 | 7 | 17.512 |
| 46 Drosophila | within-genera | 0.7 | 5 | 196.212 |
| 55 Mammalia | within-class | 1.5 | 8 | 328.205 |
And using the 400 Mycobacterium dataset to examine how parameter choices affect the analyses (still with 4 threads):
| parameters | runtime (min) | memory (GB) | alignment size (aa) |
|---|---|---|---|
| [default: k=13, d=4] | 0.9 | 8 | 11.745 |
| --k 14 | 0.9 | 9 | 17.515 |
| --depth 5 | 1 | 8 | 17.003 |
| --k 14 --depth 5 | 1 | 9 | 23.706 |
FAQ
-
When not to use kamino?
- low diversity datasets (ie, within-species), for which genome-based approaches will be more powerful
- very large datasets (eg, thousands of bacterial proteomes or hundreds of vertebrate proteomes)
- very divergent datasets (eg, animal kingdom)
- distant outgroup composed of a few isolates: these might have disproportionately more missing data
- list to be completed ...
-
Is the output reproducible?
- How to get more phylogenetic positions?
- increase the k-mer size (-k), but can substantially raise memory usage
- increase the maximum depth of the graph traversal (-d), but increases the runtime
- lower the minimum proportion of isolates per position (-m), if that is acceptable for downstream analyses
This codebase is provided under the MIT License. Some parts of the code were drafted using AI assistance.