From the Spanish word for path.
Builds an amino-acid alignment in a reference-free, alignment-free manner from a set of proteomes.
Not ‘better’ than traditional marker-based pipelines, but simpler and faster to run.
Typical usages range from between-species to within-phylum phylogenetic analyses (bacteria, archaea and eukaryotes).
under the hood
kamino performs the following successive steps:
- lists proteome files from the input directory (-i or -I)
- recodes proteins with a 6-letters recoding scheme (-r)
- simplifies proteomes by pre-filtering proteins and discarding out-branching k-mers
- builds a global assembly graph and identifies variant groups as described here (-d)
- converts variant group paths back to amino acids using a sliding window
- mask long polymorphism runs within variant groups (-m)
- filters variant groups by missing data and middle-length thresholds (-f and -l)
- extracts middle positions and incorporate 'constant' positions (-c)
- outputs the final amino acid alignment (-o)
installation
You can either compile the code locally using rustc, or install a precompiled binary from Bioconda:
running kamino
Input consists of proteome files in FASTA format (gzipped or not), with one file per sample. Files can be placed in a single directory (specified with the -i argument), or their paths can be provided in a tab-delimited file using -I.
A basic run using four threads can be performed with either of the following commands:
examples
All analyses were performed on a MacBook M4 Pro using v0.6.0 and 4 threads (other parameters set to default):
| dataset | taxonomic diversity | runtime (min) | memory (GB) | alignment size (aa) |
|---|---|---|---|---|
| 50 Mycobacterium | within-genera | 0.1 | 1 | 24,678 |
| 400 Mycobacterium | within-genera | 0.7 | 3 | 19,485 |
| 50 Polyporales (fungi) | within-order | 0.4 | 4 | 28,373 |
| 46 Drosophila | within-genera | 0.7 | 5 | 223,600 |
| 55 Mammalia | within-class | 1.7 | 8 | 333,410 |
FAQ
-
When not to use kamino?
- low diversity datasets (ie, within-species), for which genome-based approaches will be more powerful
- very large datasets (eg, thousands of bacterial proteomes or hundreds of vertebrate proteomes)
- very divergent datasets (eg, animal kingdom)
- distant outgroup composed of a few isolates: these might have disproportionately more missing data
- list to be completed ...
-
Is the output reproducible?
- How to get more phylogenetic positions?
This codebase is provided under the MIT License. Some parts of the code were drafted using AI assistance.