kamino-cli 0.3.0

From the Spanish word for path.

Builds an amino-acid alignment in a reference-free, alignment-free manner from a set of proteomes.
Not ‘better’ than traditional marker-based pipelines, but simpler and faster to run.

Typical usages range from between-species to within-phylum phylogenetic analyses (bacteria, archaea and eukaryotes).

under the hood

kamino performs the following successive steps:

lists proteome files from the input directory (-i or -I)
recodes proteins with a 6-letters recoding scheme (-r)
simplifies proteomes by discarding out-branching k-mers
builds a global assembly graph and identifies variant groups as described here (-d)
converts variant group paths back to amino acids using a sliding window
mask long polymorphism runs within variant groups (-m)
filters variant groups by missing data and middle-length thresholds (-m and -l)
extracts middle positions and incorporate 'constant' positions (-c)
outputs the final amino acid alignment (-o)

installation

You can either compile the code locally using rustc, or install a precompiled binary from Bioconda:

conda install bioconda::kamino

running kamino

Input consists of proteome files in FASTA format (gzipped or not), with one file per sample. Files can be placed in a single directory (specified with the -i argument), or their paths can be provided in a tab-delimited file using -I.

A basic run using four threads can be performed with either of the following commands:

kamino -i <input_dir> -t 4
kamino -I <tabular_file> -t 4

examples

All analyses were performed on a MacBook M4 pro using 4 threads (other parameters set to default):

dataset	taxonomic diversity	runtime (min)	memory (GB)	alignment size (aa)
50 Mycobacterium	within-genera	0.1	2	16.088
400 Mycobacterium	within-genera	0.9	8	11.745
50 Polyporales (fungi)	within-order	0.5	7	17.512
46 Drosophila	within-genera	0.7	5	196.212
55 Mammalia	within-class	1.5	8	328.205

And using the 400 Mycobacterium dataset to examine how parameter choices affect the analyses (still with 4 threads):

parameters	runtime (min)	memory (GB)	alignment size (aa)
[default: k=13, d=4]	0.9	8	11.745
--k 14	0.9	9	17.515
--depth 5	1	8	17.003
--k 14 --depth 5	1	9	23.706

FAQ

When not to use kamino?
- low diversity datasets (ie, within-species), for which genome-based approaches will be more powerful
- very large datasets (eg, thousands of bacterial proteomes or hundreds of vertebrate proteomes)
- very divergent datasets (eg, animal kingdom)
- distant outgroup composed of a few isolates: these might have disproportionately more missing data
- list to be completed ...
Is the output reproducible?

How to get more phylogenetic positions?
- increase the k-mer size (-k), but can substantially raise memory usage
- increase the maximum depth of the graph traversal (-d), but increases the runtime
- lower the minimum proportion of isolates per position (-m), if that is acceptable for downstream analyses

This codebase is provided under the MIT License. Some parts of the code were drafted using AI assistance.