kamino-cli 0.9.0

From the Spanish word for path (not a Star Wars planet!).

Builds an amino-acid alignment in a reference-free, alignment-free manner from a set of proteomes.
Not ‘better’ than traditional marker-based pipelines, but simpler and faster to run.

Typical usages range from between-species to within-family (bacteria and archaea) or within-phylum (eukaryotes) phylogenetic analyses.

under the hood

kamino performs the following successive steps:

lists proteome files from the input directory (-i or -I)
recodes proteins with a 6-letters recoding scheme (-r)
simplifies proteomes by pre-filtering proteins and discarding out-branching k-mers
builds a global assembly graph and identifies variant groups as described here (-d)
converts variant group paths back to amino acids using a sliding window
mask long polymorphism runs within variant groups (-m)
filters variant groups by missing data and middle-length thresholds (-f and -l)
extracts middle positions and incorporate 'constant' positions (-c)
outputs the final amino acid alignment (-o)

installation

You can either compile the code locally using rustc, or install a precompiled binary from Bioconda:

conda install bioconda::kamino

running kamino

Input consists of proteome files in FASTA format (gzipped or not), with one file per sample. Files can be placed in a single directory (specified with the -i argument), or their paths can be provided in a tab-delimited file using -I.

A basic run using four threads can be performed with either of the following commands:

kamino -i <input_dir> -t 4
kamino -I <tabular_file> -t 4

In the directory mode, files are recognized by their extension (.fas, .fasta, .faa, .fa, .fna; gzipped ot not).

For bacterial isolates, the phylogenomic alignment can also be generated directly from genome assemblies by selecting the option --genomes (using either -i or -I). In this case, an ultra-fast but approximate protein prediction is performed, with predicted proteomes stored in a temporary directory.

kamino -i <input_dir> -t 4 --genomes

Finally, a Neighbour-joinging tree can be generated in addition to other output files by selecting the option --NJ.

kamino -i <input_dir> -t 4 --NJ

examples

All analyses were performed on a MacBook M4 Pro using v0.6.1 and 4 threads (other parameters set to default):

dataset	taxonomic diversity	runtime (min)	memory (GB)	alignment size (aa)
50 Mycobacterium	within-genera	0.1	0.7	24,678
400 Mycobacterium	within-genera	0.5	2.5	21,011
50 Polyporales (fungi)	within-order	0.3	3.8	29,483
165 Arthropoda	within-phylum	1.1	8.5	13,002
55 Mammalia	within-class	1.6	8.1	334,108

FAQ

When not to use kamino?
- low diversity datasets (ie, within-species), for which genome-based approaches will be more powerful
- highly divergent datasets (eg, animal kingdom)
- distant outgroup composed of a few isolates: these might have disproportionately more missing data
- isolates with a low number of proteins (viruses and prokaryotes with fewer than 1,000 proteins)
Is the output reproducible?

How to get more phylogenetic positions?

This codebase is provided under the MIT License. Some parts of the code were drafted using AI assistance.