kamino-cli 0.5.1

From the Spanish word for path.

Builds an amino-acid alignment in a reference-free, alignment-free manner from a set of proteomes.
Not ‘better’ than traditional marker-based pipelines, but simpler and faster to run.

Typical usages range from between-species to within-phylum phylogenetic analyses (bacteria, archaea and eukaryotes).

under the hood

kamino performs the following successive steps:

lists proteome files from the input directory (-i or -I)
recodes proteins with a 6-letters recoding scheme (-r)
simplifies proteomes by discarding out-branching k-mers
builds a global assembly graph and identifies variant groups as described here (-d)
converts variant group paths back to amino acids using a sliding window
mask long polymorphism runs within variant groups (-m)
filters variant groups by missing data and middle-length thresholds (-f and -l)
extracts middle positions and incorporate 'constant' positions (-c)
outputs the final amino acid alignment (-o)

installation

You can either compile the code locally using rustc, or install a precompiled binary from Bioconda:

conda install bioconda::kamino

running kamino

Input consists of proteome files in FASTA format (gzipped or not), with one file per sample. Files can be placed in a single directory (specified with the -i argument), or their paths can be provided in a tab-delimited file using -I.

A basic run using four threads can be performed with either of the following commands:

kamino -i <input_dir> -t 4
kamino -I <tabular_file> -t 4

examples

All analyses were performed on a MacBook "M4 Pro" using v0.4.0 and 4 threads (other parameters set to default unless specified):

dataset	taxonomic diversity	runtime (min)	memory (GB)	alignment size (aa)
50 Mycobacterium	within-genera	0.1	2	19,283
400 Mycobacterium	within-genera	0.9	8	13,753
50 Polyporales (fungi)	within-order	0.5	8	21,808
46 Drosophila	within-genera	0.7	7	194,021
55 Mammalia	within-class	1.6	14	291,437
55 Mammalia `-k 13`	within-class	1.9	8	191,962

FAQ

When not to use kamino?
- low diversity datasets (ie, within-species), for which genome-based approaches will be more powerful
- very large datasets (eg, thousands of bacterial proteomes or hundreds of vertebrate proteomes)
- very divergent datasets (eg, animal kingdom)
- distant outgroup composed of a few isolates: these might have disproportionately more missing data
- list to be completed ...
Is the output reproducible?

How to get more phylogenetic positions?

This codebase is provided under the MIT License. Some parts of the code were drafted using AI assistance.