kamino-cli 0.8.1

Build phylogenomic datasets in seconds.
Documentation

Cargo Build & Test Clippy check codecov install with bioconda

From the Spanish word for path (not a Star Wars planet!).

Builds an amino-acid alignment in a reference-free, alignment-free manner from a set of proteomes.
Not ‘better’ than traditional marker-based pipelines, but simpler and faster to run.

Typical usages range from between-species to within-family (bacteria and archaea) or within-phylum (eukaryotes) phylogenetic analyses.


under the hood

kamino performs the following successive steps:

  • lists proteome files from the input directory (-i or -I)
  • recodes proteins with a 6-letters recoding scheme (-r)
  • simplifies proteomes by pre-filtering proteins and discarding out-branching k-mers
  • builds a global assembly graph and identifies variant groups as described here (-d)
  • converts variant group paths back to amino acids using a sliding window
  • mask long polymorphism runs within variant groups (-m)
  • filters variant groups by missing data and middle-length thresholds (-f and -l)
  • extracts middle positions and incorporate 'constant' positions (-c)
  • outputs the final amino acid alignment (-o)

installation

You can either compile the code locally using rustc, or install a precompiled binary from Bioconda:

conda install bioconda::kamino

running kamino

Input consists of proteome files in FASTA format (gzipped or not), with one file per sample. Files can be placed in a single directory (specified with the -i argument), or their paths can be provided in a tab-delimited file using -I.

A basic run using four threads can be performed with either of the following commands:

kamino -i <input_dir> -t 4
kamino -I <tabular_file> -t 4

In the directory mode, files are recognized by their extension (.fas, .fasta, .faa, .fa, .fna; gzipped ot not).

For bacterial isolates, the phylogenomic alignment can also be generated directly from genome assemblies by selecting the option --genomes (using either -i or -I). In this case, an ultra-fast but approximate protein prediction is performed, with predicted proteomes stored in a temporary directory.

kamino -i <input_dir> -t 4 --genomes

Finally, a Neighbour-joinging tree can be generated in addition to other output files by selecting the option --NJ.

kamino -i <input_dir> -t 4 --NJ

examples

All analyses were performed on a MacBook M4 Pro using v0.6.1 and 4 threads (other parameters set to default):

dataset taxonomic diversity runtime (min) memory (GB) alignment size (aa)
50 Mycobacterium within-genera 0.1 0.7 24,678
400 Mycobacterium within-genera 0.5 2.5 21,011
50 Polyporales (fungi) within-order 0.3 3.8 29,483
165 Arthropoda within-phylum 1.1 8.5 13,002
55 Mammalia within-class 1.6 8.1 334,108

FAQ

  • When not to use kamino?

    • low diversity datasets (ie, within-species), for which genome-based approaches will be more powerful
    • highly divergent datasets (eg, animal kingdom)
    • distant outgroup composed of a few isolates: these might have disproportionately more missing data
    • isolates with a low number of proteins (viruses and prokaryotes with fewer than 1,000 proteins)
  • Is the output reproducible?

  • How to get more phylogenetic positions?

This codebase is provided under the MIT License. Some parts of the code were drafted using AI assistance.