kamino-cli 0.5.1

Build phylogenomic datasets in seconds.
Documentation

Cargo Build & Test Clippy check codecov Crates.io install with bioconda

From the Spanish word for path.

Builds an amino-acid alignment in a reference-free, alignment-free manner from a set of proteomes.
Not ‘better’ than traditional marker-based pipelines, but simpler and faster to run.

Typical usages range from between-species to within-phylum phylogenetic analyses (bacteria, archaea and eukaryotes).


under the hood

kamino performs the following successive steps:

  • lists proteome files from the input directory (-i or -I)
  • recodes proteins with a 6-letters recoding scheme (-r)
  • simplifies proteomes by discarding out-branching k-mers
  • builds a global assembly graph and identifies variant groups as described here (-d)
  • converts variant group paths back to amino acids using a sliding window
  • mask long polymorphism runs within variant groups (-m)
  • filters variant groups by missing data and middle-length thresholds (-f and -l)
  • extracts middle positions and incorporate 'constant' positions (-c)
  • outputs the final amino acid alignment (-o)

installation

You can either compile the code locally using rustc, or install a precompiled binary from Bioconda:

conda install bioconda::kamino

running kamino

Input consists of proteome files in FASTA format (gzipped or not), with one file per sample. Files can be placed in a single directory (specified with the -i argument), or their paths can be provided in a tab-delimited file using -I.

A basic run using four threads can be performed with either of the following commands:

kamino -i <input_dir> -t 4
kamino -I <tabular_file> -t 4

examples

All analyses were performed on a MacBook "M4 Pro" using v0.4.0 and 4 threads (other parameters set to default unless specified):

dataset taxonomic diversity runtime (min) memory (GB) alignment size (aa)
50 Mycobacterium within-genera 0.1 2 19,283
400 Mycobacterium within-genera 0.9 8 13,753
50 Polyporales (fungi) within-order 0.5 8 21,808
46 Drosophila within-genera 0.7 7 194,021
55 Mammalia within-class 1.6 14 291,437
55 Mammalia -k 13 within-class 1.9 8 191,962

FAQ

  • When not to use kamino?

    • low diversity datasets (ie, within-species), for which genome-based approaches will be more powerful
    • very large datasets (eg, thousands of bacterial proteomes or hundreds of vertebrate proteomes)
    • very divergent datasets (eg, animal kingdom)
    • distant outgroup composed of a few isolates: these might have disproportionately more missing data
    • list to be completed ...
  • Is the output reproducible?

  • How to get more phylogenetic positions?

This codebase is provided under the MIT License. Some parts of the code were drafted using AI assistance.