kamino-cli 0.3.0

[![Cargo Build & Test](https://github.com/rderelle/kamino/actions/workflows/ci.yml/badge.svg)](https://github.com/rderelle/kamino/actions/workflows/ci.yml)
[![Clippy check](https://github.com/rderelle/kamino/actions/workflows/clippy.yml/badge.svg)](https://github.com/rderelle/kamino/actions/workflows/clippy.yml)
[![codecov](https://codecov.io/github/rderelle/kamino/graph/badge.svg?token=6B8WIGZL2F)](https://codecov.io/github/rderelle/kamino)
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](https://bioconda.github.io/recipes/kamino/README.html)

<br><br>
<p align="center">
  <img src="logo_kamino.svg" alt="kamino logo" width="400">
</p>

<br><br>

From the Spanish word for *path*.  

Builds an amino-acid alignment in a reference-free, alignment-free manner from a set of proteomes.  
Not ‘better’ than traditional marker-based pipelines, but simpler and faster to run.  
  
Typical usages range from between-species to within-phylum phylogenetic analyses (bacteria, archaea and eukaryotes).

<br>

---
## under the hood
kamino performs the following successive steps:
- lists proteome files from the input directory (-i or -I)
- recodes proteins with a 6-letters recoding scheme (-r)
- simplifies proteomes by discarding out-branching k-mers 
- builds a global assembly graph and identifies variant groups as described <a href="https://academic.oup.com/mbe/article/42/4/msaf077/8103706">here</a> (-d)
- converts variant group paths back to amino acids using a sliding window
- mask long polymorphism runs within variant groups (-m)
- filters variant groups by missing data and middle-length thresholds (-m and -l)
- extracts middle positions and incorporate 'constant' positions (-c)
- outputs the final amino acid alignment (-o)

---

## installation
You can either compile the code locally using rustc, or install a precompiled binary from Bioconda:

```bash
conda install bioconda::kamino
```

---

## running kamino
Input consists of proteome files in FASTA format (gzipped or not), with one file per sample. Files can be placed in a single directory (specified with the -i argument), or their paths can be provided in a tab-delimited file using -I.

A basic run using four threads can be performed with either of the following commands:
```bash
kamino -i <input_dir> -t 4
kamino -I <tabular_file> -t 4
```
---

## examples

All analyses were performed on a MacBook M4 pro using 4 threads (other parameters set to default):  

| dataset                     | taxonomic diversity  | runtime (min) | memory (GB) | alignment size (aa) |
|-----------------------------|----------------------|---------------|-------------|---------------------|
| 50 *Mycobacterium*          | within-genera        | 0.1           | 2           | 16.088              |
| 400 *Mycobacterium*         | within-genera        | 0.9           | 8           | 11.745              |
| 50 Polyporales (fungi)      | within-order         | 0.5           | 7           | 17.512              |
| 46 *Drosophila*             | within-genera        | 0.7           | 5           | 196.212             |
| 55 Mammalia                 | within-class         | 1.5           | 8           | 328.205             |  


And using the 400 Mycobacterium dataset to examine how parameter choices affect the analyses (still with 4 threads):

| parameters                | runtime (min) | memory (GB) | alignment size (aa) |
|---------------------------|---------------|-------------|---------------------|
| [default: k=13, d=4]      | 0.9           | 8           | 11.745              |
| --k 14                    | 0.9           | 9           | 17.515              |
| --depth 5                 | 1             | 8           | 17.003              |
| --k 14 --depth 5          | 1             | 9           | 23.706              |


---

## FAQ

- **When not to use kamino?**
    * low diversity datasets (ie, within-species), for which genome-based approaches will be more powerful 
    * very large datasets (eg, thousands of bacterial proteomes or hundreds of vertebrate proteomes)
    * very divergent datasets (eg, animal kingdom)
    * distant outgroup composed of a few isolates: these might have disproportionately more missing data
    * list to be completed ...

- **Is the output reproducible?**
<p>Yes, kamino is fully deterministic so will produce the exact same alignment for a given a specific version, a set of parameters and input proteomes.</p>

- **How to get more phylogenetic positions?**
    * increase the k-mer size (-k), but can substantially raise memory usage
    * increase the maximum depth of the graph traversal (-d), but increases the runtime
    * lower the minimum proportion of isolates per position (-m), if that is acceptable for downstream analyses

 

---

This codebase is provided under the MIT License. Some parts of the code were drafted using AI assistance.