Expand description
§kamino
kamino builds an amino-acid alignment in a reference-free, alignment-free manner from a set of proteomes. It is not “better” than traditional marker-based pipelines, but it is simpler and faster to use.
Typical usage ranges from between-species to within-phylum phylogenetic analyses (bacteria, archaea, and eukaryotes).
§Input modes
kamino accepts proteome files as input in one of two modes:
- Directory mode (
--input-directory): a directory containing FASTA proteomes (plain text or.gzcompressed). Each file represents one isolate. Filenames minus the extension become sequence names in the final amino-acid alignment. - Table mode (
--input-file): a tab-delimited file mapping a species/sample name to a proteome path (one name + path pair per line). This is useful when file names do not encode the sample name or when proteomes are located in multiple directories.
In the directory mode, files are recognized by their extension (.fas, .fasta, .faa, .fa, .fna; gzipped ot not).
For bacterial isolates, the phylogenomic alignment can also be generated directly from genome assemblies
by selecting the option --genomes (using either -i or -I). In this case, an ultra-fast but approximate
protein prediction is performed, and the predicted proteomes are written to a temporary directory.
§Arguments
-i, –input-directory : input directory with FASTA proteomes (plain or .gz)-I, –input-file <INPUT_FILE>: tab-delimited file mapping species name to proteome path--genomes: treat input files as bacterial genomes and predict proteomes before analysis-k,--k: k-mer length (default: 14; must be within the valid range for the selected recoding scheme).-f,--min-freq: minimum fraction of samples with an amino-acid per position (default: 0.85; must be ≥ 0.6).-d,--depth: maximum traversal depth from each start node (default: 8).-o,--output: output prefix for generated files (default:kamino).-c,--constant: number of constant positions retained from in-bubble k-mers (default: 3; must be ≤ k-1).-l,--length-middle: maximum number of middle positions per variant group (default: 35; must be ≥ 1).-m,--mask: mask middle segments with long mismatch runs (default: 5).-t,--threads: number of threads used for graph construction and analysis (default: 1).-r,--recode: amino-acid recoding scheme (default:sr6).--nj: generate a NJ tree from kamino alignment [nj=false]-v,--version: print version information and exit.
§Important things to optimize
The main parameters governing the number of phylogenetic positions in the final alignment are the k-mer size (-k), the depth of the recursive graph traversal (-d), and the minimum sample frequency (-f).
The default k-mer size has already been chosen to maximise the final alignment length, and increasing it usually does not substantially increase the number of variant groups. It may, however, be useful to decrease the k-mer size from 14 to 13 if memory consumption is too high.
Increasing the depth of the recursive graph traversal (e.g. from 8 to 10) generally increases the size of the final alignment, as kamino detects more variant groups during graph traversal. This is typically the most effective approach if the alignment is deemed too short.
Finally, larger alignments can also be produced by decreasing the minimum fraction of samples required to carry an amino acid (e.g. from 0.85 to 0.8), at the cost of increased missing data in the final alignment. Missing data are represented by ‘-’ (missing amino acid) and ‘X’ (ambiguous or masked amino acid).
§Less important parameters
Besides testing/benchmarking, I would not recommend modifying these parameter values.
The number of constant positions in the final alignment can be adjusted with the –constant parameter. These are taken from the left flank of the end k-mer in each variant group, next to the middle positions. Because these positions are recoded, some may become polymorphic once converted back to amino acids. Using the default c=3, constant positions represent 50 to 60% of the alignment.
The --mask parameter controls the amino-acid masking performed by kamino to prevent long runs of polymorphism from being
retained in the final alignment. These correspond to genuine but unwanted polymorphisms (e.g., micro-inversions) or,
less frequently, errors made by kamino (“misaligned” paths due to the presence of two consecutive indels). The minimum length
of polymorphism runs to be masked can be decreased using this parameter to be more stringent.
The --length-middle parameter is used to filter out long variant groups. Increase this parameter to allow longer
variant groups to be retained in the final alignment.
Finally, the 6-letter recoding scheme can be modified using the –recode parameter, although the default sr6 recoding scheme performed the best in most of my tests (sr6 ≥ dayhoff6 ≫ kgb6).
§Output files
The names of the output files are controlled by a prefix (-o; default: kamino). The prefix
may include a directory path (e.g. -o my_analyses/taxon1). Note that the output directory is not
created by kamino and must already exist.
The three output files are:
<prefix>_alignment.fas: FASTA amino acid alignment of all samples.<prefix>_missing.tsv: Tab-delimited per-sample missingness percentages.<prefix>_partitions.tsv: Tab-delimited variant group coordinates (0-based) in the FASTA alignment, along with consensus protein names when the input proteomes are annotated.
Additionally, a Neighbor-Joining (NJ) tree can be produced from the amino acid alignment
when the --nj argument is specified. Pairwise distances are computed using an F81
correction with LG stationary amino-acid frequencies. The resulting tree provides an
overview of isolate relationships and is not intended for detailed phylogenetic inference.
Structs§
- Args
- Build a node-based, colored de Bruijn graph from amino-acid proteomes and analyze bubbles.