Skip to main content

kamino_cli/
lib.rs

1//! # kamino
2//!
3//! kamino builds an amino-acid alignment in a reference-free, alignment-free manner from
4//! a set of proteomes. It is not “better” than traditional marker-based pipelines, but it is
5//! simpler and faster to use.
6//!
7//! Typical usage ranges from between-species to within-phylum phylogenetic analyses (bacteria,
8//! archaea, and eukaryotes).
9//!
10//! ## Input modes
11//! kamino accepts proteome files as input in one of two modes:
12//! - **Directory mode** (`--input-directory`): a directory containing FASTA proteomes
13//!   (plain text or `.gz` compressed). Each file represents one isolate. Filenames minus the
14//!   extension become sequence names in the final amino-acid alignment.
15//! - **Table mode** (`--input-file`): a tab-delimited file mapping a species/sample name
16//!   to a proteome path (one name + path pair per line). This is useful when file names
17//!   do not encode the sample name or when proteomes are located in multiple directories.
18//!
19//! In the directory mode, files are recognized by their extension (.fas, .fasta, .faa, .fa, .fna; gzipped ot not).
20//!
21//! For **bacterial** isolates, the phylogenomic alignment can also be generated directly from genome assemblies 
22//! by selecting the option `--genomes` (using either `-i` or `-I`). In this case, an ultra-fast but approximate 
23//! protein prediction is performed, and the predicted proteomes are written to a temporary directory.
24//!
25//! ## Arguments
26//! - `-i`, --input-directory <INPUT>: input directory with FASTA proteomes (plain or .gz)
27//! - `-I`, --input-file <INPUT_FILE>: tab-delimited file mapping species name to proteome path
28//! - `--genomes`: treat input files as bacterial genomes and predict proteomes before analysis
29//! - `-k`, `--k`: k-mer length (default: 14; must be within the valid range for the
30//!   selected recoding scheme).
31//! - `-f`, `--min-freq`: minimum fraction of samples with an amino-acid per position
32//!   (default: 0.85; must be ≥ 0.6).
33//! - `-d`, `--depth`: maximum traversal depth from each start node (default: 8).
34//! - `-o`, `--output`: output prefix for generated files (default: `kamino`).
35//! - `-c`, `--constant`: number of constant positions retained from in-bubble k-mers
36//!   (default: 3; must be ≤ k-1).
37//! - `-l`, `--length-middle`: maximum number of middle positions per variant group
38//!   (default: 35; must be ≥ 1).
39//! - `-m`, `--mask`: mask middle segments with long mismatch runs (default: 5).
40//! - `-t`, `--threads`: number of threads used for graph construction and analysis
41//!   (default: 1).
42//! - `-r`, `--recode`: amino-acid recoding scheme (default: `sr6`).
43//! - `--nj`: generate a NJ tree from kamino alignment [nj=false]
44//! - `-v`, `--version`: print version information and exit.
45//!
46//!
47//! ## Important things to optimize
48//! The main parameters governing the number of phylogenetic positions in the final alignment are
49//! the k-mer size (-k), the depth of the recursive graph traversal (-d), and the minimum sample
50//! frequency (-f).
51//!
52//! The default k-mer size has already been chosen to maximise the final alignment length, and
53//! increasing it usually does not substantially increase the number of variant groups. It may,
54//! however, be useful to decrease the k-mer size from 14 to 13 if memory consumption is too high.
55//!
56//! Increasing the depth of the recursive graph traversal (e.g. from 8 to 10) generally increases
57//! the size of the final alignment, as kamino detects more variant groups during graph traversal.
58//! This is typically the most effective approach if the alignment is deemed too short.
59//!
60//! Finally, larger alignments can also be produced by decreasing the minimum fraction of samples
61//! required to carry an amino acid (e.g. from 0.85 to 0.8), at the cost of increased missing data
62//! in the final alignment. Missing data are represented by '-' (missing amino acid) and 'X'
63//! (ambiguous or masked amino acid).
64//!
65//!
66//! ## Less important parameters
67//! Besides testing/benchmarking, I would not recommend modifying these parameter values.
68//!
69//! The number of constant positions in the final alignment can be adjusted with the --constant parameter. These are
70//! taken from the left flank of the end k-mer in each variant group, next to the middle positions. Because these
71//! positions are recoded, some may become polymorphic once converted back to amino acids. Using the default c=3,
72//! constant positions represent 50 to 60% of the alignment.
73//!
74//! The `--mask parameter` controls the amino-acid masking performed by kamino to prevent long runs of polymorphism from being
75//! retained in the final alignment. These correspond to genuine but unwanted polymorphisms (e.g., micro-inversions) or,
76//! less frequently, errors made by kamino (“misaligned” paths due to the presence of two consecutive indels). The minimum length
77//! of polymorphism runs to be masked can be decreased using this parameter to be more stringent.
78//!
79//! The `--length-middle` parameter is used to filter out long variant groups. Increase this parameter to allow longer
80//! variant groups to be retained in the final alignment.
81//!
82//! Finally, the 6-letter recoding scheme can be modified using the --recode parameter, although the default sr6 recoding
83//! scheme performed the best in most of my tests (sr6 ≥ dayhoff6 ≫ kgb6).
84//!
85//!
86//! ## Output files
87//! The names of the output files are controlled by a prefix (-o; default: `kamino`). The prefix
88//! may include a directory path (e.g. `-o my_analyses/taxon1`). Note that the output directory is not
89//! created by kamino and must already exist.
90//!
91//! The three output files are:
92//! - `<prefix>_alignment.fas`: FASTA amino acid alignment of all samples.
93//! - `<prefix>_missing.tsv`: Tab-delimited per-sample missingness percentages.
94//! - `<prefix>_partitions.tsv`: Tab-delimited variant group coordinates (0-based) in the FASTA
95//!   alignment, along with consensus protein names when the input proteomes are annotated.
96//!
97//! Additionally, a Neighbor-Joining (NJ) tree can be produced from the amino acid alignment
98//! when the `--nj` argument is specified. Pairwise distances are computed using an F81
99//! correction with LG stationary amino-acid frequencies. The resulting tree provides an
100//! overview of isolate relationships and is not intended for detailed phylogenetic inference.
101//!
102use anyhow::Context;
103use clap::{ArgAction, Parser};
104
105mod bubbles;
106mod filter_groups;
107mod graph;
108mod io;
109mod output;
110mod phylo;
111mod proba_filter;
112mod protein_prediction;
113mod recode;
114mod revert_aminoacid;
115mod traverse;
116//mod util;
117
118pub use recode::RecodeScheme;
119
120struct TraversalArtifacts {
121    groups: traverse::VariantGroups,
122    species_names: Vec<String>,
123    n_species: usize,
124    recode_scheme: RecodeScheme,
125}
126
127#[allow(clippy::too_many_arguments)]
128fn run_traverse(
129    species_inputs: &[io::SpeciesInput],
130    k: usize,
131    min_freq: f32,
132    depth: usize,
133    length_middle: usize,
134    num_threads: usize,
135    recode_scheme: RecodeScheme,
136) -> anyhow::Result<TraversalArtifacts> {
137    // Build global graph
138    let mut g = graph::Graph::new(k, recode_scheme);
139    io::build_graph_from_inputs(
140        species_inputs,
141        k,
142        min_freq,
143        length_middle,
144        &mut g,
145        num_threads,
146    )?;
147
148    // Basic stats
149    io::print_graph_size(&g);
150    //io::print_outdegree_histogram(&g);
151
152    // Bubble endpoints
153    let (start_kmers, end_kmers) = bubbles::find_bubble_endpoints(&g, min_freq, num_threads);
154    eprintln!(
155        "bubble endpoints: start={} end={}",
156        start_kmers.len(),
157        end_kmers.len()
158    );
159
160    // Traverse VGs
161    let groups = traverse::find_variant_groups(
162        &g,
163        &start_kmers,
164        &end_kmers,
165        depth,
166        min_freq,
167        length_middle,
168        num_threads,
169    );
170    let total_paths: usize = groups.values().map(|v| v.len()).sum();
171    eprintln!(
172        "variant groups: groups={} paths={}",
173        groups.len(),
174        total_paths
175    );
176
177    // Emit amino-acid sequences per path for each variant group (call disabled, keep for debugging)
178    // traverse::print_variant_group_sequences(&g, &groups);
179
180    Ok(TraversalArtifacts {
181        groups,
182        species_names: g.species_names.clone(),
183        n_species: g.n_species,
184        recode_scheme: g.recode_scheme,
185    })
186}
187
188/// Build a node-based, colored de Bruijn graph from amino-acid proteomes and analyze bubbles.
189#[derive(Parser, Debug)]
190#[command(name = "kamino", author, version, about, disable_version_flag = true)]
191#[command(
192    group = clap::ArgGroup::new("input_source")
193        .required(true)
194        .multiple(true)
195        .args(["input", "input_file"])
196)]
197pub struct Args {
198    /// Input directory with FASTA proteomes (plain or .gz)
199    #[arg(short, long = "input-directory")]
200    pub input: Option<std::path::PathBuf>,
201
202    /// Tab-delimited file mapping species name to proteome path
203    #[arg(short = 'I', long = "input-file")]
204    pub input_file: Option<std::path::PathBuf>,
205
206    /// Treat input files as bacterial genomes and predict proteomes before analysis
207    #[arg(long = "genomes", default_value_t = false, hide_default_value = true)]
208    pub genomes: bool,
209
210    /// K-mer length [k=14]
211    #[arg(short, long)]
212    pub k: Option<usize>,
213
214    /// Minimal fraction of samples with an amino-acid per position [m=0.85]
215    #[arg(
216        short = 'f',
217        long = "min-freq",
218        default_value_t = 0.85,
219        hide_default_value = true
220    )]
221    pub min_freq: f32,
222
223    /// Maximum traversal depth from each start node [d=8]
224    #[arg(short, long, default_value_t = 8, hide_default_value = true)]
225    pub depth: usize,
226
227    /// Output prefix [o=kamino]
228    #[arg(short, long, default_value = "kamino")]
229    pub output: std::path::PathBuf,
230
231    /// Number of constant positions to keep from the in-bubble k-mer [c=3]
232    #[arg(short, long)]
233    pub constant: Option<usize>,
234
235    /// Maximum number of middle positions per variant group [l=35]
236    #[arg(short = 'l', long = "length-middle", default_value_t = 35, hide_default_value = true)]
237    pub length_middle: usize,
238
239    /// Mask middle segments with long mismatch runs [m=5]
240    #[arg(
241        short = 'm',
242        long = "mask",
243        default_value_t = 5,
244        hide_default_value = true
245    )]
246    pub mask: usize,
247
248    /// Number of threads [t=1]
249    #[arg(short = 't', long, default_value_t = 1, hide_default_value = true)]
250    pub threads: usize,
251
252    /// Recoding scheme [r=sr6]
253    #[arg(short = 'r', long = "recode", value_enum, default_value_t = RecodeScheme::SR6, hide_default_value = true)]
254    pub recode: RecodeScheme,
255
256    /// Generate a NJ tree from kamino alignment [nj=false]
257    #[arg(long = "nj", default_value_t = false, hide_default_value = true)]
258    pub nj: bool,
259
260    /// Display version information.
261    #[arg(short = 'v', long = "version", action = ArgAction::Version)]
262    pub version: (),
263}
264
265pub fn run_with_args(args: Args) -> anyhow::Result<()> {
266    // k defaults and limits for SR6
267    let default_k = 14usize;
268    let max_k = 21usize;
269    let default_constant = 3usize;
270
271    anyhow::ensure!(
272        args.min_freq >= 0.6,
273        "min_freq ({}) must be ≥ 0.6.",
274        args.min_freq
275    );
276    let min_freq = args.min_freq;
277
278    let k = args.k.unwrap_or(default_k);
279    anyhow::ensure!(
280        (2..=max_k).contains(&k),
281        "k={} is invalid for {}: allowed range is 2..={} (default {})",
282        k,
283        args.recode,
284        max_k,
285        default_k
286    );
287
288    // Validate constant against k-1
289    let k1 = k - 1;
290    let constant = args.constant.unwrap_or_else(|| default_constant.min(k1));
291    anyhow::ensure!(
292        constant <= k1,
293        "constant ({}) must be ≤ k-1 ({}).",
294        constant,
295        k1
296    );
297
298    anyhow::ensure!(
299        args.length_middle >= 1,
300        "length_middle ({}) must be ≥ 1.",
301        args.length_middle
302    );
303
304    // Decide how many threads to use (default: 1)
305    anyhow::ensure!(args.threads >= 1, "threads must be ≥ 1");
306
307    eprintln!("kamino v{}", env!("CARGO_PKG_VERSION"));
308    let mut species_inputs = Vec::new();
309    let mut input_labels: Vec<String> = Vec::new();
310
311    if let Some(table) = args.input_file.as_ref() {
312        input_labels.push(format!("input_file={}", table.display()));
313        species_inputs.extend(io::collect_species_inputs_from_table(table)?);
314    }
315
316    if let Some(dir) = args.input.as_ref() {
317        input_labels.push(format!("input={}", dir.display()));
318        species_inputs.extend(io::collect_species_inputs_from_dir(dir)?);
319    }
320
321    let output_dir = args
322        .output
323        .parent()
324        .map(std::path::Path::to_path_buf)
325        .unwrap_or_else(|| std::path::PathBuf::from("."));
326    let mut genomes_tmpdir = None;
327
328    let input_label = input_labels.join(" ");
329
330    eprintln!(
331        "parameters: k={} constant={} min_freq={} depth={} length_middle={} mask={} threads={} recode={} {} output={}",
332        k,
333        constant,
334        min_freq,
335        args.depth,
336        args.length_middle,
337        args.mask,
338        args.threads,
339        args.recode,
340        input_label,
341        args.output.display()
342    );
343
344    if args.genomes {
345        eprintln!("genome files: {}", species_inputs.len());
346
347        let tmpdir = tempfile::Builder::new()
348            .prefix("tmp_kamino_")
349            .rand_bytes(6)
350            .tempdir_in(&output_dir)
351            .with_context(|| {
352                format!(
353                    "create temporary directory for predicted proteomes in {}",
354                    output_dir.display()
355                )
356            })?;
357
358        let predicted =
359            protein_prediction::predict_proteomes(&species_inputs, tmpdir.path(), args.threads)?;
360        species_inputs = predicted;
361        input_labels.push("genomes=true".to_string());
362        genomes_tmpdir = Some(tmpdir);
363    }
364
365    anyhow::ensure!(
366        !species_inputs.is_empty(),
367        "At least one input source with files must be provided."
368    );
369
370    let traversal = run_traverse(
371        &species_inputs,
372        k,
373        min_freq,
374        args.depth,
375        args.length_middle,
376        args.threads,
377        args.recode,
378    )?;
379
380    let (alignment_len, alignment_missing_pct) = output::write_outputs_with_head(
381        &species_inputs,
382        &args.output,
383        &traversal.species_names,
384        traversal.n_species,
385        traversal.recode_scheme,
386        &traversal.groups,
387        k,
388        constant,
389        min_freq,
390        args.mask,
391        args.threads,
392        args.nj,
393    )?;
394
395    let (fas_path, tsv_path, partitions_path, tree_path) = output::output_paths(&args.output);
396    eprintln!(
397        "alignment: length={} missing={:.1}%",
398        alignment_len, alignment_missing_pct
399    );
400
401    if args.nj {
402        eprintln!(
403            "output files:  {}, {}, {}, and {}",
404            fas_path.display(),
405            tsv_path.display(),
406            partitions_path.display(),
407            tree_path.display()
408        );
409    } else {
410        eprintln!(
411            "output files:  {}, {}, and {}",
412            fas_path.display(),
413            tsv_path.display(),
414            partitions_path.display()
415        );
416    }
417
418    drop(genomes_tmpdir);
419
420    Ok(())
421}