Skip to main content

kamino_cli/
lib.rs

1//! # kamino
2//!
3//! kamino builds an amino-acid alignment in a reference-free, alignment-free manner from
4//! a set of proteomes. It is not “better” than traditional marker-based pipelines, but it is
5//! simpler and faster to use.
6//!
7//! Typical usage ranges from between-species to within-phylum phylogenetic analyses (bacteria,
8//! archaea, and eukaryotes).
9//!
10//! ## Input modes
11//! kamino accepts proteome files as input in one of two modes:
12//! - **Directory mode** (`--input-directory`): a directory containing FASTA proteomes
13//!   (plain text or `.gz` compressed). Each file represents one isolate. Filenames minus the
14//!   extension become sequence names in the final amino-acid alignment.
15//! - **Table mode** (`--input-file`): a tab-delimited file mapping a species/sample name
16//!   to a proteome path (one name + path pair per line). This is useful when file names
17//!   do not encode the sample name or when proteomes are located in multiple directories.
18//!
19//! In the directory mode, files are recognized by their extension (.fas, .fasta, .faa, .fa, .fna; gzipped ot not).
20//!
21//! For **bacterial** isolates, the phylogenomic alignment can also be generated directly from genome assemblies 
22//! by selecting the option `--genomes` (using either `-i` or `-I`). In this case, an ultra-fast but approximate 
23//! protein prediction is performed, and the predicted proteomes are written to a temporary directory.
24//!
25//! ## Arguments
26//! - `-i`, --input-directory <INPUT>: input directory with FASTA proteomes (plain or .gz)
27//! - `-I`, --input-file <INPUT_FILE>: tab-delimited file mapping species name to proteome path
28//! - `--genomes`: treat input files as bacterial genomes and predict proteomes before analysis
29//! - `-k`, `--k`: k-mer length (default: 14; must be within the valid range for the
30//!   selected recoding scheme).
31//! - `-f`, `--min-freq`: minimum fraction of samples with an amino-acid per position
32//!   (default: 0.85; must be ≥ 0.6).
33//! - `-d`, `--depth`: maximum traversal depth from each start node (default: 12).
34//! - `-o`, `--output`: output prefix for generated files (default: `kamino`).
35//! - `-c`, `--constant`: number of constant positions retained from in-bubble k-mers
36//!   (default: 3; must be ≤ k-1).
37//! - `-l`, `--length-middle`: maximum number of middle positions per variant group
38//!   (default: 35; must be ≥ 1).
39//! - `-m`, `--mask`: mask middle segments with long mismatch runs (default: 5).
40//! - `-t`, `--threads`: number of threads used for graph construction and analysis
41//!   (default: 1).
42//! - `-r`, `--recode`: amino-acid recoding scheme (default: `sr6`).
43//! - `--nj`: generate a NJ tree from kamino alignment [nj=false]
44//! - `-v`, `--version`: print version information and exit.
45//!
46//!
47//! ## Important things to optimize
48//! The main parameters governing the number of phylogenetic positions in the final alignment are
49//! the k-mer size (-k), the depth of the recursive graph traversal (-d), and the minimum sample
50//! frequency (-f).
51//!
52//! The default k-mer size has already been chosen to maximise the final alignment length, and
53//! increasing it usually does not substantially increase the number of variant groups. It may,
54//! however, be useful to decrease the k-mer size from 14 to 13 if memory consumption is too high.
55//!
56//! Increasing the depth of the recursive graph traversal (e.g. from 12 to 16) generally increases
57//! the size of the final alignment, as kamino detects more variant groups during graph traversal.
58//! This is typically the most effective approach if the alignment is deemed too short, but results 
59//! in increased runtimes as the program spends more time traversing the graph.
60//!
61//! Finally, larger alignments can also be produced by decreasing the minimum fraction of samples
62//! required to carry an amino acid (e.g. from 0.85 to 0.8), at the cost of increased missing data
63//! in the final alignment. Missing data are represented by '-' (missing amino acid) and 'X'
64//! (ambiguous or masked amino acid).
65//!
66//!
67//! ## Less important parameters
68//! Besides testing/benchmarking, I would not recommend modifying these parameter values.
69//!
70//! The number of constant positions in the final alignment can be adjusted with the --constant parameter. These are
71//! taken from the left flank of the end k-mer in each variant group, next to the middle positions. Because these
72//! positions are recoded, some may become polymorphic once converted back to amino acids. Using the default c=3,
73//! constant positions represent 50 to 60% of the alignment.
74//!
75//! The `--mask parameter` controls the amino-acid masking performed by kamino to prevent long runs of polymorphism from being
76//! retained in the final alignment. These correspond to genuine but unwanted polymorphisms (e.g., micro-inversions) or,
77//! less frequently, errors made by kamino (“misaligned” paths due to the presence of two consecutive indels). The minimum length
78//! of polymorphism runs to be masked can be decreased using this parameter to be more stringent.
79//!
80//! The `--length-middle` parameter is used to filter out long variant groups. Increase this parameter to allow longer
81//! variant groups to be retained in the final alignment.
82//!
83//! Finally, the 6-letter recoding scheme can be modified using the --recode parameter, although the default sr6 recoding
84//! scheme performed the best in most of my tests (sr6 ≥ dayhoff6 ≫ kgb6).
85//!
86//!
87//! ## Output files
88//! The names of the output files are controlled by a prefix (-o; default: `kamino`). The prefix
89//! may include a directory path (e.g. `-o my_analyses/taxon1`). Note that the output directory is not
90//! created by kamino and must already exist.
91//!
92//! The three output files are:
93//! - `<prefix>_alignment.fas`: FASTA amino acid alignment of all samples.
94//! - `<prefix>_missing.tsv`: Tab-delimited per-sample missingness percentages.
95//! - `<prefix>_partitions.tsv`: Tab-delimited variant group coordinates (0-based) in the FASTA
96//!   alignment, along with consensus protein names when the input proteomes are annotated.
97//!
98//! Additionally, a Neighbor-Joining (NJ) tree can be produced from the amino acid alignment
99//! when the `--nj` argument is specified. Pairwise distances are computed using an F81
100//! correction with LG stationary amino-acid frequencies. The resulting tree provides an
101//! overview of isolate relationships and is not intended for detailed phylogenetic inference.
102//!
103use anyhow::Context;
104use clap::{ArgAction, Parser};
105
106mod bubbles;
107mod filter_groups;
108mod graph;
109mod io;
110mod output;
111mod phylo;
112mod proba_filter;
113mod protein_prediction;
114mod recode;
115mod revert_aminoacid;
116mod traverse;
117//mod util;
118
119pub use recode::RecodeScheme;
120
121struct TraversalArtifacts {
122    groups: traverse::VariantGroups,
123    species_names: Vec<String>,
124    n_species: usize,
125    recode_scheme: RecodeScheme,
126}
127
128#[allow(clippy::too_many_arguments)]
129fn run_traverse(
130    species_inputs: &[io::SpeciesInput],
131    k: usize,
132    min_freq: f32,
133    depth: usize,
134    length_middle: usize,
135    num_threads: usize,
136    recode_scheme: RecodeScheme,
137) -> anyhow::Result<TraversalArtifacts> {
138    // Build global graph
139    let mut g = graph::Graph::new(k, recode_scheme);
140    io::build_graph_from_inputs(
141        species_inputs,
142        k,
143        min_freq,
144        length_middle,
145        &mut g,
146        num_threads,
147    )?;
148
149    // Basic stats
150    io::print_graph_size(&g);
151    //io::print_outdegree_histogram(&g);
152
153    // Bubble endpoints
154    let (start_kmers, end_kmers) = bubbles::find_bubble_endpoints(&g, min_freq, num_threads);
155    eprintln!(
156        "bubble endpoints: start={} end={}",
157        start_kmers.len(),
158        end_kmers.len()
159    );
160
161    // Traverse VGs
162    let groups = traverse::find_variant_groups(
163        &g,
164        &start_kmers,
165        &end_kmers,
166        depth,
167        min_freq,
168        length_middle,
169        num_threads,
170    );
171    let total_paths: usize = groups.values().map(|v| v.len()).sum();
172    eprintln!(
173        "variant groups: groups={} paths={}",
174        groups.len(),
175        total_paths
176    );
177
178    // Emit amino-acid sequences per path for each variant group (call disabled, keep for debugging)
179    // traverse::print_variant_group_sequences(&g, &groups);
180
181    Ok(TraversalArtifacts {
182        groups,
183        species_names: g.species_names.clone(),
184        n_species: g.n_species,
185        recode_scheme: g.recode_scheme,
186    })
187}
188
189/// Build a node-based, colored de Bruijn graph from amino-acid proteomes and analyze bubbles.
190#[derive(Parser, Debug)]
191#[command(name = "kamino", author, version, about, disable_version_flag = true)]
192#[command(
193    group = clap::ArgGroup::new("input_source")
194        .required(true)
195        .multiple(true)
196        .args(["input", "input_file"])
197)]
198pub struct Args {
199    /// Input directory with FASTA proteomes (plain or .gz)
200    #[arg(short, long = "input-directory")]
201    pub input: Option<std::path::PathBuf>,
202
203    /// Tab-delimited file mapping species name to proteome path
204    #[arg(short = 'I', long = "input-file")]
205    pub input_file: Option<std::path::PathBuf>,
206
207    /// Treat input files as bacterial genomes and predict proteomes before analysis
208    #[arg(long = "genomes", default_value_t = false, hide_default_value = true)]
209    pub genomes: bool,
210
211    /// K-mer length [k=14]
212    #[arg(short, long)]
213    pub k: Option<usize>,
214
215    /// Minimal fraction of samples with an amino-acid per position [m=0.85]
216    #[arg(
217        short = 'f',
218        long = "min-freq",
219        default_value_t = 0.85,
220        hide_default_value = true
221    )]
222    pub min_freq: f32,
223
224    /// Maximum traversal depth from each start node [d=12]
225    #[arg(short, long, default_value_t = 12, hide_default_value = true)]
226    pub depth: usize,
227
228    /// Output prefix [o=kamino]
229    #[arg(short, long, default_value = "kamino")]
230    pub output: std::path::PathBuf,
231
232    /// Number of constant positions to keep from the in-bubble k-mer [c=3]
233    #[arg(short, long)]
234    pub constant: Option<usize>,
235
236    /// Maximum number of middle positions per variant group [l=35]
237    #[arg(short = 'l', long = "length-middle", default_value_t = 35, hide_default_value = true)]
238    pub length_middle: usize,
239
240    /// Mask middle segments with long mismatch runs [m=5]
241    #[arg(
242        short = 'm',
243        long = "mask",
244        default_value_t = 5,
245        hide_default_value = true
246    )]
247    pub mask: usize,
248
249    /// Number of threads [t=1]
250    #[arg(short = 't', long, default_value_t = 1, hide_default_value = true)]
251    pub threads: usize,
252
253    /// Recoding scheme [r=sr6]
254    #[arg(short = 'r', long = "recode", value_enum, default_value_t = RecodeScheme::SR6, hide_default_value = true)]
255    pub recode: RecodeScheme,
256
257    /// Generate a NJ tree from kamino alignment [nj=false]
258    #[arg(long = "nj", default_value_t = false, hide_default_value = true)]
259    pub nj: bool,
260
261    /// Display version information.
262    #[arg(short = 'v', long = "version", action = ArgAction::Version)]
263    pub version: (),
264}
265
266pub fn run_with_args(args: Args) -> anyhow::Result<()> {
267    // k defaults and limits for SR6
268    let default_k = 14usize;
269    let max_k = 21usize;
270    let default_constant = 3usize;
271
272    anyhow::ensure!(
273        args.min_freq >= 0.6,
274        "min_freq ({}) must be ≥ 0.6.",
275        args.min_freq
276    );
277    let min_freq = args.min_freq;
278
279    let k = args.k.unwrap_or(default_k);
280    anyhow::ensure!(
281        (2..=max_k).contains(&k),
282        "k={} is invalid for {}: allowed range is 2..={} (default {})",
283        k,
284        args.recode,
285        max_k,
286        default_k
287    );
288
289    // Validate constant against k-1
290    let k1 = k - 1;
291    let constant = args.constant.unwrap_or_else(|| default_constant.min(k1));
292    anyhow::ensure!(
293        constant <= k1,
294        "constant ({}) must be ≤ k-1 ({}).",
295        constant,
296        k1
297    );
298
299    anyhow::ensure!(
300        args.length_middle >= 1,
301        "length_middle ({}) must be ≥ 1.",
302        args.length_middle
303    );
304
305    // Decide how many threads to use (default: 1)
306    anyhow::ensure!(args.threads >= 1, "threads must be ≥ 1");
307
308    eprintln!("kamino v{}", env!("CARGO_PKG_VERSION"));
309    let mut species_inputs = Vec::new();
310    let mut input_labels: Vec<String> = Vec::new();
311
312    if let Some(table) = args.input_file.as_ref() {
313        input_labels.push(format!("input_file={}", table.display()));
314        species_inputs.extend(io::collect_species_inputs_from_table(table)?);
315    }
316
317    if let Some(dir) = args.input.as_ref() {
318        input_labels.push(format!("input={}", dir.display()));
319        species_inputs.extend(io::collect_species_inputs_from_dir(dir)?);
320    }
321
322    let output_dir = args
323        .output
324        .parent()
325        .map(std::path::Path::to_path_buf)
326        .unwrap_or_else(|| std::path::PathBuf::from("."));
327    let mut genomes_tmpdir = None;
328
329    let input_label = input_labels.join(" ");
330
331    eprintln!(
332        "parameters: k={} constant={} min_freq={} depth={} length_middle={} mask={} threads={} recode={} {} output={}",
333        k,
334        constant,
335        min_freq,
336        args.depth,
337        args.length_middle,
338        args.mask,
339        args.threads,
340        args.recode,
341        input_label,
342        args.output.display()
343    );
344
345    if args.genomes {
346        eprintln!("genome files: {}", species_inputs.len());
347
348        let tmpdir = tempfile::Builder::new()
349            .prefix("tmp_kamino_")
350            .rand_bytes(6)
351            .tempdir_in(&output_dir)
352            .with_context(|| {
353                format!(
354                    "create temporary directory for predicted proteomes in {}",
355                    output_dir.display()
356                )
357            })?;
358
359        let predicted =
360            protein_prediction::predict_proteomes(&species_inputs, tmpdir.path(), args.threads)?;
361        species_inputs = predicted;
362        input_labels.push("genomes=true".to_string());
363        genomes_tmpdir = Some(tmpdir);
364    }
365
366    anyhow::ensure!(
367        !species_inputs.is_empty(),
368        "At least one input source with files must be provided."
369    );
370
371    let traversal = run_traverse(
372        &species_inputs,
373        k,
374        min_freq,
375        args.depth,
376        args.length_middle,
377        args.threads,
378        args.recode,
379    )?;
380
381    let (alignment_len, alignment_missing_pct) = output::write_outputs_with_head(
382        &species_inputs,
383        &args.output,
384        &traversal.species_names,
385        traversal.n_species,
386        traversal.recode_scheme,
387        &traversal.groups,
388        k,
389        constant,
390        min_freq,
391        args.mask,
392        args.threads,
393        args.nj,
394    )?;
395
396    let (fas_path, tsv_path, partitions_path, tree_path) = output::output_paths(&args.output);
397    eprintln!(
398        "alignment: length={} missing={:.1}%",
399        alignment_len, alignment_missing_pct
400    );
401
402    if args.nj {
403        eprintln!(
404            "output files:  {}, {}, {}, and {}",
405            fas_path.display(),
406            tsv_path.display(),
407            partitions_path.display(),
408            tree_path.display()
409        );
410    } else {
411        eprintln!(
412            "output files:  {}, {}, and {}",
413            fas_path.display(),
414            tsv_path.display(),
415            partitions_path.display()
416        );
417    }
418
419    drop(genomes_tmpdir);
420
421    Ok(())
422}