kamino_cli/lib.rs
1//! # kamino
2//!
3//! kamino builds an amino-acid alignment in a reference-free, alignment-free manner from
4//! a set of proteomes. It is not “better” than traditional marker-based pipelines, but it is
5//! simpler and faster to use.
6//!
7//! Typical usage ranges from between-species to within-phylum phylogenetic analyses (bacteria,
8//! archaea, and eukaryotes).
9//!
10//! ## Input modes
11//! kamino accepts proteome files as input in one of two modes:
12//! - **Directory mode** (`--input-directory`): a directory containing FASTA proteomes
13//! (plain text or `.gz` compressed). Each file represents one isolate. Filenames minus the
14//! extension become sequence names in the final amino-acid alignment.
15//! - **Table mode** (`--input-file`): a tab-delimited file mapping a species/sample name
16//! to a proteome path (one name + path pair per line). This is useful when file names
17//! do not encode the sample name or when proteomes are located in multiple directories.
18//!
19//! In the directory mode, files are recognized by their extension (.fas, .fasta, .faa, .fa, .fna; gzipped ot not).
20//!
21//! For **bacterial** isolates, the phylogenomic alignment can also be generated directly from genome assemblies
22//! by selecting the option `--genomes` (using either `-i` or `-I`). In this case, an ultra-fast but approximate
23//! protein prediction is performed, and the predicted proteomes are written to a temporary directory.
24//!
25//! ## Arguments
26//! - `-i`, --input-directory <INPUT>: input directory with FASTA proteomes (plain or .gz)
27//! - `-I`, --input-file <INPUT_FILE>: tab-delimited file mapping species name to proteome path
28//! - `--genomes`: treat input files as bacterial genomes and predict proteomes before analysis
29//! - `-k`, `--k`: k-mer length (default: 14; must be within the valid range for the
30//! selected recoding scheme).
31//! - `-f`, `--min-freq`: minimum fraction of samples with an amino-acid per position
32//! (default: 0.85; must be ≥ 0.6).
33//! - `-d`, `--depth`: maximum traversal depth from each start node (default: 8).
34//! - `-o`, `--output`: output prefix for generated files (default: `kamino`).
35//! - `-c`, `--constant`: number of constant positions retained from in-bubble k-mers
36//! (default: 3; must be ≤ k-1).
37//! - `-l`, `--length-middle`: maximum number of middle positions per variant group
38//! (default: 35; must be ≥ 1).
39//! - `-m`, `--mask`: mask middle segments with long mismatch runs (default: 5).
40//! - `-t`, `--threads`: number of threads used for graph construction and analysis
41//! (default: 1).
42//! - `-r`, `--recode`: amino-acid recoding scheme (default: `sr6`).
43//! - `--nj`: generate a NJ tree from kamino alignment [nj=false]
44//! - `-v`, `--version`: print version information and exit.
45//!
46//!
47//! ## Important things to optimize
48//! The main parameters governing the number of phylogenetic positions in the final alignment are
49//! the k-mer size (-k), the depth of the recursive graph traversal (-d), and the minimum sample
50//! frequency (-f).
51//!
52//! The default k-mer size has already been chosen to maximise the final alignment length, and
53//! increasing it usually does not substantially increase the number of variant groups. It may,
54//! however, be useful to decrease the k-mer size from 14 to 13 if memory consumption is too high.
55//!
56//! Increasing the depth of the recursive graph traversal (e.g. from 8 to 10) generally increases
57//! the size of the final alignment, as kamino detects more variant groups during graph traversal.
58//! This is typically the most effective approach if the alignment is deemed too short.
59//!
60//! Finally, larger alignments can also be produced by decreasing the minimum fraction of samples
61//! required to carry an amino acid (e.g. from 0.85 to 0.8), at the cost of increased missing data
62//! in the final alignment. Missing data are represented by '-' (missing amino acid) and 'X'
63//! (ambiguous or masked amino acid).
64//!
65//!
66//! ## Less important parameters
67//! Besides testing/benchmarking, I would not recommend modifying these parameter values.
68//!
69//! The number of constant positions in the final alignment can be adjusted with the --constant parameter. These are
70//! taken from the left flank of the end k-mer in each variant group, next to the middle positions. Because these
71//! positions are recoded, some may become polymorphic once converted back to amino acids. Using the default c=3,
72//! constant positions represent 50 to 60% of the alignment.
73//!
74//! The `--mask parameter` controls the amino-acid masking performed by kamino to prevent long runs of polymorphism from being
75//! retained in the final alignment. These correspond to genuine but unwanted polymorphisms (e.g., micro-inversions) or,
76//! less frequently, errors made by kamino (“misaligned” paths due to the presence of two consecutive indels). The minimum length
77//! of polymorphism runs to be masked can be decreased using this parameter to be more stringent.
78//!
79//! The `--length-middle` parameter is used to filter out long variant groups. Increase this parameter to allow longer
80//! variant groups to be retained in the final alignment.
81//!
82//! Finally, the 6-letter recoding scheme can be modified using the --recode parameter, although the default sr6 recoding
83//! scheme performed the best in most of my tests (sr6 ≥ dayhoff6 ≫ kgb6).
84//!
85//!
86//! ## Output files
87//! The names of the output files are controlled by a prefix (-o; default: `kamino`). The prefix
88//! may include a directory path (e.g. `-o my_analyses/taxon1`). Note that the output directory is not
89//! created by kamino and must already exist.
90//!
91//! The three output files are:
92//! - `<prefix>_alignment.fas`: FASTA amino acid alignment of all samples.
93//! - `<prefix>_missing.tsv`: Tab-delimited per-sample missingness percentages.
94//! - `<prefix>_partitions.tsv`: Tab-delimited variant group coordinates (0-based) in the FASTA
95//! alignment, along with consensus protein names when the input proteomes are annotated.
96//!
97//! Additionally, a Neighbor-Joining (NJ) tree can be produced from the amino acid alignment
98//! when the `--nj` argument is specified. Pairwise distances are computed using an F81
99//! correction with LG stationary amino-acid frequencies. The resulting tree provides an
100//! overview of isolate relationships and is not intended for detailed phylogenetic inference.
101//!
102use anyhow::Context;
103use clap::{ArgAction, Parser};
104
105mod bubbles;
106mod filter_groups;
107mod graph;
108mod io;
109mod output;
110mod phylo;
111mod proba_filter;
112mod protein_prediction;
113mod recode;
114mod revert_aminoacid;
115mod traverse;
116//mod util;
117
118pub use recode::RecodeScheme;
119
120struct TraversalArtifacts {
121 groups: traverse::VariantGroups,
122 species_names: Vec<String>,
123 n_species: usize,
124 recode_scheme: RecodeScheme,
125}
126
127#[allow(clippy::too_many_arguments)]
128fn run_traverse(
129 species_inputs: &[io::SpeciesInput],
130 k: usize,
131 min_freq: f32,
132 depth: usize,
133 length_middle: usize,
134 num_threads: usize,
135 recode_scheme: RecodeScheme,
136) -> anyhow::Result<TraversalArtifacts> {
137 // Build global graph
138 let mut g = graph::Graph::new(k, recode_scheme);
139 io::build_graph_from_inputs(
140 species_inputs,
141 k,
142 min_freq,
143 length_middle,
144 &mut g,
145 num_threads,
146 )?;
147
148 // Basic stats
149 io::print_graph_size(&g);
150 //io::print_outdegree_histogram(&g);
151
152 // Bubble endpoints
153 let (start_kmers, end_kmers) = bubbles::find_bubble_endpoints(&g, min_freq, num_threads);
154 eprintln!(
155 "bubble endpoints: start={} end={}",
156 start_kmers.len(),
157 end_kmers.len()
158 );
159
160 // Traverse VGs
161 let groups = traverse::find_variant_groups(
162 &g,
163 &start_kmers,
164 &end_kmers,
165 depth,
166 min_freq,
167 length_middle,
168 num_threads,
169 );
170 let total_paths: usize = groups.values().map(|v| v.len()).sum();
171 eprintln!(
172 "variant groups: groups={} paths={}",
173 groups.len(),
174 total_paths
175 );
176
177 // Emit amino-acid sequences per path for each variant group (call disabled, keep for debugging)
178 // traverse::print_variant_group_sequences(&g, &groups);
179
180 Ok(TraversalArtifacts {
181 groups,
182 species_names: g.species_names.clone(),
183 n_species: g.n_species,
184 recode_scheme: g.recode_scheme,
185 })
186}
187
188/// Build a node-based, colored de Bruijn graph from amino-acid proteomes and analyze bubbles.
189#[derive(Parser, Debug)]
190#[command(name = "kamino", author, version, about, disable_version_flag = true)]
191#[command(
192 group = clap::ArgGroup::new("input_source")
193 .required(true)
194 .multiple(true)
195 .args(["input", "input_file"])
196)]
197pub struct Args {
198 /// Input directory with FASTA proteomes (plain or .gz)
199 #[arg(short, long = "input-directory")]
200 pub input: Option<std::path::PathBuf>,
201
202 /// Tab-delimited file mapping species name to proteome path
203 #[arg(short = 'I', long = "input-file")]
204 pub input_file: Option<std::path::PathBuf>,
205
206 /// Treat input files as bacterial genomes and predict proteomes before analysis
207 #[arg(long = "genomes", default_value_t = false, hide_default_value = true)]
208 pub genomes: bool,
209
210 /// K-mer length [k=14]
211 #[arg(short, long)]
212 pub k: Option<usize>,
213
214 /// Minimal fraction of samples with an amino-acid per position [m=0.85]
215 #[arg(
216 short = 'f',
217 long = "min-freq",
218 default_value_t = 0.85,
219 hide_default_value = true
220 )]
221 pub min_freq: f32,
222
223 /// Maximum traversal depth from each start node [d=8]
224 #[arg(short, long, default_value_t = 8, hide_default_value = true)]
225 pub depth: usize,
226
227 /// Output prefix [o=kamino]
228 #[arg(short, long, default_value = "kamino")]
229 pub output: std::path::PathBuf,
230
231 /// Number of constant positions to keep from the in-bubble k-mer [c=3]
232 #[arg(short, long)]
233 pub constant: Option<usize>,
234
235 /// Maximum number of middle positions per variant group [l=35]
236 #[arg(short = 'l', long = "length-middle", default_value_t = 35, hide_default_value = true)]
237 pub length_middle: usize,
238
239 /// Mask middle segments with long mismatch runs [m=5]
240 #[arg(
241 short = 'm',
242 long = "mask",
243 default_value_t = 5,
244 hide_default_value = true
245 )]
246 pub mask: usize,
247
248 /// Number of threads [t=1]
249 #[arg(short = 't', long, default_value_t = 1, hide_default_value = true)]
250 pub threads: usize,
251
252 /// Recoding scheme [r=sr6]
253 #[arg(short = 'r', long = "recode", value_enum, default_value_t = RecodeScheme::SR6, hide_default_value = true)]
254 pub recode: RecodeScheme,
255
256 /// Generate a NJ tree from kamino alignment [nj=false]
257 #[arg(long = "nj", default_value_t = false, hide_default_value = true)]
258 pub nj: bool,
259
260 /// Display version information.
261 #[arg(short = 'v', long = "version", action = ArgAction::Version)]
262 pub version: (),
263}
264
265pub fn run_with_args(args: Args) -> anyhow::Result<()> {
266 // k defaults and limits for SR6
267 let default_k = 14usize;
268 let max_k = 21usize;
269 let default_constant = 3usize;
270
271 anyhow::ensure!(
272 args.min_freq >= 0.6,
273 "min_freq ({}) must be ≥ 0.6.",
274 args.min_freq
275 );
276 let min_freq = args.min_freq;
277
278 let k = args.k.unwrap_or(default_k);
279 anyhow::ensure!(
280 (2..=max_k).contains(&k),
281 "k={} is invalid for {}: allowed range is 2..={} (default {})",
282 k,
283 args.recode,
284 max_k,
285 default_k
286 );
287
288 // Validate constant against k-1
289 let k1 = k - 1;
290 let constant = args.constant.unwrap_or_else(|| default_constant.min(k1));
291 anyhow::ensure!(
292 constant <= k1,
293 "constant ({}) must be ≤ k-1 ({}).",
294 constant,
295 k1
296 );
297
298 anyhow::ensure!(
299 args.length_middle >= 1,
300 "length_middle ({}) must be ≥ 1.",
301 args.length_middle
302 );
303
304 // Decide how many threads to use (default: 1)
305 anyhow::ensure!(args.threads >= 1, "threads must be ≥ 1");
306
307 eprintln!("kamino v{}", env!("CARGO_PKG_VERSION"));
308 let mut species_inputs = Vec::new();
309 let mut input_labels: Vec<String> = Vec::new();
310
311 if let Some(table) = args.input_file.as_ref() {
312 input_labels.push(format!("input_file={}", table.display()));
313 species_inputs.extend(io::collect_species_inputs_from_table(table)?);
314 }
315
316 if let Some(dir) = args.input.as_ref() {
317 input_labels.push(format!("input={}", dir.display()));
318 species_inputs.extend(io::collect_species_inputs_from_dir(dir)?);
319 }
320
321 let output_dir = args
322 .output
323 .parent()
324 .map(std::path::Path::to_path_buf)
325 .unwrap_or_else(|| std::path::PathBuf::from("."));
326 let mut genomes_tmpdir = None;
327
328 let input_label = input_labels.join(" ");
329
330 eprintln!(
331 "parameters: k={} constant={} min_freq={} depth={} length_middle={} mask={} threads={} recode={} {} output={}",
332 k,
333 constant,
334 min_freq,
335 args.depth,
336 args.length_middle,
337 args.mask,
338 args.threads,
339 args.recode,
340 input_label,
341 args.output.display()
342 );
343
344 if args.genomes {
345 eprintln!("genome files: {}", species_inputs.len());
346
347 let tmpdir = tempfile::Builder::new()
348 .prefix("tmp_kamino_")
349 .rand_bytes(6)
350 .tempdir_in(&output_dir)
351 .with_context(|| {
352 format!(
353 "create temporary directory for predicted proteomes in {}",
354 output_dir.display()
355 )
356 })?;
357
358 let predicted =
359 protein_prediction::predict_proteomes(&species_inputs, tmpdir.path(), args.threads)?;
360 species_inputs = predicted;
361 input_labels.push("genomes=true".to_string());
362 genomes_tmpdir = Some(tmpdir);
363 }
364
365 anyhow::ensure!(
366 !species_inputs.is_empty(),
367 "At least one input source with files must be provided."
368 );
369
370 let traversal = run_traverse(
371 &species_inputs,
372 k,
373 min_freq,
374 args.depth,
375 args.length_middle,
376 args.threads,
377 args.recode,
378 )?;
379
380 let (alignment_len, alignment_missing_pct) = output::write_outputs_with_head(
381 &species_inputs,
382 &args.output,
383 &traversal.species_names,
384 traversal.n_species,
385 traversal.recode_scheme,
386 &traversal.groups,
387 k,
388 constant,
389 min_freq,
390 args.mask,
391 args.threads,
392 args.nj,
393 )?;
394
395 let (fas_path, tsv_path, partitions_path, tree_path) = output::output_paths(&args.output);
396 eprintln!(
397 "alignment: length={} missing={:.1}%",
398 alignment_len, alignment_missing_pct
399 );
400
401 if args.nj {
402 eprintln!(
403 "output files: {}, {}, {}, and {}",
404 fas_path.display(),
405 tsv_path.display(),
406 partitions_path.display(),
407 tree_path.display()
408 );
409 } else {
410 eprintln!(
411 "output files: {}, {}, and {}",
412 fas_path.display(),
413 tsv_path.display(),
414 partitions_path.display()
415 );
416 }
417
418 drop(genomes_tmpdir);
419
420 Ok(())
421}