kamino_cli/lib.rs
1//! # kamino
2//!
3//! kamino builds an amino-acid alignment in a reference-free, alignment-free manner from
4//! a set of proteomes. It is not “better” than traditional marker-based pipelines, but it is
5//! simpler and faster to use.
6//!
7//! Typical usage ranges from between-species to within-phylum phylogenetic analyses (bacteria,
8//! archaea, and eukaryotes).
9//!
10//! ## Input modes
11//! kamino accepts proteome files as input in one of two modes:
12//! - **Directory mode** (`--input-directory`): a directory containing FASTA proteomes
13//! (plain text or `.gz` compressed). Each file represents one isolate. Filenames minus the
14//! extension become sequence names in the final amino-acid alignment.
15//! - **Table mode** (`--input-file`): a tab-delimited file mapping a species/sample name
16//! to a proteome path (one name + path pair per line). This is useful when file names
17//! do not encode the sample name or when proteomes are located in multiple directories.
18//!
19//! In the directory mode, files are recognized by their extension (.fas, .fasta, .faa, .fa, .fna; gzipped ot not).
20//!
21//! For **bacterial** isolates, the phylogenomic alignment can also be generated directly from genome assemblies
22//! by selecting the option `--genomes` (using either `-i` or `-I`). In this case, an ultra-fast but approximate
23//! protein prediction is performed, and the predicted proteomes are written to a temporary directory.
24//!
25//! ## Arguments
26//! - `-i`, --input-directory <INPUT>: input directory with FASTA proteomes (plain or .gz)
27//! - `-I`, --input-file <INPUT_FILE>: tab-delimited file mapping species name to proteome path
28//! - `--genomes`: treat input files as bacterial genomes and predict proteomes before analysis
29//! - `-k`, `--k`: k-mer length (default: 14; must be within the valid range for the
30//! selected recoding scheme).
31//! - `-f`, `--min-freq`: minimum fraction of samples with an amino-acid per position
32//! (default: 0.85; must be ≥ 0.6).
33//! - `-d`, `--depth`: maximum traversal depth from each start node (default: 12).
34//! - `-o`, `--output`: output prefix for generated files (default: `kamino`).
35//! - `-c`, `--constant`: number of constant positions retained from in-bubble k-mers
36//! (default: 3; must be ≤ k-1).
37//! - `-l`, `--length-middle`: maximum number of middle positions per variant group
38//! (default: 35; must be ≥ 1).
39//! - `-m`, `--mask`: mask middle segments with long mismatch runs (default: 5).
40//! - `-t`, `--threads`: number of threads used for graph construction and analysis
41//! (default: 1).
42//! - `-r`, `--recode`: amino-acid recoding scheme (default: `sr6`).
43//! - `--nj`: generate a NJ tree from kamino alignment [nj=false]
44//! - `-v`, `--version`: print version information and exit.
45//!
46//!
47//! ## Important things to optimize
48//! The main parameters governing the number of phylogenetic positions in the final alignment are
49//! the k-mer size (-k), the depth of the recursive graph traversal (-d), and the minimum sample
50//! frequency (-f).
51//!
52//! The default k-mer size has already been chosen to maximise the final alignment length, and
53//! increasing it usually does not substantially increase the number of variant groups. It may,
54//! however, be useful to decrease the k-mer size from 14 to 13 if memory consumption is too high.
55//!
56//! Increasing the depth of the recursive graph traversal (e.g. from 12 to 16) generally increases
57//! the size of the final alignment, as kamino detects more variant groups during graph traversal.
58//! This is typically the most effective approach if the alignment is deemed too short, but results
59//! in increased runtimes as the program spends more time traversing the graph.
60//!
61//! Finally, larger alignments can also be produced by decreasing the minimum fraction of samples
62//! required to carry an amino acid (e.g. from 0.85 to 0.8), at the cost of increased missing data
63//! in the final alignment. Missing data are represented by '-' (missing amino acid) and 'X'
64//! (ambiguous or masked amino acid).
65//!
66//!
67//! ## Less important parameters
68//! Besides testing/benchmarking, I would not recommend modifying these parameter values.
69//!
70//! The number of constant positions in the final alignment can be adjusted with the --constant parameter. These are
71//! taken from the left flank of the end k-mer in each variant group, next to the middle positions. Because these
72//! positions are recoded, some may become polymorphic once converted back to amino acids. Using the default c=3,
73//! constant positions represent 50 to 60% of the alignment.
74//!
75//! The `--mask parameter` controls the amino-acid masking performed by kamino to prevent long runs of polymorphism from being
76//! retained in the final alignment. These correspond to genuine but unwanted polymorphisms (e.g., micro-inversions) or,
77//! less frequently, errors made by kamino (“misaligned” paths due to the presence of two consecutive indels). The minimum length
78//! of polymorphism runs to be masked can be decreased using this parameter to be more stringent.
79//!
80//! The `--length-middle` parameter is used to filter out long variant groups. Increase this parameter to allow longer
81//! variant groups to be retained in the final alignment.
82//!
83//! Finally, the 6-letter recoding scheme can be modified using the --recode parameter, although the default sr6 recoding
84//! scheme performed the best in most of my tests (sr6 ≥ dayhoff6 ≫ kgb6).
85//!
86//!
87//! ## Output files
88//! The names of the output files are controlled by a prefix (-o; default: `kamino`). The prefix
89//! may include a directory path (e.g. `-o my_analyses/taxon1`). Note that the output directory is not
90//! created by kamino and must already exist.
91//!
92//! The three output files are:
93//! - `<prefix>_alignment.fas`: FASTA amino acid alignment of all samples.
94//! - `<prefix>_missing.tsv`: Tab-delimited per-sample missingness percentages.
95//! - `<prefix>_partitions.tsv`: Tab-delimited variant group coordinates (0-based) in the FASTA
96//! alignment, along with consensus protein names when the input proteomes are annotated.
97//!
98//! Additionally, a Neighbor-Joining (NJ) tree can be produced from the amino acid alignment
99//! when the `--nj` argument is specified. Pairwise distances are computed using an F81
100//! correction with LG stationary amino-acid frequencies. The resulting tree provides an
101//! overview of isolate relationships and is not intended for detailed phylogenetic inference.
102//!
103use anyhow::Context;
104use clap::{ArgAction, Parser};
105
106mod bubbles;
107mod filter_groups;
108mod graph;
109mod io;
110mod output;
111mod phylo;
112mod proba_filter;
113mod protein_prediction;
114mod recode;
115mod revert_aminoacid;
116mod traverse;
117//mod util;
118
119pub use recode::RecodeScheme;
120
121struct TraversalArtifacts {
122 groups: traverse::VariantGroups,
123 species_names: Vec<String>,
124 n_species: usize,
125 recode_scheme: RecodeScheme,
126}
127
128#[allow(clippy::too_many_arguments)]
129fn run_traverse(
130 species_inputs: &[io::SpeciesInput],
131 k: usize,
132 min_freq: f32,
133 depth: usize,
134 length_middle: usize,
135 num_threads: usize,
136 recode_scheme: RecodeScheme,
137) -> anyhow::Result<TraversalArtifacts> {
138 // Build global graph
139 let mut g = graph::Graph::new(k, recode_scheme);
140 io::build_graph_from_inputs(
141 species_inputs,
142 k,
143 min_freq,
144 length_middle,
145 &mut g,
146 num_threads,
147 )?;
148
149 // Basic stats
150 io::print_graph_size(&g);
151 //io::print_outdegree_histogram(&g);
152
153 // Bubble endpoints
154 let (start_kmers, end_kmers) = bubbles::find_bubble_endpoints(&g, min_freq, num_threads);
155 eprintln!(
156 "bubble endpoints: start={} end={}",
157 start_kmers.len(),
158 end_kmers.len()
159 );
160
161 // Traverse VGs
162 let groups = traverse::find_variant_groups(
163 &g,
164 &start_kmers,
165 &end_kmers,
166 depth,
167 min_freq,
168 length_middle,
169 num_threads,
170 );
171 let total_paths: usize = groups.values().map(|v| v.len()).sum();
172 eprintln!(
173 "variant groups: groups={} paths={}",
174 groups.len(),
175 total_paths
176 );
177
178 // Emit amino-acid sequences per path for each variant group (call disabled, keep for debugging)
179 // traverse::print_variant_group_sequences(&g, &groups);
180
181 Ok(TraversalArtifacts {
182 groups,
183 species_names: g.species_names.clone(),
184 n_species: g.n_species,
185 recode_scheme: g.recode_scheme,
186 })
187}
188
189/// Build a node-based, colored de Bruijn graph from amino-acid proteomes and analyze bubbles.
190#[derive(Parser, Debug)]
191#[command(name = "kamino", author, version, about, disable_version_flag = true)]
192#[command(
193 group = clap::ArgGroup::new("input_source")
194 .required(true)
195 .multiple(true)
196 .args(["input", "input_file"])
197)]
198pub struct Args {
199 /// Input directory with FASTA proteomes (plain or .gz)
200 #[arg(short, long = "input-directory")]
201 pub input: Option<std::path::PathBuf>,
202
203 /// Tab-delimited file mapping species name to proteome path
204 #[arg(short = 'I', long = "input-file")]
205 pub input_file: Option<std::path::PathBuf>,
206
207 /// Treat input files as bacterial genomes and predict proteomes before analysis
208 #[arg(long = "genomes", default_value_t = false, hide_default_value = true)]
209 pub genomes: bool,
210
211 /// K-mer length [k=14]
212 #[arg(short, long)]
213 pub k: Option<usize>,
214
215 /// Minimal fraction of samples with an amino-acid per position [m=0.85]
216 #[arg(
217 short = 'f',
218 long = "min-freq",
219 default_value_t = 0.85,
220 hide_default_value = true
221 )]
222 pub min_freq: f32,
223
224 /// Maximum traversal depth from each start node [d=12]
225 #[arg(short, long, default_value_t = 12, hide_default_value = true)]
226 pub depth: usize,
227
228 /// Output prefix [o=kamino]
229 #[arg(short, long, default_value = "kamino")]
230 pub output: std::path::PathBuf,
231
232 /// Number of constant positions to keep from the in-bubble k-mer [c=3]
233 #[arg(short, long)]
234 pub constant: Option<usize>,
235
236 /// Maximum number of middle positions per variant group [l=35]
237 #[arg(short = 'l', long = "length-middle", default_value_t = 35, hide_default_value = true)]
238 pub length_middle: usize,
239
240 /// Mask middle segments with long mismatch runs [m=5]
241 #[arg(
242 short = 'm',
243 long = "mask",
244 default_value_t = 5,
245 hide_default_value = true
246 )]
247 pub mask: usize,
248
249 /// Number of threads [t=1]
250 #[arg(short = 't', long, default_value_t = 1, hide_default_value = true)]
251 pub threads: usize,
252
253 /// Recoding scheme [r=sr6]
254 #[arg(short = 'r', long = "recode", value_enum, default_value_t = RecodeScheme::SR6, hide_default_value = true)]
255 pub recode: RecodeScheme,
256
257 /// Generate a NJ tree from kamino alignment [nj=false]
258 #[arg(long = "nj", default_value_t = false, hide_default_value = true)]
259 pub nj: bool,
260
261 /// Display version information.
262 #[arg(short = 'v', long = "version", action = ArgAction::Version)]
263 pub version: (),
264}
265
266pub fn run_with_args(args: Args) -> anyhow::Result<()> {
267 // k defaults and limits for SR6
268 let default_k = 14usize;
269 let max_k = 21usize;
270 let default_constant = 3usize;
271
272 anyhow::ensure!(
273 args.min_freq >= 0.6,
274 "min_freq ({}) must be ≥ 0.6.",
275 args.min_freq
276 );
277 let min_freq = args.min_freq;
278
279 let k = args.k.unwrap_or(default_k);
280 anyhow::ensure!(
281 (2..=max_k).contains(&k),
282 "k={} is invalid for {}: allowed range is 2..={} (default {})",
283 k,
284 args.recode,
285 max_k,
286 default_k
287 );
288
289 // Validate constant against k-1
290 let k1 = k - 1;
291 let constant = args.constant.unwrap_or_else(|| default_constant.min(k1));
292 anyhow::ensure!(
293 constant <= k1,
294 "constant ({}) must be ≤ k-1 ({}).",
295 constant,
296 k1
297 );
298
299 anyhow::ensure!(
300 args.length_middle >= 1,
301 "length_middle ({}) must be ≥ 1.",
302 args.length_middle
303 );
304
305 // Decide how many threads to use (default: 1)
306 anyhow::ensure!(args.threads >= 1, "threads must be ≥ 1");
307
308 eprintln!("kamino v{}", env!("CARGO_PKG_VERSION"));
309 let mut species_inputs = Vec::new();
310 let mut input_labels: Vec<String> = Vec::new();
311
312 if let Some(table) = args.input_file.as_ref() {
313 input_labels.push(format!("input_file={}", table.display()));
314 species_inputs.extend(io::collect_species_inputs_from_table(table)?);
315 }
316
317 if let Some(dir) = args.input.as_ref() {
318 input_labels.push(format!("input={}", dir.display()));
319 species_inputs.extend(io::collect_species_inputs_from_dir(dir)?);
320 }
321
322 let output_dir = args
323 .output
324 .parent()
325 .map(std::path::Path::to_path_buf)
326 .unwrap_or_else(|| std::path::PathBuf::from("."));
327 let mut genomes_tmpdir = None;
328
329 let input_label = input_labels.join(" ");
330
331 eprintln!(
332 "parameters: k={} constant={} min_freq={} depth={} length_middle={} mask={} threads={} recode={} {} output={}",
333 k,
334 constant,
335 min_freq,
336 args.depth,
337 args.length_middle,
338 args.mask,
339 args.threads,
340 args.recode,
341 input_label,
342 args.output.display()
343 );
344
345 if args.genomes {
346 eprintln!("genome files: {}", species_inputs.len());
347
348 let tmpdir = tempfile::Builder::new()
349 .prefix("tmp_kamino_")
350 .rand_bytes(6)
351 .tempdir_in(&output_dir)
352 .with_context(|| {
353 format!(
354 "create temporary directory for predicted proteomes in {}",
355 output_dir.display()
356 )
357 })?;
358
359 let predicted =
360 protein_prediction::predict_proteomes(&species_inputs, tmpdir.path(), args.threads)?;
361 species_inputs = predicted;
362 input_labels.push("genomes=true".to_string());
363 genomes_tmpdir = Some(tmpdir);
364 }
365
366 anyhow::ensure!(
367 !species_inputs.is_empty(),
368 "At least one input source with files must be provided."
369 );
370
371 let traversal = run_traverse(
372 &species_inputs,
373 k,
374 min_freq,
375 args.depth,
376 args.length_middle,
377 args.threads,
378 args.recode,
379 )?;
380
381 let (alignment_len, alignment_missing_pct) = output::write_outputs_with_head(
382 &species_inputs,
383 &args.output,
384 &traversal.species_names,
385 traversal.n_species,
386 traversal.recode_scheme,
387 &traversal.groups,
388 k,
389 constant,
390 min_freq,
391 args.mask,
392 args.threads,
393 args.nj,
394 )?;
395
396 let (fas_path, tsv_path, partitions_path, tree_path) = output::output_paths(&args.output);
397 eprintln!(
398 "alignment: length={} missing={:.1}%",
399 alignment_len, alignment_missing_pct
400 );
401
402 if args.nj {
403 eprintln!(
404 "output files: {}, {}, {}, and {}",
405 fas_path.display(),
406 tsv_path.display(),
407 partitions_path.display(),
408 tree_path.display()
409 );
410 } else {
411 eprintln!(
412 "output files: {}, {}, and {}",
413 fas_path.display(),
414 tsv_path.display(),
415 partitions_path.display()
416 );
417 }
418
419 drop(genomes_tmpdir);
420
421 Ok(())
422}