1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
//! Sequence alignment abstraction for gapsmith.
//!
//! Every aligner (blast, diamond, mmseqs2, precomputed TSV) exposes a common
//! [`Aligner`] trait that takes a query FASTA and a target FASTA and returns
//! a vector of [`Hit`]. Internally the shell-out implementations manage
//! their own temp work-directories so callers just see FASTA-in, hits-out.
//!
//! # Example
//!
//! ```ignore
//! use gapsmith_align::{AlignOpts, Aligner, DiamondAligner};
//! use std::path::Path;
//!
//! let aligner = DiamondAligner;
//! let hits = aligner.run(
//! Path::new("query.faa"),
//! &[Path::new("reference.faa")],
//! &AlignOpts::default(),
//! ).unwrap();
//! for h in hits.iter().take(5) {
//! println!("{}\t{}\t{}", h.qseqid, h.pident, h.bitscore);
//! }
//! ```
//!
//! # Backend selection
//!
//! - [`BlastpAligner`] — protein-vs-protein; always available if NCBI BLAST+
//! is on `PATH`. Slow on large genomes but the gapseq reference.
//! - [`TblastnAligner`] — protein query vs nucleotide subject (rare; used
//! for nucleotide-based reference FASTAs).
//! - [`DiamondAligner`] — 5-20× faster than BLASTp on large proteomes;
//! comparable sensitivity at `--more-sensitive` (which we default on).
//! - [`Mmseqs2Aligner`] — fast k-mer-based alternative; we replicate
//! gapseq's 4-command pipeline (createdb → search → convertalis) rather
//! than `easy-search`, because the latter reports full-alignment
//! identities instead of the k-mer prefilter identities gapseq
//! calibrates against.
//! - [`PrecomputedTsvAligner`] — skips the aligner entirely; reads a TSV
//! the caller produced with their own tool. Used by `gapsmith`'s
//! `--aligner precomputed` mode and by [`BatchClusterAligner`].
//! - [`BatchClusterAligner`] — new in gapsmith. mmseqs2-clusters N
//! genomes, runs one alignment against the reference, then expands the
//! cluster membership to per-genome TSVs. Amortises aligner cost over
//! many genomes.
//!
//! Columns always emitted by our wrappers (matching gapseq's convention):
//!
//! | column | meaning |
//! |---------|-----------------------------------------------------|
//! | qseqid | query identifier (full FASTA header, up to a space) |
//! | pident | percent identity (0–100) |
//! | evalue | BLAST-style e-value |
//! | bitscore| bit score |
//! | qcov | query coverage (0–100) |
//! | stitle | subject title (may contain spaces) |
//! | sstart | subject start |
//! | send | subject end |
//!
//! This keeps parity with `src/gapseq_find.sh` lines 249–255.
pub use ;
// Re-exported for external parity tests that need to reuse the TSV parser.
// Not part of the stable public API; prefer the per-backend aligners.
pub use ;
pub use DiamondAligner;
pub use AlignError;
pub use Hit;
pub use Mmseqs2Aligner;
pub use PrecomputedTsvAligner;
use Path;
/// Options tuning an alignment run. Sensible gapseq defaults: coverage 75%,
/// use all detected cores, no extra user args.
/// Common trait implemented by every aligner backend.