Expand description
Sequence alignment abstraction for gapsmith.
Every aligner (blast, diamond, mmseqs2, precomputed TSV) exposes a common
Aligner trait that takes a query FASTA and a target FASTA and returns
a vector of Hit. Internally the shell-out implementations manage
their own temp work-directories so callers just see FASTA-in, hits-out.
§Example
ⓘ
use gapsmith_align::{AlignOpts, Aligner, DiamondAligner};
use std::path::Path;
let aligner = DiamondAligner;
let hits = aligner.run(
Path::new("query.faa"),
&[Path::new("reference.faa")],
&AlignOpts::default(),
).unwrap();
for h in hits.iter().take(5) {
println!("{}\t{}\t{}", h.qseqid, h.pident, h.bitscore);
}§Backend selection
BlastpAligner— protein-vs-protein; always available if NCBI BLAST+ is onPATH. Slow on large genomes but the gapseq reference.TblastnAligner— protein query vs nucleotide subject (rare; used for nucleotide-based reference FASTAs).DiamondAligner— 5-20× faster than BLASTp on large proteomes; comparable sensitivity at--more-sensitive(which we default on).Mmseqs2Aligner— fast k-mer-based alternative; we replicate gapseq’s 4-command pipeline (createdb → search → convertalis) rather thaneasy-search, because the latter reports full-alignment identities instead of the k-mer prefilter identities gapseq calibrates against.PrecomputedTsvAligner— skips the aligner entirely; reads a TSV the caller produced with their own tool. Used bygapsmith’s--aligner precomputedmode and byBatchClusterAligner.BatchClusterAligner— new in gapsmith. mmseqs2-clusters N genomes, runs one alignment against the reference, then expands the cluster membership to per-genome TSVs. Amortises aligner cost over many genomes.
Columns always emitted by our wrappers (matching gapseq’s convention):
| column | meaning |
|---|---|
| qseqid | query identifier (full FASTA header, up to a space) |
| pident | percent identity (0–100) |
| evalue | BLAST-style e-value |
| bitscore | bit score |
| qcov | query coverage (0–100) |
| stitle | subject title (may contain spaces) |
| sstart | subject start |
| send | subject end |
This keeps parity with src/gapseq_find.sh lines 249–255.
Re-exports§
pub use batch::BatchClusterAligner;pub use batch::ClusterResult;pub use batch::GenomeHitSet;pub use batch::GenomeInput;pub use blast::BlastpAligner;pub use blast::TblastnAligner;pub use diamond::DiamondAligner;pub use error::AlignError;pub use hit::Hit;pub use mmseqs2::Mmseqs2Aligner;pub use precomputed::PrecomputedTsvAligner;
Modules§
- batch
- Batch-cluster alignment across N genomes.
- blast
- BLAST+ aligners:
blastp(protein-protein) andtblastn(protein query vs. nucleotide database, translated in all 6 frames). - diamond
- Diamond aligner (protein-protein).
- error
- Error type shared by every aligner backend.
- hit
- The canonical hit structure produced by every aligner backend.
- mmseqs2
- MMseqs2 aligner (protein-protein).
- precomputed
- Pre-computed alignment-TSV aligner.
- tsv
- TSV parser shared by every aligner backend.
Structs§
- Align
Opts - Options tuning an alignment run. Sensible gapseq defaults: coverage 75%, use all detected cores, no extra user args.
Traits§
- Aligner
- Common trait implemented by every aligner backend.