Crate gapsmith_align

Expand description

Sequence alignment abstraction for gapsmith.

Every aligner (blast, diamond, mmseqs2, precomputed TSV) exposes a common Aligner trait that takes a query FASTA and a target FASTA and returns a vector of Hit. Internally the shell-out implementations manage their own temp work-directories so callers just see FASTA-in, hits-out.

§Example

use gapsmith_align::{AlignOpts, Aligner, DiamondAligner};
use std::path::Path;

let aligner = DiamondAligner;
let hits = aligner.run(
    Path::new("query.faa"),
    &[Path::new("reference.faa")],
    &AlignOpts::default(),
).unwrap();
for h in hits.iter().take(5) {
    println!("{}\t{}\t{}", h.qseqid, h.pident, h.bitscore);
}

§Backend selection

BlastpAligner — protein-vs-protein; always available if NCBI BLAST+ is on PATH. Slow on large genomes but the gapseq reference.
TblastnAligner — protein query vs nucleotide subject (rare; used for nucleotide-based reference FASTAs).
DiamondAligner — 5-20× faster than BLASTp on large proteomes; comparable sensitivity at --more-sensitive (which we default on).
Mmseqs2Aligner — fast k-mer-based alternative; we replicate gapseq’s 4-command pipeline (createdb → search → convertalis) rather than easy-search, because the latter reports full-alignment identities instead of the k-mer prefilter identities gapseq calibrates against.
PrecomputedTsvAligner — skips the aligner entirely; reads a TSV the caller produced with their own tool. Used by gapsmith’s --aligner precomputed mode and by BatchClusterAligner.
BatchClusterAligner — new in gapsmith. mmseqs2-clusters N genomes, runs one alignment against the reference, then expands the cluster membership to per-genome TSVs. Amortises aligner cost over many genomes.

Columns always emitted by our wrappers (matching gapseq’s convention):

column	meaning
qseqid	query identifier (full FASTA header, up to a space)
pident	percent identity (0–100)
evalue	BLAST-style e-value
bitscore	bit score
qcov	query coverage (0–100)
stitle	subject title (may contain spaces)
sstart	subject start
send	subject end

This keeps parity with src/gapseq_find.sh lines 249–255.

Re-exports§

pub use batch::BatchClusterAligner;
pub use batch::ClusterResult;
pub use batch::GenomeHitSet;
pub use batch::GenomeInput;
pub use blast::BlastpAligner;
pub use blast::TblastnAligner;
pub use diamond::DiamondAligner;
pub use error::AlignError;
pub use hit::Hit;
pub use mmseqs2::Mmseqs2Aligner;
pub use precomputed::PrecomputedTsvAligner;

Modules§

batch: Batch-cluster alignment across N genomes.
blast: BLAST+ aligners: blastp (protein-protein) and tblastn (protein query vs. nucleotide database, translated in all 6 frames).
diamond: Diamond aligner (protein-protein).
error: Error type shared by every aligner backend.
hit: The canonical hit structure produced by every aligner backend.
mmseqs2: MMseqs2 aligner (protein-protein).
precomputed: Pre-computed alignment-TSV aligner.
tsv: TSV parser shared by every aligner backend.

Structs§

AlignOpts: Options tuning an alignment run. Sensible gapseq defaults: coverage 75%, use all detected cores, no extra user args.

Traits§

Aligner: Common trait implemented by every aligner backend.

Crate gapsmith_align

Crate gapsmith_align Copy item path

§Example

§Backend selection

Re-exports§

Modules§

Structs§

Traits§

Crate gapsmith_align