Skip to main content

Crate gapsmith_align

Crate gapsmith_align 

Source
Expand description

Sequence alignment abstraction for gapsmith.

Every aligner (blast, diamond, mmseqs2, precomputed TSV) exposes a common Aligner trait that takes a query FASTA and a target FASTA and returns a vector of Hit. Internally the shell-out implementations manage their own temp work-directories so callers just see FASTA-in, hits-out.

§Example

use gapsmith_align::{AlignOpts, Aligner, DiamondAligner};
use std::path::Path;

let aligner = DiamondAligner;
let hits = aligner.run(
    Path::new("query.faa"),
    &[Path::new("reference.faa")],
    &AlignOpts::default(),
).unwrap();
for h in hits.iter().take(5) {
    println!("{}\t{}\t{}", h.qseqid, h.pident, h.bitscore);
}

§Backend selection

  • BlastpAligner — protein-vs-protein; always available if NCBI BLAST+ is on PATH. Slow on large genomes but the gapseq reference.
  • TblastnAligner — protein query vs nucleotide subject (rare; used for nucleotide-based reference FASTAs).
  • DiamondAligner — 5-20× faster than BLASTp on large proteomes; comparable sensitivity at --more-sensitive (which we default on).
  • Mmseqs2Aligner — fast k-mer-based alternative; we replicate gapseq’s 4-command pipeline (createdb → search → convertalis) rather than easy-search, because the latter reports full-alignment identities instead of the k-mer prefilter identities gapseq calibrates against.
  • PrecomputedTsvAligner — skips the aligner entirely; reads a TSV the caller produced with their own tool. Used by gapsmith’s --aligner precomputed mode and by BatchClusterAligner.
  • BatchClusterAligner — new in gapsmith. mmseqs2-clusters N genomes, runs one alignment against the reference, then expands the cluster membership to per-genome TSVs. Amortises aligner cost over many genomes.

Columns always emitted by our wrappers (matching gapseq’s convention):

columnmeaning
qseqidquery identifier (full FASTA header, up to a space)
pidentpercent identity (0–100)
evalueBLAST-style e-value
bitscorebit score
qcovquery coverage (0–100)
stitlesubject title (may contain spaces)
sstartsubject start
sendsubject end

This keeps parity with src/gapseq_find.sh lines 249–255.

Re-exports§

pub use batch::BatchClusterAligner;
pub use batch::ClusterResult;
pub use batch::GenomeHitSet;
pub use batch::GenomeInput;
pub use blast::BlastpAligner;
pub use blast::TblastnAligner;
pub use diamond::DiamondAligner;
pub use error::AlignError;
pub use hit::Hit;
pub use mmseqs2::Mmseqs2Aligner;
pub use precomputed::PrecomputedTsvAligner;

Modules§

batch
Batch-cluster alignment across N genomes.
blast
BLAST+ aligners: blastp (protein-protein) and tblastn (protein query vs. nucleotide database, translated in all 6 frames).
diamond
Diamond aligner (protein-protein).
error
Error type shared by every aligner backend.
hit
The canonical hit structure produced by every aligner backend.
mmseqs2
MMseqs2 aligner (protein-protein).
precomputed
Pre-computed alignment-TSV aligner.
tsv
TSV parser shared by every aligner backend.

Structs§

AlignOpts
Options tuning an alignment run. Sensible gapseq defaults: coverage 75%, use all detected cores, no extra user args.

Traits§

Aligner
Common trait implemented by every aligner backend.