Expand description
Batch-cluster alignment across N genomes.
When annotating many genomes against the same gapseq reference seqdb, the per-genome loop is dominated by redundant alignments — closely related genomes share a lot of proteins. The batch-cluster path amortizes that cost:
concat_genomesmerges every genome’s proteome into one FASTA, rewriting each header as<GENOMEID>|<orig_header>so we can trace any downstream hit back to its genome of origin.cluster_with_mmseqscallsmmseqs easy-clusterto produce (a) a representative FASTA and (b) a two-column<rep>\t<member>TSV.BatchClusterAligner::align_genomesthen runs a single alignment of the gapseq query FASTA against the representatives and expands each rep-hit into its cluster members, bucketing the results by genome.
The expansion is an approximation: a member inherits its representative’s bitscore/identity. Users who need per-member precision can re-align the affected members with any of the standard aligners — typically a tiny fraction of the original N-genome cost.
Structs§
- Batch
Cluster Aligner - Driver for the batch-cluster mode.
- Cluster
Result - Result of an
mmseqs easy-clusterrun. - Genome
HitSet - Hits for one genome after batch-cluster expansion. Emitted in original
input order so downstream code can pair them up with
GenomeInputby index without extra lookups. - Genome
Input - A genome’s protein FASTA plus a short identifier used to prefix every header on concatenation.
Constants§
- GENOME_
SEP - Separator between the genome ID and the original FASTA header. A pipe
|matches the convention used in the Rprepare_batch_alignments.Rpipeline; the same character also appears in UniProt-stylesp|P12345|NAMEaccessions, so parsers must split on the first|only.
Functions§
- cluster_
with_ mmseqs - concat_
genomes - Produce a single FASTA at
outwhose headers are rewritten to>GENOMEID|ORIGHEADER. Fails if any genome ID contains|(because the downstream split would be ambiguous). - parse_
cluster_ tsv - split_
genome_ prefix - Pull
GENOMEIDout of a concatenated header (first|-separated token). ReturnsNoneon a malformed header.