Skip to main content

Module batch

Module batch 

Source
Expand description

Batch-cluster alignment across N genomes.

When annotating many genomes against the same gapseq reference seqdb, the per-genome loop is dominated by redundant alignments — closely related genomes share a lot of proteins. The batch-cluster path amortizes that cost:

  1. concat_genomes merges every genome’s proteome into one FASTA, rewriting each header as <GENOMEID>|<orig_header> so we can trace any downstream hit back to its genome of origin.
  2. cluster_with_mmseqs calls mmseqs easy-cluster to produce (a) a representative FASTA and (b) a two-column <rep>\t<member> TSV.
  3. BatchClusterAligner::align_genomes then runs a single alignment of the gapseq query FASTA against the representatives and expands each rep-hit into its cluster members, bucketing the results by genome.

The expansion is an approximation: a member inherits its representative’s bitscore/identity. Users who need per-member precision can re-align the affected members with any of the standard aligners — typically a tiny fraction of the original N-genome cost.

Structs§

BatchClusterAligner
Driver for the batch-cluster mode.
ClusterResult
Result of an mmseqs easy-cluster run.
GenomeHitSet
Hits for one genome after batch-cluster expansion. Emitted in original input order so downstream code can pair them up with GenomeInput by index without extra lookups.
GenomeInput
A genome’s protein FASTA plus a short identifier used to prefix every header on concatenation.

Constants§

GENOME_SEP
Separator between the genome ID and the original FASTA header. A pipe | matches the convention used in the R prepare_batch_alignments.R pipeline; the same character also appears in UniProt-style sp|P12345|NAME accessions, so parsers must split on the first | only.

Functions§

cluster_with_mmseqs
concat_genomes
Produce a single FASTA at out whose headers are rewritten to >GENOMEID|ORIGHEADER. Fails if any genome ID contains | (because the downstream split would be ambiguous).
parse_cluster_tsv
split_genome_prefix
Pull GENOMEID out of a concatenated header (first |-separated token). Returns None on a malformed header.