Module batch

Expand description

Batch-cluster alignment across N genomes.

When annotating many genomes against the same gapseq reference seqdb, the per-genome loop is dominated by redundant alignments — closely related genomes share a lot of proteins. The batch-cluster path amortizes that cost:

concat_genomes merges every genome’s proteome into one FASTA, rewriting each header as <GENOMEID>|<orig_header> so we can trace any downstream hit back to its genome of origin.
cluster_with_mmseqs calls mmseqs easy-cluster to produce (a) a representative FASTA and (b) a two-column <rep>\t<member> TSV.
BatchClusterAligner::align_genomes then runs a single alignment of the gapseq query FASTA against the representatives and expands each rep-hit into its cluster members, bucketing the results by genome.

The expansion is an approximation: a member inherits its representative’s bitscore/identity. Users who need per-member precision can re-align the affected members with any of the standard aligners — typically a tiny fraction of the original N-genome cost.

Structs§

BatchClusterAligner: Driver for the batch-cluster mode.
ClusterResult: Result of an mmseqs easy-cluster run.
GenomeHitSet: Hits for one genome after batch-cluster expansion. Emitted in original input order so downstream code can pair them up with GenomeInput by index without extra lookups.
GenomeInput: A genome’s protein FASTA plus a short identifier used to prefix every header on concatenation.

Constants§

GENOME_SEP: Separator between the genome ID and the original FASTA header. A pipe | matches the convention used in the R prepare_batch_alignments.R pipeline; the same character also appears in UniProt-style sp|P12345|NAME accessions, so parsers must split on the first | only.

Functions§

cluster_with_mmseqs
concat_genomes: Produce a single FASTA at out whose headers are rewritten to >GENOMEID|ORIGHEADER. Fails if any genome ID contains | (because the downstream split would be ambiguous).
parse_cluster_tsv
split_genome_prefix: Pull GENOMEID out of a concatenated header (first |-separated token). Returns None on a malformed header.

Module batch

Module batch Copy item path

Structs§

Constants§

Functions§

Module batch