Module complex

Expand description

Port of src/complex_detection.R.

Input: a vector of FASTA sequence descriptors (the full header text after >) taken from one reaction’s reference files, plus the reaction id (so per-reaction subunit dictionaries apply).

Output: a parallel vector of Option<String> giving each sequence’s canonical subunit assignment, or None when the extractor decided the sequence is unlabeled.

Algorithm (faithful to the R original, line numbers from src/complex_detection.R):

Apply 8 alternating regex patterns (com.pat1..com.pat8) to extract a raw subunit phrase.
Normalize “subunit/chain/polypeptide/component” → “Subunit”.
Reorder so the label comes after Subunit (alpha Subunit → Subunit alpha).
Subunit-dict translation (dat/complex_subunit_dict.tsv): Subunit <synonym> → Subunit <canonical>.
Numeral mapping: Latin I..XV → 1..15, single letters A..Z → 1..26, greek alpha..sigma → 1..17, small/medium/large → 1..3.
Strip any trailing [A-z] on Subunit N<letter> (R drops sub-sub-complexes).
Low-count filter: if mean count ≥ 10, drop subunits with count < 5.
High-quality selection: if ≥ 66% of hits are numbered subunits, drop everything else.
Global cutoff: if ≤ 20% of inputs produced any hit at all, blank the whole vector.

Functions§

detect_subunits: Detect one subunit per sequence descriptor for a given reaction. A None entry means “no subunit recognized”.

Module complex

Module complex Copy item path

Functions§

Module complex