Skip to main content

Module complex

Module complex 

Source
Expand description

Port of src/complex_detection.R.

Input: a vector of FASTA sequence descriptors (the full header text after >) taken from one reaction’s reference files, plus the reaction id (so per-reaction subunit dictionaries apply).

Output: a parallel vector of Option<String> giving each sequence’s canonical subunit assignment, or None when the extractor decided the sequence is unlabeled.

Algorithm (faithful to the R original, line numbers from src/complex_detection.R):

  1. Apply 8 alternating regex patterns (com.pat1..com.pat8) to extract a raw subunit phrase.
  2. Normalize “subunit/chain/polypeptide/component” → “Subunit”.
  3. Reorder so the label comes after Subunit (alpha SubunitSubunit alpha).
  4. Subunit-dict translation (dat/complex_subunit_dict.tsv): Subunit <synonym>Subunit <canonical>.
  5. Numeral mapping: Latin I..XV → 1..15, single letters A..Z → 1..26, greek alpha..sigma → 1..17, small/medium/large → 1..3.
  6. Strip any trailing [A-z] on Subunit N<letter> (R drops sub-sub-complexes).
  7. Low-count filter: if mean count ≥ 10, drop subunits with count < 5.
  8. High-quality selection: if ≥ 66% of hits are numbered subunits, drop everything else.
  9. Global cutoff: if ≤ 20% of inputs produced any hit at all, blank the whole vector.

Functions§

detect_subunits
Detect one subunit per sequence descriptor for a given reaction. A None entry means “no subunit recognized”.