Expand description
Port of src/complex_detection.R.
Input: a vector of FASTA sequence descriptors (the full header text after
>) taken from one reaction’s reference files, plus the reaction id (so
per-reaction subunit dictionaries apply).
Output: a parallel vector of Option<String> giving each sequence’s
canonical subunit assignment, or None when the extractor decided the
sequence is unlabeled.
Algorithm (faithful to the R original, line numbers from
src/complex_detection.R):
- Apply 8 alternating regex patterns (com.pat1..com.pat8) to extract a raw subunit phrase.
- Normalize “subunit/chain/polypeptide/component” → “Subunit”.
- Reorder so the label comes after
Subunit(alpha Subunit→Subunit alpha). - Subunit-dict translation (
dat/complex_subunit_dict.tsv):Subunit <synonym>→Subunit <canonical>. - Numeral mapping: Latin I..XV → 1..15, single letters A..Z → 1..26, greek alpha..sigma → 1..17, small/medium/large → 1..3.
- Strip any trailing
[A-z]onSubunit N<letter>(R drops sub-sub-complexes). - Low-count filter: if mean count ≥ 10, drop subunits with count < 5.
- High-quality selection: if ≥ 66% of hits are numbered subunits, drop everything else.
- Global cutoff: if ≤ 20% of inputs produced any hit at all, blank the whole vector.
Functions§
- detect_
subunits - Detect one subunit per sequence descriptor for a given reaction. A
Noneentry means “no subunit recognized”.