assembly-theory 0.3.0

Open and Reproducible Computation of Molecular Assembly Indices
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reference Dataset Curation\n",
    "\n",
    "This notebook documents the curation of four reference datasets of `.mol` files used in this repository: `gdb13_1201`, `gdb17_200`, `coconut_55`, and `checks`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary packages.\n",
    "from collections import defaultdict\n",
    "import gzip\n",
    "import numpy as np\n",
    "import os\n",
    "import os.path as osp\n",
    "import pandas as pd\n",
    "from rdkit import Chem\n",
    "import shutil\n",
    "import tarfile\n",
    "from urllib.request import urlretrieve\n",
    "import zipfile\n",
    "\n",
    "# Set up random number generation.\n",
    "rng = np.random.default_rng(137)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conversion between molecular representation formats\n",
    "\n",
    "The `assembly-theory` package primarily uses the `.mol` molecular representation.\n",
    "The GDB-13 and GDB-17 databases contain molecules in `.smi` format, which is for the SMILES molecular representation.\n",
    "Additionally, the COCONUT database contains molecules in `.sdf` format, which is closely related to the `.mol` format.\n",
    "The following functions utilize RDKit to convert between representations using the [`rdkit.Chem.rdchem.Mol`](https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html#rdkit.Chem.rdchem.Mol) class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def smiles_to_sdf(smiles_list, sdf_output_name):\n",
    "    \"\"\"\n",
    "    Converts a list of SMILES molecule strings into a single .sdf file. \n",
    "\n",
    "    :param smiles_list: a list of SMILES molecule strings\n",
    "    :param sdf_output_name: a string filename of the output .sdf file\n",
    "    \"\"\"\n",
    "    writer = Chem.SDWriter(sdf_output_name)\n",
    "    for smiles in smiles_list:\n",
    "        writer.write(Chem.MolFromSmiles(smiles))\n",
    "\n",
    "\n",
    "def mols_to_sdf(mol_list, sdf_output_name):\n",
    "    \"\"\"\n",
    "    Writes a list of RDKit Mol objects to a single .sdf file.\n",
    "\n",
    "    :param mol_list: a list of RDKit Mol objects\n",
    "    :param sdf_output_name: a string filename of the output .sdf file\n",
    "    \"\"\"\n",
    "    writer = Chem.SDWriter(sdf_output_name)\n",
    "    for mol_object in mol_list:\n",
    "        writer.write(mol_object)\n",
    "\n",
    "\n",
    "def sdf_to_molfiles(sdf_filename, output_directory):\n",
    "    \"\"\"\n",
    "    Splits all molecules in a .sdf file into individual .mol files in a specified output directory. \n",
    "\n",
    "    :param sdf_filename: a string filename of the input .sdf file\n",
    "    :param output_directory: a string name of the output directory\n",
    "    \"\"\"\n",
    "    # Ensure the output directory exists.\n",
    "    os.makedirs(output_directory, exist_ok=True)\n",
    "    \n",
    "    # Load each molecule in the .sdf file as a Mol object and write it out as a numbered .mol file.\n",
    "    supplier = Chem.SDMolSupplier(sdf_filename)\n",
    "    num_digits = int(np.log10(len(supplier))) + 1\n",
    "    for i, mol in enumerate(supplier):\n",
    "        if mol is not None:\n",
    "            Chem.MolToMolFile(mol, osp.join(output_directory, f\"{str(i+1).zfill(num_digits)}.mol\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sampling molecules from databases\n",
    "\n",
    "Our reference datasets contain curated subsets of larger databases such as GDB-13, GDB-17, and COCONUT.\n",
    "The following function controls this (possibly-random) sampling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def smiles_subset(smi_filename, num_molecules=None, random_sample=False):\n",
    "    \"\"\"\n",
    "    Collect a fixed-size subset of SMILES strings from the given .smi file.\n",
    "\n",
    "    :param smi_filename: a string filename of the input .smi file\n",
    "    :param num_molecules: an int number of molecules to collect, or None to collect all\n",
    "    :param random_sample: True iff the SMILES strings should be sampled uniformly at random\n",
    "    :returns: a list containing the specified number of SMILES strings\n",
    "    \"\"\"\n",
    "    # Read every line in the .smi file.\n",
    "    with open(smi_filename, 'r') as file:\n",
    "        smiles = [line.strip() for line in file]\n",
    "\n",
    "    # If num_molecules is None, get ready to collect all molecules in the file.\n",
    "    if num_molecules is None:\n",
    "        num_molecules = len(smiles)\n",
    "\n",
    "    # If random, then do sampling uniformly at random; otherwise, take the first num_molecules strings in order.\n",
    "    if random_sample:\n",
    "        return [str(s) for s in rng.choice(smiles, num_molecules, replace=False)]\n",
    "    else:\n",
    "        return smiles[:num_molecules]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `gdb13_1201`\n",
    "\n",
    "This dataset contains 1,201 small, organic molecular structures sampled from [GDB-13](https://zenodo.org/record/5172018), a database of enumerated chemical structures containing Carbon, Hydrogen, Nitrogen, Oxygen, Sulfur, and Chlorine that are constrained only by valence rules and quantum mechanics but may not be chemically stable or synthesizable (Blum & Reymond, 2009).\n",
    "Our sample includes all 201 molecules in GDB-13 with 4–5 heavy atoms and 200 randomly sampled molecules for each number of heavy atoms from 6–10.\n",
    "\n",
    "The following code downloads and extracts GDB-13 and obtains our desired subset.\n",
    "GDB-13 is organized into `.smi` files by numbers of heavy atoms.\n",
    "For example, file `4.smi` contains only molecules with four heavy atoms.\n",
    "Within these individual `.smi` files, individual molecules are sorted by their molecular complexity, meaning molecules containing primarily Carbon are listed first, and molecules containing a greater variety of atoms are listed later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download GDB-13 in its entirety.\n",
    "print(\"Downloading GDB-13 from Zenodo...\")\n",
    "gdb13_url = \"https://zenodo.org/records/5172018/files/gdb13.tgz?download=1\"\n",
    "urlretrieve(gdb13_url, \"gdb13.tgz\")\n",
    "print(\"Done downloading.\")\n",
    "\n",
    "# Extract GDB-13 from .tgz to a new directory called gdb13/.\n",
    "gdb13_dir = \"gdb13\"\n",
    "with tarfile.open(\"gdb13.tgz\", 'r') as f_tar:\n",
    "    f_tar.extractall(path=gdb13_dir)\n",
    "\n",
    "# Collect all molecules with 4-5 heavy atoms (43 four-heavy, 158 five-heavy).\n",
    "print(\"Collecting the gdb13_1201 subset...\")\n",
    "for i in [4, 5]:\n",
    "    subset = smiles_subset(osp.join(gdb13_dir, f\"{i}.smi\"))\n",
    "    smiles_to_sdf(subset, sdf_output_name=f\"{i}.sdf\")\n",
    "\n",
    "# For each number of heavy atoms from 6-10, collect 200 molecules uniformly at random.\n",
    "for i in range(6, 11):\n",
    "    subset = smiles_subset(osp.join(gdb13_dir, f\"{i}.smi\"), num_molecules=200, random_sample=True)\n",
    "    smiles_to_sdf(subset, sdf_output_name=f\"{i}.sdf\")\n",
    "\n",
    "# Concatenate these .sdf files into one gdb13_1201.sdf file for our reference dataset.\n",
    "with open(\"gdb13_1201.sdf\", 'wb') as f_out:\n",
    "    for i in range(4, 11):\n",
    "        with open(f\"{i}.sdf\", 'rb') as f_in:\n",
    "            shutil.copyfileobj(f_in, f_out)\n",
    "\n",
    "# Split this .sdf file into individual .mol files.\n",
    "sdf_to_molfiles(\"gdb13_1201.sdf\", output_directory=osp.join(\"..\", \"data\", \"gdb13_1201\"))\n",
    "\n",
    "# Clean up all intermediate files.\n",
    "print(\"Deleting intermediate files...\")\n",
    "os.remove(\"gdb13.tgz\")\n",
    "shutil.rmtree(gdb13_dir)\n",
    "for i in range(4, 11):\n",
    "    os.remove(f\"{i}.sdf\")\n",
    "os.remove(\"gdb13_1201.sdf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `gdb17_200`\n",
    "\n",
    "This dataset contains 200 organic molecular structures sampled from the larger [GDB-17](https://zenodo.org/record/5172018) database, which includes additional nuclei beyond GDB-13 such as the halogens Flourine and Iodine (Ruddigkeit et al., 2012).\n",
    "Compared to GDB-13, these molecules are typically larger and represent more structural diversity.\n",
    "Our sample includes 50 randomly sampled molecules for each number of heavy atoms from 14–17.\n",
    "\n",
    "GDB-17 downloads as a single `.smi` file containing 50 million molecules.\n",
    "Obtaining our sample of 50 molecules per number of heavy atoms requires slightly more effort than in GDB-13, where molecules were already split out by number of heavy atoms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download GDB-17 in its entirety.\n",
    "print(\"Downloading GDB-17 from Zenodo...\")\n",
    "gdb17_url = \"https://zenodo.org/records/5172018/files/GDB17.50000000.smi.gz?download=1\"\n",
    "urlretrieve(gdb17_url, \"gdb17.gz\")\n",
    "print(\"Done downloading.\")\n",
    "\n",
    "# Extract GDB-17 from .gz to a .smi file.\n",
    "with gzip.open(\"gdb17.gz\", 'rb') as f_in:\n",
    "    with open(\"gdb17.smi\", 'wb') as f_out:\n",
    "        shutil.copyfileobj(f_in, f_out)\n",
    "\n",
    "# Downsample from 50M to 100K molecules uniformly at random.\n",
    "print(\"Collecting the gdb17_200 subset...\")\n",
    "sample = smiles_subset(\"gdb17.smi\", num_molecules=100000, random_sample=True)\n",
    "\n",
    "# Write a quick function to get numbers of heavy atoms from SMILES strings.\n",
    "def heavy(smiles):\n",
    "    return Chem.MolFromSmiles(smiles).GetNumHeavyAtoms()\n",
    "\n",
    "# Create a pandas dataframe with the SMILES string and number of heavy atoms.\n",
    "smiles_df = pd.DataFrame({'smiles': sample})\n",
    "smiles_df['heavy'] = smiles_df.map(heavy)\n",
    "\n",
    "# For each number of heavy atoms from 14-17, collect 50 molecules into an .sdf file.\n",
    "for i in range(14, 18):\n",
    "    subset = smiles_df[smiles_df['heavy'] == i]['smiles'].tolist()\n",
    "    smiles_to_sdf(subset[:50], sdf_output_name=f\"{i}.sdf\")\n",
    "\n",
    "# Concatenate these .sdf files into one gdb17_200.sdf file for our reference dataset.\n",
    "with open(\"gdb17_200.sdf\", 'wb') as f_out:\n",
    "    for i in range(14, 18):\n",
    "        with open(f\"{i}.sdf\", 'rb') as f_in:\n",
    "            shutil.copyfileobj(f_in, f_out)\n",
    "\n",
    "# Split this .sdf file into individual .mol files.\n",
    "sdf_to_molfiles(\"gdb17_200.sdf\", output_directory=osp.join(\"..\", \"data\", \"gdb17_200\"))\n",
    "\n",
    "# Clean up all intermediate files.\n",
    "print(\"Deleting intermediate files...\")\n",
    "os.remove(\"gdb17.gz\")\n",
    "os.remove(\"gdb17.smi\")\n",
    "for i in range(14, 18):\n",
    "    os.remove(f\"{i}.sdf\")\n",
    "os.remove(\"gdb17_200.sdf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `coconut_55`\n",
    "\n",
    "This dataset contains 55 natural products sampled from the COCONUT database (Sorokina et al., 2021), [accessed in late 2024](https://zenodo.org/records/13897048) prior to COCONUT 2.0 (Chandrasekhar et al., 2025).\n",
    "Natural products (or secondary metabolites) are a rich source of evolved chemical complexity, often exhibiting drug-like properties.\n",
    "Subsets of this database were used to benchmark recent algorithmic progress in (Seet et al., 2024).\n",
    "Our sample includes 5 randomly sampled molecules for each number of heavy atoms from 15–25.\n",
    "\n",
    "COCONUT downloads as a single `.sdf` file containing all molecules in the database.\n",
    "Like GDB-17, this requires some effort to identify molecules' number of heavy atoms and sample from them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download COCONUT in its entirety.\n",
    "print(\"Downloading COCONUT from Zenodo...\")\n",
    "coconut_url = \"https://zenodo.org/records/13897048/files/coconut_complete-10-2024.sdf.zip?download=1\"\n",
    "urlretrieve(coconut_url, \"coconut.zip\")\n",
    "print(\"Done downloading.\")\n",
    "\n",
    "# Extract COCONUT from .zip to a new directory called coconut/.\n",
    "with zipfile.ZipFile(\"coconut.zip\", 'r') as f_zip:\n",
    "    f_zip.extractall(path='coconut')\n",
    "\n",
    "# Read molecules from .sdf, filtering and sorting by number of heavy atoms.\n",
    "print(\"Collecting the coconut_55 subset...\")\n",
    "mols = defaultdict(list)\n",
    "for mol in Chem.SDMolSupplier(osp.join('coconut', 'coconut_complete-10-2024.sdf')):\n",
    "    if mol is not None:\n",
    "        num_heavy = mol.GetNumHeavyAtoms()\n",
    "        if num_heavy >= 15 and num_heavy <= 25:\n",
    "            mols[num_heavy].append(mol)\n",
    "        \n",
    "# For each number of heavy atoms from 15-25, collect 5 molecules uniformly at random.\n",
    "for i in range(15, 26):\n",
    "    subset = list(rng.choice(mols[i], 5, replace=False))\n",
    "    mols_to_sdf(subset, sdf_output_name=f\"{i}.sdf\")\n",
    "\n",
    "# Concatenate these .sdf files into one coconut_55.sdf file for our reference dataset.\n",
    "with open(\"coconut_55.sdf\", 'wb') as f_out:\n",
    "    for i in range(15, 26):\n",
    "        with open(f\"{i}.sdf\", 'rb') as f_in:\n",
    "            shutil.copyfileobj(f_in, f_out)\n",
    "\n",
    "\n",
    "# Split this .sdf file into individual .mol files.\n",
    "sdf_to_molfiles(\"coconut_55.sdf\", output_directory=osp.join(\"..\", \"data\", \"coconut_55\"))\n",
    "\n",
    "# Clean up all intermediate files.\n",
    "print(\"Deleting intermediate files...\")\n",
    "os.remove(\"coconut.zip\")\n",
    "shutil.rmtree(\"coconut\")\n",
    "for i in range(15, 26):\n",
    "    os.remove(f\"{i}.sdf\")\n",
    "os.remove(\"coconut_55.sdf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `checks`\n",
    "\n",
    "This dataset contains 15 named molecules (e.g., anthracene, aspirin, caffeine, morphine) from [KEGG COMPOUND](https://www.genome.jp/kegg/compound/) (Kanehisa, 2019; Kanehisa et al., 2023; Kanehisa & Goto, 2000).\n",
    "These molecules are primarily used for rapid testing have have numbers of heavy atoms ranging from 5&ndash;21.\n",
    "The code below simply documents where these molecules were downloaded from and automates the downloads."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the checks dataset directory and ensure it exists.\n",
    "checks_dir = osp.join(\"..\", \"data\", \"checks\")\n",
    "os.makedirs(checks_dir, exist_ok=True)\n",
    "\n",
    "# Define molecule names and download URLs.\n",
    "mols = [\n",
    "    (\"anthracene\", \"https://www.kegg.jp/entry/-f+m+C14315\"),\n",
    "    (\"aspartic\", \"https://www.kegg.jp/entry/-f+m+C00049\"),\n",
    "    (\"aspirin\", \"https://www.kegg.jp/entry/-f+m+C01405\"),\n",
    "    (\"benzene\", \"https://www.kegg.jp/entry/-f+m+C01407\"),\n",
    "    (\"caffeine\", \"https://www.kegg.jp/entry/-f+m+C07481\"),\n",
    "    (\"chrysene\", \"https://www.kegg.jp/entry/-f+m+C14222\"),\n",
    "    (\"cyanidin\", \"https://www.kegg.jp/entry/-f+m+C05905\"),\n",
    "    (\"ethylenethiourea\", \"https://www.kegg.jp/entry/-f+m+C14398\"),\n",
    "    (\"fosfructose\", \"https://www.kegg.jp/entry/-f+m+C00354\"),\n",
    "    (\"morphine\", \"https://www.kegg.jp/entry/-f+m+C01516\"),\n",
    "    (\"naphthalene\", \"https://www.kegg.jp/entry/-f+m+C00829\"),\n",
    "    (\"pyrazole\", \"https://www.kegg.jp/entry/-f+m+C00481\"),\n",
    "    (\"pyrene\", \"https://www.kegg.jp/entry/-f+m+C14335\"),\n",
    "    (\"thiouracil\", \"https://www.kegg.jp/entry/-f+m+C19304\"),\n",
    "    (\"xylopyranose\", \"https://www.kegg.jp/entry/-f+m+C02205\")\n",
    "]\n",
    "\n",
    "# Download the checks dataset.\n",
    "print(\"Downloading checks molecules from KEGG COMPOUND...\")\n",
    "for mol_name, mol_url in mols:\n",
    "    urlretrieve(mol_url, osp.join(checks_dir, f\"{mol_name}.mol\"))\n",
    "print(\"Done downloading.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "Blum, L. C., & Reymond, J.-L. (2009). 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. *Journal of the American Chemical Society*, *131*(25), 8732–8733. https://doi.org/10.1021/ja902302h\n",
    "\n",
    "Chandrasekhar, V., Rajan, K., Kanakam, S. R. S., Sharma, N., Weißenborn, V., Schaub, J., & Steinbeck, C. (2025). COCONUT 2.0: A comprehensive overhaul and curation of the collection of open natural products database. *Nucleic Acids Research*, *53*(D1), D634–D643. https://doi.org/10.1093/nar/gkae1063\n",
    "\n",
    "Kanehisa, M. (2019). Toward understanding the origin and evolution of cellular organisms. *Protein Science*, *28*(11), 1947–1951. https://doi.org/10.1002/pro.3715\n",
    "\n",
    "Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M., & Ishiguro-Watanabe, M. (2023). KEGG for taxonomy-based analysis of pathways and genomes. *Nucleic Acids Research*, *51*(D1), D587–D592. https://doi.org/10.1093/nar/gkac963\n",
    "\n",
    "Kanehisa, M., & Goto, S. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. *Nucleic Acids Research*, *28*(1), 27–30. https://doi.org/10.1093/nar/28.1.27\n",
    "\n",
    "Ruddigkeit, L., Van Deursen, R., Blum, L. C., & Reymond, J.-L. (2012). Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. *Journal of Chemical Information and Modeling*, *52*(11), 2864–2875. https://doi.org/10.1021/ci300415d\n",
    "\n",
    "Seet, I., Patarroyo, K. Y., Siebert, G., Walker, S. I., & Cronin, L. (2024). *Rapid Computation of the Assembly Index of Molecular Graphs* (No. 2410.09100). arXiv. https://doi.org/10.48550/arXiv.2410.09100\n",
    "\n",
    "Sorokina, M., Merseburger, P., Rajan, K., Yirik, M. A., & Steinbeck, C. (2021). COCONUT online: Collection of Open Natural Products database. *Journal of Cheminformatics*, *13*(1), 2. https://doi.org/10.1186/s13321-020-00478-9"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}