assembly-theory 0.3.0

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reference Dataset Curation\n",
    "\n",
    "This notebook documents the curation of four reference datasets of `.mol` files used in this repository: `gdb13_1201`, `gdb17_200`, `coconut_55`, and `checks`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary packages.\n",
    "from collections import defaultdict\n",
    "import gzip\n",
    "import numpy as np\n",
    "import os\n",
    "import os.path as osp\n",
    "import pandas as pd\n",
    "from rdkit import Chem\n",
    "import shutil\n",
    "import tarfile\n",
    "from urllib.request import urlretrieve\n",
    "import zipfile\n",
    "\n",
    "# Set up random number generation.\n",
    "rng = np.random.default_rng(137)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conversion between molecular representation formats\n",
    "\n",
    "The `assembly-theory` package primarily uses the `.mol` molecular representation.\n",
    "The GDB-13 and GDB-17 databases contain molecules in `.smi` format, which is for the SMILES molecular representation.\n",
    "Additionally, the COCONUT database contains molecules in `.sdf` format, which is closely related to the `.mol` format.\n",
    "The following functions utilize RDKit to convert between representations using the [`rdkit.Chem.rdchem.Mol`](https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html#rdkit.Chem.rdchem.Mol) class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def smiles_to_sdf(smiles_list, sdf_output_name):\n",
    "    \"\"\"\n",
    "    Converts a list of SMILES molecule strings into a single .sdf file. \n",
    "\n",
    "    :param smiles_list: a list of SMILES molecule strings\n",
    "    :param sdf_output_name: a string filename of the output .sdf file\n",
    "    \"\"\"\n",
    "    writer = Chem.SDWriter(sdf_output_name)\n",
    "    for smiles in smiles_list:\n",
    "        writer.write(Chem.MolFromSmiles(smiles))\n",
    "\n",
    "\n",
    "def mols_to_sdf(mol_list, sdf_output_name):\n",
    "    \"\"\"\n",
    "    Writes a list of RDKit Mol objects to a single .sdf file.\n",
    "\n",
    "    :param mol_list: a list of RDKit Mol objects\n",
    "    :param sdf_output_name: a string filename of the output .sdf file\n",
    "    \"\"\"\n",
    "    writer = Chem.SDWriter(sdf_output_name)\n",
    "    for mol_object in mol_list:\n",
    "        writer.write(mol_object)\n",
    "\n",
    "\n",
    "def sdf_to_molfiles(sdf_filename, output_directory):\n",
    "    \"\"\"\n",
    "    Splits all molecules in a .sdf file into individual .mol files in a specified output directory. \n",
    "\n",
    "    :param sdf_filename: a string filename of the input .sdf file\n",
    "    :param output_directory: a string name of the output directory\n",
    "    \"\"\"\n",
    "    # Ensure the output directory exists.\n",
    "    os.makedirs(output_directory, exist_ok=True)\n",
    "    \n",
    "    # Load each molecule in the .sdf file as a Mol object and write it out as a numbered .mol file.\n",
    "    supplier = Chem.SDMolSupplier(sdf_filename)\n",
    "    num_digits = int(np.log10(len(supplier))) + 1\n",
    "    for i, mol in enumerate(supplier):\n",
    "        if mol is not None:\n",
    "            Chem.MolToMolFile(mol, osp.join(output_directory, f\"{str(i+1).zfill(num_digits)}.mol\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sampling molecules from databases\n",
    "\n",
    "Our reference datasets contain curated subsets of larger databases such as GDB-13, GDB-17, and COCONUT.\n",
    "The following function controls this (possibly-random) sampling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def smiles_subset(smi_filename, num_molecules=None, random_sample=False):\n",
    "    \"\"\"\n",
    "    Collect a fixed-size subset of SMILES strings from the given .smi file.\n",
    "\n",
    "    :param smi_filename: a string filename of the input .smi file\n",
    "    :param num_molecules: an int number of molecules to collect, or None to collect all\n",
    "    :param random_sample: True iff the SMILES strings should be sampled uniformly at random\n",
    "    :returns: a list containing the specified number of SMILES strings\n",
    "    \"\"\"\n",
    "    # Read every line in the .smi file.\n",
    "    with open(smi_filename, 'r') as file:\n",
    "        smiles = [line.strip() for line in file]\n",
    "\n",
    "    # If num_molecules is None, get ready to collect all molecules in the file.\n",
    "    if num_molecules is None:\n",
    "        num_molecules = len(smiles)\n",
    "\n",
    "    # If random, then do sampling uniformly at random; otherwise, take the first num_molecules strings in order.\n",
    "    if random_sample:\n",
    "        return [str(s) for s in rng.choice(smiles, num_molecules, replace=False)]\n",
    "    else:\n",
    "        return smiles[:num_molecules]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `gdb13_1201`\n",
    "\n",
    "This dataset contains 1,201 small, organic molecular structures sampled from [GDB-13](https://zenodo.org/record/5172018), a database of enumerated chemical structures containing Carbon, Hydrogen, Nitrogen, Oxygen, Sulfur, and Chlorine that are constrained only by valence rules and quantum mechanics but may not be chemically stable or synthesizable (Blum & Reymond, 2009).\n",
    "Our sample includes all 201 molecules in GDB-13 with 4&ndash;5 heavy atoms and 200 randomly sampled molecules for each number of heavy atoms from 6&ndash;10.\n",
    "\n",
    "The following code downloads and extracts GDB-13 and obtains our desired subset.\n",
    "GDB-13 is organized into `.smi` files by numbers of heavy atoms.\n",
    "For example, file `4.smi` contains only molecules with four heavy atoms.\n",
    "Within these individual `.smi` files, individual molecules are sorted by their molecular complexity, meaning molecules containing primarily Carbon are listed first, and molecules containing a greater variety of atoms are listed later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download GDB-13 in its entirety.\n",
    "print(\"Downloading GDB-13 from Zenodo...\")\n",
    "gdb13_url = \"https://zenodo.org/records/5172018/files/gdb13.tgz?download=1\"\n",
    "urlretrieve(gdb13_url, \"gdb13.tgz\")\n",
    "print(\"Done downloading.\")\n",
    "\n",
    "# Extract GDB-13 from .tgz to a new directory called gdb13/.\n",
    "gdb13_dir = \"gdb13\"\n",
    "with tarfile.open(\"gdb13.tgz\", 'r') as f_tar:\n",
    "    f_tar.extractall(path=gdb13_dir)\n",
    "\n",
    "# Collect all molecules with 4-5 heavy atoms (43 four-heavy, 158 five-heavy).\n",
    "print(\"Collecting the gdb13_1201 subset...\")\n",
    "for i in [4, 5]:\n",
    "    subset = smiles_subset(osp.join(gdb13_dir, f\"{i}.smi\"))\n",
    "    smiles_to_sdf(subset, sdf_output_name=f\"{i}.sdf\")\n",
    "\n",
    "# For each number of heavy atoms from 6-10, collect 200 molecules uniformly at random.\n",
    "for i in range(6, 11):\n",
    "    subset = smiles_subset(osp.join(gdb13_dir, f\"{i}.smi\"), num_molecules=200, random_sample=True)\n",
    "    smiles_to_sdf(subset, sdf_output_name=f\"{i}.sdf\")\n",
    "\n",
    "# Concatenate these .sdf files into one gdb13_1201.sdf file for our reference dataset.\n",
    "with open(\"gdb13_1201.sdf\", 'wb') as f_out:\n",
    "    for i in range(4, 11):\n",
    "        with open(f\"{i}.sdf\", 'rb') as f_in:\n",
    "            shutil.copyfileobj(f_in, f_out)\n",
    "\n",
    "# Split this .sdf file into individual .mol files.\n",
    "sdf_to_molfiles(\"gdb13_1201.sdf\", output_directory=osp.join(\"..\", \"data\", \"gdb13_1201\"))\n",
    "\n",
    "# Clean up all intermediate files.\n",
    "print(\"Deleting intermediate files...\")\n",
    "os.remove(\"gdb13.tgz\")\n",
    "shutil.rmtree(gdb13_dir)\n",
    "for i in range(4, 11):\n",
    "    os.remove(f\"{i}.sdf\")\n",
    "os.remove(\"gdb13_1201.sdf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `gdb17_200`\n",
    "\n",
    "This dataset contains 200 organic molecular structures sampled from the larger [GDB-17](https://zenodo.org/record/5172018) database, which includes additional nuclei beyond GDB-13 such as the halogens Flourine and Iodine (Ruddigkeit et al., 2012).\n",
    "Compared to GDB-13, these molecules are typically larger and represent more structural diversity.\n",
    "Our sample includes 50 randomly sampled molecules for each number of heavy atoms from 14&ndash;17.\n",
    "\n",
    "GDB-17 downloads as a single `.smi` file containing 50 million molecules.\n",
    "Obtaining our sample of 50 molecules per number of heavy atoms requires slightly more effort than in GDB-13, where molecules were already split out by number of heavy atoms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download GDB-17 in its entirety.\n",
    "print(\"Downloading GDB-17 from Zenodo...\")\n",
    "gdb17_url = \"https://zenodo.org/records/5172018/files/GDB17.50000000.smi.gz?download=1\"\n",
    "urlretrieve(gdb17_url, \"gdb17.gz\")\n",
    "print(\"Done downloading.\")\n",
    "\n",
    "# Extract GDB-17 from .gz to a .smi file.\n",
    "with gzip.open(\"gdb17.gz\", 'rb') as f_in:\n",
    "    with open(\"gdb17.smi\", 'wb') as f_out:\n",
    "        shutil.copyfileobj(f_in, f_out)\n",
    "\n",
    "# Downsample from 50M to 100K molecules uniformly at random.\n",
    "print(\"Collecting the gdb17_200 subset...\")\n",
    "sample = smiles_subset(\"gdb17.smi\", num_molecules=100000, random_sample=True)\n",
    "\n",
    "# Write a quick function to get numbers of heavy atoms from SMILES strings.\n",
    "def heavy(smiles):\n",
    "    return Chem.MolFromSmiles(smiles).GetNumHeavyAtoms()\n",
    "\n",
    "# Create a pandas dataframe with the SMILES string and number of heavy atoms.\n",
    "smiles_df = pd.DataFrame({'smiles': sample})\n",
    "smiles_df['heavy'] = smiles_df.map(heavy)\n",
    "\n",
    "# For each number of heavy atoms from 14-17, collect 50 molecules into an .sdf file.\n",
    "for i in range(14, 18):\n",
    "    subset = smiles_df[smiles_df['heavy'] == i]['smiles'].tolist()\n",
    "    smiles_to_sdf(subset[:50], sdf_output_name=f\"{i}.sdf\")\n",
    "\n",
    "# Concatenate these .sdf files into one gdb17_200.sdf file for our reference dataset.\n",
    "with open(\"gdb17_200.sdf\", 'wb') as f_out:\n",
    "    for i in range(14, 18):\n",
    "        with open(f\"{i}.sdf\", 'rb') as f_in:\n",
    "            shutil.copyfileobj(f_in, f_out)\n",
    "\n",
    "# Split this .sdf file into individual .mol files.\n",
    "sdf_to_molfiles(\"gdb17_200.sdf\", output_directory=osp.join(\"..\", \"data\", \"gdb17_200\"))\n",
    "\n",
    "# Clean up all intermediate files.\n",
    "print(\"Deleting intermediate files...\")\n",
    "os.remove(\"gdb17.gz\")\n",
    "os.remove(\"gdb17.smi\")\n",
    "for i in range(14, 18):\n",
    "    os.remove(f\"{i}.sdf\")\n",
    "os.remove(\"gdb17_200.sdf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `coconut_55`\n",
    "\n",
    "This dataset contains 55 natural products sampled from the COCONUT database (Sorokina et al., 2021), [accessed in late 2024](https://zenodo.org/records/13897048) prior to COCONUT 2.0 (Chandrasekhar et al., 2025).\n",
    "Natural products (or secondary metabolites) are a rich source of evolved chemical complexity, often exhibiting drug-like properties.\n",
    "Subsets of this database were used to benchmark recent algorithmic progress in (Seet et al., 2024).\n",
    "Our sample includes 5 randomly sampled molecules for each number of heavy atoms from 15&ndash;25.\n",
    "\n",
    "COCONUT downloads as a single `.sdf` file containing all molecules in the database.\n",
    "Like GDB-17, this requires some effort to identify molecules' number of heavy atoms and sample from them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download COCONUT in its entirety.\n",
    "print(\"Downloading COCONUT from Zenodo...\")\n",
    "coconut_url = \"https://zenodo.org/records/13897048/files/coconut_complete-10-2024.sdf.zip?download=1\"\n",
    "urlretrieve(coconut_url, \"coconut.zip\")\n",
    "print(\"Done downloading.\")\n",
    "\n",
    "# Extract COCONUT from .zip to a new directory called coconut/.\n",
    "with zipfile.ZipFile(\"coconut.zip\", 'r') as f_zip:\n",
    "    f_zip.extractall(path='coconut')\n",
    "\n",
    "# Read molecules from .sdf, filtering and sorting by number of heavy atoms.\n",
    "print(\"Collecting the coconut_55 subset...\")\n",
    "mols = defaultdict(list)\n",
    "for mol in Chem.SDMolSupplier(osp.join('coconut', 'coconut_complete-10-2024.sdf')):\n",
    "    if mol is not None:\n",
    "        num_heavy = mol.GetNumHeavyAtoms()\n",
    "        if num_heavy >= 15 and num_heavy <= 25:\n",
    "            mols[num_heavy].append(mol)\n",
    "        \n",
    "# For each number of heavy atoms from 15-25, collect 5 molecules uniformly at random.\n",
    "for i in range(15, 26):\n",
    "    subset = list(rng.choice(mols[i], 5, replace=False))\n",
    "    mols_to_sdf(subset, sdf_output_name=f\"{i}.sdf\")\n",
    "\n",
    "# Concatenate these .sdf files into one coconut_55.sdf file for our reference dataset.\n",
    "with open(\"coconut_55.sdf\", 'wb') as f_out:\n",
    "    for i in range(15, 26):\n",
    "        with open(f\"{i}.sdf\", 'rb') as f_in:\n",
    "            shutil.copyfileobj(f_in, f_out)\n",
    "\n",
    "\n",
    "# Split this .sdf file into individual .mol files.\n",
    "sdf_to_molfiles(\"coconut_55.sdf\", output_directory=osp.join(\"..\", \"data\", \"coconut_55\"))\n",
    "\n",
    "# Clean up all intermediate files.\n",
    "print(\"Deleting intermediate files...\")\n",
    "os.remove(\"coconut.zip\")\n",
    "shutil.rmtree(\"coconut\")\n",
    "for i in range(15, 26):\n",
    "    os.remove(f\"{i}.sdf\")\n",
    "os.remove(\"coconut_55.sdf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `checks`\n",
    "\n",
    "This dataset contains 15 named molecules (e.g., anthracene, aspirin, caffeine, morphine) from [KEGG COMPOUND](https://www.genome.jp/kegg/compound/) (Kanehisa, 2019; Kanehisa et al., 2023; Kanehisa & Goto, 2000).\n",
    "These molecules are primarily used for rapid testing have have numbers of heavy atoms ranging from 5&ndash;21.\n",
    "The code below simply documents where these molecules were downloaded from and automates the downloads."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the checks dataset directory and ensure it exists.\n",
    "checks_dir = osp.join(\"..\", \"data\", \"checks\")\n",
    "os.makedirs(checks_dir, exist_ok=True)\n",
    "\n",
    "# Define molecule names and download URLs.\n",
    "mols = [\n",
    "    (\"anthracene\", \"https://www.kegg.jp/entry/-f+m+C14315\"),\n",
    "    (\"aspartic\", \"https://www.kegg.jp/entry/-f+m+C00049\"),\n",
    "    (\"aspirin\", \"https://www.kegg.jp/entry/-f+m+C01405\"),\n",
    "    (\"benzene\", \"https://www.kegg.jp/entry/-f+m+C01407\"),\n",
    "    (\"caffeine\", \"https://www.kegg.jp/entry/-f+m+C07481\"),\n",
    "    (\"chrysene\", \"https://www.kegg.jp/entry/-f+m+C14222\"),\n",
    "    (\"cyanidin\", \"https://www.kegg.jp/entry/-f+m+C05905\"),\n",
    "    (\"ethylenethiourea\", \"https://www.kegg.jp/entry/-f+m+C14398\"),\n",
    "    (\"fosfructose\", \"https://www.kegg.jp/entry/-f+m+C00354\"),\n",
    "    (\"morphine\", \"https://www.kegg.jp/entry/-f+m+C01516\"),\n",
    "    (\"naphthalene\", \"https://www.kegg.jp/entry/-f+m+C00829\"),\n",
    "    (\"pyrazole\", \"https://www.kegg.jp/entry/-f+m+C00481\"),\n",
    "    (\"pyrene\", \"https://www.kegg.jp/entry/-f+m+C14335\"),\n",
    "    (\"thiouracil\", \"https://www.kegg.jp/entry/-f+m+C19304\"),\n",
    "    (\"xylopyranose\", \"https://www.kegg.jp/entry/-f+m+C02205\")\n",
    "]\n",
    "\n",
    "# Download the checks dataset.\n",
    "print(\"Downloading checks molecules from KEGG COMPOUND...\")\n",
    "for mol_name, mol_url in mols:\n",
    "    urlretrieve(mol_url, osp.join(checks_dir, f\"{mol_name}.mol\"))\n",
    "print(\"Done downloading.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "Blum, L. C., & Reymond, J.-L. (2009). 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. *Journal of the American Chemical Society*, *131*(25), 8732–8733. https://doi.org/10.1021/ja902302h\n",
    "\n",
    "Chandrasekhar, V., Rajan, K., Kanakam, S. R. S., Sharma, N., Weißenborn, V., Schaub, J., & Steinbeck, C. (2025). COCONUT 2.0: A comprehensive overhaul and curation of the collection of open natural products database. *Nucleic Acids Research*, *53*(D1), D634–D643. https://doi.org/10.1093/nar/gkae1063\n",
    "\n",
    "Kanehisa, M. (2019). Toward understanding the origin and evolution of cellular organisms. *Protein Science*, *28*(11), 1947–1951. https://doi.org/10.1002/pro.3715\n",
    "\n",
    "Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M., & Ishiguro-Watanabe, M. (2023). KEGG for taxonomy-based analysis of pathways and genomes. *Nucleic Acids Research*, *51*(D1), D587–D592. https://doi.org/10.1093/nar/gkac963\n",
    "\n",
    "Kanehisa, M., & Goto, S. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. *Nucleic Acids Research*, *28*(1), 27–30. https://doi.org/10.1093/nar/28.1.27\n",
    "\n",
    "Ruddigkeit, L., Van Deursen, R., Blum, L. C., & Reymond, J.-L. (2012). Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. *Journal of Chemical Information and Modeling*, *52*(11), 2864–2875. https://doi.org/10.1021/ci300415d\n",
    "\n",
    "Seet, I., Patarroyo, K. Y., Siebert, G., Walker, S. I., & Cronin, L. (2024). *Rapid Computation of the Assembly Index of Molecular Graphs* (No. 2410.09100). arXiv. https://doi.org/10.48550/arXiv.2410.09100\n",
    "\n",
    "Sorokina, M., Merseburger, P., Rajan, K., Yirik, M. A., & Steinbeck, C. (2021). COCONUT online: Collection of Open Natural Products database. *Journal of Cheminformatics*, *13*(1), 2. https://doi.org/10.1186/s13321-020-00478-9"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}