cosmolkit-core 0.2.3

Redesigned COSMolKit core with value-style molecule state and explicit topology operation contracts
Documentation

COSMolKit

COSMolKit is a Python molecular toolkit backed by a Rust core. It provides value-style molecule operations, SMILES and SDF workflows, 2D depiction, fingerprints, batch processing, and protein-focused structural biology APIs.

The library is built around explicit behavior: supported operations return structured results, unsupported behavior fails visibly, and public molecule transforms are explicit about whether they return new values or mutate in place.

COSMolKit is designed for array-oriented structural data access, keeping molecular data efficient and natural for NumPy, PyTorch, and model-building workflows.

Documentation

Installation

pip install cosmolkit

Core Concepts

  • Value-style molecules: methods such as with_hydrogens(), without_hydrogens(), with_kekulized_bonds(), and with_2d_coords() return new molecule values.
  • Explicit mutation: in-place Molecule operations always end with _. The trailing underscore has no other public Molecule meaning.
  • Explicit errors: invalid input and unsupported behavior are surfaced as errors instead of silent fallbacks.
  • Batch-native processing: MoleculeBatch keeps input order, supports structured per-record failures, and can run batch transforms and exports with configurable parallelism.
  • Array-friendly data access: coordinates, bounds matrices, fingerprints, and graph features are exposed in forms that fit Python numerical workflows.

Value-Style Transformations

Normal molecule operations return new objects and do not mutate their inputs. This follows the same explicit-dataflow direction as modern dataframe libraries: users can reason about each transformation as a new value while COSMolKit can share unchanged internal storage efficiently.

from cosmolkit import Molecule

mol = Molecule.from_smiles("CCO")
mol_h = mol.with_hydrogens()

assert mol is not mol_h

Python Quick Start

from cosmolkit import Molecule, MoleculeBatch

mol = Molecule.from_smiles("c1ccccc1O")
mol_2d = mol.with_2d_coords()

print(mol_2d.to_smiles())
print(mol_2d.coords_2d())

svg = mol_2d.to_svg(width=400, height=300)
mol_2d.write_png("phenol.png", width=400, height=300)

fp = mol.fingerprint_morgan(radius=2, n_bits=2048)
print(fp.on_bits())

batch = (
    MoleculeBatch.from_smiles_list(
        ["CCO", "c1ccccc1", "CC(=O)O"],
        sanitize=True,
        errors="keep",
    )
    .with_parallel_jobs(8)
    .with_progress_bar(False)
)

prepared = batch.add_hydrogens(errors="keep").compute_2d_coords(errors="keep")
print(prepared.valid_mask())
print(prepared.to_smiles_list())

prepared.to_images(
    "molecule_images",
    format="png",
    size=(300, 300),
    errors="keep",
    filenames=["ethanol", "benzene", "acetate"],
)

Protein Structures

Use Protein when the workflow is focused on protein chains rather than the full structural table.

from cosmolkit import Protein

protein = Protein.from_pdb("1crn.pdb")

print(protein.num_chains())
print(protein.num_residues())
print(protein.num_atoms())

for chain in protein.chains():
    print(chain.index(), chain.kind(), len(chain))
    for residue in chain.residues():
        print(residue.name(), residue.kind(), len(residue))

SDF and Dataset Workflows

SdfDataset builds a lightweight index of SDF record byte ranges, so individual records and chunks can be read without loading an entire file into memory.

from cosmolkit import SdfDataset

dataset = SdfDataset.open("library.sdf")
print(len(dataset))

record = dataset[0]
mol = record.molecule()

for batch in dataset.batches(size=1024, errors="keep", n_jobs=8):
    smiles = batch.to_smiles_list()

Feature Areas

  • Molecular graph construction and inspection
  • SMILES parsing and writing
  • MOL/SDF reading and writing
  • XYZ block reading
  • Hydrogen transforms and Kekulization
  • Sanitization and chemistry problem detection
  • 2D coordinate generation and SVG/PNG depiction
  • Morgan and Avalon fingerprints
  • Distance-geometry bounds matrices
  • Substructure matching and SMARTS parsing
  • Ordered batch transforms and exports
  • PDB/mmCIF molecule-block parsing and protein projection APIs
  • Support-status metadata for public features

Design Principles

COSMolKit aims to be Python-friendly, batch-friendly, and suitable for model-building workflows.

  • Correctness comes before breadth.
  • Public transforms use value semantics.
  • Mutation-capable workflows are explicit.
  • Unsupported chemistry should fail clearly.
  • RDKit-parity behavior is the correctness floor for supported cheminformatics features.
  • High-throughput APIs should preserve input order and expose per-record failures.

Examples

Python examples live in python/examples/.

Roadmap

Status labels:

  • โœ… available in the public Python API
  • ๐Ÿงช implemented or partially available, still being hardened
  • ๐Ÿšง planned / not yet public

Chemistry Core

Goal: keep the supported molecular core correct before expanding breadth.

  • โœ… Molecule, atom, and bond graph model
  • โœ… SMILES parsing
  • โœ… SMILES writing with RDKit-style writer options for supported branches
  • โœ… Ring perception, valence handling, aromaticity, and Kekulization
  • โœ… Hydrogen addition and removal
  • โœ… Sanitization for supported chemistry workflows
  • โœ… Stereochemistry inspection for supported atom and bond states
  • โœ… Distance-geometry bounds matrices
  • โœ… Morgan fingerprints and Tanimoto similarity
  • ๐Ÿงช Avalon fingerprints
  • ๐Ÿงช Substructure matching and SMARTS parsing
  • ๐Ÿšง Broader descriptor APIs such as formula, molecular weight, and ring statistics

File I/O and Depiction

Goal: make common molecule import, export, and visualization workflows usable from Python.

  • โœ… MOL/SDF reading
  • โœ… XYZ block reading
  • โœ… SDF dataset indexing for large files
  • โœ… SDF writing for supported V2000/V3000 branches
  • โœ… PDB block to molecule conversion
  • โœ… mmCIF block to molecule conversion through the same molecule-conversion profile
  • โœ… 2D coordinate generation
  • โœ… SVG drawing
  • โœ… PNG export
  • ๐Ÿงช RDKit-style visual parity testing for supported depiction output
  • ๐Ÿšง Annotation overlays and richer drawing customization
  • ๐Ÿšง 3D conformer generation and embedding APIs

Batch-Native Workflows

Goal: make high-throughput molecule preparation and export a core product identity.

  • โœ… Ordered MoleculeBatch.from_smiles_list()
  • โœ… Batch transforms for sanitization, hydrogens, Kekulization, and 2D coordinates
  • โœ… Configurable parallelism with with_parallel_jobs()
  • โœ… Configurable progress display with with_progress_bar()
  • โœ… Per-record errors, valid masks, and error reports
  • โœ… Batch SMILES, image, and SDF export paths
  • ๐Ÿงช Golden parity tests for parallel batch behavior
  • ๐Ÿšง More streaming and chunked dataset workflows

Protein and Structural Biology

Goal: provide practical Biopython-like structure workflows without forcing users through low-level structural tables.

  • โœ… Protein.from_pdb() / Protein.from_mmcif() high-level entry points
  • โœ… Protein chain, residue, and atom iteration
  • โœ… Protein-only projection from broader structural data
  • ๐Ÿงช PDB/mmCIF structural parsing
  • ๐Ÿšง Selection utilities for chains, residues, atoms, and neighborhoods
  • ๐Ÿšง Ligand, nucleic-acid, and mixed-structure ergonomic APIs

Python API and ML Readiness

Goal: expose verified molecular behavior through a practical Python interface.

  • โœ… Value-style molecule transformations
  • โœ… Graph, coordinate, fingerprint, and bounds-matrix accessors
  • โœ… Python examples for drawing, SDF-to-SMILES, batch processing, and proteins
  • ๐Ÿงช Type stubs and documentation coverage
  • ๐Ÿšง Stable model-ready graph exports
  • ๐Ÿšง NumPy / PyTorch oriented adapters
  • ๐Ÿšง Molecular tokenization and AI-native geometry helpers

Browser and Deployment

Goal: support lightweight chemistry workflows outside native Python processes.

  • ๐Ÿšง WASM compilation target
  • ๐Ÿšง JavaScript bindings
  • ๐Ÿšง Browser-native SMILES/SDF parsing and depiction

Respect for RDKit

COSMolKit is developed with deep respect for RDKit and the broader open-source cheminformatics community. The goal is an independent Rust-native implementation that preserves interoperability and RDKit-parity behavior where appropriate, while offering a deterministic Python API and AI-native extension surface.