COSMolKit
COSMolKit is a Python molecular toolkit backed by a Rust core. It provides value-style molecule operations, SMILES/SDF/MOL2/XYZ workflows, 2D depiction, native 3D conformer generation, UFF/MMFF optimization, fingerprints, batch processing, and protein-focused structural biology APIs.
The library is built around explicit behavior: supported operations return structured results, unsupported behavior fails visibly, and public molecule transforms are explicit about whether they return new values or mutate in place.
COSMolKit is designed for array-oriented structural data access, keeping molecular data efficient and natural for NumPy, PyTorch, and model-building workflows.
Documentation
- Python documentation: https://kit.cosmol.org/
- Rust crate notes:
crates/cosmolkit/README.md
Installation
Core Concepts
- Value-style molecules: methods such as
with_hydrogens(),without_hydrogens(),with_kekulized_bonds(), andwith_2d_coordinates()return new molecule values. - Explicit mutation: in-place
Moleculeoperations always end with_. The trailing underscore has no other publicMoleculemeaning. - Explicit errors: invalid input and unsupported behavior are surfaced as errors instead of silent fallbacks.
- Batch-native processing:
MoleculeBatchkeeps input order, supports structured per-record failures, and can run batch transforms and exports with configurable parallelism. - Array-friendly data access: coordinates, bounds matrices, fingerprints, and graph features are exposed in forms that fit Python numerical workflows.
- Source-backed 3D workflows: conformer generation and UFF/MMFF optimization are available through the public Python API.
Value-Style Transformations
Normal molecule operations return new objects and do not mutate their inputs. This follows the same explicit-dataflow direction as modern dataframe libraries: users can reason about each transformation as a new value while COSMolKit can share unchanged internal storage efficiently.
=
=
assert is not
Python Quick Start
=
=
=
=
=
=
=
Protein Structures
Use Protein when the workflow is focused on protein chains rather than the
full structural table.
=
SDF and Dataset Workflows
SdfDataset builds a lightweight index of SDF record byte ranges, so individual
records and chunks can be read without loading an entire file into memory.
Molfile-only readers such as Molecule.read_mol() follow RDKit
MolFromMolBlock boundaries: they stop after the first M END line and leave
trailing SDF data fields to the SDF APIs.
=
=
=
=
Conformer Generation And Optimization
=
=
= 0xF00D
= 1
= True
=
=
=
=
with_3d_conformer() follows RDKit's ETKDG behavior for trusted molecular
graphs: molecules without explicit hydrogens are embedded as heavy-atom-only
conformers instead of failing or automatically adding hydrogens. Calling
with_hydrogens() first is recommended for all-atom geometry, force-field
optimization, and hydrogen-bond-sensitive workflows. Coordinate-only inputs
such as XYZ blocks do not contain a bond topology and are not valid ETKDG
inputs until a trusted graph has been constructed.
Feature Areas
- Molecular graph construction and inspection
- SMILES parsing and writing
- MOL/SDF reading and writing
- MOL2 reading with RDKit-style
Mol2ParserParams - XYZ block reading
- Hydrogen transforms and Kekulization
- Sanitization and chemistry problem detection
- 2D coordinate generation and SVG/PNG depiction
- Native 3D conformer generation with DG/KDG/ETDG/ETKDG parameter presets
- UFF/MMFF optimization of generated or imported 3D conformers
- Morgan and Avalon fingerprints
- Distance-geometry bounds matrices
- Substructure matching and SMARTS parse metadata
- Ordered batch transforms and exports
- PDB/mmCIF molecule-block parsing and protein projection APIs
- Support-status metadata for public features
Design Principles
COSMolKit aims to be Python-friendly, batch-friendly, and suitable for model-building workflows.
- Correctness comes before breadth.
- Public transforms use value semantics.
- Mutation-capable workflows are explicit.
- Unsupported chemistry should fail clearly.
- RDKit-parity behavior is the correctness floor for supported cheminformatics features.
- High-throughput APIs should preserve input order and expose per-record failures.
Examples
Python examples live in python/examples/.
Roadmap
Status labels:
- โ available in the public Python API
- ๐งช implemented or partially available, still being hardened
- ๐ง planned / not yet public
Chemistry Core
Goal: keep the supported molecular core correct before expanding breadth.
- โ Molecule, atom, and bond graph model
- โ SMILES parsing
- โ SMILES writing with RDKit-style writer options for supported branches
- โ Ring perception, valence handling, aromaticity, and Kekulization
- โ Hydrogen addition and removal
- โ Sanitization for supported chemistry workflows
- โ Stereochemistry inspection for supported atom and bond states
- โ Distance-geometry bounds matrices
- โ Native 3D conformer generation and UFF/MMFF post-optimization for supported molecules
- ๐งช Morgan fingerprints and Tanimoto similarity
- ๐งช Avalon fingerprints
- ๐งช Substructure matching and Python SMARTS parse metadata
- ๐ง Broader descriptor APIs such as formula, molecular weight, and ring statistics
File I/O and Depiction
Goal: make common molecule import, export, and visualization workflows usable from Python.
- โ MOL/SDF reading
- โ MOL2 reading
- โ XYZ block reading
- โ SDF dataset indexing for large files
- โ SDF writing for supported V2000/V3000 branches
- โ PDB block to molecule conversion
- โ mmCIF block to molecule conversion through the same molecule-conversion profile
- โ 2D coordinate generation
- โ SVG drawing
- โ PNG export
- ๐งช RDKit-style visual parity testing for supported depiction output
- ๐ง Annotation overlays and richer drawing customization
- โ 3D conformer generation and embedding APIs
Batch-Native Workflows
Goal: make high-throughput molecule preparation and export a core product identity.
- โ
Ordered
MoleculeBatch.from_smiles_list() - โ Batch transforms for sanitization, hydrogens, Kekulization, and 2D coordinates
- โ
Configurable parallelism with
with_parallel_jobs() - โ
Configurable progress display with
with_progress_bar() - โ Per-record errors, valid masks, and error reports
- โ Batch SMILES, image, and SDF export paths
- ๐งช Golden parity tests for parallel batch behavior
- ๐ง More streaming and chunked dataset workflows
Protein and Structural Biology
Goal: provide practical Biopython-like structure workflows without forcing users through low-level structural tables.
- โ
Protein.from_pdb()/Protein.from_mmcif()high-level entry points - โ Protein chain, residue, and atom iteration
- โ Protein-only projection from broader structural data
- ๐งช PDB/mmCIF structural parsing
- ๐ง Selection utilities for chains, residues, atoms, and neighborhoods
- ๐ง Ligand, nucleic-acid, and mixed-structure ergonomic APIs
Python API and ML Readiness
Goal: expose verified molecular behavior through a practical Python interface.
- โ Value-style molecule transformations
- โ Graph, coordinate, fingerprint, and bounds-matrix accessors
- โ Python examples for drawing, SDF-to-SMILES, batch processing, and proteins
- ๐งช Type stubs and documentation coverage
- ๐ง Stable model-ready graph exports
- ๐ง NumPy / PyTorch oriented adapters
- ๐ง Molecular tokenization and AI-native geometry helpers
Browser and Deployment
Goal: support lightweight chemistry workflows outside native Python processes.
- ๐ง WASM compilation target
- ๐ง JavaScript bindings
- ๐ง Browser-native SMILES/SDF parsing and depiction
Respect for RDKit
COSMolKit is developed with deep respect for RDKit and the broader open-source cheminformatics community. The goal is an independent Rust-native implementation that preserves interoperability and RDKit-parity behavior where appropriate, while offering a deterministic Python API and AI-native extension surface.