COSMolKit
COSMolKit is a Python molecular toolkit backed by a Rust core. It provides value-style molecule operations, SMILES and SDF workflows, 2D depiction, fingerprints, batch processing, and protein-focused structural biology APIs.
The library is built around explicit behavior: supported operations return structured results, unsupported behavior fails visibly, and public molecule transforms return new values instead of mutating their inputs.
COSMolKit is designed for array-oriented structural data access, keeping molecular data efficient and natural for NumPy, PyTorch, and model-building workflows.
Documentation
- Python documentation: https://kit.cosmol.org/
- Rust and development notes:
crates/cosmolkit/README.md
Installation
Core Concepts
- Value-style molecules: methods such as
with_hydrogens(),without_hydrogens(),with_kekulized_bonds(), andwith_2d_coords()return new molecule values. - Explicit errors: invalid input and unsupported behavior are surfaced as errors instead of silent fallbacks.
- Batch-native processing:
MoleculeBatchkeeps input order, supports structured per-record failures, and can run batch transforms and exports with configurable parallelism. - Array-friendly data access: coordinates, bounds matrices, fingerprints, and graph features are exposed in forms that fit Python numerical workflows.
Value-Style Transformations
Normal molecule operations return new objects and do not mutate their inputs. This follows the same explicit-dataflow direction as modern dataframe libraries: users can reason about each transformation as a new value while the Rust core can share unchanged internal storage efficiently.
=
=
assert is not
Python Quick Start
=
=
=
=
=
=
Protein Structures
Use Protein when the workflow is focused on protein chains rather than the
full structural table.
=
For lower-level structural workflows, COSMolKit also exposes BioStructure
types in Rust and Python.
SDF and Dataset Workflows
SdfDataset builds a lightweight index of SDF record byte ranges, so individual
records and chunks can be read without loading an entire file into memory.
=
=
=
=
Feature Areas
- Molecular graph construction and inspection
- SMILES parsing and writing
- MOL/SDF reading and writing
- Hydrogen transforms and Kekulization
- Sanitization and chemistry problem detection
- 2D coordinate generation and SVG/PNG depiction
- Morgan and Avalon fingerprints
- Distance-geometry bounds matrices
- Substructure matching and SMARTS parsing
- Ordered batch transforms and exports
- PDB/mmCIF parsing and protein projection APIs
- Support-status metadata for public features
Design Principles
COSMolKit aims to be Python-friendly, batch-friendly, and suitable for model-building workflows.
- Correctness comes before breadth.
- Public transforms use value semantics.
- Mutation-capable workflows are explicit.
- Unsupported chemistry should fail clearly.
- RDKit-compatible behavior is the correctness floor for supported cheminformatics features.
- High-throughput APIs should preserve input order and expose per-record failures.
Examples
Python examples live in python/examples/.
Roadmap
Status labels:
- โ available in the public Python API
- ๐งช implemented or partially available, still being hardened
- ๐ง planned / not yet public
Chemistry Core
Goal: keep the supported molecular core correct before expanding breadth.
- โ Molecule, atom, and bond graph model
- โ SMILES parsing
- โ SMILES writing with RDKit-style writer options for supported branches
- โ Ring perception, valence handling, aromaticity, and Kekulization
- โ Hydrogen addition and removal
- โ Sanitization for supported chemistry workflows
- โ Stereochemistry inspection for supported atom and bond states
- โ Distance-geometry bounds matrices
- โ Morgan fingerprints and Tanimoto similarity
- ๐งช Avalon fingerprints
- ๐งช Substructure matching and SMARTS parsing
- ๐ง Broader descriptor APIs such as formula, molecular weight, and ring statistics
File I/O and Depiction
Goal: make common molecule import, export, and visualization workflows usable from Python.
- โ MOL/SDF reading
- โ SDF dataset indexing for large files
- โ SDF writing for supported V2000/V3000 branches
- โ 2D coordinate generation
- โ SVG drawing
- โ PNG export
- ๐งช RDKit-style visual parity testing for supported depiction output
- ๐ง Annotation overlays and richer drawing customization
- ๐ง 3D conformer generation and embedding APIs
Batch-Native Workflows
Goal: make high-throughput molecule preparation and export a core product identity.
- โ
Ordered
MoleculeBatch.from_smiles_list() - โ Batch transforms for sanitization, hydrogens, Kekulization, and 2D coordinates
- โ
Configurable parallelism with
with_parallel_jobs() - โ
Configurable progress display with
with_progress_bar() - โ Per-record errors, valid masks, and error reports
- โ Batch SMILES, image, and SDF export paths
- ๐งช Golden parity tests for parallel batch behavior
- ๐ง More streaming and chunked dataset workflows
Protein and Structural Biology
Goal: provide practical Biopython-like structure workflows without forcing users through low-level structural tables.
- โ
Protein.from_pdb()/Protein.from_mmcif()high-level entry points - โ Protein chain, residue, and atom iteration
- โ Protein-only projection from broader structural data
- ๐งช PDB/mmCIF structural parsing
- ๐งช Lower-level
BioStructureaccess for advanced workflows - ๐ง Selection utilities for chains, residues, atoms, and neighborhoods
- ๐ง Ligand, nucleic-acid, and mixed-structure ergonomic APIs
Python API and ML Readiness
Goal: expose verified Rust-backed behavior through a practical Python interface.
- โ Value-style molecule transformations
- โ Graph, coordinate, fingerprint, and bounds-matrix accessors
- โ Python examples for drawing, SDF-to-SMILES, batch processing, and proteins
- ๐งช Type stubs and documentation coverage
- ๐ง Stable model-ready graph exports
- ๐ง NumPy / PyTorch oriented adapters
- ๐ง Molecular tokenization and AI-native geometry helpers
Browser and Deployment
Goal: support lightweight chemistry workflows outside native Python processes.
- ๐ง WASM compilation target
- ๐ง JavaScript bindings
- ๐ง Browser-native SMILES/SDF parsing and depiction
Respect for RDKit
COSMolKit is developed with deep respect for RDKit and the broader open-source cheminformatics community. The goal is an independent Rust-native implementation that preserves interoperability and behavioral compatibility where appropriate, while offering a more deterministic Python API and AI-native extension surface.