sdfrust
A fast, pure-Rust parser for SDF (Structure Data File), MOL2, and XYZ chemical structure files, with Python bindings.
Features
- SDF V2000/V3000: Full read/write support for both SDF format versions
- MOL2 (TRIPOS): Full read/write support for MOL2 format
- XYZ: Read support for XYZ coordinate files (single and multi-molecule)
- Gzip Support: Transparent decompression of
.gzfiles (optionalgzipfeature) - Python Bindings: First-class Python API with NumPy integration
- Streaming Parsing: Memory-efficient iteration over large files
- Molecular Descriptors: MW, exact mass, ring count, rotatable bonds, and more
- ML-Ready Featurization: OGB-compatible GNN features, ECFP fingerprints, Gasteiger charges
- Chemical Perception: SSSR rings, aromaticity (Hückel 4n+2), hybridization, conjugation
- 3D GNN Support: Cutoff-based neighbor lists, bond/dihedral angles for SchNet/DimeNet/GemNet
- High Performance: ~220,000 molecules/sec (4-7x faster than RDKit)
- Real-World Validated: 100% success rate on PDBbind 2024 dataset (27,670 ligand SDF files)
Installation
Rust
Add to your Cargo.toml:
[]
= "0.6"
Python
# From source (requires Rust toolchain)
Quick Start
Rust
use ;
// Parse a single molecule (any format)
let mol = parse_sdf_file?;
let mol = parse_mol2_file?;
let mol = parse_xyz_file?;
let mol = parse_auto_file?; // Auto-detect
println!;
println!;
println!;
println!;
// Parse multiple molecules
let molecules = parse_sdf_file_multi?;
// Iterate over large files (memory efficient)
for result in iter_sdf_file?
// Write molecules
write_sdf_file?;
Python
# Parse molecules
=
=
# Access properties
# Molecular descriptors
# NumPy integration
= # (N, 3) array
= # (N,) array
# Iterate over large files
# Write molecules
Data Model
Molecule
The main container for molecular data:
Atom
Represents an atom with 3D coordinates:
Bond
Represents a bond between two atoms:
BondOrder
Molecule Methods
// Atom/bond counts
mol.atom_count
mol.bond_count
mol.is_empty
// Connectivity
mol.neighbors // Get connected atom indices
mol.bonds_for_atom // Get bonds for an atom
// Properties
mol.get_property
mol.set_property
// Chemistry
mol.formula // "C2H6O"
mol.total_charge // Sum of formal charges
mol.element_counts // HashMap of element counts
mol.has_aromatic_bonds
mol.has_charges
// Geometry
mol.centroid // Geometric center
mol.center // Move centroid to origin
mol.translate
// Descriptors
mol.molecular_weight // IUPAC 2021 atomic weights
mol.exact_mass // Monoisotopic mass
mol.heavy_atom_count // Non-hydrogen atoms
mol.ring_count // Using Euler formula
mol.rotatable_bond_count // RDKit-compatible definition
mol.is_atom_in_ring
mol.is_bond_in_ring
// Filtering
mol.atoms_by_element
mol.bonds_by_order
ML-Ready Features
sdfrust provides chemical perception and featurization for molecular ML — compute features in Rust, output NumPy arrays for PyTorch/JAX.
OGB GNN Featurization
use ogb;
let graph = ogb_graph_features;
// graph.atom_features: [N, 9] - atomic_num, chirality, degree, charge, num_hs, radical, hybridization, aromatic, in_ring
// graph.bond_features: [2E, 3] - bond_type, stereo, conjugated
// graph.edge_src, graph.edge_dst: directed edge index
# Python: direct NumPy arrays for PyTorch Geometric
= # np.array [N, 9]
= # np.array [E, 3]
= # dict with edge_src, edge_dst
ECFP/Morgan Fingerprints
use ecfp;
let fp = ecfp; // ECFP4, 2048 bits
let similarity = fp.tanimoto;
= # List[bool]
= # np.array (2048,)
= # float
Gasteiger Partial Charges
use gasteiger_charges;
let charges = gasteiger_charges; // Vec<f64>
= # List[float]
= # np.array (N,)
Chemical Perception
use ;
let rings = sssr; // SSSR ring perception
let aromatic = all_aromatic_atoms; // Hückel 4n+2 aromaticity
let hybs = all_hybridizations; // SP/SP2/SP3
let conjugated = all_conjugated_bonds; // Bond conjugation
= # List of ring atom lists
= # List[bool]
= # List[str]
= # List[bool]
3D GNN Features (geometry feature)
use ;
let nl = neighbor_list; // Cutoff-based neighbor list
let angles = all_bond_angles; // Bond angles for DimeNet
let dihedrals = all_dihedral_angles; // Dihedral angles for GemNet
= # dict: edge_src, edge_dst, distances
= # dict: triplets, angles
= # dict: quadruplets, angles
Supported Formats
| Format | Read | Write |
|---|---|---|
| SDF V2000 | ✅ | ✅ |
| SDF V3000 | ✅ | ✅ |
| MOL2 (TRIPOS) | ✅ | ✅ |
| XYZ | ✅ | - |
Gzip (.gz) |
✅ | - |
Format Auto-Detection
// Automatically detect V2000 vs V3000
let mol = parse_sdf_auto_file?;
// Automatically choose format based on molecule requirements
write_sdf_auto_file?;
Performance
The library is designed for high performance:
| Tool | Throughput | vs sdfrust |
|---|---|---|
| sdfrust | ~220,000 mol/s | baseline |
| RDKit | ~30,000-50,000 mol/s | 4-7x slower |
| Pure Python | ~3,000-5,000 mol/s | 40-70x slower |
- Streaming parser for memory-efficient processing of large files
- Minimal allocations during parsing
- Zero-copy where possible
Python API Reference
Parsing Functions
# SDF (V2000/V3000)
=
=
= # Auto-detect format
=
# MOL2
=
=
=
# Iterators (memory-efficient)
Writing Functions
# Auto-select V2000/V3000
ML Feature Methods (on Molecule)
# Chemical perception
# SSSR rings
# Aromaticity
# Hybridization
# Conjugation
# ML features
# [N, 9] OGB features
# [E, 3] OGB features
# Full graph dict
# ECFP bit vector
# Tanimoto coefficient
# Partial charges
# NumPy arrays (with numpy feature)
# np.array [N, 9]
# np.array [E, 3]
# np.array (n_bits,)
# np.array (N,)
# 3D features (with geometry feature)
# Neighbor list dict
# Angle in radians
# Dihedral in radians
Classes
Molecule: Main molecular containerAtom: Atom with coordinatesBond: Bond between atomsBondOrder: Bond type enumBondStereo: Stereochemistry enumSdfFormat: Format version enum
Citation
If you use sdfrust in your research, please cite:
License
MIT License - Copyright (c) 2025-2026 Hosein Fooladi