sdfrust
A fast, pure-Rust parser for SDF (Structure Data File), MOL2, and XYZ chemical structure files, with Python bindings.
Features
- SDF V2000/V3000: Full read/write support for both SDF format versions
- MOL2 (TRIPOS): Full read/write support for MOL2 format
- XYZ: Read support for XYZ coordinate files (single and multi-molecule)
- Gzip Support: Transparent decompression of
.gzfiles (optionalgzipfeature) - Python Bindings: First-class Python API with NumPy integration
- Streaming Parsing: Memory-efficient iteration over large files
- Molecular Descriptors: MW, exact mass, ring count, rotatable bonds, and more
- High Performance: ~220,000 molecules/sec (4-7x faster than RDKit)
- Real-World Validated: 100% success rate on PDBbind 2024 dataset (27,670 ligand SDF files)
Installation
Rust
Add to your Cargo.toml:
[]
= "0.5"
Python
# From source (requires Rust toolchain)
Quick Start
Rust
use ;
// Parse a single molecule (any format)
let mol = parse_sdf_file?;
let mol = parse_mol2_file?;
let mol = parse_xyz_file?;
let mol = parse_auto_file?; // Auto-detect
println!;
println!;
println!;
println!;
// Parse multiple molecules
let molecules = parse_sdf_file_multi?;
// Iterate over large files (memory efficient)
for result in iter_sdf_file?
// Write molecules
write_sdf_file?;
Python
# Parse molecules
=
=
# Access properties
# Molecular descriptors
# NumPy integration
= # (N, 3) array
= # (N,) array
# Iterate over large files
# Write molecules
Data Model
Molecule
The main container for molecular data:
Atom
Represents an atom with 3D coordinates:
Bond
Represents a bond between two atoms:
BondOrder
Molecule Methods
// Atom/bond counts
mol.atom_count
mol.bond_count
mol.is_empty
// Connectivity
mol.neighbors // Get connected atom indices
mol.bonds_for_atom // Get bonds for an atom
// Properties
mol.get_property
mol.set_property
// Chemistry
mol.formula // "C2H6O"
mol.total_charge // Sum of formal charges
mol.element_counts // HashMap of element counts
mol.has_aromatic_bonds
mol.has_charges
// Geometry
mol.centroid // Geometric center
mol.center // Move centroid to origin
mol.translate
// Descriptors
mol.molecular_weight // IUPAC 2021 atomic weights
mol.exact_mass // Monoisotopic mass
mol.heavy_atom_count // Non-hydrogen atoms
mol.ring_count // Using Euler formula
mol.rotatable_bond_count // RDKit-compatible definition
mol.is_atom_in_ring
mol.is_bond_in_ring
// Filtering
mol.atoms_by_element
mol.bonds_by_order
Supported Formats
| Format | Read | Write |
|---|---|---|
| SDF V2000 | ✅ | ✅ |
| SDF V3000 | ✅ | ✅ |
| MOL2 (TRIPOS) | ✅ | ✅ |
| XYZ | ✅ | - |
Gzip (.gz) |
✅ | - |
Format Auto-Detection
// Automatically detect V2000 vs V3000
let mol = parse_sdf_auto_file?;
// Automatically choose format based on molecule requirements
write_sdf_auto_file?;
Performance
The library is designed for high performance:
| Tool | Throughput | vs sdfrust |
|---|---|---|
| sdfrust | ~220,000 mol/s | baseline |
| RDKit | ~30,000-50,000 mol/s | 4-7x slower |
| Pure Python | ~3,000-5,000 mol/s | 40-70x slower |
- Streaming parser for memory-efficient processing of large files
- Minimal allocations during parsing
- Zero-copy where possible
Python API Reference
Parsing Functions
# SDF (V2000/V3000)
=
=
= # Auto-detect format
=
# MOL2
=
=
=
# Iterators (memory-efficient)
Writing Functions
# Auto-select V2000/V3000
Classes
Molecule: Main molecular containerAtom: Atom with coordinatesBond: Bond between atomsBondOrder: Bond type enumBondStereo: Stereochemistry enumSdfFormat: Format version enum
Citation
If you use sdfrust in your research, please cite:
License
MIT License - Copyright (c) 2025-2026 Hosein Fooladi