sdfrust 0.6.0

A fast, pure-Rust parser for SDF, MOL2, and XYZ chemical structure files
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
# sdfrust

[![Crates.io](https://img.shields.io/crates/v/sdfrust.svg)](https://crates.io/crates/sdfrust)
[![docs.rs](https://docs.rs/sdfrust/badge.svg)](https://docs.rs/sdfrust)
[![CI](https://github.com/hfooladi/sdfrust/actions/workflows/rust.yml/badge.svg)](https://github.com/hfooladi/sdfrust/actions/workflows/rust.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A fast, pure-Rust parser for SDF (Structure Data File), MOL2, and XYZ chemical structure files, with Python bindings.

## Features

- **SDF V2000/V3000**: Full read/write support for both SDF format versions
- **MOL2 (TRIPOS)**: Full read/write support for MOL2 format
- **XYZ**: Read support for XYZ coordinate files (single and multi-molecule)
- **Gzip Support**: Transparent decompression of `.gz` files (optional `gzip` feature)
- **Python Bindings**: First-class Python API with NumPy integration
- **Streaming Parsing**: Memory-efficient iteration over large files
- **Molecular Descriptors**: MW, exact mass, ring count, rotatable bonds, and more
- **ML-Ready Featurization**: OGB-compatible GNN features, ECFP fingerprints, Gasteiger charges
- **Chemical Perception**: SSSR rings, aromaticity (Hückel 4n+2), hybridization, conjugation
- **3D GNN Support**: Cutoff-based neighbor lists, bond/dihedral angles for SchNet/DimeNet/GemNet
- **High Performance**: ~220,000 molecules/sec (4-7x faster than RDKit)
- **Real-World Validated**: 100% success rate on PDBbind 2024 dataset (27,670 ligand SDF files)

## Installation

### Rust

Add to your `Cargo.toml`:

```toml
[dependencies]
sdfrust = "0.6"
```

### Python

```bash
# From source (requires Rust toolchain)
cd sdfrust-python
pip install maturin
maturin develop --features numpy
```

## Quick Start

### Rust

```rust
use sdfrust::{parse_sdf_file, parse_mol2_file, parse_xyz_file, parse_auto_file, write_sdf_file};

// Parse a single molecule (any format)
let mol = parse_sdf_file("molecule.sdf")?;
let mol = parse_mol2_file("molecule.mol2")?;
let mol = parse_xyz_file("coords.xyz")?;
let mol = parse_auto_file("unknown_format.sdf")?; // Auto-detect
println!("Name: {}", mol.name);
println!("Atoms: {}", mol.atom_count());
println!("Formula: {}", mol.formula());
println!("MW: {:.2}", mol.molecular_weight().unwrap());

// Parse multiple molecules
let molecules = parse_sdf_file_multi("database.sdf")?;

// Iterate over large files (memory efficient)
for result in iter_sdf_file("large_database.sdf")? {
    let mol = result?;
    // Process each molecule
}

// Write molecules
write_sdf_file("output.sdf", &mol)?;
```

### Python

```python
import sdfrust

# Parse molecules
mol = sdfrust.parse_sdf_file("molecule.sdf")
mol = sdfrust.parse_mol2_file("molecule.mol2")

# Access properties
print(f"Name: {mol.name}")
print(f"Atoms: {mol.num_atoms}")
print(f"Formula: {mol.formula()}")
print(f"MW: {mol.molecular_weight():.2f}")

# Molecular descriptors
print(f"Rings: {mol.ring_count()}")
print(f"Rotatable bonds: {mol.rotatable_bond_count()}")
print(f"Heavy atoms: {mol.heavy_atom_count()}")

# NumPy integration
import numpy as np
coords = mol.get_coords_array()  # (N, 3) array
atomic_nums = mol.get_atomic_numbers()  # (N,) array

# Iterate over large files
for mol in sdfrust.iter_sdf_file("large_database.sdf"):
    print(f"{mol.name}: MW={mol.molecular_weight():.2f}")

# Write molecules
sdfrust.write_sdf_file(mol, "output.sdf")
```

## Data Model

### Molecule

The main container for molecular data:

```rust
pub struct Molecule {
    pub name: String,
    pub atoms: Vec<Atom>,
    pub bonds: Vec<Bond>,
    pub properties: HashMap<String, String>,
    pub format_version: SdfFormat,  // V2000 or V3000
}
```

### Atom

Represents an atom with 3D coordinates:

```rust
pub struct Atom {
    pub index: usize,
    pub element: String,
    pub x: f64, pub y: f64, pub z: f64,
    pub formal_charge: i8,
    pub mass_difference: i8,
    // ... additional fields
}
```

### Bond

Represents a bond between two atoms:

```rust
pub struct Bond {
    pub atom1: usize,
    pub atom2: usize,
    pub order: BondOrder,
    pub stereo: BondStereo,
}
```

### BondOrder

```rust
pub enum BondOrder {
    Single, Double, Triple, Aromatic,
    SingleOrDouble, SingleOrAromatic, DoubleOrAromatic, Any,
    Coordination, Hydrogen,  // V3000 only
}
```

## Molecule Methods

```rust
// Atom/bond counts
mol.atom_count()
mol.bond_count()
mol.is_empty()

// Connectivity
mol.neighbors(atom_index)      // Get connected atom indices
mol.bonds_for_atom(atom_index) // Get bonds for an atom

// Properties
mol.get_property("MW")
mol.set_property("SMILES", "CCO")

// Chemistry
mol.formula()         // "C2H6O"
mol.total_charge()    // Sum of formal charges
mol.element_counts()  // HashMap of element counts
mol.has_aromatic_bonds()
mol.has_charges()

// Geometry
mol.centroid()        // Geometric center
mol.center()          // Move centroid to origin
mol.translate(x, y, z)

// Descriptors
mol.molecular_weight()     // IUPAC 2021 atomic weights
mol.exact_mass()           // Monoisotopic mass
mol.heavy_atom_count()     // Non-hydrogen atoms
mol.ring_count()           // Using Euler formula
mol.rotatable_bond_count() // RDKit-compatible definition
mol.is_atom_in_ring(idx)
mol.is_bond_in_ring(idx)

// Filtering
mol.atoms_by_element("C")
mol.bonds_by_order(BondOrder::Double)
```

## ML-Ready Features

sdfrust provides chemical perception and featurization for molecular ML — compute features in Rust, output NumPy arrays for PyTorch/JAX.

### OGB GNN Featurization

```rust
use sdfrust::featurize::ogb;

let graph = ogb::ogb_graph_features(&mol);
// graph.atom_features: [N, 9] - atomic_num, chirality, degree, charge, num_hs, radical, hybridization, aromatic, in_ring
// graph.bond_features: [2E, 3] - bond_type, stereo, conjugated
// graph.edge_src, graph.edge_dst: directed edge index
```

```python
# Python: direct NumPy arrays for PyTorch Geometric
atom_feats = mol.get_ogb_atom_features_array()  # np.array [N, 9]
bond_feats = mol.get_ogb_bond_features_array()  # np.array [E, 3]
graph = mol.ogb_graph_features()                 # dict with edge_src, edge_dst
```

### ECFP/Morgan Fingerprints

```rust
use sdfrust::fingerprints::ecfp;

let fp = ecfp::ecfp(&mol, 2, 2048);  // ECFP4, 2048 bits
let similarity = fp.tanimoto(&other_fp);
```

```python
fp = mol.ecfp(radius=2, n_bits=2048)       # List[bool]
fp_array = mol.get_ecfp_array()             # np.array (2048,)
sim = mol.tanimoto_similarity(other_mol)    # float
```

### Gasteiger Partial Charges

```rust
use sdfrust::descriptors::gasteiger_charges;

let charges = gasteiger_charges(&mol);  // Vec<f64>
```

```python
charges = mol.gasteiger_charges()           # List[float]
charges_array = mol.get_gasteiger_charges_array()  # np.array (N,)
```

### Chemical Perception

```rust
use sdfrust::descriptors::{sssr, all_aromatic_atoms, all_hybridizations, all_conjugated_bonds};

let rings = sssr(&mol);                         // SSSR ring perception
let aromatic = all_aromatic_atoms(&mol);         // Hückel 4n+2 aromaticity
let hybs = all_hybridizations(&mol);             // SP/SP2/SP3
let conjugated = all_conjugated_bonds(&mol);     // Bond conjugation
```

```python
rings = mol.sssr()                    # List of ring atom lists
aromatic = mol.all_aromatic_atoms()   # List[bool]
hybs = mol.all_hybridizations()       # List[str]
conj = mol.all_conjugated_bonds()     # List[bool]
```

### 3D GNN Features (geometry feature)

```rust
use sdfrust::geometry::{neighbor_list, all_bond_angles, all_dihedral_angles};

let nl = neighbor_list(&mol, 5.0);         // Cutoff-based neighbor list
let angles = all_bond_angles(&mol);         // Bond angles for DimeNet
let dihedrals = all_dihedral_angles(&mol);  // Dihedral angles for GemNet
```

```python
nl = mol.neighbor_list(cutoff=5.0)         # dict: edge_src, edge_dst, distances
angles = mol.all_bond_angles()              # dict: triplets, angles
dihedrals = mol.all_dihedral_angles()       # dict: quadruplets, angles
```

## Supported Formats

| Format | Read | Write |
|--------|------|-------|
| SDF V2000 |||
| SDF V3000 |||
| MOL2 (TRIPOS) |||
| XYZ || - |
| Gzip (`.gz`) || - |

### Format Auto-Detection

```rust
// Automatically detect V2000 vs V3000
let mol = parse_sdf_auto_file("molecule.sdf")?;

// Automatically choose format based on molecule requirements
write_sdf_auto_file("output.sdf", &mol)?;
```

## Performance

The library is designed for high performance:

| Tool | Throughput | vs sdfrust |
|------|------------|------------|
| **sdfrust** | ~220,000 mol/s | baseline |
| RDKit | ~30,000-50,000 mol/s | 4-7x slower |
| Pure Python | ~3,000-5,000 mol/s | 40-70x slower |

- Streaming parser for memory-efficient processing of large files
- Minimal allocations during parsing
- Zero-copy where possible

## Python API Reference

### Parsing Functions

```python
# SDF (V2000/V3000)
mol = sdfrust.parse_sdf_file(path)
mol = sdfrust.parse_sdf_string(content)
mol = sdfrust.parse_sdf_auto_file(path)  # Auto-detect format
mols = sdfrust.parse_sdf_file_multi(path)

# MOL2
mol = sdfrust.parse_mol2_file(path)
mol = sdfrust.parse_mol2_string(content)
mols = sdfrust.parse_mol2_file_multi(path)

# Iterators (memory-efficient)
for mol in sdfrust.iter_sdf_file(path):
    process(mol)
```

### Writing Functions

```python
sdfrust.write_sdf_file(mol, path)
sdfrust.write_sdf_string(mol)
sdfrust.write_sdf_auto_file(mol, path)  # Auto-select V2000/V3000
sdfrust.write_sdf_file_multi(mols, path)
```

### ML Feature Methods (on Molecule)

```python
# Chemical perception
mol.sssr()                           # SSSR rings
mol.all_aromatic_atoms()             # Aromaticity
mol.all_hybridizations()             # Hybridization
mol.all_conjugated_bonds()           # Conjugation

# ML features
mol.ogb_atom_features()              # [N, 9] OGB features
mol.ogb_bond_features()              # [E, 3] OGB features
mol.ogb_graph_features()             # Full graph dict
mol.ecfp(radius=2, n_bits=2048)      # ECFP bit vector
mol.tanimoto_similarity(other)       # Tanimoto coefficient
mol.gasteiger_charges()              # Partial charges

# NumPy arrays (with numpy feature)
mol.get_ogb_atom_features_array()    # np.array [N, 9]
mol.get_ogb_bond_features_array()    # np.array [E, 3]
mol.get_ecfp_array()                 # np.array (n_bits,)
mol.get_gasteiger_charges_array()    # np.array (N,)

# 3D features (with geometry feature)
mol.neighbor_list(cutoff=5.0)        # Neighbor list dict
mol.bond_angle(i, j, k)             # Angle in radians
mol.dihedral_angle(i, j, k, l)      # Dihedral in radians
```

### Classes

- `Molecule`: Main molecular container
- `Atom`: Atom with coordinates
- `Bond`: Bond between atoms
- `BondOrder`: Bond type enum
- `BondStereo`: Stereochemistry enum
- `SdfFormat`: Format version enum

## Citation

If you use sdfrust in your research, please cite:

```bibtex
@software{fooladi2025sdfrust,
  author = {Fooladi, Hosein},
  title = {sdfrust: A fast, pure-Rust parser for SDF, MOL2, and XYZ chemical structure files},
  year = {2025},
  url = {https://github.com/hfooladi/sdfrust},
  license = {MIT}
}
```

## License

MIT License - Copyright (c) 2025-2026 Hosein Fooladi