sci-form 0.13.0

High-performance 3D molecular conformer generation using ETKDG distance geometry
Documentation
# Roadmap: sci-form Core Conformer Engine and Orbital Platform

**Primary goal:** Build a native Rust molecular engine that reproduces RDKit-grade 3D conformer generation without heavy C++ dependencies, ships clean Rust, Python, and WASM APIs, and grows into an electronic-structure and volumetric-visualization platform for the web.

**Execution principles:**
- Single-conformer, RDKit-style embedding comes first. Accuracy should come from the algorithm, not brute-force seed farming.
- Port mathematical logic, not C++ object hierarchies.
- Keep the public API usable from CLI, Python, and WebAssembly.
- Optimize for browser deployment once correctness is stable.

**Technology stack:**
- **Language:** Rust, targeting `wasm32-unknown-unknown`
- **Linear algebra:** `nalgebra` today, with room to evaluate `ndarray-linalg` where generalized eigensolvers become a better fit
- **Graph layer:** `petgraph`
- **Reference oracle:** Python + native RDKit
- **Parallel execution:** `rayon` for batch and volumetric workloads

---

## Current Implementation Snapshot (March 2026)

- [x] Rust core crate for SMILES parsing, molecular graph construction, distance geometry, ETKDG-style refinement, and force-field minimization
- [x] Differential testing and benchmark coverage against RDKit across GDB-20 and ChemBL-style datasets
- [x] Public Rust API plus Python and WASM bindings
- [x] Release optimization with `lto = true` and `strip = true`
- [x] Reach RDKit-class throughput on the practical single-conformer path (optimized: pre-computed pair lists, stack-allocated diffs, adaptive restart limits)

**Current measured status:**
- Single-conformer internal validation: max RMSD 0.477 A, avg RMSD 0.069 A on the current 100-molecule pipeline test set
- Large-batch heavy-atom validation: 0.024 A average pairwise-distance RMSD on GDB-20, with 0.00% above 0.5 A in the multi-seed cross-check
- ChemBL 10K practical benchmark: 97.54% embed success, 97.18% perfect-geometry rate, about 2.1 mol/s
- Main remaining conformer-engine gap: throughput. The current single-conformer path is still around 106 ms/mol versus the 11 ms/mol target.

**Experimental engine snapshot:**
- `src/experimental_2/` is now wired into the crate as an isolated namespace
- Phase 1 GPU infrastructure, Phase 2 quantum engine, and Phase 3 SCF engine are implemented and compile cleanly
- Phase 4 spectroscopy and Phase 5 GPU rendering exist as prototype tracks behind the experimental namespace
- Verified regression status for the experimental stack: `54 passed` in `test_experimental_comparison` and `14 passed, 7 ignored` in `test_extended_molecules`

---

## Base Data Model Direction

The core data structures need to support both the current conformer engine and the future orbital, electrostatics, and materials stack.

```rust
use nalgebra::Vector3;

pub struct Atom {
    pub element: u8,
    pub position: Vector3<f64>,
    pub formal_charge: i8,
    pub partial_charge: Option<f64>,
    pub hybridization: Hybridization,
}

pub struct Bond {
    pub start: usize,
    pub end: usize,
    pub order: BondOrder,
}

pub struct AtomicOrbital {
    pub atom_index: usize,
    pub principal_n: u8,
    pub angular_l: u8,
    pub magnetic_m: i8,
    pub label: String,
    pub vsip: f64,
    pub zeta: f64,
}

pub struct Molecule {
    pub atoms: Vec<Atom>,
    pub bonds: Vec<Bond>,
    pub metadata: MoleculeMeta,
    pub unit_cell: Option<UnitCell>,
    pub basis: Option<Vec<AtomicOrbital>>,
}

pub struct VolumetricGrid {
    pub origin: Vector3<f64>,
    pub spacing: f64,
    pub dims: [usize; 3],
    pub values: Vec<f32>,
}
```

**Development note:** do not translate RDKit file-by-file. Extract the numerical logic and rebuild it in Rust with composition-friendly modules and data structures.

---

## Track A: 3D Conformer Engine

### Phase A1: Infrastructure and Molecular Data Model

- [x] Initialize the Rust workspace and split it into core, CLI, Python, and WASM crates
- [x] Configure `wasm-bindgen`, PyO3, packaging, and CI build coverage
- [x] Build the differential-testing pipeline against RDKit reference outputs
- [x] Implement the molecular graph, atom and bond enums, radii tables, and SMILES parsing
- [x] Expand parser coverage: 60+ bracket elements, expanded implicit H valence table, error on unknown elements

### Phase A2: Distance Geometry and Bounds Matrix

- [x] Implement 1-2, 1-3, and 1-4 bounds-matrix generation
- [x] Implement bounds smoothing with shortest-path propagation
- [x] Implement RDKit-compatible RNG plumbing and distance sampling variants
- [x] Implement metric-matrix construction, centering, and `SymmetricEigen` projection to 3D and 4D coordinates
- [x] Reduce random-coordinate fallback cost for very large molecules (adaptive thresholds, scaled box size)

### Phase A3: Chemical Refinement and Stereochemistry

- [x] Detect and enforce tetrahedral chirality during embedding
- [x] Validate chiral volume, planarity, and double-bond geometry after embedding
- [x] Reproduce the RDKit-style retry-on-failure embedding loop
- [x] Add ETKDG-lite and ETKDGv3-style torsion knowledge, including macrocycle pattern support
- [x] Close remaining fidelity gaps: added amide, sulfonamide, sulfoxide, sulfone, phosphonate, allylic torsion patterns

### Phase A4: Energy Minimization and Force Fields

- [x] Translate UFF parameter tables and atom typing for the supported organic subset
- [x] Implement bonded and non-bonded energy terms plus gradients
- [x] Implement bounds-space BFGS and L-BFGS minimizers
- [x] Implement the ETKDG 3D force field and BFGS refinement path
- [x] Move the numerically sensitive pipeline to `f64` for stability
- [x] Expand parameter coverage: 18 new UFF atom types (transition metals, main-group), MMFF94 with bond/angle/torsion/vdW terms and builder

### Phase A5: Public API, WASM, and Performance Delivery

- [x] Expose stable Rust entry points for single and batch embedding
- [x] Ship WASM bindings for full results, compact coordinates, batch embedding, and parsing
- [x] Ship Python bindings for single and batch embedding
- [x] Publish benchmark documentation and CI build steps for `wasm-pack`
- [x] Enable release optimizations in Cargo profiles
- [x] Benchmark browser-side performance against RDKit MinimalLib (browser benchmark HTML page)
- [x] Replace JSON-heavy transfer paths with typed-array oriented APIs (Float64Array/Float32Array WASM functions)
- [x] Tune final WASM size with `wasm-opt` and dedicated size profiling (wasm_opt_tune.sh script)

---

## Track B: Electronic Structure and Orbital Visualization (EHT)

This track adds semiempirical electronic-structure support and 3D orbital rendering on top of the conformer engine. The intended first method is Extended Huckel Theory (EHT), using Slater-type orbitals conceptually and Gaussian expansions for practical integral evaluation.

**Core orbital form:**

$$
\phi(r, \theta, \varphi) = N r^{n-1} e^{-\zeta r} Y_l^m(\theta, \varphi)
$$

### Phase B1: Parameterization and Orbital Basis

- [x] Build an EHT parameter table for each supported element, including valence-shell orbital definitions
- [x] Store per-orbital VSIP values to populate the diagonal Hamiltonian terms $H_{ii}$
- [x] Store per-orbital Slater exponents $\zeta$ for orbital extent and contraction
- [x] Define a Rust representation for Slater-type orbitals that is stable enough for evaluation and rendering
- [x] Add contracted Gaussian expansions such as STO-3G or STO-6G for overlap-integral evaluation

### Phase B2: Matrix Construction ($S$ and $H$)

- [x] Assemble the symmetric overlap matrix $S$ for all valence orbitals in the molecule
- [x] Enforce the diagonal condition $S_{ii} = 1$ and validate symmetry and conditioning
- [x] Fill the diagonal Hamiltonian with VSIP values for each atomic orbital
- [x] Compute off-diagonal Hamiltonian elements with the Wolfsberg-Helmholtz approximation
- [x] Make the empirical constant $K$ configurable, with 1.75 as the default starting point

**Wolfsberg-Helmholtz approximation:**

$$
H_{ij} = \frac{1}{2} K S_{ij} \left(H_{ii} + H_{jj}\right)
$$

### Phase B3: Generalized Eigenproblem and Orbital Coefficients

- [x] Solve the generalized Schrodinger equation $HC = SCE$
- [x] Implement Lowdin orthogonalization by diagonalizing $S$ and building $S^{-1/2}$
- [x] Build the orthogonalized Hamiltonian $H' = S^{-1/2} H S^{-1/2}$
- [x] Run the standard symmetric eigensolve on $H'$ to obtain orbital energies and transformed coefficients
- [x] Back-transform the coefficients with $C = S^{-1/2} C'$ and expose orbital occupations, HOMO, and LUMO metadata

### Phase B4: 3D Volumetric Mapping Engine

- [x] Build a padded bounding box around the optimized molecular geometry
- [x] Generate a configurable 3D voxel grid for orbital and density sampling
- [x] Parallelize grid evaluation with `rayon` so millions of points are tractable
- [x] Evaluate molecular-orbital amplitudes pointwise with the linear combination of atomic orbitals
- [x] Add chunked or streamed evaluation modes: x-slab chunked evaluation with VolumeSlab iterator

**Orbital evaluation in space:**

$$
\Psi_i(x,y,z) = \sum_{\mu=1}^{N} C_{\mu i} \phi_{\mu}(x,y,z)
$$

### Phase B5: Output Structures and Rendering Paths

- [x] Export raw volumetric data as `Vec<f32>` or `Float32Array` plus grid metadata for GPU volume rendering
- [x] Define a WASM-friendly transfer format for dimensions, bounds, spacing, and orbital-energy metadata
- [x] Implement a Marching Cubes path for isosurface extraction in Rust
- [x] Return mesh vertices, normals, and indices for fast WebGL or WebGPU rendering when volume textures are too heavy
- [x] Benchmark transfer cost, render cost, and memory usage for raw volumes versus isosurfaces (integration test)

---

## Track C: Post-Core Platform Expansion

### Phase C1: Electrostatics and Surface Analysis

- [x] Implement a native partial-charge solver, either Gasteiger-Marsili or an EHT-derived charge workflow
- [x] Add solvent-accessible surface area and cavity analysis with a Shrake-Rupley style algorithm
- [x] Connect electrostatic descriptors and surface outputs back into the conformer and orbital pipelines

### Phase C2: Population Analysis (Mulliken & Löwdin Charges)

- [x] Compute population matrices from density matrix $P$ and overlap matrix $S$
- [x] Implement Mulliken and Löwdin charge partitioning to extract per-atom partial charges from MO coefficients
- [x] Validate against PySCF/Multiwfn with charge-sum invariants ($\sum_i q_i = Q_{total}$)
- [x] Expose charge arrays via Rust lib, CLI, WASM, and Python bindings

### Phase C3: Dipole Moments

- [x] Calculate molecular dipole vector $\vec{\mu} = (\mu_x, \mu_y, \mu_z)$ from nuclear and electron-density contributions
- [x] Validate magnitude (in Debyes) and spatial orientation against PySCF/ORCA/Gaussian
- [x] Support conformer-dependent dipole calculations

### Phase C4: Electrostatic Potential Maps (ESP)

- [x] Generate volumetric ESP on 3D grids: $\Phi(\vec{r}) = \sum_A \frac{Z_A}{|\vec{r} - \vec{R}_A|} - \int \frac{\rho(\vec{r}')}{|\vec{r} - \vec{r}'|} d\vec{r}'$
- [x] Export/import `.cube` format for validation against PySCF cubegen or Gaussian
- [x] Color-code ESP visualization: `esp_color_map()` and `esp_grid_to_colors()` (red=negative, blue=positive)
- [x] Parallelize grid evaluation with rayon: `compute_esp_grid_parallel()` for million-point grids

### Phase C5: Density of States (DOS) and Projected DOS (PDOS)

- [x] Apply Gaussian smearing to orbital energies to generate smooth DOS curves
- [x] Implement PDOS: separate DOS by atomic contribution
- [x] Validate DOS curves: `dos_mse()` mean-squared error metric for comparison
- [x] Export as graph data for web visualization: `export_dos_json()` with energies, total_dos, pdos

### Phase C6: Molecular Alignment and RMSD (Kabsch/Quaternion)

- [x] Implement Kabsch algorithm for optimal 3D alignment of two molecular geometries
- [x] Support quaternion-based rotation fitting: `align_quaternion()` (Coutsias et al. 2004)
- [x] Compute heavy-atom, all-atom, and atom-type-specific RMSD variants
- [x] Validate: quaternion matches Kabsch RMSD to <1e-8, aligned coords to <1e-6
- [x] Use RMSD as metric for conformer diversity and benchmark comparisons

### Phase C7: Force Fields (UFF / MMFF94) and Strain Energy

- [x] Fully parameterize UFF with bonded (stretch, bend, torsion) and non-bonded (vdW, Coulomb) terms
- [x] Implement MMFF94: atom typing (28 types), bond stretch, angle bend, torsion, vdW, builder for full molecules
- [x] Compute strain energy: sum of all individual energy terms
- [x] Validate gradients: numerical vs analytical validation for all MMFF94 terms (<1e-3 max error)
- [x] Enable geometry optimization with selected force field using L-BFGS

### Phase C8: Complex Materials Assembly

- [x] Add SBU and coordination-aware assembly logic for metal centers and framework nodes
- [x] Extend the data model with periodic unit-cell vectors and periodic boundary conditions
- [x] Support browser-side generation and visualization of repeating crystalline networks

### Phase C9: Web Data Transport and Parallelism

- [x] Add zero-copy Arrow-compatible memory layouts for large coordinate, orbital, and grid datasets
- [x] Move heavy browser workloads to Web Workers with a WASM-safe threading strategy
- [x] Add chunked streaming for large volumetric and ensemble outputs to avoid JSON bottlenecks

---

## Track E: Machine Learning Potentials (ANI Family)

This track provides near-DFT accuracy at force-field speeds using physical neural networks.

- [x] **Step 1.1: Spatial Partitioning:** Implemented `CellList` for O(N) neighbor searching within a $5.2 \text{ \AA}$ cutoff radius (`src/ani/neighbor.rs`).
- [x] **Step 1.2: Atomic Environment Vectors (AEVs):** Encoded Behler-Parrinello radial and angular symmetry functions for translational/rotational invariance (`src/ani/aev.rs`, `src/ani/aev_params.rs`, `src/ani/cutoff.rs`).
- [x] **Step 1.3: Neural Network Engine:** Created a pure-Rust feed-forward inference engine supporting GELU/CELU activations with a binary weight loader (`src/ani/nn.rs`, `src/ani/weights.rs`).
- [x] **Step 1.4: Analytical Gradients:** Implemented exact backpropagation through the AEVs to yield analytical forces respecting Newton's third law (`src/ani/gradients.rs`, `src/ani/api.rs`).

---

## Track F: HF-3c Quantum Engine

This track introduces a specialized minimal-basis Hartree-Fock engine with semi-empirical corrections for rapid geometric and spectroscopic screening.

- [x] **Step 2.1: Gaussian Integrals:** Implemented Obara-Saika recurrence for analytical evaluation of overlap, kinetic, nuclear attraction, and electron repulsion integrals (`src/hf/overlap_kin.rs`, `src/hf/nuclear.rs`, `src/hf/integrals.rs`, `src/hf/basis.rs`).
- [x] **Step 2.2: SCF Cycle:** Created a Roothaan-Hall SCF solver with Löwdin orthogonalization and DIIS convergence acceleration (`src/hf/fock.rs`, `src/hf/scf.rs`).
- [x] **Step 2.3: Empirical Corrections:** Added Grimme D3-BJ dispersion, geometric counterpoise (gCP), and short-range basis (SRB) corrections (`src/hf/d3.rs`, `src/hf/gcp.rs`, `src/hf/srb.rs`).
- [x] **Step 2.4: CIS UV-Vis:** Implemented Configuration Interaction Singles (CIS) for vertical excitation energies and oscillator strengths (`src/hf/cis.rs`, `src/hf/api.rs`).

---

## Track D: Advanced Spectroscopy and Response Properties

For a detailed breakdown of the spectroscopy development plan, see [docs/spectroscopy_roadmap.md](file:///home/lestad/github/sci-form/docs/spectroscopy_roadmap.md).
Specific deep-dive for NMR: [docs/nmr_detailed_roadmap.md](file:///home/lestad/github/sci-form/docs/nmr_detailed_roadmap.md).

- [x] **UV-Vis Spectroscopy:** Vertical excitations via sTDA-xTB and spectral broadening (Phase D1)
- [x] **IR Spectroscopy:** Numerical Hessian, vibrational frequencies, and dipole intensities (Phase D2)
- [x] **NMR Spectroscopy:** Topological/ML-based 1H/13C prediction plus fast relative heteronuclear inference and expanded nucleus registry (Phase D3)

---

## Priority Order From Here

- [x] Reduce conformer-engine runtime: pre-computed active pair lists, stack-allocated diffs, adaptive restart limits
- [x] Harden first-try correctness: adaptive thresholds for large molecules, expanded torsion patterns
- [x] Finish the production-quality browser benchmarking and typed-array transport path
- [x] Extend EHT pipeline with population analysis (Phase C2–C3) for industrial chemical descriptors
- [x] Implement Kabsch/RMSD (Phase C6) as the production alignment and diversity metric
- [x] Complete force-field validation (Phase C7) before enabling conformer geometry optimization
- [x] Evaluate Arrow/WASM-Workers architecture (Phase C9) for browser-side ensemble rendering
- [x] **[NEW]** Implement sTDA-xTB module for ultrafast UV-Vis screening (Phase D1)
- [x] **[NEW]** Build numerical Hessian infrastructure for IR vibrational analysis (Phase D2)
- [x] **[NEW]** Integrate ML models for NMR chemical shift prediction (Phase D3)

---

## Track G: Large-Scale Algorithmic Validation

With the conformer engine, spectroscopic models, ML potentials (ANI), and quantum engines (HF-3c) essentially built, the system needs to be stress-tested against established standard libraries using comprehensive molecule sets.

- [ ] **Step 3.1: Assembly of Representative Molecule Bank:** Curate a diverse set of at least 100 representative molecules spanning different functional groups, stereocenters, and sizes.
- [ ] **Step 3.2: Automated Pipeline Benchmarking:** Construct automated Python-based test pipelines linking `sci-form` binaries against reference implementations (e.g., PySCF for quantum targets, RDKit for geometries/descriptors).
- [ ] **Step 3.3: Metric Evaluation:** Enforce a strict passing threshold (e.g., <1% deviation) across energy evaluations, structure gradients, and physical property predictions.
- [ ] **Step 3.4: Documentation of Scaling Limits:** Profile execution times and memory consumption on the 100-molecule test bank to pinpoint throughput limiters for optimization.