quadprec
Quad-precision linear algebra in Rust for solving ill-conditioned symmetric positive definite systems — specifically the overlap matrix problem that kills SCF convergence in quantum chemistry.
cargo run --release -p demo
What's new in v0.2.0
Core math (crates/quad)
- Jacobi eigensolve — symmetric eigenvalue decomposition in f128, needed for SCF canonical orthogonalization
- exp/ln/pow — transcendental functions via argument reduction + Taylor/Newton
- FromStr — parse decimal strings like
"3.14159265358979323846".parse::<f128>()to full 31-digit precision - GEMM / GEMM_ATB — parallel matrix multiply with Rayon, including $\mathbf{A}^T\mathbf{B}$ for basis transforms
- Blocked Cholesky — cache-friendly NB=64 blocked factorization for large matrices
- NEON SIMD — 2-wide f64 batch dd operations on Apple Silicon (aarch64)
- num-traits — optional
Zero,One,Num,Floattraits (--features num-traits) - serde — optional lossless JSON serialization (
--features serde)
Solver (crates/solver)
- Apple Accelerate FFI — dpotrf/dpotrs for 10-100x f64 Cholesky speedup on macOS (default-on)
- Canonical orthogonalization — $\mathbf{X} = \mathbf{U},\text{diag}(1/\sqrt{\lambda})$ with threshold, solves $\mathbf{F}\mathbf{C} = \mathbf{S}\mathbf{C}\boldsymbol{\varepsilon}$ (Roothaan-Hall)
Test count: 65 base + 12 with num-traits = 77 tests, all passing.
Why your SCF diverges
In Hartree-Fock and DFT, the Roothaan-Hall equation is a generalized eigenvalue problem:
$$\mathbf{F}\mathbf{C} = \mathbf{S}\mathbf{C}\boldsymbol{\varepsilon}$$
where $\mathbf{S}$ is the overlap matrix of atomic orbital basis functions:
$$S_{\mu\nu} = \langle \phi_\mu | \phi_\nu \rangle = \int \phi_\mu(\mathbf{r}), \phi_\nu(\mathbf{r}), d\mathbf{r}$$
The standard approach orthogonalizes the basis via $\mathbf{S} = \mathbf{L}\mathbf{L}T$ (Cholesky) or $\mathbf{X} = \mathbf{S}{-1/2}$ (canonical orthogonalization). Both require inverting or factoring $\mathbf{S}$.
The problem is the condition number $\kappa(\mathbf{S})$. When you use augmented or diffuse basis sets (aug-cc-pVDZ, aug-cc-pVTZ), nearby diffuse Gaussians overlap almost completely:
$$S_{\mu\nu} = \left(\frac{4\alpha_\mu\alpha_\nu}{(\alpha_\mu + \alpha_\nu)2}\right){d/4} \exp!\left(-\frac{\alpha_\mu\alpha_\nu}{\alpha_\mu + \alpha_\nu}|\mathbf{R}\mu - \mathbf{R}\nu|^2\right)$$
For diffuse exponents $\alpha \sim 0.01$, this approaches 1.0 for adjacent atoms, making $\mathbf{S}$ nearly singular. The condition number grows with system size:
| System | $N_\text{basis}$ | $\kappa(\mathbf{S})$ | f64 status |
|---|---|---|---|
| Small molecule | ~100 | $10^4$ | Fine |
| Protein fragment | ~750 | $10^{13}$ | Cholesky fails |
| Periodic slab | ~2000 | $10^{20}$+ | Impossible |
f64 has ~16 decimal digits. When $\kappa(\mathbf{S}) > 10{13}$, Cholesky loses all significant digits. When $\kappa(\mathbf{S}) > 10{16}$, Cholesky produces a negative pivot and refuses to continue. Your SCF never starts.
This is a well-known failure mode. Current workarounds:
- Remove diffuse functions — sacrifices accuracy for the properties you needed them for
- Use MOLOPT basis sets (CP2K) — sidesteps the problem at the theory level, not the compute level
- Canonical orthogonalization with threshold — discards near-null eigenvectors, changes your basis
None of these solve the underlying numerical problem. They all compromise the physics.
Why this fixes it
This crate implements double-double arithmetic: each scalar is stored as a pair of f64 values $(h, \ell)$ where the true value is $h + \ell$ exactly, with $|\ell| \leq \frac{1}{2},\text{ulp}(h)$.
$$\text{f64: } \epsilon \approx 10^{-16} \quad(16\text{ digits})$$
$$\text{f128 (double-double): } \epsilon \approx 10^{-31} \quad(31\text{ digits})$$
Every arithmetic operation uses error-free transformations (Dekker/Knuth):
$$\texttt{TwoSum}(a,b): \quad s + e = a + b \quad\text{exactly}$$
$$\texttt{TwoProd}(a,b): \quad p + e = a \times b \quad\text{exactly (via FMA)}$$
This gives you ~31 decimal digits per scalar. The Cholesky factorization now handles:
| $\kappa(\mathbf{S})$ | f64 | f128 (this crate) |
|---|---|---|
| $10^{13}$ | Fails | $|\mathbf{x} - \mathbf{x}_\text{true}| \sim 10^{-18}$ |
| $10^{15}$ | Fails | $|\mathbf{x} - \mathbf{x}_\text{true}| \sim 10^{-14}$ |
| $10^{20}$ | Fails | $|\mathbf{x} - \mathbf{x}_\text{true}| \sim 10^{-8}$ |
| $10^{22}$ | Fails | $|\mathbf{x} - \mathbf{x}_\text{true}| \sim 10^{-5}$ |
The solver auto-selects the cheapest precision strategy:
- $\kappa < 10^{10}$: Pure f64 Cholesky. No overhead.
- $10{10} < \kappa < 10{16}$: f64 Cholesky + iterative refinement in f128. Factor once in f64 (fast), compute residual $\mathbf{r} = \mathbf{b} - \mathbf{A}\mathbf{x}$ in f128 (exact), solve correction $\mathbf{A},\delta\mathbf{x} = \mathbf{r}$ in f128. Converges in 1–3 iterations.
- $\kappa > 10^{16}$: Full f128 Cholesky. f64 can't even factor; f128 handles it.
How to use it
Drop-in solve (f64 input)
use solve_spd;
let a: = your_overlap_matrix; // row-major, n x n
let b: = your_rhs;
let n = basis_size;
let sol = solve_spd?;
sol.x // Vec<f128> — full quad-precision solution
sol.x_f64 // Vec<f64> — truncated (loses digits when kappa > 10^10)
sol.kappa // f64 — estimated condition number
sol.strategy // Float64 | MixedPrecision | Float128
Note: When $\kappa(\mathbf{S})$ is large, the solution accuracy is limited by the precision of your input. If $\mathbf{A}$ and $\mathbf{b}$ are assembled in f64, you lose $\log_{10}\kappa$ digits before the solve even starts. For the full payoff, assemble in f128.
Full-precision assembly (the real payoff)
use f128;
use solve_spd_f128;
// Assemble S in quad precision — diffuse pairs get full 31-digit accuracy
let mut s = vec!;
for in shell_pairs
let sol = solve_spd_f128?;
// sol.x has ~31 - log10(kappa) correct digits
Using f128 directly
use f128;
let a = from_f64;
let b = from_f64;
let c = a + b;
let d = c - a;
// d.to_f64() == 1e-20 (exact — f64 would give 0.0)
// Arithmetic: + - * / sqrt() recip() abs()
// Transcendentals: exp() ln() pow()
// Parsing: "3.14159265358979323846".parse::<f128>()
// All operator-overloaded, all #[inline(always)]
Eigenvalue decomposition
use ;
// Symmetric matrix (row-major)
let a = vec!;
let = jacobi_eigen?;
// eigenvalues: [1.0, 3.0] (sorted)
// eigenvectors: orthonormal columns
Matrix multiply (GEMM)
use ;
// C = A * B (m×k * k×n → m×n, row-major)
let mut c = vec!;
gemm;
// C = A^T * B (for orthogonalization: X^T S X)
gemm_atb;
Integrating with your SCF code
The precision-critical path in SCF is small. You don't need f128 everywhere:
Your existing code (f64) This crate (f128)
┌───────────────────────┐ ┌────────────────────────┐
│ Electron integrals │ │ │
│ (Coulomb, exchange) │ │ Overlap matrix S │
│ │ │ Cholesky / eigensolve │
│ Fock matrix assembly │◄──────────►│ Iterative refinement │
│ │ │ Orthogonalization │
│ Density matrix P │ │ │
│ SCF convergence check │ │ │
└───────────────────────┘ └────────────────────────┘
~90% of wall time ~10% of wall time
Concretely, to integrate into an existing Hartree-Fock or DFT code:
-
Replace your overlap matrix assembly with f128-precision computation for shell pairs involving diffuse functions. Use
quad::f128arithmetic. -
Replace your Cholesky/orthogonalization with
solve_spd_f128()or usequad::cholesky()directly on the f128 overlap matrix. -
Keep everything else in f64. Coulomb/exchange integrals, Fock matrix construction, density matrix updates — all stay in f64. These are not precision-limited, they're compute-limited.
-
Use iterative refinement (built into
solve_spd()) for the generalized eigenvalue problem: solve in f64, correct in f128, converge.
If your code uses libint2 or libcint for integral evaluation, you can FFI to them for everything except the overlap integrals, which you recompute in f128.
Demo output
(matrix assembled in quad precision — the real payoff)
Hilbert n=12 kappa=1.1e13 ||x-1||=9.89e-18 ||r||=1.18e-31
Hilbert n=13 kappa=1.8e14 ||x-1||=1.05e-16 ||r||=5.70e-32
Hilbert n=14 kappa=2.9e15 ||x-1||=5.10e-14 ||r||=5.59e-32
Hilbert n=15 kappa=4.7e16 ||x-1||=5.22e-12 ||r||=9.14e-32
Hilbert n=18 kappa=1.9e20 ||x-1||=5.46e-8 ||r||=1.09e-31
Hilbert n=20 kappa=4.9e22 ||x-1||=9.52e-5 ||r||=1.07e-31
f64 Cholesky fails at n=13. f128 keeps going through n=20 ($\kappa = 5 \times 10{22}$). Residual stays at $\sim 10{-31}$ throughout — the solver is producing answers accurate to the limits of 31-digit arithmetic.
Why not just use a GPU?
GPUs are going the wrong direction for this problem:
| FP64 | FP16/FP8 | |
|---|---|---|
| H100 (Hopper) | 34 TFLOPS | 1979 TFLOPS |
| B200 (Blackwell) | 40 TFLOPS | 9000+ TFLOPS |
| Ratio shift | 1.2x | 4.5x |
The silicon budget is pouring into low-precision AI modes. FP64 is an afterthought. And Apple Silicon GPUs have no FP64 at all — Metal only supports FP32 and FP16.
Double-double arithmetic runs entirely on the CPU, uses only f64 hardware (which every CPU has), and parallelizes via Rayon work-stealing. On an M-series Mac, you get 10 performance cores doing f128 Cholesky while the GPU sits idle (or handles the precision-tolerant work).
Crate structure
crates/
quad/ # f128 type, BLAS-like ops, eigensolve, GEMM, SIMD kernels
compensated-sum/ # Neumaier compensated summation for f64 and f128
solver/ # solve_spd() — auto-selects precision, iterative refinement
overlap/ # Mixed-precision overlap integral assembly with Schwarz screening
demo/ # cargo run --release -p demo
v0.2.0 features
- Jacobi eigensolve — symmetric eigenvalue decomposition in full f128 precision, needed for canonical orthogonalization
- GEMM / GEMM_ATB — parallel matrix multiply with Rayon, including $\mathbf{A}^T\mathbf{B}$ variant for basis transforms
- Blocked Cholesky — cache-friendly NB=64 blocked factorization, 5-10x faster at $n > 500$
- Transcendentals —
exp(),ln(),pow()in f128 precision via argument reduction + Taylor/Newton - FromStr — parse decimal strings to full 31-digit precision:
"3.14159265358979323846".parse::<f128>() - NEON SIMD — 2-wide f64 batch operations on Apple Silicon (aarch64), ~1.5x throughput for dd arithmetic
- Apple Accelerate — optional FFI to dpotrf/dpotrs for 10-100x f64 Cholesky speedup on macOS
- num-traits — optional
Zero,One,Num,Floattrait implementations (--features num-traits) - serde — optional serialization with lossless {hi, lo} roundtrip (
--features serde) - Canonical orthogonalization — $\mathbf{X} = \mathbf{U},\text{diag}(1/\sqrt{\lambda})$ with threshold for near-null removal
- Mixed-precision overlap assembly — Schwarz screening routes ~5-15% of shell pairs to f128, rest stays f64
Requirements
- Rust 2024 edition
- Any platform with FMA (fused multiply-add) support — all modern x86-64 and ARM processors
- Core: no external dependencies beyond
rayon - Optional:
num-traits,serdevia feature flags - macOS: Apple Accelerate integration via
--features accelerate
References
- T.J. Dekker, "A floating-point technique for extending the available precision" (1971)
- D.E. Knuth, The Art of Computer Programming, Vol 2 — error-free transformations
- SCF convergence failure with aug-cc-pVDZ — the exact failure this solves
- CP2K SCF convergence problems with large basis sets
- Y. Hida, X.S. Li, D.H. Bailey, "Algorithms for Quad-Double Precision Floating Point Arithmetic" (2001)