quadprec

Quad-precision linear algebra in Rust for solving ill-conditioned symmetric positive definite systems — specifically the overlap matrix problem that kills SCF convergence in quantum chemistry.

cargo run --release -p demo

What's new in v0.2.0

Core math (crates/quad)

Jacobi eigensolve — symmetric eigenvalue decomposition in f128, needed for SCF canonical orthogonalization
exp/ln/pow — transcendental functions via argument reduction + Taylor/Newton
FromStr — parse decimal strings like "3.14159265358979323846".parse::<f128>() to full 31-digit precision
GEMM / GEMM_ATB — parallel matrix multiply with Rayon, including $\mathbf{A}^T\mathbf{B}$ for basis transforms
Blocked Cholesky — cache-friendly NB=64 blocked factorization for large matrices
NEON SIMD — 2-wide f64 batch dd operations on Apple Silicon (aarch64)
num-traits — optional Zero, One, Num, Float traits (--features num-traits)
serde — optional lossless JSON serialization (--features serde)

Solver (crates/solver)

Apple Accelerate FFI — dpotrf/dpotrs for 10-100x f64 Cholesky speedup on macOS (default-on)
Canonical orthogonalization — $\mathbf{X} = \mathbf{U},\text{diag}(1/\sqrt{\lambda})$ with threshold, solves $\mathbf{F}\mathbf{C} = \mathbf{S}\mathbf{C}\boldsymbol{\varepsilon}$ (Roothaan-Hall)

Test count: 65 base + 12 with num-traits = 77 tests, all passing.

Why your SCF diverges

In Hartree-Fock and DFT, the Roothaan-Hall equation is a generalized eigenvalue problem:

$$\mathbf{F}\mathbf{C} = \mathbf{S}\mathbf{C}\boldsymbol{\varepsilon}$$

where $\mathbf{S}$ is the overlap matrix of atomic orbital basis functions:

$$S_{\mu\nu} = \langle \phi_\mu | \phi_\nu \rangle = \int \phi_\mu(\mathbf{r}), \phi_\nu(\mathbf{r}), d\mathbf{r}$$

The standard approach orthogonalizes the basis via $\mathbf{S} = \mathbf{L}\mathbf{L}^{T$ (Cholesky) or $\mathbf{X} = \mathbf{S}}{-1/2}$ (canonical orthogonalization). Both require inverting or factoring $\mathbf{S}$.

The problem is the condition number $\kappa(\mathbf{S})$. When you use augmented or diffuse basis sets (aug-cc-pVDZ, aug-cc-pVTZ), nearby diffuse Gaussians overlap almost completely:

$$S_{\mu\nu} = \left(\frac{4\alpha_\mu\alpha_\nu}{(\alpha_\mu + \alpha_\nu)^2}\right){d/4} \exp!\left(-\frac{\alpha_\mu\alpha_\nu}{\alpha_\mu + \alpha_\nu}|\mathbf{R}\mu - \mathbf{R}\nu|^2\right)$$

For diffuse exponents $\alpha \sim 0.01$, this approaches 1.0 for adjacent atoms, making $\mathbf{S}$ nearly singular. The condition number grows with system size:

System	$N_\text{basis}$	$\kappa(\mathbf{S})$	f64 status
Small molecule	~100	$10^4$	Fine
Protein fragment	~750	$10^{13}$	Cholesky fails
Periodic slab	~2000	$10^{20}$+	Impossible

f64 has ~16 decimal digits. When $\kappa(\mathbf{S}) > 10^{{13}$, Cholesky loses all significant digits. When $\kappa(\mathbf{S}) > 10}{16}$, Cholesky produces a negative pivot and refuses to continue. Your SCF never starts.

This is a well-known failure mode. Current workarounds:

Remove diffuse functions — sacrifices accuracy for the properties you needed them for
Use MOLOPT basis sets (CP2K) — sidesteps the problem at the theory level, not the compute level
Canonical orthogonalization with threshold — discards near-null eigenvectors, changes your basis

None of these solve the underlying numerical problem. They all compromise the physics.

Why this fixes it

This crate implements double-double arithmetic: each scalar is stored as a pair of f64 values $(h, \ell)$ where the true value is $h + \ell$ exactly, with $|\ell| \leq \frac{1}{2},\text{ulp}(h)$.

$$\text{f64: } \epsilon \approx 10^{-16} \quad(16\text{ digits})$$

$$\text{f128 (double-double): } \epsilon \approx 10^{-31} \quad(31\text{ digits})$$

Every arithmetic operation uses error-free transformations (Dekker/Knuth):

$$\texttt{TwoSum}(a,b): \quad s + e = a + b \quad\text{exactly}$$

$$\texttt{TwoProd}(a,b): \quad p + e = a \times b \quad\text{exactly (via FMA)}$$

This gives you ~31 decimal digits per scalar. The Cholesky factorization now handles:

$\kappa(\mathbf{S})$	f64	f128 (this crate)
$10^{13}$	Fails	$\|\mathbf{x} - \mathbf{x}_\text{true}\| \sim 10^{-18}$
$10^{15}$	Fails	$\|\mathbf{x} - \mathbf{x}_\text{true}\| \sim 10^{-14}$
$10^{20}$	Fails	$\|\mathbf{x} - \mathbf{x}_\text{true}\| \sim 10^{-8}$
$10^{22}$	Fails	$\|\mathbf{x} - \mathbf{x}_\text{true}\| \sim 10^{-5}$

The solver auto-selects the cheapest precision strategy:

$\kappa < 10^{10}$: Pure f64 Cholesky. No overhead.
$10^{{10} < \kappa < 10}{16}$: f64 Cholesky + iterative refinement in f128. Factor once in f64 (fast), compute residual $\mathbf{r} = \mathbf{b} - \mathbf{A}\mathbf{x}$ in f128 (exact), solve correction $\mathbf{A},\delta\mathbf{x} = \mathbf{r}$ in f128. Converges in 1–3 iterations.
$\kappa > 10^{16}$: Full f128 Cholesky. f64 can't even factor; f128 handles it.

How to use it

Drop-in solve (f64 input)

use quadprec_solver::solve_spd;

let a: Vec<f64> = your_overlap_matrix;  // row-major, n x n
let b: Vec<f64> = your_rhs;
let n = basis_size;

let sol = solve_spd(&a, &b, n)?;

sol.x       // Vec<f128> — full quad-precision solution
sol.x_f64   // Vec<f64>  — truncated (loses digits when kappa > 10^10)
sol.kappa   // f64       — estimated condition number
sol.strategy // Float64 | MixedPrecision | Float128

Note: When $\kappa(\mathbf{S})$ is large, the solution accuracy is limited by the precision of your input. If $\mathbf{A}$ and $\mathbf{b}$ are assembled in f64, you lose $\log_{10}\kappa$ digits before the solve even starts. For the full payoff, assemble in f128.

Full-precision assembly (the real payoff)

use quad::f128;
use quadprec_solver::solve_spd_f128;

// Assemble S in quad precision — diffuse pairs get full 31-digit accuracy
let mut s = vec![f128::ZERO; n * n];
for (i, j) in shell_pairs {
    let alpha_i = f128::from_f64(exponents[i]);
    let alpha_j = f128::from_f64(exponents[j]);
    let r2 = compute_distance_sq_f128(centers[i], centers[j]);
    s[i * n + j] = overlap_integral_f128(alpha_i, alpha_j, r2);
    s[j * n + i] = s[i * n + j];
}

let sol = solve_spd_f128(&s, &b, n)?;
// sol.x has ~31 - log10(kappa) correct digits

Using `f128` directly

use quad::f128;

let a = f128::from_f64(1.0);
let b = f128::from_f64(1e-20);
let c = a + b;
let d = c - a;
// d.to_f64() == 1e-20 (exact — f64 would give 0.0)

// Arithmetic: + - * / sqrt() recip() abs()
// Transcendentals: exp() ln() pow()
// Parsing: "3.14159265358979323846".parse::<f128>()
// All operator-overloaded, all #[inline(always)]

Eigenvalue decomposition

use quad::{f128, jacobi_eigen};

// Symmetric matrix (row-major)
let a = vec![
    f128::from_f64(2.0), f128::from_f64(1.0),
    f128::from_f64(1.0), f128::from_f64(3.0),
];
let (eigenvalues, eigenvectors) = jacobi_eigen(&a, 2, 100)?;
// eigenvalues: [1.0, 3.0] (sorted)
// eigenvectors: orthonormal columns

Matrix multiply (GEMM)

use quad::{f128, gemm, gemm_atb};

// C = A * B  (m×k * k×n → m×n, row-major)
let mut c = vec![f128::ZERO; m * n];
gemm(&a, &b, &mut c, m, n, k);

// C = A^T * B  (for orthogonalization: X^T S X)
gemm_atb(&a, &b, &mut c, m, n, k);

Integrating with your SCF code

The precision-critical path in SCF is small. You don't need f128 everywhere:

Your existing code (f64)             This crate (f128)
┌───────────────────────┐            ┌────────────────────────┐
│ Electron integrals    │            │                        │
│ (Coulomb, exchange)   │            │ Overlap matrix S       │
│                       │            │ Cholesky / eigensolve  │
│ Fock matrix assembly  │◄──────────►│ Iterative refinement   │
│                       │            │ Orthogonalization      │
│ Density matrix P      │            │                        │
│ SCF convergence check │            │                        │
└───────────────────────┘            └────────────────────────┘
        ~90% of wall time                  ~10% of wall time

Concretely, to integrate into an existing Hartree-Fock or DFT code:

Replace your overlap matrix assembly with f128-precision computation for shell pairs involving diffuse functions. Use quad::f128 arithmetic.
Replace your Cholesky/orthogonalization with solve_spd_f128() or use quad::cholesky() directly on the f128 overlap matrix.
Keep everything else in f64. Coulomb/exchange integrals, Fock matrix construction, density matrix updates — all stay in f64. These are not precision-limited, they're compute-limited.
Use iterative refinement (built into solve_spd()) for the generalized eigenvalue problem: solve in f64, correct in f128, converge.

If your code uses libint2 or libcint for integral evaluation, you can FFI to them for everything except the overlap integrals, which you recompute in f128.

Demo output

--- Part 2: f128 input via solve_spd_f128() ---
    (matrix assembled in quad precision — the real payoff)

  Hilbert  n=12   kappa=1.1e13   ||x-1||=9.89e-18   ||r||=1.18e-31
  Hilbert  n=13   kappa=1.8e14   ||x-1||=1.05e-16   ||r||=5.70e-32
  Hilbert  n=14   kappa=2.9e15   ||x-1||=5.10e-14   ||r||=5.59e-32
  Hilbert  n=15   kappa=4.7e16   ||x-1||=5.22e-12   ||r||=9.14e-32
  Hilbert  n=18   kappa=1.9e20   ||x-1||=5.46e-8    ||r||=1.09e-31
  Hilbert  n=20   kappa=4.9e22   ||x-1||=9.52e-5    ||r||=1.07e-31

f64 Cholesky fails at n=13. f128 keeps going through n=20 ($\kappa = 5 \times 10^{{22}$). Residual stays at $\sim 10}{-31}$ throughout — the solver is producing answers accurate to the limits of 31-digit arithmetic.

Why not just use a GPU?

GPUs are going the wrong direction for this problem:

	FP64	FP16/FP8
H100 (Hopper)	34 TFLOPS	1979 TFLOPS
B200 (Blackwell)	40 TFLOPS	9000+ TFLOPS
Ratio shift	1.2x	4.5x

The silicon budget is pouring into low-precision AI modes. FP64 is an afterthought. And Apple Silicon GPUs have no FP64 at all — Metal only supports FP32 and FP16.

Double-double arithmetic runs entirely on the CPU, uses only f64 hardware (which every CPU has), and parallelizes via Rayon work-stealing. On an M-series Mac, you get 10 performance cores doing f128 Cholesky while the GPU sits idle (or handles the precision-tolerant work).

Crate structure

crates/
  quad/              # f128 type, BLAS-like ops, eigensolve, GEMM, SIMD kernels
  compensated-sum/   # Neumaier compensated summation for f64 and f128
  solver/            # solve_spd() — auto-selects precision, iterative refinement
  overlap/           # Mixed-precision overlap integral assembly with Schwarz screening
  demo/              # cargo run --release -p demo

v0.2.0 features

Jacobi eigensolve — symmetric eigenvalue decomposition in full f128 precision, needed for canonical orthogonalization
GEMM / GEMM_ATB — parallel matrix multiply with Rayon, including $\mathbf{A}^T\mathbf{B}$ variant for basis transforms
Blocked Cholesky — cache-friendly NB=64 blocked factorization, 5-10x faster at $n > 500$
Transcendentals — exp(), ln(), pow() in f128 precision via argument reduction + Taylor/Newton
FromStr — parse decimal strings to full 31-digit precision: "3.14159265358979323846".parse::<f128>()
NEON SIMD — 2-wide f64 batch operations on Apple Silicon (aarch64), ~1.5x throughput for dd arithmetic
Apple Accelerate — optional FFI to dpotrf/dpotrs for 10-100x f64 Cholesky speedup on macOS
num-traits — optional Zero, One, Num, Float trait implementations (--features num-traits)
serde — optional serialization with lossless {hi, lo} roundtrip (--features serde)
Canonical orthogonalization — $\mathbf{X} = \mathbf{U},\text{diag}(1/\sqrt{\lambda})$ with threshold for near-null removal
Mixed-precision overlap assembly — Schwarz screening routes ~5-15% of shell pairs to f128, rest stays f64

Requirements

Rust 2024 edition
Any platform with FMA (fused multiply-add) support — all modern x86-64 and ARM processors
Core: no external dependencies beyond rayon
Optional: num-traits, serde via feature flags
macOS: Apple Accelerate integration via --features accelerate

References

T.J. Dekker, "A floating-point technique for extending the available precision" (1971)
D.E. Knuth, The Art of Computer Programming, Vol 2 — error-free transformations
SCF convergence failure with aug-cc-pVDZ — the exact failure this solves
CP2K SCF convergence problems with large basis sets
Y. Hida, X.S. Li, D.H. Bailey, "Algorithms for Quad-Double Precision Floating Point Arithmetic" (2001)

quadprec 0.2.1